Integration of heterogeneous data for protein ontology database using semantic web technology

Author(s)Li, Xiang
Date Accessioned2019-05-20T13:19:18Z
Date Available2019-05-20T13:19:18Z
Publication Date2018
SWORD Update2019-02-15T17:02:32Z
AbstractAs the volume and diversity of data and the desire to share them increase, we inevitably encounter the problem of combining heterogeneous data generated from many different but related sources and the problem of providing users with a unified view of this combined data set. Data integration systems facilitate information access and reuse by providing a common access point and a more complete view of the available information. A widely adopted system, Semantic Web, provides the requisite technologies to make such integration possible: 1) an abstract model for the relational graphs: RDF; 2) a query language adapted for the relational graphs: SPARQL; and 3) various technologies to characterize the relationships and categorize resources: RDFS, OWL etc. ☐ PRO databases draw on data sources that provide orthology, annotation, and mapping information, as well as sequence-related data, including amino acid and splice variants and multiple sequence alignments. The PRO website is currently hosted in two places: University of Delaware for entry page and visualization, and Georgetown University for text search and browse. The dual-site structure requires that data files be duplicated and overlapped, thus creating website maintenance issue. To streamline the update process and to remove redundancy, we explored simplifying the data integration for the PRO database using Semantic Web technology. In this process, the heterogeneous data was converted into RDF triples and integrated into a Virtuoso RDF triple store. Furthermore, a Virtuoso/SPARQL based search engine for the full-scale text search and hierarchy browsing for PRO website was developed. Tests reveal that we achieved similar performance as compared to the Apache Lucene based search engine currently being used. We also developed RESTful APIs for programmatic access to the PRO database using Open API specification and Django REST framework. ☐ In conclusion, the semantic web technologies such as RDF and SPARQL etc. are suitable for data integration. Heterogeneous data in the PRO database are structured and simplified by using RDF triples so that search efficiency can be improved. In addition, the thesis showed the design and implementation of the RESTful APIs in detail along with application examples. The thesis aims to provide a clear description of the heterogeneous data integration process and API design and implementation process that can be used as a reference in the field of Bioinformatics.en_US
AdvisorChen, Chuming
DegreeM.S.
DepartmentUniversity of Delaware, Center for Bioinformatics and Computational Biology
DepartmentUniversity of Delaware, Department of Computer and Information Sciences
Unique Identifier1101900478
URLhttp://udspace.udel.edu/handle/19716/24179
Languageen
PublisherUniversity of Delawareen_US
URIhttps://search.proquest.com/docview/2194351418?accountid=10457
TitleIntegration of heterogeneous data for protein ontology database using semantic web technologyen_US
TypeThesisen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Li_udel_0060M_13585.pdf
Size:
5.51 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.22 KB
Format:
Item-specific license agreed upon to submission
Description: