Integration of heterogeneous data for protein ontology database using semantic web technology

Date
2018
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
As the volume and diversity of data and the desire to share them increase, we inevitably encounter the problem of combining heterogeneous data generated from many different but related sources and the problem of providing users with a unified view of this combined data set. Data integration systems facilitate information access and reuse by providing a common access point and a more complete view of the available information. A widely adopted system, Semantic Web, provides the requisite technologies to make such integration possible: 1) an abstract model for the relational graphs: RDF; 2) a query language adapted for the relational graphs: SPARQL; and 3) various technologies to characterize the relationships and categorize resources: RDFS, OWL etc. ☐ PRO databases draw on data sources that provide orthology, annotation, and mapping information, as well as sequence-related data, including amino acid and splice variants and multiple sequence alignments. The PRO website is currently hosted in two places: University of Delaware for entry page and visualization, and Georgetown University for text search and browse. The dual-site structure requires that data files be duplicated and overlapped, thus creating website maintenance issue. To streamline the update process and to remove redundancy, we explored simplifying the data integration for the PRO database using Semantic Web technology. In this process, the heterogeneous data was converted into RDF triples and integrated into a Virtuoso RDF triple store. Furthermore, a Virtuoso/SPARQL based search engine for the full-scale text search and hierarchy browsing for PRO website was developed. Tests reveal that we achieved similar performance as compared to the Apache Lucene based search engine currently being used. We also developed RESTful APIs for programmatic access to the PRO database using Open API specification and Django REST framework. ☐ In conclusion, the semantic web technologies such as RDF and SPARQL etc. are suitable for data integration. Heterogeneous data in the PRO database are structured and simplified by using RDF triples so that search efficiency can be improved. In addition, the thesis showed the design and implementation of the RESTful APIs in detail along with application examples. The thesis aims to provide a clear description of the heterogeneous data integration process and API design and implementation process that can be used as a reference in the field of Bioinformatics.
Description
Keywords
Citation