Browsing by Author "Jiang, Xiangying"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
Item Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)(Oxford University Press., 2017-03-24) Jiang, Xiangying; Ringwald, Martin; Blake, Judith; Shatkay, Hagit; Xiangying Jiang, Martin Ringwald, Judith Blake and Hagit Shatkay; ; Jiang, Xiangying; Shatkay, HagitThe Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area.Item Toward effective biomedical document classification for supporting the biocuration workflow(University of Delaware, 2020) Jiang, XiangyingScientific literature is an important source of knowledge supporting biomedical research. The large and rapidly increasing number of publications makes automated biomedical document classification become useful and essential in biomedical research. Effective biomedical document classifiers are especially needed in the biodatabases, such as Mouse Genome Informatics (MGI) database, Flybase and UniProt, as much information in such databases are manually collected from the publications. This is a slow, labor-intensive process that can benefit from automation. ☐ We propose machine learning methods for addressing biomedical document classification for supporting biodatabases workflow. We present our work in the context of Gene Expression Database (GXD) in MGI, which is the largest comprehensive dataset concerning expression information in the mouse. We first develop a simple yet effective classifier employing statistical feature selection aiming to identify publications relevant to GXD over a large balanced dataset. However, biodatabases are typically highly imbalanced. To address class imbalance, we then present a modied meta-classification framework employing clustering-based under-sampling along with our feature selection strategies. Notably, the majority of previous proposed biomedical document classifiers only use text information extracted from the title and abstract of the publication. However, as our group and several others noted, images provide substantial information for determining the topics discussed in the publications. As such, improving on the method for imbalanced biomedical document classification described above, we introduce a classification scheme incorporating features gathered from image captions, in addition to that obtained from titles-and-abstracts. Experiment results demonstrate that our proposed classification frameworks effectively address the biomedical document classification for supporting biodatabases curation workflow.Item Utilizing image and caption information for biomedical document classification(Bioinformatics, 2021-07-12) Li, Pengyuan; Jiang, Xiangying; Zhang, Gongbo; Trabucco, Juan Trelles; Raciti, Daniela; Smith, Cynthia; Ringwald, Martin; Marai, G. Elisabeta; Arighi, Cecilia; Shatkay, HagitMotivation: Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature—a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results. Results: We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance. Availability and implementation: Source code and the list of PMIDs of the publications in our datasets are available upon request.