Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

Author(s)Jiang, Xiangying
Author(s)Ringwald, Martin
Author(s)Blake, Judith
Author(s)Shatkay, Hagit
Ordered AuthorXiangying Jiang, Martin Ringwald, Judith Blake and Hagit Shatkay
Ordered Author
UD AuthorJiang, Xiangyingen_US
UD AuthorShatkay, Hagiten_US
Date Accessioned2018-06-07T14:37:31Z
Date Available2018-06-07T14:37:31Z
Copyright DateCopyright © The Author(s) 2017en_US
Publication Date2017-03-24
DescriptionPublisher's PDFen_US
AbstractThe Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area.en_US
DepartmentUniversity of Delaware.Department of Computer and Information Sciences.en_US
CitationJiang, Xiangying, et al. "Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)." Database 2017.1 (2017).en_US
DOI10.1093/database/bax017en_US
ISSN1758-0463en_US
URLhttp://udspace.udel.edu/handle/19716/23552
Languageen_USen_US
PublisherOxford University Press.en_US
dc.rightsCreative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.en_US
dc.sourceDatabase: The Journal of Biological Databases and Curationen_US
dc.source.urihttps://academic.oup.com/databaseen_US
TitleEffective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)en_US
TypeArticleen_US
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)_1493413970T1607.pdf
Size:
410.25 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.22 KB
Format:
Item-specific license agreed upon to submission
Description: