Author
VAN AUKEN, KIMBERLY - California Institute Of Technology | |
Schaeffer, Mary | |
MCQUILTON, PETER - University Of Cambridge | |
LAULEDERKIND, STANLEY - Medical College Of Wisconsin | |
LI, DONGHUI - Carnegie Institute - Stanford | |
WANG, SHUR-JEN - Medical College Of Wisconsin | |
HAYMAN, G. THOMAS - Medical College Of Wisconsin | |
TWEEDIE, SUSAN - University Of Cambridge | |
ARIGHI, CECILIA - University Of Delaware | |
DONE, JAMES - California Institute Of Technology |
Submitted to: Database: The Journal of Biological Databases and Curation
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 7/3/2014 Publication Date: 7/3/2014 Publication URL: http://handle.nal.usda.gov/10113/59644 Citation: Van Auken, K., Schaeffer, M.L., McQuilton, P., Laulederkind, S.J., Li, D., Wang, S., Hayman, G., Tweedie, S., Arighi, C.N., Done, J., et al. 2014. BC4GO: a full-text corpus for the BioCreative IV GO Task. Database: The Journal of Biological Databases and Curation. 2014:1-9. Available: http://database.oxfordjournals.org/content/2014/bau074 Interpretive Summary: With the release of the maize genome sequence in 2009, researchers in plant breeding and basic biology want to know the functional information encoded in the sequence. Experimental information about function in the peer-reviewed scientific literature, cannot currently be linked to the genome sequence data without careful reading by highly skilled curators. This is a major bottleneck in assigning experimentally confirmed function to genes at these databases, as the task is both expensive and time-consuming. An alternative is to develop a computational solution, using digital copies of public literature, that extracts functional information assigned to genes , and provides it in a form useable by model organism genome databases. This study reports a body of literature developed as a test data set for natural language processing software developers. The data set was manually annotated by curators from several well established plant and animal model genome databases and which included TAIR (Arabidopsis, a model plant), MaizeGDB (corn), FlyBase (fruit fly ), WormBase (nematode) and RGDB (rat ). The annotations included specific function and evidence code terms from a controlled vocabulary called "Gene Ontology" or GO and which is an international standard for assinging function to genes in any organism. The annotated data set has been used in a 2013 BioCreative IV task for an international group of software teams, with the goal to automate the work of manual curation. The data set is publicly accessible, as only articles with free public access are included. An interesting finding is that the full text of articles is required to properly assign gene function with evidence codes. Technical Abstract: Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database (MOD) groups. Due to its manual nature, this task is time-consuming and labor-intensive, and thus considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full-text. However, few systems have delivered an accuracy that is comparable to human annotators. One recognized challenge in developing such systems is the lack of marked passage-level evidence text that provides the basis for making GO annotations. To this end, we aim to create a corpus that includes the GO evidence text along with the three essential elements of GO annotations: 1) a gene or gene product, 2) a GO term and 3) a GO evidence code. To ensure our results are consistent with real-life GO annotation data, we recruited a team of eight professional GO curators from the biocuration community, and asked them to follow their routine GO annotation protocols. With the aid of a web-based annotation tool, our annotators marked up nearly 4,000 unique text passages in 200 full-text articles where on average each unique GO term is annotated with four different evidence text passages. Our corpus analysis shows that most of the evidence text occurs in the body of the article while comparatively as little as 12% appears in the abstracts. This result demonstrates the necessity of using full text for text mining GO terms. Through its use as the official data set for the BioCreative IV GO (BC4GO) task, we expect our unique BC4GO corpus to become a valuable resource for the BioNLP research community. |