Skip to main content
ARS Home » Midwest Area » Columbia, Missouri » Plant Genetics Research » Research » Publications at this Location » Publication #308058

Title: BC4GO: a full-text corpus for the BioCreative IV GO Task

item VAN AUKEN, KIMBERLY - California Institute Of Technology
item Schaeffer, Mary
item MCQUILTON, PETER - University Of Cambridge
item LAULEDERKIND, STANLEY - Medical College Of Wisconsin
item LI, DONGHUI - Carnegie Institute - Stanford
item WANG, SHUR-JEN - Medical College Of Wisconsin
item HAYMAN, G. THOMAS - Medical College Of Wisconsin
item TWEEDIE, SUSAN - University Of Cambridge
item ARIGHI, CECILIA - University Of Delaware
item DONE, JAMES - California Institute Of Technology

Submitted to: Database: The Journal of Biological Databases and Curation
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 7/3/2014
Publication Date: 7/3/2014
Publication URL:
Citation: Van Auken, K., Schaeffer, M.L., McQuilton, P., Laulederkind, S.J., Li, D., Wang, S., Hayman, G., Tweedie, S., Arighi, C.N., Done, J., et al. 2014. BC4GO: a full-text corpus for the BioCreative IV GO Task. Database: The Journal of Biological Databases and Curation. 2014:1-9. Available:

Interpretive Summary: With the release of the maize genome sequence in 2009, researchers in plant breeding and basic biology want to know the functional information encoded in the sequence. Experimental information about function in the peer-reviewed scientific literature, cannot currently be linked to the genome sequence data without careful reading by highly skilled curators. This is a major bottleneck in assigning experimentally confirmed function to genes at these databases, as the task is both expensive and time-consuming. An alternative is to develop a computational solution, using digital copies of public literature, that extracts functional information assigned to genes , and provides it in a form useable by model organism genome databases. This study reports a body of literature developed as a test data set for natural language processing software developers. The data set was manually annotated by curators from several well established plant and animal model genome databases and which included TAIR (Arabidopsis, a model plant), MaizeGDB (corn), FlyBase (fruit fly ), WormBase (nematode) and RGDB (rat ). The annotations included specific function and evidence code terms from a controlled vocabulary called "Gene Ontology" or GO and which is an international standard for assinging function to genes in any organism. The annotated data set has been used in a 2013 BioCreative IV task for an international group of software teams, with the goal to automate the work of manual curation. The data set is publicly accessible, as only articles with free public access are included. An interesting finding is that the full text of articles is required to properly assign gene function with evidence codes.

Technical Abstract: Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database (MOD) groups. Due to its manual nature, this task is time-consuming and labor-intensive, and thus considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full-text. However, few systems have delivered an accuracy that is comparable to human annotators. One recognized challenge in developing such systems is the lack of marked passage-level evidence text that provides the basis for making GO annotations. To this end, we aim to create a corpus that includes the GO evidence text along with the three essential elements of GO annotations: 1) a gene or gene product, 2) a GO term and 3) a GO evidence code. To ensure our results are consistent with real-life GO annotation data, we recruited a team of eight professional GO curators from the biocuration community, and asked them to follow their routine GO annotation protocols. With the aid of a web-based annotation tool, our annotators marked up nearly 4,000 unique text passages in 200 full-text articles where on average each unique GO term is annotated with four different evidence text passages. Our corpus analysis shows that most of the evidence text occurs in the body of the article while comparatively as little as 12% appears in the abstracts. This result demonstrates the necessity of using full text for text mining GO terms. Through its use as the official data set for the BioCreative IV GO (BC4GO) task, we expect our unique BC4GO corpus to become a valuable resource for the BioNLP research community.