Location: Plant, Soil and Nutrition ResearchTitle: k-mer grammar uncovers maize regulatory architecture
|MEJIA-GUERRA, MARIA KATHERINE - Cornell University|
|Buckler, Edward - Ed|
Submitted to: BMC Plant Biology
Publication Type: Review Article
Publication Acceptance Date: 2/21/2019
Publication Date: 3/15/2019
Citation: Mejia-Guerra, M., Buckler IV, E.S. 2019. k-mer grammar uncovers maize regulatory architecture. Biomed Central (BMC) Plant Biology. 19:103. https://doi.org/10.1186/s12870-019-1693-2.
Interpretive Summary: It has been observed that the portions of the genome that regulate gene expression accumulate variation between individuals and that is key to explaining quantitative traits of interest in agriculture. Consequently, it is of value to “annotate” the regulatory regions in the genome for larger populations, therefore, determining the specific sequence variants that are responsible for differences in gene expression and ultimately, for variations in traits of interest, such as yield. However, identifying what part of the genome takes part in regulating gene expression rely on experimental methods that take an enormous amount of time, effort, and money. Because of that, the “annotation” of regulatory regions can only be done in few individuals for a specie of interest. In this project, we aimed to determine the predictive characteristics of the sequences that are regulatory, using the available experimental data in few individuals, and then build models that allowed us to extrapolate the “annotations” to large populations or to other related species. To do so, we borrowed machine learning methods from the field of Natural Language Processing (NLP), an active area of research that combines artificial intelligence, computer science, and linguistic to handle great amounts of text for a variety of tasks. NLP has demonstrated high accuracy to automatically extract relevant terms that differentiate between two different authors, or between different newspaper sections, without relying on knowing the oddities of human languages. This modeling approaches overcome the challenges of having to generate new, expensive experimental data to accurately annotate intergenic regions across maize lines. Therefore, these accurate models should be able to help breeders more effectively transfer this information from the reference maize line onto other genotypes, which together with the current low cost of genotyping, shows potential to accelerate the breeding process.
Technical Abstract: Only a small percentage of the genome sequence is involved in regulation of gene expression, but to biochemically identify this portion is expensive and laborious. In species like maize, with diverse intergenic regions and lots of repetitive elements, this is an especially challenging problem. While regulatory regions are rare, they do have characteristic chromatin contexts and sequence organization (the grammar) with which they can be identified. We developed a computational framework to exploit this sequence arrangement. The models learn to classify regulatory regions based on sequence features - k-mers. To do this, we borrowed two approaches from the field of natural language processing: (1) "bag-of-words" which is commonly used for differentially weighting key words in tasks like sentiment analyses, and (2) a vector-space model using word2vec (vector-k-mers), that captures semantic and linguistic relationships between words. We built "bag-of-k-mers" and "vector-k-mers" models that distinguish between regulatory and non-regulatory regions with an accuracy above 90%. Our "bag-of-k-mers" achieved higher overall accuracy, while the "vector-k-mers" models were more useful in highlighting key groups of sequences within the regulatory regions. These models now provide powerful tools to annotate regulatory regions in other maize lines beyond the reference, at low cost and with high accuracy.