Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Research Project #425848

Research Project: Enhancing Plant Genome Function Maps Through Genomic, Genetic, Computational and Collaborative Research

Location: Plant, Soil and Nutrition Research

2015 Annual Report

1: Apply computational, genomic, genetic and/or systems biology approaches to develop new models for plant genome structure and organization that advance our understanding of plant evolution and diversity. 1.1: Establish an integrated reference genome resource for plant genomes. 1.2: Analysis and visualization of genotypic, epigenomic, and functionally phenotypic diversity. 1.3: Comparative genomics: analysis of plant genomes (stewardship of reference resource) and visualization informed by evolutionary histories. 2: Analyze and develop genome level regulatory network models that focus on and integrate the processes underlying plant development and responses to environmental change. 2.1: Develop genome-wide functional networks for the model plant genome Arabidopsis. 2.2: Crop GRNs to support functional prediction for agriculturally relevant phenotypes. 3: Collaborate, develop and implement new standards for the management and analysis of plant genomic, genetic and phenotypic information to facilitate integration and interoperability between biological databases. 4: Facilitate the use of genomic and genetic data, information, and tools for germplasm improvement, thus empowering ARS scientists and partners to use a new generation of computational tools and resources.

We propose to leverage emerging and standard computational and experimental approaches, building on existing and newly developed resources to support stewardship of plant genome reference sequences, genome annotations and gene networks. This will support development of a common standard platform for comparative genomic analysis and visualization. The enriched genome annotations will include controlled vocabularies to describe metadata and primary data associated with comparative phylogenomics, epigenetics, and population-based phenotypes. The proposed research in gene networks is directed at the development and validation of gene regulatory networks (GRN). The network view of the underlying molecular processes will enhance the fundamental biological understanding of development and abiotic stress responses and its relationship to agronomic traits. The computationally predicted and experimentally verified sub-networks combined with the prioritized regulatory gene targets will provide focal points for further research at gene-by-gene level. They will be integrated with the suite of genetic resources obtained from Objective 1, including SNPs and orthology mapping, and thus will be a resource for breeders and researchers engaged in molecular breeding approaches and segregation analysis. Genome-wide network reconstructions will be quite useful in quantifying and characterizing the genotype-to-phenotype relationships. We propose to leverage and build upon existing infrastructure to manage and analyze plant genomic, genetic, and phenotypic data. The resources will focus on the delivery of anticipated products from Objectives 1 and 2 with a focus on plant datasets, but much of the software will be species-agnostic, making the resources developed from the project usable to a broader audience including animals, insects, and fish relevant to agriculture, human health, and a sustainable environment.

Progress Report
Apply computational, genomic, genetic and systems biology approaches to develop new models for plant genome structure and organization that advance our understanding of plant evolution and diversity. In the last year, our primary focus has been to evaluate methodology to support the development of reference plant genomes utilizing the existing and newly emerging sequencing technologies, and use this understanding to better support the standardized representation of plant genomes in Gramene and Ensembl Plants. Both Ensembl Plants and Gramene serve as knowledge portals for crops and model plant species. Ensembl Plants is a web portal that integrates variant, functional, expression, marker and comparative data, while Gramene serves as a curated, online data source for comparative functional genomics in crops and model plant species. In the last year at Gramene, six new plant genomes were added, namely Brassica oleracea, Theobroma cacao, Oryza longistaminata, Oryza rufipogon, Leersia perrieri, and Ostreococcus lucimarinus. In addition, five existing genomes were updated – Hordeum vulgare, Medicago truncatula, Oryza barthii, Oryza meridionalis, and Triticum aestivum. Also, genetic variations data was updated for five genomes – Zea mays, Sorghum bicolor, Triticum aestivum, Solanum licopersicum, and Hordeum vulgare. We were able to establish a new collaboration with the European Variation Archive aimed at developing variation data and meta-data submission standards to support easy access and archival of genetic variation data. Gramene announced four major releases that included software updates and primarily targeted the comparative analyses pipelines to support whole genome alignments and protein based gene trees. Sequencing technology has been rapidly progressing over the years, in fact outpacing the Moore’s law in cost and speed over the last 7-8 years. Consequently, the opportunities to utilize this technology to support new reference genomes and diversity profiling are constantly evolving. In the last year, our focus has been on using this technology to evaluate hybrid approaches to support reference genome assembly, improve transcript models, and survey a sorghum EMS population. The current sequencing platforms in industry generate two types of read lengths for sequenced DNA - short read (100-500 bp) and long read sequences (2-17 kb). While the longer reads provide more information, they are more error-prone and costlier to generate per bp, making them less viable in near term. Generally, the size and complexity of the genome of interest determine which of the two sequencing approaches is technically and financially feasible. In the last year, we have been evaluating assemblies of rice and maize genomes using both sequencing approaches. One of the highlights of this effort has been the collaboration with Pacific Biosciences to support genome assembly and transcript profiling of the maize B73 line using the new sequencing chemistry from Pacific Biosciences. This newly released chemistry supports an average read length of 10-15 kb with longest reads greater than 40 kb. We have successfully generated 65X coverage of maize B73 and are currently evaluating software to support a reference de novo assembly. Preliminary de novo assemblies look promising but further modification of the assemblers will be required to optimize the final product. We have also been evaluating the use of this technology to support transcript profiling (used to quantify gene expression), as we focus on full-length transcripts and multiple isoforms. Again, working closely with Pacific Biosciences, we are evaluating protocols for multiplexing different plant tissues and size selection to optimize the number of genes sampled and tissue isoforms. Preliminary analyses of six maize tissue types have increased the mean number of isoforms for a gene in maize from two to five. These longer isoforms will be used to support improved annotations for maize. Analyze and develop genome level regulatory network models that focus on and integrate the processes underlying plant development and responses to environmental change. In the last year, work has focused on expanding the resources associated with miRNA & nitrogen use efficiency (NUE) networks. Specifically, we have expanded resources in Arabidopsis and maize used to measure binding of transcription factor proteins that turn on gene expression to the promoters of genes they activate. We have generated transcription profiling data sets in both maize and Arabidopsis. For the Arabidopsis miRNA network, the bulk of the effort has been in manuscript preparation and follow-up experiments. We anticipate submitting a manuscript in the fall of 2015. The current network consists of 5,376 transcription factor protein-DNA interactions, based on a screen of 180 promoters of miRNAs, their targets, and some highly connected transcription factors (TFs). We have updated the post-transcriptional network between miRNAs and their targets, and have used publicly available spatio-temporal transcriptomics data to assign the function to the TFs in the network. The existing network exhibits a high level of interconnection and hierarchical organization. Working with collaborators, we have begun to generate resources to support regulatory network for NUE. The preliminary network consisted of more than 60 promoters screens. Using the Arabidopsis NUE network and integration of public expression data, we have generated a candidate list of genes for functional analyses. Moving from a model plant species, Arabidopsis, to a crop species, maize, we have generated projections of both the networks. We are in the process of validating the functions of 1,200 maize TFs; we have validated nearly 85% of these. As a proof of concept, we are initially focused on screening maize miRNA 399 and 398 promoters, as these are a relatively small family. In the next year, we will continue to expand upon additional miRNA and NUE focused screens. In support of the maize NUE screens, we have designed and carried out a hydroponics-based transcript profiling (RNA-Seq) experiment with our collaborators for low vs. normal nitrogen, generating more than 30 different libraries. The sequencing data is currently under evaluation and is being used to identify the TFs to support the Y1H library screening. Collaborate, develop and implement new standards for the management and analysis of plant genomic, genetic and phenotypic information to facilitate integration and interoperability between biological databases. Breakthroughs in imaging and sequencing technologies have led to new opportunities to generate reference genome sequences for a majority of species. These have also resulted in massive challenges to manage, analyze, share and draw insights from the thousands of trillions of data points that are being generated. “Big Data” in Biology will require a paradigm shift. Data is no longer sparse. Cultural changes will be required to shift resources from the generation to the management and sharing of data. With our colleagues, we are evaluating and developing national high performance computing resources to support the storage and analyses of the plant genome and phenotype data. Such initiatives include collaboration with DOE Systems Biology Knowledgebase (KBase) and NSF iPlant Collaborative (iPlant). KBase is an open-source, open-architecture framework for reproducible and collaborative computational systems biology. One of the primary objectives of KBase is to enable more accurate models for dynamic cellular systems for plants and microbes. Working closely with KBase in last year, we developed functionality to build plant metabolic models based on sequences of transcripts annotated with metabolic functions from the PlantSEED project. Another ongoing collaboration with KBase focuses on building an RNA-Seq pipeline and studying the expression networks of poplar root, leaf and xylem in response to abiotic stresses (drought, heat, cold and high salinity) and attempting to discover potential candidate genes and their allelic variations in Populus trichocarpa var. Nisqually and in hybrid poplar varieties. iPlant offers an open-source, comprehensive and foundational infrastructure to support plant biology research. Working closely with iPlant in the last year, we updated models for association, ecology and evolution modeling, as well as workflows to support metabolic profiling. We also added variant calling functionality based on resequenced genomic data from soybean, rice and sorghum (from SoyKB, IRRI and ICRISAT, respectively). In addition, we provided a continued delivery of webinars, workshops and trainings to support the scientific community, via the Gramene, KBase and iPlant infrastructure. Facilitate the use of genomic and genetic data, information, and tools for germplasm improvement, thus empowering ARS scientists and partners to use a new generation of computational tools and resources. In the last year, we participated in the ARS-led effort to develop the SciNet platform. Specifically, we have supported the review and implementation of the science DMZ, design and draft implementation of the high performance cluster, and four workshops to support science engagement within the agency.

1. De novo short read genome assembly of rice enables discovery of genes that exist in diverse varieties. The complex history of rice domestication gave rise to five subpopulations, each with specialized traits and cultivation geographies. Many large-scale structural differences in the genomes of the different rice varieties have resulted in substantial variation in gene content in their genomes. The full complement of genes that exist in diverse varieties is currently poorly understood in rice. ARS researchers and collaborators in Ithaca, New York, produced new reference assemblies for two additional rice varieties belonging to the indica and aus subpopulations of rice. Using Nipponbare (a japonica strain) as positive control, scientists demonstrated that low-cost sequencing combined with advanced computational methods could yield gene-enriched de novo assemblies of other rice varieties with high accuracy and completeness. Examination of several known QTLs revealed differences in haplotype structure between the varieties. These differences correlated with known phenotypic differences that influence traits such as crop yield, submergence tolerance, phosphorus uptake, and hybrid sterility. Remarkably, each of the three varieties possesses several megabases of genomic sequence that are absent from the other two, and these unique sequences harbor hundreds of genes. This research can be applied for other high value crops and was significant as it demonstrated the ability to generate low cost reference genome assembly for small to moderate sized plant genomes, and the ability to use these data to capture information on novel gene space that harbors genes associated with agronomic traits.

2. Development of cyberinfrastructure for the agricultural life sciences. Our world is changing rapidly. The human population is increasing, while arable land and fisheries are decreasing and food cultivation is being diverted for fuel production. Climate instability and energy sustainability are impacting agricultural and ecological systems, while concomitant changes in land-use patterns affect global biodiversity. In order to successfully address these issues, we need to understand how the appearance, physiology and behavior of organisms are shaped by the interactions between their genetic makeup and the environment. Although these global challenges are sobering, the efforts to respond productively will lead to new, exciting science—provided that the computational infrastructure is in place to handle the necessary datasets, analyses, interpretation of results, and dissemination of knowledge. Advances in biological research technology have enabled scientists to amass unprecedented amounts of data, and many researchers find themselves drowning in this sea of data. ARS scientists in Ithaca, New York, are IPlant Collaborative partners, a large, 10 year, national effort targeted at the development of the cyberinfrastructure that provides scientists and educators with ready access to needed software and analysis tools. This work is significant as the distributed systems enable democratic access to high performance computing to scientists, by “bringing the infrastructure to the data” to enable fast computing, and at the same time, reducing the bandwidth required to transfer large amounts of next-generation sequencing data. Now for the first time, a scientist can assemble a rice genome, and characterize the genes – all in the same day.

3. Investigating novel, genetic diversity and functional characterization of Sorghum populations. Genetic variation, whether it is due to natural or induced mutations, is the raw materials for plant breeding. Recent studies show that genetic diversity within cultivated crop varieties is reduced due to artificial selection during plant breeding. Mutagenesis can generate novel variations, which in turn can be introduced into a breeding population. Ethyl methanesulfonate (EMS) is a chemical used to efficiently generate high-density mutations in genomes, which are conventionally identified by techniques that can detect single nucleotide mismatches. ARS scientists in Ithaca, New York and Lubbock, Texas sequenced 256 different mutant lines of sorghum to 16X coverage of the whole genome using short read sequencing approaches and discovered more than 1.8 million canonical G/C to A/T mutations, affecting more than 94% of the genes annotated in the sorghum genome. Based on comparisons to diversity captured in the existing sorghum genotyped collection, greater than 96% of the induced mutations are novel. This work is significant as our results demonstrate that a collection of EMS sequenced mutant lines can be used efficiently to discover new traits and their underlying causal mutations. This germplasm and the functional insights that can be derived from the germplasm can be used to accelerate sorghum breeding as well as directed breeding of germplasm in other crop species.

Review Publications
Wang, L., Ware, D., Lushbough, C., Merchang, N., Stein, L. 2014. A genome-wide association study platform built on iPlant cyber-infrastructure. Concurrency and Computation: Practice and Experience. DOI: 10.1002/cpe.3236.
Kumari, S., Ware, D. 2013. Genome-wide computational prediction and analysis of core promoter elements across plant monocots and dicots. PLoS One. 8(10):e79011.
Zwickl, D.J., Stein, J.C., Wing, R.A., Ware, D., Sanderson, M.J. 2014. Disentangling methodological and biological sources of gene tree discordance on oryza (poaceae) chromosome 3. Systematic Biology. DOI: 10.1093/sysbio/syu027.
Zwickl, D.J., Stein, J.C., Wing, R.A., Ware, D., Sanderson, M.J. 2014. Sources of gene tree discordance on oryza (poaceae) chromosome 3. Systematic Biology. DOI: 10.101093/sysbio/syu027.
Noutsos, C., Perera, M., Nikolau, B.J., Seaver, S., Ware, D. 2015. Metabolomic profiling of the nectars of Aquilegia pubescens and A. canadensis. PLoS One. 10(5):e0124501.
Dharmawardhana, P., Ren, L., Amarasinghe, V., Monaco, M.K., Thomason, J., Ravenscroft, D., Mccouch, S., Ware, D., Jaiswal, P. 2013. A genome scale metabolic network for rice and accompanying analysis of tryptophan, auxin and serotonin biosynthesis regulation under biotic stress. Rice. 6(1):1-15.
Schatz, M.C., Maron, L.G., Stein, J.C., Hernandez, W.A., Gurtowski, J., Biggers, E., Lee, H., Kramer, M., Antoniou, E., Ghiban, E., Wright, M.H., Chia, J., Ware, D., Mccouch, S.R., Mccombia, W.R. 2014. New whole genome de novo assemblies of three divergent strains of rice (O. sativa) documents novel gene space of aus and indica. Genome Biology. 15:506-521.
Lou, M., Gu, Y.Q., You, F., Deal, K., Ma, Y., Hu, Y., Huo, N., Wang, Y., Wang, J., Chen, S., Jorgensen, C., Zhang, Y., Mcguire, P., Pasternak, S., Stein, J., Ware, D., Kramer, M., Mccombie, W., Kianian, S., Martis, M., Mayer, K., Sehgal, K., Li, W., Gill, B., Bevan, M., Simkova, H., Dolezel, J., Weining, S., Lazo, G.R., Anderson, O.D., Dvorak, J. 2013. A 4-gigabase physical map unlocks the structure and evolution of the complex genome of Aegilops tauschii, the wheat D-genome progenitor. Proceedings of the National Academy of Sciences. 110(19):7940-7945.
Monaco, M.K., Stein, J., Naithani, S., Wei, S., Dharmawardhana, P., Kumari, S., Amarasinghe, V., Youens-Clark, K., Thomason, J., Preece, J., Pasternak, S., Olson, A., Jiao, Y., Lu, Z., Bolser, D., Kerhornou, A., Staines, D., Watts, B., Wu, G., D'Eustachio, P., Haw, R., Croft, D., Kersey, P., Stein, L., Jaiswal, P., Ware, D. 2014. Gramene 2013: Comparative plant genomics resources. Nucleic Acids Research. 42:D1193-D1199.
Kersey, P.J., Allen, J.E., Christensen, M., Davis, P., Falin, L.J., Grabmueller, C., Hughes, D.S., Humphrey, J., Kerhornou, A., Khobova, J., Langridge, N., Mcdowall, M.D., Maheswari, U., Maslen, G., Nuhn, M., Ong, C.K., Paulini, M., Pedro, H., Toneva, I., Tuli, M., Walts, B., Williams, G., Wilson, D., Youens-Clark, K., Monaco, M.K., Stein, J., Wei, X., Ware, D., Bolser, D.M., Howe, K.L., Kulesha, E. 2014. Ensembl Genomes 2013: scaling up access to genome-wide data. Nucleic Acids Research. 45:D546-D552.