Location: Plant, Soil and Nutrition Research2010 Annual Report
1a. Objectives (from AD-416)
Biological research benefits extraordinarily from the integration of many different types of data both within and between species. The first specific objective of the proposal builds on existing and emerging data sets, providing resources to characterize, track and ultimately identify sequence associated with agronomically important traits. The second objective addresses infrastructure to manage, visualize and distribute complex datasets. The research makes use of four methodologies, data integration, software development, genome annotation, and evolutionary analysis. Throughout the proposal each objective builds upon each other. Combined they hold greater potential for providing a knowledge base for improving agricultural varieties. 1. Enhance our knowledge of plant genome structure, organization and evolution through computational and experimental approaches. 2. Develop and implement standards for plant genome databases. This includes development of vocabulary, methods, database structures and visualization software to facilitate data integration and interoperability.
1b. Approach (from AD-416)
We propose to leverage computational and experimental approaches, building on existing and new developed resources to create standardized baseline comparative maps and genome annotations across plant genomes, with an emphasis on crop grasses and other agriculturally important species as well as model genomes. As part of this work we will leverage existing infrastructure and build upon these to deliver data management and visualization tools for sequence, maps, diversity, and phenotype data sets.
3. Progress Report
Tools were developed to enhance the existing genome annotations infrastructure and methodology was improved to identify sequence variation and regulatory sequences using the recently available short-read sequencing technology. Improvements in the computational resources included updates to existing software, evaluation of new software, new development as well as improvements for visualization and performance, review of existing standards and data sets, and the continuation of existing as well as creation of new collaborations with other bioinformatic projects including Ensembl, Plant Ensembl, GMOD, MaizeGDB, GrainGenes and Iplant. Over the past year, our group has been working towards large-scale analysis of next-generation sequence reads. While one approach aligns individual reads to well-established reference sequences, the generation of de novo assemblies from the data can prove powerful for the identification of biological content and structural variation not captured by primary sequencing. This is particularly crucial in the complex genomes of plants, where a reference sequence is often not available. In this period, we have adopted ABySS as the assembly platform of choice, as it leverages distributed computing environments to parallelize the compute-intensive assembly algorithm. In maize, we are using this approach to identify novel sequences that harbor full-length cDNA sequences that were not mapped to the reference sequence (B73 RefGen_v2). Using multiple sets of evidence, such as the NAM-based genetic map (from the Maize Diversity Project) and orthology to rice and sorghum, we are presently developing a method to anchor such novel sequences within the reference sequence so as to provide context for the unmapped genes within the reference chromosomes. We have been collaborating with scientists at Cold Spring Harbor Laboratory to generate a de novo assembly of Solanum pimpinellifolium, the wild progenitor of the domesticated Heinz variety of tomato (Solanum lycopersicum). We generated both a whole-genome assembly based on multiple insert libraries of the organism as well as a transcriptome assembly. The data is currently being analyzed for structural variation and gene characterization. We are using a similar approach to study the wild Muscadine grape (Vitis rotundifolia) and its relationship with the domesticated grape variety (Vitis vinifera), whose reference sequence and preliminary annotations are available. By aligning the novel contigs of the wild variety to the reference sequence, we can present a preliminary picture of the gene order in the genome. In addition to the de novo assembly work, of note for this period is the enhancement of baseline annotations for the Maize Sequencing and Maize HapMap project. The primary gene and variation annotations were used as a reference to develop a 100K array for expression profiling and a 55K genotyping platform for maize. These resources serve as tools to understand the genomic contribution to phenotypic variation, which in turn can be used to inform future breeding strageties.
1. Characterization of the Maize B73 genome. Many plant genomes are polyploid or have experienced an ancestral polyploidy event. In the last year, analysis of the reference B73 genome, an ancient tetraploid, ARS scientists at the Robert W. Holley Center for Agriculture & Health in Ithaca, NY were able to construct maps representing the two ancestral genomes that contributed to the ancestral polyploidy event. The maps suggest that in the reduction to the current diploid state, there was a preferential loss of one of the parental genomes over the other. When the genes were lost, they were more likely to be lost as groups of genes than as single genes. This data suggests that there are active mechanisms by which a plant retains memory of previous genetic history of a parent through generations and that it can continue to selectively retain one parent of origin over another. In the analysis, we found several examples where two copies of genes were retained, one from each parent. In the case where two copies were retained, we found these genes had some bias for regulatory function, including transcription factors, kinases and phosphatases. It was also found that non-coding microRNA genes were more likely to be retained. The microRNA genes are known to function as post-transcriptional regulators of gene transcripts. The retention of regulatory genes supports the role of regulatory genes as major players in plant adaptation. This work was done in collaboration with scientists at Cold Spring Harbor Laboratory, Washington University, and the University of Arizona.
2. The development of a Maize HapMap. The Maize HapMap project aims to construct a high-resolution, integrated single nucleotide polymorphism (SNP), insertion-deletion (indel) and copy-number variation (CNV) map of the Zea mays. Such a map will facilitate association mapping of complex traits in maize, and, in doing so, accelerate breeding efforts aimed at agricultural sustainability. The approach taken by ARS scientists at the Robert W. Holley Center for Agriculture & Health in Ithaca, NY to create this variation map was to use a next-generation sequencing platform developed by Illumina, Inc. to sequence, at a high-depth of coverage, the genomes of carefully selected maize inbred lines. Phase 1 of the project, completed in 2009, focused on the parental lines of the maize nested association mapping (NAM) population. Through a collaborative effort, more than 3 million variations were identified between the B73 accession and 25 additional accessions of maize that are part of the NAM panel. The variations were used to identify genetic variation for marker-assisted breeding strategies that directly contributed to the development of the 55K Illumina genotyping array. The variations were also used to identify ~130 regions in the maize genomes that have been under selection and are likely important for agriculture. In addition when the patterns of variation were compared with genomic and genetic positions, the information suggested that ~15% of the annotated genes were in regions of lower combinations rates. Along with signatures of residual heterozygosity, this suggests there may still be regions of the maize germplasm that are underemployed for the existing genetic variation. The basic research described in this project, provides baseline resources for translation research including the function of a gene locus that can be used for candidate gene selection and identification of variation to be used for marker assisted breeding.
3. Digital gene expression signatures for maize development. The ability to profile quantitative changes in expression for all genes simultaneously can help us understand genes that work together to coordinate plant development. Comparing the differences in expression between a normal plant and one that has had a change/mutation in a gene allows ARS scientists at the Robert W. Holley Center for Agriculture & Health in Ithaca, NY to compare and contrast differences that result from this single change. In this work, we developed and tested a framework for analysis of digital gene expression (DGE) profiles using ultra-high-throughput sequencing technology and the newly assembled B73 maize reference genome. For proof of concept, we sought to use this technique to identify differences in gene expression profiles in immature maize (Zea mays) ears in a wild type and the mutant RAMOSA (RA) which harbors a defective gene that affects the developmental fate of axillary and, consequently, alters branching patterns. Genetic control of branching, especially in ears where kernels are born, has clear relevance to crop improvement with respect to seed number and harvesting ability. Overall, 86% of short read sequences were anchored to the maize genome sequence and 37,117 known genes were identified, 66% of which were detected above our threshold for statistical testing. We used comparative genomics to leverage existing information from Arabidopsis and rice in functional analyses of differentially expressed maize genes. Results from this study provide a basis for analysis of short-read expression data in maize and resolved specific expression signatures that will help define mechanisms of action for the RA3 gene.
4. Characterized regulatory network in the root stele. In multicellular organisms like plants, expression of genes is under complex regulation that controls, when, where and how much of the gene is expressed. A great deal of the control is due to two classes of genes: transcription factors (TF) and microRNAs (miRNA). Transcription factors are protein-coding genes that bind to the promoters of genes to promote or repress transcriptions in a spatially restricted manner. MicroRNAs are a class of non-coding gene that can further refine the spatial expression of these transcription factors. In the last year, ARS scientists at the Robert W. Holley Center for Agriculture & Health in Ithaca, NY have used a combination of experimental approaches that allow us to identify within specific cell types the interaction between transcription factors and miRNAs to establish a gene regulatory network (GRN) of transcription factors and miRNA expression in roots. This network consisting of 103 interactions between 64 TFs and 8 miRNAs, and is the largest network of its type in plants. Our data provides an understanding of the regulatory complexity in the root stele. This information can be used by breeders to inform candidate genes involved in root development, as well as in the selection on the choice of promoters (native rather than artificial) to alternate the gene expression levels in the root.
Nelson, R., Avraham, S., Shoemaker, R.C., May, G., Ware, D., Gessler, D.D. 2009. Applications and Methods Utilizing the Simple Semantic Web Architecture and Protocol (SSWAP) for Bioinformatics Resource Discovery and Disparate Data and Service Integration. BioMed Central (BMC) BioData Mining. 10:309.