Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #328254

Title: Large-scale atlas of microarray data reveals biological landscape of gene expression in Arabidopsis

item HE, FEI - Brookhaven National Laboratory
item YOO, SHINJAE - Brookhaven National Laboratory
item WANG, DAIFENG - Yale University
item KUMARI, SUNITA - Cold Spring Harbor Laboratory
item GERSTEIN, MARK - Yale University
item Ware, Doreen
item MASLOV, SERGEI - Brookhaven National Laboratory

Submitted to: Plant Journal
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 3/25/2016
Publication Date: 6/1/2016
Citation: He, F., Yoo, S., Wang, D., Kumari, S., Gerstein, M., Ware, D., Maslov, S. 2016. Large-scale atlas of microarray data reveals biological landscape of gene expression in Arabidopsis. Plant Journal. 86(6):472-480.

Interpretive Summary: The major contribution of this study is to provide an integrated dataset of more than 6000 expression profiling samples and metadata such as tissue type, growth condition and developmental stages were manually curated for each sample. The important finding is that for a given sample, the tissue type can be predicted using transcriptomic data. It can help in characterizing samples of unknown origin and also for verification of database annotations.

Technical Abstract: Transcriptome datasets from thousands of samples of the model plant Arabidopsis thaliana have been collectively generated by multiple individual labs. Although integration and meta-analysis of these samples has become routine in the plant research community, it is often hampered by the lack of metadata or differences in annotation styles by different labs. In this study, we carefully selected and integrated 6,057 Arabidopsis microarray expression samples from 304 experiments deposited to NCBI GEO. Metadata such as tissue type, growth condition, and developmental stage were manually curated for each sample. We then studied global expression landscape of the integrated dataset and found that samples of the same tissue tend to be more similar to each other than to samples of other tissues, even in different growth conditions or developmental stages. Root has the most distinct transcriptome compared to aerial tissues, but the transcriptome of cultured root is more similar to those of aerial tissues as the former samples lost their cellular identity. Using a simple computational classification method, we showed that the tissue type of a sample can be successfully predicted based on its expression profile, opening the door for automatic metadata extraction and facilitating re-use of plant transcriptome data. As a proof of principle we applied our automated annotation pipeline to 708 RNA-seq samples from public repositories and verified accuracy of our predictions with samples’ metadata provided by authors.