Publication : USDA ARS

ARS Home » Research » Publications at this Location » Publication #271297

Title: Really big data: Processing and analysis of large datasets

Author

	Cole, John
	NEWMAN, S - Genus
	FOERTTER, F - Genus
	AGUILAR, I - Collaborator
	COFFEY, M - Scottish Agricultural College

Submitted to: Journal of Animal Science
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 11/13/2011
Publication Date: 3/1/2012
Citation: Cole, J.B., Newman, S., Foertter, F., Aguilar, I., Coffey, M. 2012. Really big data: Processing and analysis of large datasets. Journal of Animal Science. 90(3):723-733.

Interpretive Summary: Modern animal breeding datasets are large and getting larger, due in part to recent availability of high-density DNA marker and sequence data. High-performance computing methods are needed to efficiently store and analyze those data, and will depend on sound software development practices. Storage requirements for genotypes are modest, but full-sequence data will require much more space. Files from genetic evaluations may consume lots of space because results from multiple runs must be kept. There is interest in new health and management traits, and it will take many years to collection enough observations to produce accurate genetic evaluations. Analytical tools developed for large datasets may help identify unexpected relationships in the data, and improved visualization tools also will provide insights in to the data. Genomic selection requires a lot of computing power, and recent work shows that single-step approaches have similar requirements to traditional methods. Processing time could be reduced using custom libraries and parallel computing. Large datasets also create challenges for the delivery of genetic evaluations which must be overcome in a way that does not disrupt the transition from conventional to genomic evaluations. Processing time is important, especially as real-time systems for on-farm decision are developed. The ultimate value of these systems is to decrease time-to-results in research, increase accuracy in genomic evaluations, and accelerate rates of genetic improvement.

Technical Abstract: Modern animal breeding datasets are large and getting larger, due in part to the recent availability of DNA data for many animals. Computational methods for efficiently storing and analyzing those data are under development. The amount of storage space required for such datasets is increasing rapidly. There also is growing interest in the collection of new health and management traits, and it will take many years to collect enough measurements to produce accurate genetic evaluations. Tools developed for large datasets may help identify interesting relationships in the data, and improved tools for graphically presenting data will provide insights. Large datasets also create challenges for the delivery of genetic evaluations that cannot be allowed to disrupt during the transition from conventional to genomic evaluations. The ultimate value of these systems is to speed-up research, increase the accuracy of genomic evaluations, and accelerate rates of genetic improvement.