Publication : USDA ARS

ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #295594

Title: An algorithm for deciding the number of clusters and validating using simulated data with application to exploring crop population structure

Author

	NEWELL, MARK - Samuel Roberts Noble Foundation, Inc
	COOK, DIANNE - Iowa State University
	HOFMANN, HEIKE - Iowa State University
	Jannink, Jean-Luc

Submitted to: Annals of Applied Statistics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 6/20/2013
Publication Date: 12/23/2013
Citation: Newell, M., Cook, D., Hofmann, H., Jannink, J. 2013. An algorithm for deciding the number of clusters and validating using simulated data with application to exploring crop population structure. Annals of Applied Statistics. 7:1898-1916.

Interpretive Summary: Populations of individuals often have a group structure where individuals within a group are more closely related to each other than they are to individuals outside the group. A first step in exploring population structure in crop plants and other organisms is to define the number of groups that exist for a given population. The genetic marker data sets being generated have become increasingly large over time and commonly have more markers than individuals. This manuscript proposes an algorithm for deciding the number of groups, and validates the algorithm on simulated data sets varying in both the number of groups and their divergence. The algorithm was then tested on six empirical data sets across three small grain species. Validation on simulated sets coupled with testing on empirical sets suggests that the algorithm can be used for a wide variety of genetic data sets.

Technical Abstract: A first step in exploring population structure in crop plants and other organisms is to define the number of subpopulations that exist for a given data set. The genetic marker data sets being generated have become increasingly large over time and commonly are the high-dimension, low sample size (HDLSS) situation. An algorithm for deciding the number of clusters is proposed, and is validated on simulated data sets varying in both the level of structure and the number of clusters covering the range of variation observed empirically. The algorithm was then tested on six empirical data sets across three small grain species. The algorithm uses bootstrapping, three methods of clustering, and defines the optimum number of clusters based on a common criterion, the Hubert’s gamma statistic. Validation on simulated sets coupled with testing on empirical sets suggests that the algorithm can be sued for a wide variety of genetic data sets.

U.S. DEPARTMENT OF AGRICULTURE

Plant, Soil and Nutrition Research: Ithaca, NY