|Lincare, John - Winstep Software Technologies|
Submitted to: Communications in Biometry and Crop Science (CBCS)
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 6/19/2010
Publication Date: 7/15/2010
Citation: Simko, I., Lincare, J.M. 2010. Combining partially ranked data in plant breeding and biology: II. Analysis with Rasch model.. Communications in Biometry and Crop Science (CBCS). 2010, Vol 5, Pages 56-65.
Interpretive Summary: In genetics, biology, or breeding, it is often useful to combine observations from numerous experiments into a single dataset. For example, in association mapping studies, phenotypic data used for the analysis are often collected from experiments conducted over the course of several years or are obtained from germplasm databases. However, combining data from different years, locations, laboratories, and databases is challenging, because not all of the independent variables (e.g. plant accessions) will be common across all experiments. Moreover, laboratories often use their own rating scales that cannot be combined by standard statistical approaches. This situation creates the need to develop methodologies that would allow combining datasets with only a partial overlap and dissimilar rating scales. To combine data from dissimilar rating scales into a single aggregated rating, the absolute values from each test might be replaced with relative rankings. If there are two or more rankings of the same elements, then there may be enough information to construct interval measures of the distances between elements. We also illustrate application of the method on sets of real data in four examples.
Technical Abstract: Many years of breeding experiments, germplasm screening, and molecular biologic experimentation have generated volumes of sequence, genotype, and phenotype information that have been stored in public data repositories. These resources afford genetic and genomic researchers the opportunity to handle and analyze raw data from multiple laboratories and study groups whose research interests revolve around a common or closely related trait. However, although such data sets are widely available for secondary analysis, their heterogeneous nature often precludes their direct combination and joint exploration. Integration of phenotype information across multiple studies and databases is challenging due to variations in the measurement instruments, endpoint classifications, and biological material employed by each investigator. In the present work, we demonstrate how Rasch measurement model can surmount these problems. The model allows incorporating data sets with partially overlapping variables, large numbers of missing data points and dissimilar ratings of phenotypic endpoints. The model also enables quantifying the extent of heterogeneity between data sets. Biologists can use the model in a data-mining process to obtain combined ratings from various databases and other sources. Subsequently, these ratings can be used for selecting desirable material or (in combination with genotypic information) for mapping genes involved in the particular trait. The model is not limited to genetics and breeding and can be applied in many other areas of biology and agriculture.