Submitted to: Communications in Biometry and Crop Science (CBCS)
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 6/19/2010
Publication Date: 7/15/2010
Citation: Simko, I., Pechenick, D.A. 2010. Combining partially ranked data in plant breeding and biology: I. Rank aggregating methods.. Communications in Biometry and Crop Science (CBCS), Year 2010, volume 5, pages 41-55.
Interpretive Summary: In biological research, data from multiple studies, experiments, or trials are frequently combined prior to statistical analysis. However, combining absolute values by standard statistical approaches is not always possible or desirable. In such cases, the absolute values might be replaced with relative ranks, and the ranks combined into a single aggregate ranking. Rank aggregation techniques range from the simple that are based on averages to the complex that employ advanced computational methodologies. In the present work we concentrate on methods that allow for aggregating partially ranked data from multiple studies. This type of data is observed in genetics, breeding, pathology, ecology, and other areas of life sciences, where different studies use different subsets of individuals (e.g. plant cultivars). In plant breeding, for example, subsets of individuals (plant cultivars) are evaluated annually in multiple locations. However, sources of variability (biological, analytical, and others) lead to measurements that are not always comparable across studies. We show how different methods for the aggregate ranking of partially ranked data can be applied to combine heterogeneous data.
Technical Abstract: Combining heterogeneous data from plant breeding trials into a single dataset can be challenging, especially if observations have been performed only on partially overlapping sets of accessions, or if evaluations were done with different rating scales. In the present work we propose combining such data by making use of aggregate ranking approaches. To test 13 aggregate ranking methods for performance, we have simulated 16 types of datasets that resemble those observed in plant breeding trials. The evaluation of aggregate ranking methods was carried out using both distance-based measures (Kendall’s tau and Spearman’s rho) and number of rank violations caused by a proposed aggregate ranking. Our analysis indicates that methods based on Bradley-Terry or Rasch models performed better than the other tested methods when factors such as fitness of aggregate rankings, time required for analyses, and ability to analyze weak rankings were considered. Verification of the approach on real data from 19 studies indicated a substantial increase in significance (P-value dropped by a factor of 100,000) when linkage between a marker and a trait was based on aggregated data rather than on each of the individual trials. The ability to combine heterogeneous data from independent studies has important ramifications for data analysis in association studies. Results from our study indicate that this kind of meta-analysis is more powerful than individual analyses