Title: Random forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle Authors
Submitted to: Journal of Dairy Science
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: June 20, 2013
Publication Date: August 12, 2013
Citation: Yao, C., Spurlock, D., Armentano, L., Page, D., Vandehaar, M., Bickhart, D.M., Weigel, K. 2013. Random forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle. Journal of Dairy Science. 96(10):6716-6729. Interpretive Summary: Feed efficiency is economically important in the beef and dairy cattle industries. To study feed efficiency on a genomic basis, we used residual feed intake as the phenotype, and the Random Forests algorithm to estimate SNP effects. The top 25 pairwise epistatic interactions between SNPs and top 188 SNPs based on additive and epistatic effects were reported. Mapping to Bos taurus assembly UMD 3.1, the percentage of SNPs located in residual feed intake QTL in beef cattle among the top 188 SNPs was significantly higher than the percentage in whole genome markers, and 68 annotated genes were mapped by 74 of the 188 SNPs.
Technical Abstract: Feed efficiency is an economically important trait in the beef and dairy cattle industries. Residual feed intake (RFI) is a measure of partial efficiency that is independent of production level per unit of body weight. The objective of this study was to identify significant associations between single nucleotide polymorphism (SNP) markers and RFI in dairy cattle using the Random Forests (RF) algorithm. Genomic data included 42,275 SNP genotypes for 395 Holstein cows, whereas phenotypic measurements were daily RFI from 50 to 150 days postpartum. Residual feed intake was defined as the difference between an animal’s feed intake and the average intake of its cohort, after adjustment for year and season of calving, year and season of measurement, age at calving nested within parity, days in milk, milk yield, body weight, and body weight change. Random Forests is a widely used machine learning algorithm that has been applied to classification and regression problems. By analyzing the tree structures produced within RF, the 25 most frequent pairwise SNP interactions were reported as possible epistatic interactions. The importance scores that are generated by RF take into account both main effects of variables and interactions between variables, and the most negative value of all importance scores can be used as the cutoff level for declaring SNPs effects as significant. Ranking by importance scores, 188 SNPs surpassed the threshold, among which 38 SNPs were mapped to RFI QTL regions reported in a previous study in beef cattle, and 2 SNPs were also detected by a genome-wide association study in beef cattle. The ratio of number of SNPs located in RFI QTL to the total number of SNPs in top 188 SNPs chosen by RF was significantly higher than in all 42,275 whole genome markers. Pathway analysis indicated that many of the top 188 SNPs are in genomic regions that contain annotated genes with biological functions that may influence RFI. Frequently occurring ancestor-descendant SNP pairs can be explored as possible epistastic effects for further study. The importance scores generated by RF can be used effectively to identify large additive or epistatic SNPs and informative QTL. The consistency in results of our study and previous studies in beef cattle indicates that the genetic architecture of RFI in dairy cattle might be similar to that of beef cattle.