Skip to main content
ARS Home » Research » Publications at this Location » Publication #171046


item Van Tassell, Curtis - Curt
item Cregan, Perry

Submitted to: BMC Bioinformatics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 1/6/2006
Publication Date: 1/6/2006
Citation: Matukumalli, L.K., Grefenstette, J.J., Van Tassell, C.P., Choi, I., Cregan, P.B. 2006. Application of machine learning in SNP discovery. BMC Bioinformatics. 6(7):4.

Interpretive Summary: Single nucleotide polymorphisms (SNP) are the variations observed in the DNA sequence from different individuals. SNP discovery can help in genetic analysis for individual disease propensity (pharmacogenomics) and productivity improvement in plant and animal species. SNP prediction software is not reliable. Hence each SNP identified is usually manually verified by an expert. This tends to be very expensive. To reduce the expert intervention we have developed new software (SNP-PHAGE-ML) that uses machine learning to accurately predict the correct polymorphism. This software will facilitate high throughput SNP discovery. Open source software developed to implement these techniques will be made publicly available.

Technical Abstract: Along with the whole genome sequence projects, major efforts are now being placed on identifying sequence variations and haplotypes between different individuals or species. Results from computational tools to identify SNP from sequence data need to be expertly annotated to reject false SNP. Implementation of machine learning (ML) program for confirming polymorphisms can reduce the expert intervention, thereby reducing cost and time. PolyBayes program was used for analyzing polymorphisms across several soybean (inbred) genotypes. The prediction accuracy was only 50% even with 100% probabilities estimated by PolyBayes. We have carefully selected a set of 10 parameters that can influence the expert decision and used 2417 polymorphisms identified by PolyBayes that were expert evaluated (1066 True, 1351 False) to implement a ML program called C4.5. The prediction accuracy was 90.6%. We optimized the parameters and re-evaluated the polymorphisms falsely predicted by the ML program. This increased the prediction accuracy to 97.7%. The optimized parameters were tested on a large data set of 17,590 expert evaluated polymorphisms (2445 true, 15145 false). The average prediction accuracy was 97.3% in the five fold cross validation. This program along with a web interface for viewing sequence assemblies was implemented as part of SNP pipeline (SNP-PHAGE).