Skip to main content
ARS Home » Northeast Area » Beltsville, Maryland (BARC) » Beltsville Agricultural Research Center » Genetic Improvement for Fruits & Vegetables Laboratory » Research » Research Project #442999

Research Project: Developing a Machine Learning Tool for Genomic Selection

Location: Genetic Improvement for Fruits & Vegetables Laboratory

Project Number: 8042-21000-023-011-S
Project Type: Non-Assistance Cooperative Agreement

Start Date: Sep 1, 2022
End Date: Aug 30, 2024

Objective:
Most modern plant breeding programs utilize marker-assisted selection (MAS). However, marker discovery in non-model crops is hampered by the lack of ideal populations and complete phenotypic and genotypic data. These shortcomings can be partially application of machine learning (ML). Our objectives are to: 1) Develop an effective ML-based algorithm with an emphasis on deep learning architecture tailored for genome selection and 2) Release a simple to implement tool for those that are familiar with plant breeding/MAS, though lacking the background to construct a machine learning (ML)-based pipeline from scratch.

Approach:
Preliminary research at ARS has shown that effective ML models for MAS can be built despite imperfections in the dataset. In addition, ARS preliminary work identified epistatic loci that contributed to field rot resistance of cranberry, utilizing a mixed population. These data and data from ongoing projects will be used to develop machine learning algorithms that are superior to those currently available. Briefly, loci identification and marker selection will be performed through a machine learning approach initially using random forest regression, while further analysis using more complex neural networks will follow. Single-nucleotide polymorphisms (SNPs) from genotyping by sequencing (GBS) data aligned to our cranberry reference genome (487 Mb, 124 contigs, N50 15 Mb) will be used as features. Markers will be nominated by the importance of their contribution to the tree structure. Combinations of top-rated contributing markers are selected based on their combined R2 value when singular markers are not sufficient in predicting phenotypes.