Skip to main content
ARS Home » Plains Area » Fargo, North Dakota » Edward T. Schafer Agricultural Research Center » Potato, Pulse and Small Grains Quality Research » Research » Research Project #448962

Research Project: PCHI: Improving Protein Digestibility in Pulses with Higher-throughput Assays and Sequence-based Prediction

Location: Potato, Pulse and Small Grains Quality Research

Project Number: 3060-21650-002-074-S
Project Type: Non-Assistance Cooperative Agreement

Start Date: Sep 1, 2025
End Date: Dec 31, 2026

Objective:
1) Develop and deploy higher-throughput assays for pulse protein digestibility in chickpea and lima bean breeding pipelines and in a multi-parental cowpea population. 2) Profile proteomes and amino acids in cooked seed and liquid digesta to evaluate amino acid bioaccessibility and create a reference set of protein digestibility scores in key reference genome lines. 3) Use AI-based protein models to develop sequence-based prediction of digestible protein content and identify genetic factors underlying variation in percent digestibility.

Approach:
Overall, this project aims to enhance the sustainability of the global food supply through optimized production of pulses (via more efficient allocation of protein resources towards digestible nutrition at a given level of agronomic input and yield) and to promote increased consumption of pulses through enhanced functionality of whole pulses. We will characterize protein digestibility across key chickpea, lima, and cowpea lines by translating a 96-well plate-based pepsin digestion method for use in pulses. We will use sample sets from chickpea, lima, and cowpea field trials over the past three years in Davis: ~80 and 30 breeding lines of lima and chickpea, respectively, and 168 lines from a cowpea multi-parent advanced-generation intercross (MAGIC) population. Based on existing grain macronutrient and genotypic data (Fig. 1), we will select 30 lines representing genetic and phenotypic variation from each of the three crops to evaluate for protein digestibility. We will then select the crop with the greatest additive genetic variance for % protein digestibility based on the assay described above to follow up with more detailed analyses. We will compare the sensitivity of the four methods (Diatta-Holgate, Osborne, static, and dynamic) to capture genotypic variation and rankings. Proteomic profiling of cooked seed (before digestion) and liquid digesta after dynamic digestion of key chickpea, lima, and cowpea reference genome lines will then be conducted. For AI-based protein models, we will train a cross-genera machine learning (ML) model to predict the digestibility of individual SSPs from sequence data using the protein language model ESM3. Then, we will input the translated amino acid sequences into ESM3 and extract embeddings for each protein, which are latent representations of protein features. We will then split the 0-1 in vitro digestibility scores from the proteomics data in Obj. 2 into training (80%) and test (20%) sets and train an ML model to predict in vitro digestibility given the ESM3 embeddings. We plan to test both ML models (e.g. Random Forest and XGBoost) and deep learning models (e.g. convolutional neural networks) to identify the model with the highest predictive ability (assessed via the correlation between predicted and measured digestibility in the test set). We will use the models’ feature importance scores to infer sequences with particular biological relevance to digestibility. B. Test the model’s applicability to breeding pipelines by predicting the digestible protein content of diverse genotypes.