Location: Crop Improvement and Genetics Research
Title: Structural variability of Pfam domains based on alphafold2 predictionsAuthor
![]() |
PORETSKY, ELLY - Oak Ridge Institute For Science And Education (ORISE) |
![]() |
Andorf, Carson |
![]() |
Sen, Taner |
|
Submitted to: Proteins: Structure, Function, and Bioinformatics
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 7/3/2025 Publication Date: 7/22/2025 Citation: Poretsky, E., Andorf, C.M., Sen, T.Z. 2025. Structural variability of Pfam domains based on alphafold2 predictions. Proteins: Structure, Function, and Bioinformatics. https://doi.org/10.1002/prot.70021. DOI: https://doi.org/10.1002/prot.70021 Interpretive Summary: Functional genomics aims to understand the biological roles of genes and gene products (especially proteins) to control and manipulate biological processes, and enhance desirable traits. These traits include improved abiotic and biotic stress resistance in humans, animals, plants, and microbes. Protein domains are the functional building blocks of proteins and can be used to predict protein function. The extensive detection of structural domains used for protein function prediction has been enabled by experimentally solved protein structures, along with the genome-scale high quality protein structure predictions. This work focuses on predicted domains to analyze the structural diversity of protein structures and measure the inaccuracies associated with predictions. The study shows that protein domains possess inherent structural variability, often with different secondary structure compositions, even within the same protein families. More accurate protein domain prediction workflows are therefore needed. Technical Abstract: Understanding the biological functions of genes and gene products, especially proteins, is one of the main goals of functional genomics. Such understanding will help control and manipulate biological processes to enhance desirable traits, including improved abiotic and biotic stress resistance in humans, animals, plants, and microbes. Protein domains, considered to be the functional building blocks of proteins, have been used extensively to predict protein function. Experimentally solved protein structures, together with the genome-scale high quality protein structure predictions with methods such as AlphaFold2, enabled the extensive detection of structural domains used for protein function prediction. Despite this, sequence-based approaches for protein function prediction, including the use of protein domain prediction by methods such as the Pfam database, remain popular due to their reliability, low cost, and ease of use. Although the sequence variability of Pfam domains has been reported in several studies, the structural variability of Pfam domains has been understudied. In this work, we focus on predicted Pfam domains, in the context of AlphaFold2 predicted structures, to analyze the structural diversity of Pfam domains and the possible implications for imprecisions that may lead to inaccurate functional predictions. For this study, we have extracted the Pfam domain structural portion from the predicted structures of the 16 model organism proteomes in the AlphaFold2 database. An analysis of the distribution of assigned secondary structures within Pfam domain families revealed that while for the majority of Pfam domain families were assigned secondary structures, many Pfam domain families contained between 20-40% Pfam domain structures with no assigned secondary structure, with some cases reaching up to 100% of Pfam family members. This suggested that a substantial number of predicted Pfam domains are structurally variable. To better understand the structural variability with Pfam domain families, we used FoldSeek to cluster the individual Pfam domain structures. A comparison of the FoldSeek cluster representatives revealed that most cluster representatives are structurally similar, with an average TM-score of 0.6 between cluster representatives. To better capture the structural diversity, we applied agglomerative clustering to merge the FoldSeek cluster representatives which effectively reduced the average TM-score between cluster representatives to 0.6. We then provided several study cases showing that Pfam domain clustering can facilitate the detection of structural variability of Pfam domain families, in cases where Pfam domain families are divided among different predominant structural clusters or containing structural outliers. The accuracy of domain predictions and their associated functions is an ongoing challenge in the field of protein bioinformatics. In this study, we have used two popular prediction applications/resources, Alphfold2 and Pfam, to demonstrate inherent imprecision of protein domain predictions by comparing their predicted structures. Our study shows that even within the same Pfam families, Pfam domains possess inherent structural variability, often with different secondary structure compositions. Our study demonstrates the need to develop more accurate protein domain prediction workflows. |
