Location: Subtropical Plant Pathology Research
Title: ChemFM as a scaling law guided foundation model pre-trained on informative chemicalsAuthor
![]() |
CAI, FEIYAN - Clemson University |
![]() |
HANNA, KATELIN - Clemson University |
![]() |
TZENG, TZENG-RONG - Clemson University |
![]() |
Duan, Yong Ping |
![]() |
LIU, LIN - Georgia Institute Of Technology |
![]() |
PILLA, SRIKANTH - University Of Delaware |
![]() |
LI, GANG - Clemson University |
![]() |
LUO, FENG - Clemson University |
|
Submitted to: Popular Publication
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 11/5/2025 Publication Date: 12/18/2025 Citation: Cai, F., Hanna, K., Tzeng, T.J., Duan, Y., Liu, L., Pilla, S., Li, G., Luo, F. ChemFM as a scaling law guided foundation model pre-trained on informative chemicals. Communications Chemistry. 9:3. 2025. https://doi.org/10.1038/s42004-025-01793-8. DOI: https://doi.org/10.1038/s42004-025-01793-8 Interpretive Summary: Over the past decade, artificial intelligence has revolutionized research methodologies across scientific disciplines, including chemistry. One promising approach to addressing the challenges that currently AI faces is the development of foundation models. These models are pre-trained on large unannotated datasets, often using weakly supervised or unsupervised methods to extract complex, general-domain features, enabling them to be fine-tuned for various downstream tasks with minimal additional training. In this work, we introduced ChemFM, a 3-billion-parameter foundation model designed for chemicals that can be fine-tuned for various chemical design and property prediction tasks. By leveraging the paradigm of casual language modeling, ChemFM was trained on SMILES strings from 178 million molecules in UniChem database. ChemFM effectively learned SMILES syntax as well as the molecular internal relationships between atoms and bonds, enabling its adaptation for various downstream tasks. We first validated ChemFM on 34 property prediction datasets from domains including pharmaceutical, physicochemical, and bioactivity, showing consistent outperformance over existing approaches across all datasets. Moreover, ChemFM demonstrated superior performance for potential antibiotic screening, highlighting its potential to advance real-world drug discovery. ChemFM also exhibited flexibility and versatility in conditional molecular generation tasks. Unlike previous approaches that required training separate models for each condition or condition combination, ChemFM allowed the training of a single unified model capable of handling all variations of condition combinations. The unified model not only achieved strong generative performance but also enabled effective control and matching of flexible desired conditions. Furthermore, we demonstrated that ChemFM can be seamlessly integrated with existing sequence editing-based methods for reaction prediction, resulting in state-of-the-art performance on 4 reaction prediction tasks, including both forward synthesis and retrosynthesis. ChemFM can be leveraged for diverse chemical research endeavors and may significantly advance chemistry research. Technical Abstract: Artificial intelligence (AI) has significantly advanced computational chemistry research in various tasks. However, traditional AI methods often rely on task-specific model designs and training, which constrain both the scalability of model size and generalization across different tasks. Here, we introduce ChemFM, a large foundation model specifically developed for chemicals. ChemFM comprises 3 billion parameters and is pre-trained on 178 million molecules using self-supervised causal language modeling to extract generalizable molecular representations. This model can be adapted to diverse downstream chemical applications using either full-parameter or parameter-efficient finetuning methods. ChemFM consistently outperforms state-of-the-art task-specific AI models across all tested tasks. Notably, it achieves up to 67.48% performance improvement across 34 property prediction benchmarks, up to 33.80% reduction in mean average deviation between conditioned and actual properties of generated molecules in conditional molecular generation tasks, and up to 3.7% top-1 accuracy improvement across 4 reaction prediction datasets. Moreover, ChemFM demonstrates its superior performance in predicting antibiotic activity and cytotoxicity, highlighting its potential to advance the discovery of novel antibiotics. We anticipate that ChemFM will significantly advance chemistry research by providing a foundation model capable of effectively generalizing across a broad range of tasks with minimal additional training. |
