The 2024 Nobel Prize in Chemistry was awarded to the DeepMind team for developing AlphaFold2, which resolved a decades-long challenge in biology: predicting the three-dimensional structure of proteins from their amino acid sequences.
In the post-AlphaFold era, the next pivotal frontier in protein science is protein function. Proteins must exhibit high activity, selectivity, and stability to become viable commercial products. However, predicting protein function remains a formidable challenge. A fundamental biological principle states that even a 1% change in a protein's sequence has a 95% probability of rendering the protein inactive. Interestingly, AlphaFold2 often predicts little structural change for these mutations, underscoring that while structure is necessary, it is not sufficient for function.
To address this challenge, Professor Liang Hong from Shanghai Jiao Tong University led a collaborative team comprising researchers from the Natural Sciences Research Institute, School of Physics and Astronomy, School of Pharmacy, Zhangjiang Institute for Advanced Study, School of Life Science and Technology, Shanghai AI Laboratory, East China University of Science and Technology, and ShanghaiTech University. Over several years, the team focused on data collection, cleaning, labeling, and AI model exploration to develop the Pro-series, a general AI framework for protein design. Recently, their work, titled "A General Temperature-Guided Language Model to Design Proteins of Enhanced Stability and Activity," was published in Science Advances.
Through wet-lab experiments, the Pro-PRIME model (Protein language model for Intelligent Masked Pretraining and Environment prediction) demonstrated exceptional performance. For five target proteins (Figure 1), the model achieved over 30% accuracy in identifying beneficial single-point mutations among its top-45 zero-shot predictions—an accuracy ten times higher than traditional high-throughput random screening. These optimizations spanned various functions, including catalytic activity, thermal stability, extreme pH tolerance, and synthetic capacity for non-natural products. Additionally, using a few-shot fine-tuning approach, the model required fewer than 100 wet-lab samples and only 2-4 iterative design cycles to generate highly functional protein mutants. For example, a T7 RNA polymerase mutant exhibited a 12.8°C increase in melting temperature (Tm) and a nearly fourfold increase in activity compared to the wild type. Remarkably, some designed mutants outperformed commercial products that have dominated the market for over a decade.
Figure 1. Wet-lab results for Pro-PRIME across five proteins. The top three proteins underwent single-point mutations, while Cas12a and T7 RNA polymerase achieved 10-15 multi-point mutations within four dry-wet experimental iterations.
Pro-PRIME predicts protein mutant performance without relying on prior experimental data. It leverages a "temperature-aware" language model trained on a dataset of 96 million protein sequences annotated with temperature information. The training process combines token-level masked language modeling (MLM) task with sequence-level optimal growth temperature (OGT) prediction. A multi-task learning framework introduces a correlation loss term to align token-level and sequence-level tasks, enabling the model to better capture temperature-related features. This methodology naturally biases Pro-PRIME toward assigning higher scores to sequences with enhanced thermal stability and biological activity.
In practice, Pro-PRIME employs zero-shot prediction to test a small set of single-point mutations, followed by iterative supervised learning based on experimental data to predict multi-point mutations. Within four wet-lab iterations and with fewer than 100 experimental tests, Pro-PRIME successfully designed several high-performing protein mutants.
Figure 2. Pro-PRIME's pretraining approach, zero-shot prediction method, and dry-wet iterative strategy.
In evaluations against public mutation databases (ProteinGym and ΔTm) across 283 proteins, Pro-PRIME demonstrated superior predictive performance compared to state-of-the-art models. It also excelled in predicting wild-type protein melting temperatures (Tm) and optimal enzymatic reaction temperatures (Topt).
Wet-lab validations involved five proteins: LbCas12a, T7 RNA polymerase, creatinase, an artificial nucleic acid polymerase, and a specific nanobody heavy chain variable region. Among the top 30-45 AI-predicted single-point mutations tested, over 30% showed significant improvements in key properties, such as thermal stability, enzymatic activity, antigen-antibody binding affinity, non-natural nucleic acid polymerization, or tolerance to extreme alkaline conditions. For some proteins, the positive mutation rate exceeded 50%. Furthermore, Pro-PRIME demonstrated the ability to combine neutral single-point mutations into beneficial multi-point mutations, highlighting its capacity to learn higher-order mutational effects—a breakthrough for traditional protein engineering.
These results underscore Pro-PRIME's broad applicability in protein engineering.
Pro-PRIME introduces a novel approach to designing protein mutants, significantly improving efficiency and accuracy without relying on extensive experimental data. By reducing dependence on experimental screening, Pro-PRIME not only increases success rates in mutant design but also addresses engineering challenges that traditional methods cannot solve. The model’s ability to predict multiple attributes of proteins provides researchers with a valuable tool, even in unfamiliar protein domains.
This technology has broad applications in industrial and pharmaceutical fields, particularly in scenarios requiring proteins to withstand extreme temperatures or environmental conditions. With continued development, Pro-PRIME is poised to revolutionize protein engineering by lowering costs, accelerating product development, and expanding the field's boundaries.
Moreover, Pro-PRIME's correlation-based multi-task pretraining framework offers a blueprint for integrating biophysical priors into future large-model pretraining efforts.
In summary, Pro-PRIME combines deep learning and large-scale data to provide an efficient and practical pathway for protein engineering. It enhances success rates for stability and activity design, improves experimental efficiency in resource-constrained settings, and paves the way for breakthroughs in scientific research and industrial applications.
This study was led by Professor Liang Hong (Natural Sciences Research Institute, School of Physics and Astronomy, and Zhangjiang Institute for Advanced Study), Pan Tan (Shanghai AI Laboratory), Jia Liu (ShanghaiTech University), and Jie Song (Hangzhou Institute for Advanced Studies, CAS) as corresponding authors. Co-first authors include Fan Jiang (Ph.D. student, Shanghai Jiao Tong University), Mingchen Li (intern, Shanghai AI Laboratory), Jiajun Dong (ShanghaiTech University), Yuanxi Yu and Banghao Wu (Shanghai Jiao Tong University), and Xinyu Sun (University of Science and Technology of China). The study was supported by the National Natural Science Foundation of China (12104295), Shanghai Municipal Science and Technology Commission (23JS1400600), Shanghai Jiao Tong University Innovation Fund (21X010200843), Chongqing Major Science and Technology Innovation Project (CSTB2022TIAD-STX0017), Shanghai AI Laboratory, and the High-Performance Computing and Student Innovation Centers of Shanghai Jiao Tong University.
Shanghai Jiao Tong University
No.800 Dong Chuan Road, No.5 Physics building
Minhang District, Shanghai,
200240
沪交ICP备05010
© School of Physics and Astronomy Shanghai Jiao Tong University All rights reserved
WeChat Official Account