CodonBERT large language model for mRNA vaccines [METHODS]

Sizhen Li1,4, Saeed Moayedpour1,4, Ruijiang Li1, Michael Bailey1, Saleh Riahi1, Lorenzo Kogler-Anele1, Milad Miladi2, Jacob Miner2, Fabien Pertuy2, Dinghai Zheng2, Jun Wang2, Akshay Balsubramani2, Khang Tran2, Minnie Zacharia2, Monica Wu2, Xiaobo Gu2, Ryan Clinton2, Carla Asquith2, Joseph Skaleski2, Lianne Boeglin2, Sudha Chivukula2, Anusha Dias2, Tod Strugnell2, Fernando Ulloa Montoya3, Vikram Agarwal2, Ziv Bar-Joseph1 and Sven Jager1 1Digital R&D, Sanofi, Cambridge, Massachusetts 02141, USA; 2mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA; 3mRNA Center of Excellence, Sanofi, 69280 Marcy L'Etoile, France

4 These authors contributed equally to this work.

Corresponding authors: zivbjcs.cmu.edu, sven.jagersanofi.com Abstract

mRNA-based vaccines and therapeutics are gaining popularity and usage across a wide range of conditions. One of the critical issues when designing such mRNAs is sequence optimization. Even small proteins or peptides can be encoded by an enormously large number of mRNAs. The actual mRNA sequence can have a large impact on several properties, including expression, stability, immunogenicity, and more. To enable the selection of an optimal sequence, we developed CodonBERT, a large language model (LLM) for mRNAs. Unlike prior models, CodonBERT uses codons as inputs, which enables it to learn better representations. CodonBERT was trained using more than 10 million mRNA sequences from a diverse set of organisms. The resulting model captures important biological concepts. CodonBERT can also be extended to perform prediction tasks for various mRNA properties. CodonBERT outperforms previous mRNA prediction methods, including on a new flu vaccine data set.

Received December 15, 2023. Accepted June 25, 2024.

留言 (0)

沒有登入
gif