AI Model Optimizes Protein Drug Production in Yeast, Could Cut Development Costs

MIT chemical engineers developed a large language model that optimizes codon sequences for protein production in industrial yeast, boosting efficiency for six proteins including human growth hormone and cancer antibodies.

MIT chemical engineers have harnessed artificial intelligence to optimize the development of new protein manufacturing processes, which could reduce the overall costs of developing and manufacturing these drugs. The study appears this week in the Proceedings of the National Academy of Sciences.

Industrial yeasts are a powerhouse of protein production, used to manufacture vaccines, biopharmaceuticals, and other useful compounds. Using a large language model (LLM), the MIT team analyzed the genetic code of the industrial yeast Komagataella phaffii — specifically, the codons that it uses. There are multiple possible codons, or three-letter DNA sequences, that can be used to encode a particular amino acid, and the patterns of codon usage are different for every organism.

The new MIT model learned those patterns for K. phaffii and then used them to predict which codons would work best for manufacturing a given protein. This allowed the researchers to boost the efficiency of the yeast's production of six different proteins, including human growth hormone and a monoclonal antibody used to treat cancer.

"Having predictive tools that consistently work well is really important to help shorten the time from having an idea to getting it into production. Taking away uncertainty ultimately saves time and money," says a senior author of the study, the Raymond A. and Helen E. St. Laurent Professor of Chemical Engineering at MIT, a member of the Koch Institute for Integrative Cancer Research, and faculty co-director of the MIT Initiative for New Manufacturing (MIT INM).

Yeast such as K. phaffii and Saccharomyces cerevisiae (baker's yeast) are the workhorses of the biopharmaceutical industry, producing billions of dollars of protein drugs and vaccines every year. To engineer yeast for industrial protein production, researchers take a gene from another organism, such as the insulin gene, and modify it so that the microbe will produce it in large quantities. This requires coming up with an optimal DNA sequence for the yeast cells, integrating it into the yeast's genome, devising favorable growth conditions for it, and finally purifying the end product.

For new biologic drugs — large, complex drugs produced by living organisms — this development process might account for 15 to 20 percent of the overall cost of commercializing the drug. "Today, those steps are all done by very laborious experimental tasks," the senior author says. "We have been looking at the question of where could we take some of the concepts that are emerging in machine learning and apply them to make different aspects of the process more reliable and simpler to predict."

In this study, the researchers wanted to try to optimize the sequence of DNA codons that make up the gene for a protein of interest. There are 20 naturally occurring amino acids, but 64 possible codon sequences, so most of these amino acids can be encoded by more than one codon. Each codon corresponds to a unique transfer RNA (tRNA) molecule, which carries the correct amino acid to the ribosome, where amino acids are strung together into proteins.

Different organisms use each of these codons at different rates, and designers of engineered proteins often optimize the production of their proteins by choosing the codons that occur the most frequently in the host organism. However, this doesn't necessarily produce the best results. If the same codon is always used to encode arginine, for example, the cell may run low on the tRNA molecules that correspond to that codon.

To take a more nuanced approach, the MIT team deployed a type of large language model known as an encoder-decoder. Instead of analyzing text, the researchers used it to analyze DNA sequences and learn the relationships between codons that are used in specific genes. Their training data, which came from a publicly available dataset from the National Center for Biotechnology Information, consisted of the amino acid sequences and corresponding DNA sequences for all of the approximately 5,000 proteins naturally produced by K. phaffii.

"The model learns the syntax or the language of how these codons are used," the senior author says. "It takes into account how codons are placed next to each other, and also the long-distance relationships between them."

Once the model was trained, the researchers asked it to optimize the codon sequences of six different proteins, including human growth hormone, human serum albumin, and trastuzumab, a monoclonal antibody used to treat cancer. They also generated optimized sequences of these proteins using four commercially available codon optimization tools. The researchers inserted each of these sequences into K. phaffii cells and measured how much of the target protein each sequence generated. For five of the six proteins, the sequences from the new MIT model worked the best.

Related Articles

References

  1. New AI model could cut the costs of developing protein drugs - EurekAlert! · www.eurekalert.org
  2. AI Model May Slash Protein Drug Development Costs | Mirage News · www.miragenews.com
  3. New AI model could cut the costs of developing protein drugs | MIT News · news.mit.edu