Researchers at the National Institute of Standards and Technology (NIST) have developed a new statistical tool they have used to predict protein function. Not only could it help with the difficult task of altering proteins in practically useful ways, but it also works with methods that are fully interpretable – an advantage over the conventional artificial intelligence (AI) that has aided protein engineering in the past. .
The new tool, called LANTERN, could be useful in work ranging from producing biofuels to improving crops to developing new disease treatments. Proteins, as building blocks of biology, are a key element in all of these tasks. But while it’s relatively easy to make changes to the DNA strand that serves as the blueprint for a particular protein, determining which specific base pairs — rungs on the DNA ladder — are the keys to the desired effect remains a challenge. reach. † Finding these keys has been the purview of AI built with deep neural networks (DNNs), which, while effective, are notoriously opaque to human understanding.
Described in a new article published in the Proceedings of the National Academy of Sciences, LANTERN demonstrates the ability to predict the genetic edits needed to create useful differences in three different proteins. One is the spike-shaped protein from the surface of the SARS-CoV-2 virus that causes COVID-19; understanding how changes in DNA might alter this spike protein could help epidemiologists predict the future of the pandemic. The other two are well-known laboratory workhorses: the LacI protein from the E. coli bacteria and the green fluorescent protein (GFP) used as a marker in biology experiments. By selecting these three topics, the NIST team was able to demonstrate not only that their tool works, but that the results are interpretable – a key feature for industry, which needs predictive methods that help understand the underlying system. .
“We have an approach that is fully interpretable and also has no loss in predictive power,” said Peter Tonner, a statistician and computational biologist at NIST and the principal developer of LANTERN. “There’s a widespread assumption that if you want one of those things, you can’t have the other. We’ve shown that sometimes you can have both.”
The problem the NIST team tackles can be represented as interacting with a complex machine with a huge control panel filled with thousands of unlabeled switches: the device is a gene, a strand of DNA that codes for a protein; the switches are base pairs on the strand. The switches all affect the output of the device in one way or another. If your job is to make the machine work differently in a specific way, which switches should you flip?
Because the answer may require changes in multiple base pairs, scientists must flip a combination of them, measure the result, then choose a new combination and measure again. The number of permutations is daunting.
“The number of possible combinations could be greater than the number of atoms in the universe,” Tonner said. “You could never measure all possibilities. It’s a ridiculously large number.”
Due to the sheer amount of data involved, DNNs have been given the task of sorting a set of data and predicting which base pairs to flip. In this they have proved successful – as long as you don’t ask for an explanation of how they get their answers. They are often described as ‘black boxes’ because their inner workings are inscrutable.
“It’s really hard to understand how DNNs make their predictions,” said NIST physicist David Ross, one of the co-authors of the paper. “And that’s a big deal if you want to use those predictions to come up with something new.”
LANTERN, on the other hand, is explicitly designed to be understandable. Part of its explainability comes from using interpretable parameters to represent the data it analyzes. Rather than allow the number of these parameters to become exceedingly large and often unfathomable, as is the case with DNNs, each parameter in LANTERN’s calculations has a purpose that should be intuitive so that users understand what these parameters mean and how they affect LANTERNs. predictions.
The LANTERN model represents protein mutations using vectors, commonly used mathematical tools often represented visually as arrows. Each arrow has two properties: the direction implies the effect of the mutation, while the length indicates how strong that effect is. If two proteins have vectors pointing in the same direction, LANTERN indicates that the proteins have a similar function.
The directions of these vectors often correspond to biological mechanisms. LANTERN, for example, taught a direction related to protein folding in all three datasets the team studied. (Folding plays a critical role in how a protein functions, so identifying this factor in data sets was an indication that the model is functioning as intended.) When making predictions, LANTERN simply joins these vectors together — a method that users can trace at examining the predictions.
Other labs had already used DNNs to make predictions about which switch flips would make useful changes to the three proteins involved, so the NIST team decided to compare LANTERN with the results of the DNNs. The new approach wasn’t just good enough; according to the team, it achieves a new state of the art in predictive accuracy for these kinds of problems.
“LANTERN equaled or outperformed almost all alternative approaches to prediction accuracy,” Tonner said. “It outperforms all other approaches in predicting changes in LacI, and it has comparable predictive accuracy for GFP for all but one. For SARS-CoV-2, it has a higher predictive accuracy than all alternatives except one type of DNN, that matched LANTERN’s accuracy, but not beat.”
LANTERN calculates which sets of switches have the greatest effect on a particular characteristic of the protein — folding stability, for example — and summarizes how the user can tweak that characteristic to achieve a desired effect. In a way, LANTERN transforms the many switches on our machine’s panel into a few simple dials.
“It reduces thousands of switches to maybe five little dials you can turn,” Ross said. “It tells you that the first dial will have a big effect, the second a different effect but smaller, the third even smaller, and so on. So as an engineer, it tells me that I can focus on the first and second dial to get the result LANTERN explains all this for me, and it’s incredibly helpful.”
Rajmonda Caceres, a scientist at MIT’s Lincoln Laboratory who is familiar with the method behind LANTERN, said she appreciates the tool’s interpretability.
“Not many AI methods have been applied to biology applications where they are explicitly designed for interpretability,” said Caceres, who is not affiliated with the NIST study. “When biologists see the results, they can see which mutation is contributing to the change in the protein. This level of interpretation allows for more interdisciplinary research as biologists can understand how the algorithm learns and they can generate further insights about the biological system in study .”
Tonner said that while he is pleased with the results, LANTERN is not a panacea for AI’s explainability problem. Exploring alternatives to DNNs more broadly would benefit the entire effort to create explainable, reliable AI, he said.
“In the context of predicting genetic effects on protein function, LANTERN is the first example of something that can match DNNs in predictive power while still being fully interpretable,” Tonner said. “It offers a specific solution to a specific problem. We hope it can be applied to others and that this work inspires the development of new interpretable approaches. We do not want predictive AI to remain a black box.”
#black #boxes #LANTERN #illuminates #artificial #intelligence #tool #bioengineers #predictive #explainable