A celebrated AI has learned a new trick: doing chemistry

(THE CONVERSATION) Artificial intelligence has changed the way science is done by enabling researchers to analyze the vast amounts of data that modern scientific instruments generate. It can find a needle in a million haystacks of information and, using deep learning, it can learn from the data itself. AI accelerates progress in gene huntingmedicinemedicine design and the origin of organic compounds

Deep learning uses algorithms, often neural networks trained on large amounts of data, to extract information from new data. It is very different from traditional computing with its step-by-step instructions. Instead, it learns from data. Deep learning is much less transparent than traditional computer programming and leaves open important questions: what has the system learned, what does it know?

Like a chemistry professor I like to design tests with at least one difficult question that increases the knowledge of the students to determine whether they can combine different ideas and synthesize new ideas and concepts. We came up with such a question for the AI ​​lawyers’ banner, AlphaFold, who solved the problem protein folding problem

folding protein

Proteins are present in all living organisms. They give structure to cells, catalyze reactions, transport small molecules, digest food and do much more. They are made up of long chains of amino acids, like beads on a string. But for a protein to do its job in the cell, it must twist and bend in a complex three-dimensional structure, a process called protein folding. Misfolded proteins can lead to disease.

In his chemistry Nobel Prize ceremony in 1972, Christian Anfinsen postulated that it should be possible to calculate the three-dimensional structure of a protein from the order of its building blocksthe amino acids.

Just as the order and spacing of the letters in this article give it meaning and message, so the order of the amino acids determines the identity and shape of the protein, which results in its function.

Due to the inherent flexibility of the amino acid building blocks, a typical protein can have an estimated 10 to the power of 300 different shapes† This is a huge number, more than the number of atoms in the universe† But within a millisecond, each protein in an organism will fold into its own specific shape — the lowest energy rank of all the chemical bonds that make up the protein. Change just one amino acid out of the hundreds of amino acids typically found in a protein and it can misfold and stop working.


For 50 years, computer scientists have been trying to solve the protein folding problem – with little success. Then in 2016 Deep Mindan AI subsidiary of Google’s parent company Alphabet started its AlphaFold program. It used the protein database as his training set, which contains the experimentally determined structures of more than 150,000 proteins.

In less than five years, AlphaFold the problem of protein folding – at least the most useful part of it, namely determining the protein structure on the basis of the amino acid sequence. AlphaFold does not explain how the proteins fold so quickly and accurately. It was a big win for AI, because not only did it gain tremendous scientific prestige, it was also a major scientific advancement that could affect everyone’s lives.

Thanks to programs like AlphaFold2 and RoseTTAFoud, researchers like me can determine the three-dimensional structure of proteins in an hour or two from the sequence of amino acids that make up the protein — at no cost. Before AlphaFold2, we had to crystallize the proteins and solve the structures using X-ray crystallographya process that took months and cost tens of thousands of dollars per structure.

We now also have access to the AlphaFold protein structure database, where Deepmind deposited the 3D structures of nearly all proteins found in humans, mice, and more than 20 other species. To date, they have solved over a million constructions and plan to add 100 million constructions this year alone. The knowledge of proteins has skyrocketed. The structure of half of all known proteins is likely to be documented by the end of 2022, including many new unique structures associated with new useful functions.

Think like a chemist

AlphaFold2 was not designed to predict how proteins would interact with each other, but it has been able to model how individual proteins combine into form large complex units composed of multiple proteins† We had a challenging question for AlphaFold – did the structural training set teach it some chemistry? Could it tell if amino acids would react with each other – a rare but important event?

I am a computer chemist interested in: fluorescent proteins† These are proteins found in hundreds of marine organisms such as jellyfish and coral. Their glow can be used light up and study diseases

There are 578 fluorescent proteins in the protein database, of which 10 are “broken” and do not fluoresce. Proteins rarely attack themselves, a process called autocatalytic post-translational modification, and it is very difficult to predict which proteins will react with themselves and which will not.

Only a chemist with a significant amount of knowledge of fluorescent proteins would be able to use the amino acid sequence to find the fluorescent proteins that have the correct amino acid sequence to undergo the chemical transformations necessary to make them fluorescent. When we presented AlphaFold2 with the sequences of 44 fluorescent proteins that are not in the protein database, it folded the fixed fluorescent proteins differently from the broken proteins

The result surprised us: AlphaFold2 had learned some chemistry. It had discovered which amino acids in fluorescent proteins do the chemistry that makes them glow. We suspect that the protein database training set and multiple sequence alignments allows AlphaFold2 to “think” like chemists and search for the amino acids needed to react with each other to make the protein fluorescent.

A folding program that learns some chemistry from its training set also has broader implications. What else can be gained from other deep learning algorithms by asking the right questions? Can facial recognition algorithms find hidden markers for diseases? Could algorithms designed to predict consumer spending patterns also find a propensity for petty theft or cheating? And most importantly, this capability – and similar leaps in skill in other AI systems – desirable?

Leave a Comment

Your email address will not be published.