A 50 year old Protein Folding Problem and the AI that Solved it

16 min readJan 1, 2022

Proteins are the molecular machinery of life. They do everything from digesting our food to acting as photoreceptors in our eyes. Proteins build proteins and proteins also breakdown proteins. Built from a unique sequence of amino acids, proteins have a specific 3d structure that allows them to do work. How proteins go from a string of amino acids to a compact structure that acts as a nano-machine is called the protein folding problem. Even if we know the amino acid sequence of a protein, we can’t determine the final 3d structure as the possible of chemical interactions are endless. A dead-end problem for last 5 decades had a breakthrough when Deepmind’s AI, AlphaFold presented with the highest success rate ever.

WHAT IS A PROTEIN?

Proteins are essential for life. They are among the most abundant and most diverse organic molecules. Along with being an essential nutrient and a fuel source, protein is also one of the building blocks of body’s tissue. Other building blocks include carbohydrates, lipids and nucleic acids. Pretty much everything in our cells is carried out by proteins. You can think of protein as micro-machineries. Just as the structure of a lever and pulley allows it to do work, the 3d structure of protein allows it to do the specific task it’s made for. And because of their tremendous structural diversity, proteins can serve a significantly wider variety of functions than any other type of molecule in the body. A single cell can contain thousands of proteins, each with a unique function. Enzymes like the ones that aid in digestion are made out of proteins. The specific 3d structure of individual enzymes allows them to make certain reactions happen more easily than they would have otherwise. All enzymes are proteins. DNA Polymerase, for example, is an enzyme that helps add nucleotides to DNA during replication process. The condition albinism results from an inherited defect in the enzyme that catalyzes the formation of melanin from the molecule DOPA (dihydroxyphenylalanine).

Proteins like Collagen, Microtubules and Keratin also help provide structure to cells. Collagen provides structure and tensile strength to tissues like tendons and ligaments. Microtubules help form the Cytoskeleton — a complex and dynamic network of protein filaments present in cytoplasm of all cells. Keratin is found in the outer layer of dead cells in the epidermis, where it prevents water losee through the skin. Other proteins like actin and myosin are used in muscle contraction. Proteins like albumin carry other molecules around. Proteins like hemoglobin carry oxygen to cells and tissues. Antibodies are also proteins.

Proteins also act as receptors on cells. These receptors can be for numerous things. For example, all cells in our body express MHC (Major Histocompatibility Complex) receptors that distinguishes them from foreign cells. [Learn more about MHC receptors and the role they play in our immune system in my Innate and Adaptive Immunity blog] Some receptors act as receivers of signals by binding to extracellular molecules. Some receptors are embedded in the cellular membrane and open like gates when the right molecules bind to them. Sometime the things binding to the receptors are proteins themselves, like hormones. In these cases, proteins act like both chemical messengers and their destination receptors.

Melatonin, Insulin and Thyroxine are example of protein hormones. Insulin for example, is a hormone released by pancreas when there is excess blood glucose in circulation. Insulin is then responsible for opening the gates of cells to allow glucose in.

The specific 3D structure of Insulin and the receptor it binds to allows it to open the doors of cells to allow glucose flowing in bloodstream to be taken in. Below is a closer representation of what the 3D structure of Insulin looks like.

It looks fairly complicated. What are the building blocks of protein itself?These large complex molecules, chemically are made from chains of Amino Acids.

WHAT ARE AMINO ACIDS?

Amino acids are organic compounds that have an amino (NH3) and a carboxylate (CO2) group bound to a side chain (R) which is unique to each amino acid.

More than 500 naturally occurring amino acids exist but only 20 of them are present in our genetic code. All livings organisms use the same 20 amino acids to make thousands of different proteins. Here is a compilation of the 20 amino acids with their unique side chains:

HOW ARE PROTEINS MANUFACTURED?

So, the 20 acids are arranged in various combinations to make a chain of thousands of amino acids. Each protein has its own unique sequence of amino acids. But how are these proteins manufactured and how do cells know what order to arrange the amino acids in? The sequence of amino acids is derived from the genetic code. This translation of genetic code into proteins is called the Central Dogma. I go over this Nucleotides of DNA →sequence of Amino Acids →Protein in my Genetic Code blog.

There are five different nucleotides. DNA uses four: Adenosine (A), Thymine (T), Guanine (G), and Cytosine (C). RNA uses Uracil (U) instead of Thymine. Cells decode DNA by reading nucleotides in groups of 3 called codons. Adenosine binds to Thymine and Cytosine binds to Guanine. The DNA is a double helix. So, the genetic code has a complimentary strand that binds to it and it’s this strand that’s used to make copies of genes. The relationship between codons and amino acids is the genetic code.

Most codons specify an amino acid. Notice there are several ways to code one amino acid. The reason for this is that there are four possible nucleotides spelling out a three letter amino acid. If you do some math, we get a total of 63 possible 3 letter combinations. These 63 combinations are assigned to the 20 different amino acids. A huge benefit of this redundancy is that sometimes a mutation may not make any difference as the new 3 letter combination still translates to the same amino acid. Because changing even one of the thousands of amino acids can, sometimes, change the structure and function of the protein. For example, sickle cell anemia is caused by a single nucleotide mutation in the gene for hemoglobin. This changes the amino acid from Glutamic acid to Valine. This single change causes the shape of hemoglobin to become sickle like and less efficient in carrying oxygen, plus it also gets stuck in narrow vessels causing occlusion and pain.

There is also one start codon — AUG , which codes for methionine(all proteins start with this amino acid) and 3 stop codons — UAA, UAG, UGA which indicates where to stop translation. A gene in the DNA could be hundreds of thousands of base pairs or nucleotides long. This translates to a string of thousand different amino acids. This sequence is called the primary structure of protein. Some proteins can have multiple sequences. Insulin, for example, has two chains of amino acids. So, it has two primary structures. Haemoglobin has four primary structures or polypeptide (protein) chains.

Once you have the primary structure, the backbone of the polypeptide chain minus the R side chain — essentially parts of the protein chain that’s identical), bond with each other to form secondary structures. Most common secondary structures include β pleated sheets and α helixes. The carbonyl O of one amino acid binds with the amino H of the other amino acid. In an α helix each amino acid is bonded to one that’s 4 down the chain. This pattern creates a helix shape. In β pleated sheets two or more segments of polypeptide chain line up together and form bonds in between. Hemoglobin is made of 2 α and 2 β subunits.

The overall 3 dimensional structure of the protein is called the tertiary structure. Tertiary structure is formed from the interactions between the R side groups or the unique chemical groups of each amino acid. R groups come in a variety so they can form many different types of chemical bonds. Hence, predicting this can be difficult.

If a protein is only made of one polypeptide chain then this tertiary structure will be its final structure. But many proteins like the Insulin and hemoglobin are made of multiple polypeptide chains. The interactions between these individual chains create the ultimate quaternary structure of the protein. DNA polymerase, the enzyme that synthesizes new strands of DNA also has a quaternary structure formed by an interaction of 10 individual polypeptide chains!

Whenever a cell needs a specific protein to perform a job, it transcribes the gene for that protein from DNA to messenger RNA(mRNA) inside the nucleus. mRNA unlike DNA is able to leave the nucleus. Once outside, the mRNA can be translated into a string of amino acids. The chain of amino acid makes a polypeptide which folds into a 3d model of the required protein. This entire production line is orchestrated by proteins themselves called ribosomes. Check out my Central Dogma blog to get a much more detailed version of this.

SUMMARY OF PROTEIN STRUCTURES

A protein can revert back to its primary structure if exposed to abnormal temperatures or pH. Sometimes this is reversible and the chain can refold into its original structure. But other times it is irreversible, like in case of the albumin found in egg whites. Heating egg changes the structure of albumin irreversibly as we all know a cooked egg can’t go back to being uncooked. The change of protein from its 3d structure to its polypeptide chain is called denaturation. Most denatured proteins are non functional as they have lost the 3d structure.

WHAT IS THE PROTEIN FOLDING PROBLEM?

So, we now know that proteins can be simplified into a string of amino acids. The code for this sequence of amino acid can be found in the genes in our DNA. Take Insulin for example. It is a protein that we know the entire amino acid sequence for. But we still need a bacteria like E.Coli to manufacture insulin for medical use. Yes, that’s how Insulin is made. We insert the Insulin gene in E.Coli DNA and use its cellular machinery to produce insulin for us. This is exactly how viruses work.

Why can’t we just attach one amino acid to the next in the right sequence and make the protein ourselves? Turns out it’s not that easy. In his acceptance speech for 1972 Nobel prize in chemistry, Christian Anfinsen famously postulated that we should be able to determine the 3d structure of any protein if we knew its primary amino acid sequence. This sparked a 5 decade quest to predict a protein’s 3d structure. Each sequence of thousands of different amino acids can be folded in MANYYYY different ways. To be precise, one primary sequence can have 10³⁰⁰ possible configurations of final structure!! In 1969 Cyrus Levinthal noted that it would take longer than the age of the universe to list all possible configurations of final structure that one primary sequence can have. But somehow the cellular machinery of our body knows exactly how to fold it, sometimes in milliseconds. This dichotomy is refered to as Levinthal’s paradox because we haven’t been able to crack how it actually does that. We know the cells utilize few protein chaperones that help guide the amino acids in the folding process. But the exact mechanisms are unknown.

HOW CAN WE IDENTIFY PROTEIN STRUCTURES NOW?

If we have the protein, we can use X ray crystallography or NMR spectroscopy to identify its structure. In X ray crystallography, high powered X rays are aimed at a tiny crystal containing trillions of identical proteins. The crystal scatters the X rays on an electronic device which acts like a digital camera. This is performed from all angles to capture the 3d image of the protein. There’s also a newer method called cryo-electron microscopy. Proteins that we have identified these structure for are listed under the Protein Data Bank. So far, that protein bank has a collection of around 170,000 protein structure. But from life’s genome we have gathered a list of total 180 million protein sequences! These primary sequences are registered in the Universal Protein Database.

But despite all the methods available to us for determining protein structures we can’t do this for all 180 million and counting protein sequences. That’s because all these methods are extremely expensive as well as time consuming and none determine the 3d structure purely from the primary sequence. Also, some proteins like cellular membrane proteins are hard to crystallize. Hence, began the ultimate competition to drive creativity in this field.

CASP Competition

Critical Assessment of Protein Structure Prediction is a worldwide experiment that has been taking place every 2 years since 1994. Founded by John Moult and Krzysztof Fidelis, the goal of the competition is to find a reliable method of predicting the 3d structure of a protein from its primary sequence of amino acids. Participants must blindly predict the structure of the proteins, and these predictions are then compared to ground truth experimental data when they become available.

CASP uses GDT (Global Distance Test) to measure accuracy of structure predictions. GDT ranges from 0–100. GDT can essentially be defined as the percentage of amino acids within a threshold distance from its correct position. The success rate of determining protein structure in the competition remained around 40 GDT until AlphaFold hit a high of 60 GDT in 2018. Then came AlphaFold 2 had an accuracy of 92.4 GDT across all targets!! AlphaFold2’s margin of error is comparable to a width of an atom. This was unprecedented. A scientific marvel. Even for the hardest challenges, AlphaFold2 achieved a GDT of 87. This breakthrough is incomparable specially for predicting structures of proteins that can’t otherwise be experimentally determined such as the membrane proteins. And this breakthrough occurred decades before anyone would have predicted it. All thanks to Deepmind’s AlphaFold. Today AlphaFold is available for free to the whole world. How did AlphaFold do it?

AlphaFold

AlphaFold’s success can be attributed to another big advancement of 21st century — Machine Learning. Then to truly understand how AlphaFold gained such accuracy in predicting structures we need to understand Machine learning, which can be a whole blog in itself. But here, I will try my best to at least lay out the basic idea.

WHAT IS MACHINE LEARNING

We can think of machine learning as essentially a magical tool that lets computer learn on its own without extensive coding and supervision. An AI using machine learning algorithms is not very different from a baby learning about its world. We have had computers for a while now and they are much better than us at storing data and memory. But when it comes to processing power, our own brain still is the most powerful computer in the world. What will it take to build a computer that can think and learn on its own? Well, for beginners we need an unimaginable amount of processing power (for eg tons of gpu — also fun fact our brain runs on less watts than required by a light bulb — but till we can figure out how to make “organic” AI we are gonna need a shit ton of computing power). But we also need a whole new way of coding. Most programs are coded using “If this then That”. But not every problem is as simple as searching through a database of if this then that. Certain problems like tackling language, optimization, or protein folding are more organic and require novel methods to solve. A toddler, for example, learns language by recognizing patterns. We don’t label every single possible object that exists with its name and expect a baby to learn that way. Babies learn their mother tongue without being taught, all by recognizing patterns and categorizing words using those patterns. And they do this in a trial and error fashion. Machine Learning is like a human brain on steroids. It learns and it learns fast as long as it’s one general category like learning language (GPT), learning to play GO (AlphaGo), learning to play Starcraft (AlphaStar) or learning how proteins fold (AlphaGo). A supercomputer that can do it all is theoretically possible but then we are talking about city sized cubes filled with gpu — unless we come up with a whole new material to build the computer with.

So, that’s what machine learning is. It’s still “If this then That” but with many added hidden layers called neural networks that continuously feed into each other, finding patterns, analyzing and reanalyzing every output before spitting out a final output. For example, it’s not if green then apple or if green, small and round then it’s apple anymore. Because that object could also be a ball or many other things. You can’t create an infinite list of if this+this+this… then that. Machine learning saves us the trouble of having to do that. Just like different characteristics of the apple are processed in different parts of brain, a machine learning AI has these different parts or “hidden layers” to analyze and check each characteristic with a constant feedback mechanism.

BACK TO AlphaFold

AlphaFold was trained using the data of 170,000 protein sequences and structures from the Protein Bank. These proteins come from all kinds of living beings. As we just learnt, machine learning uses various hidden layers between the initial input and final output. To understand about AlphaFold’s hidden layers we will reference the diagram below.

AlphaFold takes the input primary sequence and runs it against different databases to construct a Multiple Sequence Alignment (MSA). Sequence alignment is a method of arranging DNA, RNA or protein segments to find similarities. There’s a saying that “Nothing is new under the sun”. Primary sequences mutate and proteins change but the key feature that is essential to the protein’s function tends to be preserved under evolutionary pressures. An enzyme’s active site, for example, may only consist of 5% of the entire structure but it’s absolutely vital to the enzyme’s function. Sheerly due to probability, most mutations in an enzyme’s gene will then accumulate in regions not coding for the active site. If a mutation does occur on the active site or another structurally important site then that gene then will be under evolutionary pressure to preserve the structure of the enzyme. For example, let’s says in the active site the negatively charged amino acid Glutamate is bound to a positively charged Histidine. If Glutamate mutates to a positive amino acid, the histidine would then be under evolutionary pressure to change to a negatively charged amino acid in order to preserve the protein’s function. Here is an example of myoglobin proteins from four different animals. Though the structure looks pretty much the same, only 25% of the primary sequence matches between the orange (human) and yellow (pigeon) version.

Clockwise from top left: human, African elephant, blackfin tuna, pigeon

Image representing how mutations co-evolve to preserve structure.

So, this is one parameter that AlphaFold uses to categorize primary sequences. It also runs the input sequence through a database of protein structures to find structures that will be similar to the given sequence. It does this by identifying pairs of amino acids that are most likely to be in contact. Using this information, it creates an initial model of the structure called “pair representation”.

The information from the MSA and input sequence gets passed through a “transformer” and spits out a 3d structure. This structure is then compared to the pair representation and changes are made accordingly. The model is further refined by running through MSA analysis again. This process is repeated several times as information between both pair representation and MSA get iterated to each other till a final 3d structure design for the protein is produced. This is done as a spatial graph, with each amino acid having a point in a 3d space. A confidence score for each structural bond is also assigned by AlphaFold. You may be wondering what a transformer is. It’s another tool of machine learning. GPT, the language AI uses it as well. In fact, the T in GPT stands for transformer. The Transformer, or in this case the Evoformer, essentially squeezes out every ounce of information from MSA and the initial input. It then recognizes which information is important to the final structure.

To be honest, nothing that AlphaFold does is novel in the sense that the it’s these very basic principles that other competitors of CASP use. It’s what our own human brain would use as well. It’s logic. Recognizing patterns in data is what it all boils down to. But what sets AlphaFold apart is the machine learning algorithm that makes this type of problem solving much more efficient than any other method. A Machine learning AI can process much more information than a regular computer. Though it is still nothing compared to our brain. But combine this power with the machine learning platform’s efficiency in sorting through data to find patterns. That’s where our brain falls behind. We can’t process through 180 million protein sequences using only our mind power. Even worse, we don’t even know the structures for most of them.

Given all this there is still much to learn. We have crossed leaps and bounds but still far from perfect. Eventually when we become masters of protein manufacturing, we will be able to create our own proteins that don’t exist yet from primary sequence created by AI to do tasks we don’t have nano-machines for yet. We can tackle everything from making enzymes for plastic degradation to creating structures capable of carbon capture. We can also revolutionize medical treatments and make many health disorders obsolete. Being able to efficiently and accurately manufacture proteins will also allow us to be better prepared in case of another or even current pandemic. This is because all viruses are equipped with special proteins that give them certain perks. For example, the spike protein of SARS-CoV2 allows it to unlock our cells without triggering the immune system’s alarm. Earlier this year AlphaFold predicted several protein structures for SARS-CoV2. And not just proteins. This achievement of AlphaFold is a celebration of potential of AI as a scientific tool. I am confident that AI will become one of humanity’s most useful tools in expanding the frontiers of scientific knowledge.

Hope you found this post informative as well as interesting! :)