What We Don't Know
What We Don't Know
Protein folding
This episode is about protein folding, specifically the protein folding problem that has pervaded biochemistry since 1960, when the first atomic-resolution protein structure was presented. First I will explain what proteins are, why they are important, what they are made of - proteins 101 - then begin unravelling the problem of how they fold. We will explore the motivations behind the problem and its greatest challenges. Finally we’ll consider the existing methods for determining protein structure, with particular focus on DeepMind’s AlphaFold, before finishing with the future of protein folding.
https://whatwedontknow.buzzsprout.com/
Hello everyone, welcome to the seventh episode of ‘What We Don’t Know’, a podcast that explores the boundaries of human knowledge, investigating the unanswered questions and theories that unravel them at the frontiers of science. During this podcast I hope to get you interested in new areas of science, maths and technology, teaching you about existing concepts and igniting a curiosity for the things we have yet to know.
This episode is about protein folding, specifically the protein folding problem that has pervaded biochemistry since 1960, when the first atomic-resolution protein structure was presented. First I will explain what proteins are, why they are important, what they are made of - proteins 101 - then begin unravelling the problem of how they fold. We will explore the motivations behind the problem and its greatest challenges. Finally we’ll consider the existing methods for determining protein structure, with particular focus on DeepMind’s AlphaFold, before finishing with the future of protein folding.
Along with carbohydrates, lipids and nucleic acids, proteins are some of the most important macromolecules in living organisms. They are large, complex machines underpinning nearly every biological process, from organ-communication and damaged tissue repair, to transport of molecules and fighting infections as antibodies. Enzymes, which facilitate nearly all cellular chemical reactions, as well as translate DNA, are proteins. Many hormones which regulate the body, including insulin, which allows glucose to enter cells, are proteins. Myofibrillar proteins largely compose myofibrils, which bundle into muscle fibers, which make up muscles and enable movement. The list goes on.
The general structure of a protein has four orders. First is the primary structure. This is the sequence of amino acids that link to form the polypeptide chain, and it is determined by the sequence of DNA nucleotides in a gene. Transcription and translation convert this DNA sequence into an amino acid sequence, which forms a polypeptide chain when the amino group of one amino acid forms a peptide bond with the carboxyl group of another amino acid. Each amino acid has a carbon atom bonded with an amino group (NH2), a hydrogen, a carboxyl group (COOH) and a variant side chain R.
Interactions between atoms of the polypeptide backbone, which refers to non-R-group parts, cause local folded structures of the secondary structure. The most common types are the alpha helix and the beta pleated sheet, held together by hydrogen bonds between the carbonyl oxygen of one amino acid’s carboxyl group, and the hydrogen of another’s amino group. Alpha helices resemble curled ribbons with R groups sticking out; beta pleated sheets resemble sheets (no prizes for guessing that) whose R groups extend above and below the plane of the sheet. A protein may be of only one type, or have sections of each, and unstructured regions between secondary structures are called ‘bends’.
The tertiary structure is where complexities arrive. This is the overall 3D structure of a protein. It is primarily due to interactions between R groups, which include hydrogen bonding, ionic bonding, dipole-dipole interactions, London dispersion forces, disulphide bonds, and hydrophobic interactions.
Finally, some proteins are made of multiple polypeptide chains, known as subunits, which come together under the influence of many of the same interactions that influence tertiary structure, especially weak interactions like hydrogen bonding and London dispersion forces. For example, haemoglobin has four subunits, and DNA polymerase has ten. Multiple amino acid chains assembled together define the quaternary structure.
We know over 200 million protein sequences. For only a tiny fraction of those do we know the corresponding 3D structure. Around 100,000 structures were discovered through an enormous experimental effort, but scientists would like to understand all proteins’ 3D structures, because its shape significantly impacts its function.
Some genetic mutations cause changes in amino acid sequence, and the protein folds incorrectly, so it no longer fulfills its function. Type2 diabetes, Alzheimer’s, Parkinson’s, and ALS are all diseases caused by incorrectly folded proteins. For example, the 14 kDa protein α-synuclein is strongly associated with Parkinson’s, but we haven’t fully deciphered its biochemistry. Understanding how proteins fold, and, by extension, reliably predicting how a change in amino acid sequence affects a protein’s folded shape, would develop our understanding of how these diseases work. Then we could treat their symptoms more successfully, or even find cures.
If only there was a way to predict the 3D structure of a protein solely from its amino acid sequence. And therein lies the protein folding problem.
Amino acid sequences can be very, very long, and fold in many directions. In fact, Cyrus Levinthal estimated in the 1960s that a typical protein could fold in 10140 ways. The true numbers are no less staggering: 1050 for small proteins and 10300 for large ones. That’s a lot of conformations - i.e. arrangements in space of a protein’s atoms - for a small string of monomers to choose from.
Even more startling is the speed with which a protein does choose. And it chooses the same every time. Proteins, in general, do not have a large number of stable shapes, but prefer folding the same way each time.
There are two main ways to approach the protein folding problem. The first is to produce a model for why proteins fold one way and not another.
A major milestone in understanding the folding mechanism was Anfinsen’s thermodynamic hypothesis of 1962, where he postulated that the 3D structure, i.e. the native structure, of a protein is the most thermodynamically stable structure, depending only on the amino acid sequence and conditions of the solution, not on the kinetic folding route. In other words, proteins fold into their lowest energy shape, and do so based entirely on the sequence of amino acids.
Not only did this greatly simplify the problem, but it meant that it didn’t matter how the protein was synthesised, because its folding depended only on the physical chemistry of its amino acids’ interactions, and accurate experiments could be done in test tubes. The hypothesis also suggested that proteins can refold after denaturation (when they lose their higher-order structure). It’s since been shown that some can refold alone, but others need assistance from chaperone proteins, known as chaperonins.
That was clearly not enough to solve the mechanism of protein folding. Firstly, is there a dominant factor explaining why any two proteins have different structures? There’s considerable evidence that hydrophobic interactions play a major role, but nothing is for certain.
There are also several models for how proteins fold: the diffusion-collision model, where microdomain structures form, diffuse and collide to form larger structures, and the nucleation-condensation mechanism, where a diffuse transition state ensemble nucleates tertiary contacts. In models, secondary structures may assemble in a hierarchical order, or follow steps of assembly of foldons, structural units, or search for topomers, largely unfolded states with native-like 3D structures.
Scientists use diagrams all the time to convey complicated ideas. Protein folding certainly involves complicated ideas. After studies of the conformational space of protein folding, i.e. the energy landscape of all conformations a protein searches to find its native shape, scientists have concluded that proteins have funnel-shaped energy landscapes. This means there are many high-energy states and few low-energy states. Differently shaped funnels can convey rates of reaction, known as folding kinetics. Fast folding is a simple funnel; slow random searching is bumpy, like a golf course; kinetic trapping has moats and wells. The funnels demonstrate the distinction between simple classical chemical reactions, where reactant A becomes product B through a sequence of individual structures, and protein folding, where the reactant (the denatured state) is not a single structure, and instead transitions from disorder to order.
The search for accurate models of protein folding mechanisms is arduous. There’s a lot left to understand. Even so, our understanding has improved in leaps and bounds. A protein solves its large global optimisation problem - finding the lowest energy state - by completing a series of smaller local optimisation problems, using peptide fragments to assemble a native structure.
Remember when I said there were two ways to approach the problem? I’ve discussed the approach of model building, but now let’s consider the approach of analysing an array of protein structures and extracting patterns in how they fold.
Both ways need examples to study, so I’ll briefly mention how scientists collect examples. In X-ray crystallography, x-rays beams are fired at crystallized proteins and the way they scatter is recorded. This data is used to calculate where atoms are in the protein. Unfortunately, some proteins take months, or even years to crystallize, so the method is flawed. A new method, cryo-electron microscopy, uses electron microscopy on deep frozen proteins.
Whatever the method, experimentalists have been building a database of protein structures for many years. The Protein Data Bank (PDB) can then be used by algorithms.
DeepMind Technologies is an artificial intelligence company owned by Google. In DeepMind’s paper on AlphaFold, published in Nature in 2021, they wrote that computational approaches are needed to enable large-scale structural bioinformatics. In order to predict folding, one could simulate the thermodynamics or kinetics of protein physics, with approximations, but due to high computational needs, context dependence and incomplete physical models, this is highly difficult for even moderate-sized proteins. The evolutionary program alternative analyses the evolutionary history of proteins, homology (i.e. similarity) to solved structures, and pairwise evolutionary correlations. Although the growth of the PDB and advances in genomic sequencing have been beneficial, both methods are unreliable, especially for proteins without solved close homologues.
Enter AlphaFold.
In CASP, the Critical Assessment of Protein Structure Prediction, DeepMind used AlphaFold to score 60% in 2018 and AlphaFold2 to score almost 90% in 2020. This means AlphaFold’s predictions matched with unpublished, known protein shapes with 90% accuracy.
How does it work? AlphaFold is a neural network, a form of machine learning algorithm loosely modelled on neurons in the human brain. Their output depends on the strength of connections between virtual neurons - of course, the neurons are modules of a computer program, not cells. Training from a collection of data, neural networks alter these connection strengths in a process of trial and error, in order to reach a certain goal, perhaps minimising a global cost function. The network’s architecture is usually specific to the problem at hand.
AlphaFold works in two steps. The first stage takes inputs, the primary amino acid sequence and aligned sequences of homologous proteins, and runs them through repeated layers of a neural network block named Evoformer, to produce an array representing processed multiple sequence alignments and an array representing residue pairs.
But what does that mean? Like with other CASP programs, it starts by comparing a protein’s amino acid sequence with similar ones in the database - that refers to multiple sequence alignments, or MSAs. It finds pairs of amino acids that tend to appear alongside each other in 3D space, despite not being together in the chain, which implies they are closely located in the target folded protein. Then the neural network predicts the distance between this pair of amino acids when the protein folds. A parallel network predicts the angles of joints between consecutive amino acids.
However, this set of distances and angles may be physically impossible, so in the second stage, AlphaFold creates a 3D folding conformation, which is physically possible but practically random. Its initial state has rotations set at the identity and positions at the origin. This undergoes gradient descent, an optimisation method that iteratively refines this structure until it becomes as similar as possible to the predictions from stage one.
AlphaFold has several things going for it, not least the high degree of accuracy. Its strengths include a new output representation and associated loss that enables end-to-end structure prediction, rather than subunits of a protein that must be assembled afterwards, and an emphasis on iterative, i.e. repeated, refinement. It trained on PDB data with supervised learning then underwent self-distillation. This is where a trained network predicts the structure of more sequences, producing a new dataset, which can be incorporated into the whole dataset for a new round of training from scratch. The extra data enhances accuracy.
However, it has flaws. AlphaFold cannot yet be used for drug discovery, because the degree of accuracy, measured in average or root-mean-squared difference, is about 1.6 angstrom, about the size of a bond length. For truly reliable insights into protein chemistry or drug design, scientists need a confidence of atomic positions within 0.3 angstrom. AlphaFold also performs poorly for multi-protein complexes. These include ribosomes, ion channels, and polymerases, arguably some of the more interesting proteins to understand. Also, any neural network learns from existing data. This data must be produced in the first place, and patterns underrepresented in the PDB will not be picked up on by the network, so it may be unable to predict structures with unrepresented folds.
Even so, AlphaFold is a huge step in protein biochemistry. The predictions it makes can be used in conjunction with other methods to explore protein folding, for example by simplifying the calculations from cryo-electron microscopy. It is also a milestone in the development of practical neural networks.
Mohammed AlQuraishi, of Harvard Medical School, made a neural network in 2019, demonstrating that DeepMind is not the only one making progress in computationally-driven folding predictions. His algorithm uses a mathematical function to calculate structures in a single stage. It can predict structures in milliseconds rather than hours or days, but technical difficulties meant it performed poorly at CASP13.
In this episode I explained what proteins are, why they are important, and the orders of protein structure. Then I explored the protein folding problem - whether we can predict the 3D structure from a sequence of amino acids - and what mechanisms drive folding. There are multiple ways of determining native shape, but machine learning, and the AlphaFold neural networks in particular, have revolutionised the field, and I considered the strengths and weaknesses of different tools.
Understanding proteins is not just an interesting riddle to solve. Proteins are perhaps the most fundamental molecule in biology. They are the building blocks of all life.
If we can predict how proteins fold using only their amino acid sequence, we can discover the 3D structures of essential proteins that would otherwise take years to understand. Our knowledge of basic biology would increase tenfold, and with it, our knowledge of protein-driven diseases, catalysing rapid improvements in healthcare. Not only would there be improvements in drug creation, but also in artificial design and synthesis of proteins. In this world, we could make proteins that stimulate the immune system to fight cancer, or a universal flu vaccine, or proteins that break down plastic.
And thanks to biochemists, software engineers, and all the people working to crack the protein folding problem, we are closer to this world than ever before.
Thank you for listening.
References:
- https://www.khanacademy.org/science/biology/macromolecules/proteins-and-amino-acids/a/orders-of-protein-structure
- https://youtu.be/KpedmJdrTpY
- https://youtu.be/yhJWAdZl-Ck
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2443096/
- https://arxiv.org/abs/1306.1372
- https://cs.stackexchange.com/questions/128493/is-protein-folding-really-np-hard-and-how-to-show-that
- https://www.nature.com/articles/s41586-021-03819-2
- https://occamstypewriter.org/scurry/2020/12/02/no-deepmind-has-not-solved-protein-folding/
- https://www.nature.com/articles/d41586-019-01357-6