Protein Families Deciphered: Machine Learning Categorizes Proteins Based on Their Functions

Models like AlphaFold have made great strides in finding protein shapes, which determine their biological functions. New work separated proteins into functional families without considering their shapes.

What’s new: A team led by Maxwell L. Bileschi classified protein families using a model (called ProtCNN) and a process (called ProtREP) that used that model’s representations to address families that included fewer than 10 annotated examples. The project was a collaboration between Google, BigHat Biosciences, Cambridge University, European Molecular Biology Laboratory, Francis Crick Institute, and MIT.

Key insight: A neural network that has been trained on an existing database of proteins and their families can learn to assign a protein to a family directly. However, some families offer too few labeled examples to learn from. In such cases, an average representation of a given family’s members can provide a standard of comparison to determine whether other proteins fall into that family.

How it works: The authors trained a ResNet on a database of nearly 137 million proteins and nearly 18,000 family classifications.

  • The authors trained the model to classify proteins in roughly 13,000 families that each contained 10 or more examples.
  • Taking representations from the second-to-last layer, they averaged the representations of proteins in each family.
  • At inference, they compared an input protein’s representation with each family’s average representation. They chose the family whose average matched most closely according to cosine similarity.
  • In addition, they built an ensemble of 19 trained ResNets that determined classifications by majority vote.

Results: The ensemble model achieved accuracy of 99.8 percent, higher than both comparing representations (99.2 percent) and the popular method known as BLASTp (98.3 percent). When classifying members of low-resource families, the representation-comparison method achieved 85.1 percent accuracy. Applying the ensemble to unlabeled proteins increased the number of labeled proteins in the database by nearly 10 percent — more than the number of annotations added to the database over the past decade.

Why it matters: New problems don’t always require new methods. Many unsolved problems — in biology and beyond — may yield to well established machine learning approaches such as few-shot learning techniques.

We’re thinking: Young people, especially, ought to appreciate this work. After all, it’s pro-teen.