Life’s instruction book is written with a genetic code, yet its origins have stayed stubbornly unclear. A new study looks backward through protein history and finds that tiny two-amino acid units called dipeptides carry a surprising amount of that story.
Dipeptides are simple pairs, but their patterns across species turn out to line up with how the genetic code likely expanded.
The team examined 4.3 billion dipeptides across 1,561 complete protein sets, also known as a proteome, to reconstruct a timeline of early protein features.
The work was led by Gustavo Caetano-Anollés at the University of Illinois Urbana-Champaign (UIUC). He and colleagues linked these protein patterns to known steps in the translation system that turns genes into proteins.
A dipeptide is just two amino acids linked by a peptide bond. That sounds basic, yet how often different dipeptides show up through evolution can reveal what kinds of building blocks were favored when proteins first took shape.
Proteins do not build themselves from DNA alone. Enzymes called aminoacyl tRNA synthetases connect each amino acid to its matching transfer RNA (tRNA), and the ribosome reads tRNA triplets, called the anticodon, to assemble a chain in the correct order.
The team compared three independent timelines. First, they built a phylogenetic tree from dipeptide frequencies across the tree of life.
Second, they mapped those results onto a tree of protein domain structures, which act like reusable parts of proteins.
Finally, the experts checked whether this protein picture fits what we know about tRNA history. They found that the order in which amino acids became part of the code matches the order that dipeptides rose in abundance.
Many dipeptides have a mirror partner made by reversing the order of the amino acids. The team saw a repeated synchronicity, where a dipeptide and its reverse often appear near the same time on the evolutionary timeline.
“The synchronous appearance of dipeptide-antidipeptide sequences along the dipeptide chronology supported an ancestral duality of bidirectional coding operating at the proteome level,” wrote Caetano-Anollés.
Before the modern genetic code was fully specified in the anticodon loop, biochemists found evidence for an earlier set of rules in the acceptor stem of tRNA. That older set is often called an operational code, and it governs how synthetases recognize and charge tRNAs.
In the new work, dipeptide patterns point to the same early phases, with amino acids like leucine and serine dominating the earliest stage, then others joining as specificity improved.
This two-phase picture supports a longstanding hypothesis that the two classes of synthetase enzymes may have roots on complementary strands of the same ancestral gene. That would naturally produce paired signals, which fits the twin rise of dipeptide and anti-dipeptide partners.
Minimal enzyme cores for synthetases, often called urzymes, can catalyze the key activation step for amino acids even when pared down to about a hundred residues.
These minimal systems help make sense of an early world where proteins and RNAs cooperated with fewer parts.
The new dipeptide chronology suggests that early proteins borrowed recurring two-amino acid patterns that supported folding and catalysis, then diversified.
The authors also mapped when heat-related protein features show up. They report that predictors of thermostability tend to appear later in the dipeptide timeline, which implies that early proteins formed in milder settings rather than boiling extremes.
One widely cited analysis found that the combined abundance of seven amino acids, the IVYWREL set, correlates with the optimal growth temperature of many microbes.
A 2021 tool that classifies thermophilic proteins using dipeptide propensities also points toward later development of heat tough patterns.
If early protein structure favored certain dipeptides, those patterns can guide today’s protein engineering.
This suggests that some building blocks are historically robust, which is useful when you are trying to design enzymes that work reliably in living cells.
The research also offers a fresh cross-check for genome annotation and bioinformatics pipelines. Features that are old, widely shared, and tied to folding constraints tend to be more portable across organisms.
Do dipeptide frequencies really carry deep time signals, or are they mostly shaped by modern lifestyle and ecology?
The team minimized this by sampling widely across Archaea, Bacteria, and Eukarya, then asking whether three different datasets converge on the same story.
Could the mirrored rise of dipeptide pairs come from codon quirks rather than ancient bidirectional genes? Perhaps in part, but the link to complementary synthetase classes, plus prior evidence for operational code rules, argues that the signal is not a statistical fluke.
Protein domain families show distinct dipeptide fingerprints. That can be turned into predictive models that flag when a new protein likely belongs to a particular fold, even before a good 3D structure is available.
Synthetic biology can also test specific dipeptide swaps forecast by the timeline. If a swap at a key site changes stability or activity as expected, that will keep sharpening the map of which ancient patterns still matter in modern proteins.
The study is published in the Journal of Molecular Biology.
—–
Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.
—–