Although the sequencing of the human genome was heralded as a breakthrough when it was announced to the world in 2001, the truth was that large sections of chromosomes had not actually been sequenced in full at that time. Twenty years later, and with the help of improved technology for reading longer sections of DNA with high accuracy, scientists have finally been able to fill in the gaps in all 23 chromosomes. The results, published today in the journal Science, reveal hitherto unknown secrets about human evolution and ancestry.
When the current study began, not a single human chromosome had been fully sequenced. The areas where information was missing included mostly the regions around each centromere, the site where a constriction divides the chromosome into two “arms.” The DNA in this region is characterized by multiple, repeating sequences of code that are difficult to tell apart, or to join up correctly when read in short sections. Thanks to the use of new technology that allows ultra-long sections of DNA to be sequenced, it has now been possible to establish the genetic code on all the human chromosomes, from end to end.
“Uncovering the complete sequence of these formerly missing regions of the genome told us so much about how they’re organized, which was totally unknown for many chromosomes,” said Nicolas Altemose, a postdoctoral fellow at the University of California, Berkeley, and a co-author of four new papers about the completed genome. “Before, we just had the blurriest picture of what was there, and now it’s crystal clear down to single base pair resolution.”
Altemose is first author of one paper that describes the base pair sequences around the centromere. A paper explaining the methods used for the sequencing will appear in the April 1 print edition of the journal Science, while Altemose’s centromere paper and four others describing what the new sequences tell us, are summarized in the journal with the full papers posted online. Four companion papers, including one for which Altemose is co-first author, will also appear online April 1 in the journal Nature Methods.
Until the completion of this study, the reference human genome, known as GRCh38, was used by doctors when searching for mutations linked to disease, as well as by scientists looking at the evolution of human genetic variation. However there were large gaps that remained unsequenced, not only in the area of the centromeres, but also near the telomeres – the terminal ends of each chromosome – and around the genes that code for ribosomal proteins. This meant that not all variations of human genes had been identified, limiting the understanding of genetic diseases.
The newly completed human genome, referred to as T2T-CHM13, was sequenced and analyzed by a team of more than 100 scientists under the auspices of the Telemere-to-Telomere Consortium, or T2T, named for the telomeres that cap the ends of all chromosomes. The consortium’s gapless version of all 22 autosomes and the X sex chromosome identifies 3,055 billion base pairs, or nucleotide pairs, the building units from which chromosomes and genes are built. These base pairs form the genetic “code” that determines the activities of cells and is passed on to future generations.
Within the 3,055 billion base pairs, the team also identified 19,969 protein-coding genes. Of these, about 2,000 were new ones, but most of them were no longer functional. In total around 115 of the newly identified genes were probably functional. The T2T team also found about 2 million additional gene variants in the human genome, 622 of which occur in medically relevant genes. These will help doctors to diagnose genetically based illnesses.
“In the future, when someone has their genome sequenced, we will be able to identify all of the variants in their DNA and use that information to better guide their health care,” said Adam Phillippy, one of the leaders of T2T and a senior investigator at the National Human Genome Research Institute (NHGRI) of the National Institutes of Health. “Truly finishing the human genome sequence was like putting on a new pair of glasses. Now that we can clearly see everything, we are one step closer to understanding what it all means.”
DNA sequences around the centromere hint at evolutionary past
One area that has brought new understanding is the genetic sequences around the centromere region. The T2T team used the new sequencing techniques to locate the place within the centromere where the kinetochore is found. This is a disc-shaped protein structure that is crucial when duplicated chromosomes are pulled apart into “daughter” cells during the process of cell division. When the chromosomes are separated successfully, this ensures that each daughter cell has a complete set of chromosomes.
“When this goes wrong, you end up with missegregated chromosomes, and that leads to all kinds of problems,” said Altemose. “If that happens in meiosis, that means you can have chromosomal anomalies leading to spontaneous miscarriage or congenital diseases. If it happens in somatic cells, you can end up with cancer – basically, cells that have massive misregulation.”
What the researchers found in and around the centromeres were layers of new nucleotide sequences overlaying layers of older sequences as if, through evolutionary time, new centromere regions have developed repeatedly to bind to the kinetochore. The older regions are characterized by having more mutations and deletions, indicating that they are not functional and are no longer used by the cell. The newer sequences where the kinetochore binds, however, are much less variable, and more functional.
Protein-DNA interactions are critical
Altemose is also interested in finding and understanding the areas in the chromosomes where proteins interact with DNA. “Without proteins, DNA is nothing,” said Altemose, who earned a Ph.D. in bioengineering jointly from UC Berkeley and UC San Francisco in 2021, after having received a D.Phil. in statistics from Oxford University.
“DNA is a set of instructions with no one to read it if it doesn’t have proteins around to organize it, regulate it, repair it when it’s damaged and replicate it,” explains Altemose. “Protein-DNA interactions are really where all the action is happening for genome regulation, and being able to map where certain proteins bind to the genome is really important for understanding their function.”
The DNA analysis that led to the complete sequencing of the 22 autosomal chromosomes and an X chromosome was conducted on one individual’s genome only. Subsequently, the researchers also sequenced a complete Y chromosome from another individual, a project that took almost as long to complete as it did to sequence the rest of the genome, Altemose said. The analysis of this new Y chromosome sequence will appear in a future publication.
DNA sequences used to trace human lineages
Altemose and his team, which included UC Berkeley project scientist Sasha Langley, also used the new reference genome, T2T-CHM13, to compare the centromeric DNA of 1,600 individuals from around the world. The results revealed major differences in both the sequence and copy number of repetitive DNA around the centromeres. Previous studies have shown that when groups of ancient humans migrated out of Africa to the rest of the world, they took only a small sample of genetic variants with them. Altemose and his team confirmed that this pattern is also seen in the variation of nucleotide sequences around the centromeres.
“What we found is that in individuals with recent ancestry outside the African continent, their centromeres, at least on chromosome X, tend to fall into two big clusters, while most of the interesting variation is in individuals who have recent African ancestry,” Altemose said. “This isn’t entirely a surprise, given what we know about the rest of the genome. But what it suggests is that if we want to look at the interesting variation in these centromeric regions, we really need to have a focused effort to sequence more African genomes and do complete telomere-to-telomere sequence assembly.”
Altemose also noted that DNA sequences around the centromere could possibly be used to trace human lineages back to our common ape ancestors. “As you move away from the site of the active centromere, you get more and more degraded sequence, to the point where if you go out to the furthest shores of this sea of repetitive sequences, you start to see the ancient centromere that, perhaps, our distant primate ancestors used to bind to the kinetochore,” Altemose said. “It’s almost like layers of fossils.”
The important role of ultra-long-read nanopore sequencing
The success of the T2T team’s sequencing project is due to improved techniques for sequencing long, continuous stretches of DNA, which helps when determining the order of highly repetitive stretches of chromosomes. Among these are PacBio’s HiFi sequencing, which can read lengths of more than 20,000 base pairs with high accuracy.
“These new long-read DNA sequencing technologies are just incredible; they’re such game changers, not only for this repetitive DNA world, but because they allow you to sequence single, long molecules of DNA,” Altemose said. “You can begin to ask questions at a level of resolution that just wasn’t possible before, not even with short-read sequencing methods.”
Altemose plans to explore the centromeric regions further in future research, using an improved technique he and colleagues at Stanford developed to locate the sites on the chromosome that are bound by proteins, similar to how the kinetochore binds to the centromere. Meanwhile, the T2T consortium will be working on developing a reference genome that represents all of humanity, rather than just one individual.
It is envisaged that the information from the fully sequenced human genome will contribute to our understanding of chromosome functioning, human disease and genomic variation, as well as giving us clues to patterns of human evolution and movement around the planet.