In early 2020, a few months after the start of the Covid-19 pandemic, scientists were able to sequence the complete genome of the virus responsible for the infection, SARS-CoV-2. While many of its genes were already known at that time, the full complement of genes encoding the proteins was not resolved.
Now, after performing an extensive comparative genomic study, researchers at MIT have generated what they describe as the most accurate and complete genetic annotation of the SARS-CoV-2 genome. In their study, which appears today in Nature communications, they confirmed several genes encoding proteins and found that a few others that had been suggested as genes do not code for any protein.
“We were able to use this powerful comparative genomics approach for evolutionary signatures to uncover the true functional protein coding content of this extremely important genome,” says Manolis Kellis, lead author of the study and professor of computer science at MIT . Computer Science and Artificial Intelligence Laboratory (CSAIL) as well as member of the Broad Institute of MIT and Harvard.
The research team also analyzed nearly 2,000 mutations that have occurred in different isolates of SARS-CoV-2 since it began infecting humans, allowing them to assess the importance of these mutations in altering the ability of the virus. evade the immune system or become more infectious. .
The SARS-CoV-2 genome comprises nearly 30,000 RNA bases. Scientists have identified several regions known to encode genes encoding proteins, based on their similarity to genes encoding proteins found in related viruses. A few other regions were suspected of encoding proteins, but they had not been definitively classified as genes encoding proteins.
To determine which parts of the SARS-CoV-2 genome actually contain genes, the researchers performed a type of study known as comparative genomics, in which they compare the genomes of similar viruses. The SARS-CoV-2 virus belongs to a subgenus of virus called Sarbecovirus, most of which infect bats. The researchers carried out their analysis on SARS-CoV-2, SARS-CoV (which caused the SARS epidemic in 2003) and 42 strains of bat sarbecovirus.
Kellis previously developed computational techniques to do this type of analysis, which his team also used to compare the human genome with the genomes of other mammals. The techniques are based on the analysis of the conservation of certain DNA or RNA bases between species and on the comparison of their evolutionary patterns over time.
Using these techniques, the researchers confirmed six genes encoding proteins in the SARS-CoV-2 genome in addition to the five that are well established in all coronaviruses. They also determined that the region that encodes a gene called ORF3a also encodes an additional gene, which they name ORF3c. The gene has RNA bases that overlap with ORF3a but occur in a different reading frame. This gene in a gene is rare in large genomes, but common in many viruses, whose genomes are under selective pressure to remain compact. The role of this new gene, as well as several other genes in SARS-CoV-2, is not yet known.
The researchers also showed that five other regions that had been proposed as possible genes did not code for functional proteins, and they also ruled out the possibility that there are even more genes encoding the conserved proteins to be discovered.
“We have analyzed the entire genome and are convinced that there are no other conserved genes encoding proteins,” says Irwin Jungreis, lead author of the study and researcher at CSAIL. “Experimental studies are needed to understand the functions of uncharacterized genes, and by determining which are real, we allow other researchers to focus their attention on these genes rather than spending their time on something that doesn’t even translate. not in protein. “
The researchers also acknowledged that many previous articles not only used incorrect gene sets, but also sometimes conflicting gene names. To remedy the situation, they brought the SARS-CoV-2 community together and presented a set of recommendations for naming the genes for SARS-CoV-2, in a separate article published a few weeks ago in Virology.
In the new study, the researchers also analyzed more than 1,800 mutations that have occurred in SARS-CoV-2 since it was first identified. For each gene, they compared how quickly that particular gene has evolved in the past with how it has evolved since the start of the current pandemic.
They found that in most cases, genes that evolved rapidly for long periods of time before the current pandemic continued to evolve, and those that tended to evolve slowly maintained that trend. However, the researchers also identified exceptions to these patterns, which could shed light on how the virus evolved as it adapted to its new human host, Kellis says.
In one example, the researchers identified a region of the core protein, which surrounds viral genetic material, that showed many more mutations than expected from its historical evolutionary patterns. This protein region is also classified as a target of human B cells. Therefore, mutations in this region can help the virus escape the human immune system, Kellis says.
“The most accelerated region of the entire SARS-CoV-2 genome is in the middle of this core protein,” he says. “We assume that variants which will not mutate this region are recognized by the human immune system and eliminated, whereas variants which randomly accumulate mutations in this region are in fact better able to evade the human immune system and remain in circulation. “
The researchers also analyzed mutations that appeared in variants of concern, such as strain B.1.1.7 from England, strain P.1 from Brazil and strain B.1.351 from South Africa. Most of the mutations that make these variants more dangerous are in the spike protein, and help the virus spread faster and bypass the immune system. However, each of these variants also carries other mutations.
“Each of these variants has over 20 other mutations, and it’s important to know which of them are likely to do something and which are not,” says Jungreis. “Thus, we used our comparative genomic evidence to obtain a first pass estimate at which of this data is likely to be important based on which data was in conserved positions.”
The data could help other scientists focus their attention on the mutations that seem most likely to have significant effects on the infectivity of the virus, the researchers say. They have made the annotated gene set and their mutation classifications available in the University of California, Santa Cruz Genome Browser for other researchers to use.
“We can now go and study the evolutionary context of these variants and understand how the current pandemic fits into this larger story,” says Kellis. “For strains that have many mutations, we can see which of those mutations are likely to be host-specific adaptations, and which mutations may not be a topic of writing.”
The research was funded by the National Human Genome Research Institute and the National Institutes of Health. Rachel Sealfon, a researcher at the Flatiron Institute Center for Computational Biology, is also the author of the article.