Interview with David Paez-Espino (JGI): Identifying novel viruses on Earth and in the human body

June 12th, 2018 by Amy Proal

Hey readers!

Viruses are the most abundant life form on the planet

This blog and my peer-reviewed papers explore how the human microbiome can impact chronic inflammatory disease processes. It is well understood that an extensive microbiome persists in the gut (hundreds of trillions of microbial cells in total!) In addition, extensive microbiome communities have been identified in every other human body site – from the brain, to the liver, to the lungs, to the bladder, to the placenta and beyond. We now realize that no part of the human body is sterile. This includes human blood, which is now understood to contain a microbiome (even in healthy individuals).

However it’s very important to understand that even when all this new human microbiome knowledge is taken together, the global research community has still not yet identified, characterized or studied many of the microbes and viruses in the human body. In fact, almost every new human microbiome study turns up new species of bacteria, viruses and fungi that were previously unknown to science. For example, just a few months ago, Stephen Quake identified thousands of species of bacteria, viruses and fungi in human blood that were simply not known to exist before the study was performed.

Viruses are the most abundant life forms on the planet, but we know particularly little about human viruses. In other words, there are an incredible number of viruses in the human body that have not yet been identified. This is especially true of bacteriophages (viruses that infect bacteria). A recent study estimated that 31 billion bacteriophages traffic the body on a daily basis. The human body also harbors double stranded DNA viruses and RNA viruses. Most of us are familiar with double-stranded DNA viruses like the herpes viruses. We are also familiar with RNA viruses like polio, measles and influenza.

But these well-known RNA/DNA viruses comprise just a tiny fraction of all the viruses capable of persisting in the human body. Many other related viruses are still in the process of being discovered!

It follows that we cannot study human chronic inflammatory disease without understanding that viruses we have not yet identified may play a role in many human disease processes. To do so would be like going to the rainforest, studying only 2% of the animals, and coming to conclusions about how the entire rainforest functions off that information alone.

Despite this fact many doctors have been taught to test for only 10-20 well-known viruses in their human patients. If these viruses are not identified in the patient, it is assumed that a virus (or group of viruses) cannot be driving or contributing to the patient’s disease. We must work hard to change this assumption, because it greatly prevents the medical/research communities from looking at a much broader picture of what might be going on.

David Paez-Espino

In order to best make this point, I interviewed David Paez-Espino. David works at the Joint Genome Institute in Walnut Creek California, in the Department of Microbial Genomics and Metagenomics. He and his colleagues have created new, complex computational technologies that help identify both novel and known viruses. In fact, they have started a project called “Uncovering the Earth’s Virome” (viral communities). This project searches for novel viruses not just in the human body, but in ecosystems across the planet – including the ocean, the soil, the air, and in other animals. These viral sequences are stored in the world’s largest public viral database: IMG/VR.

Over just the past few years, Paez-Espino and team have identified and sequenced many, many new viruses. In a 2016 Nature paper, they unveiled more than 125,000 partial and complete viral genomes, which boosted the number of known viral genes in the world by 16-fold. By the time they published a follow-up paper in 2017, the contents of the IMG/VR database had doubled from the figure referenced in 2016. By January 2018 the database had tripled in size. At that point Paez-Espino stated the following:

“Among the million viral sequences predicted, we have now identified over 34,000 of them targeting several microbial taxa for the very first time, we’ve associated new viruses to known microbial genomes (culturable and unculturable), and the vast majority of the gene content (over 21 million genes in total) remains hypothetical or unknown, meaning that there is tremendous potential for new discoveries out of that gene pool.”


In this interview, David and I talk more about these findings. We talk about how his team has identified all kinds of viral entities and how they focus on the identification of less-studied viral groups, e.g. giant viruses, and virophages (viruses that co-infect eukaryotic cells along with giant viruses). These giant viruses were not previously known to persist in the human body. We also talk about how the study of novel viruses holds incredible potential to improve our understanding of chronic and rare inflammatory conditions.

Here is a short video clip of part of our conversation. The full interview is below and has been edited for accuracy purposes:

Amy: Tell me about your job at JGI and what projects you’re working on now.

David: Over the past four years here at JGI I’ve started working on a project called “Uncovering the Earth’s Virome” – a very ambitious name, I know. We are trying to get a sense of what viruses are out there. To do that we are generating a comprehensive list or viral catalog of viruses from samples everywhere on Earth. I work in the Microbiome Data Science group (lead by Dr. Nikos Kyrpides), where we look for novel or known viruses in environmental samples…all kinds of samples: we have metagenomes, we have metatransciptomes. We also have some isolates. And we don’t just look for viruses – we also try to link the viruses to their specific habitats and to their specific hosts. Everything is computational. But then we try to validate our findings by working with other groups here at Berkeley or beyond.

Amy: Where do get your initial viral data from (that you then use to identify novel viruses)?

David: The JGI sequences many of these samples through collaboration projects (CSPs and FICUS programs). But we also have part of our public repository database (called IMG/M) represented by other groups’ data. We developed a virus prediction pipeline to identify viral sequences directly out of those samples, and have created IMG/VR (the largest virus repository integrated by isolates and microbiome-derived viral sequences). And everything is public. We are mining millions and millions and millions of nucleotides. So we have terabytes of sequences. And this collection of viral data in IMG/VR is very diverse – it includes marine/aquatic samples, terrestrial samples (like soil), human microbiome samples, and even microbiomes from mice, ruminants, air, arthropods etc.

Amy: So let’s say I sent you a blood sample. When it comes to identifying new viruses in this sample: What is your technology able to identify versus what a technique like PCR could identify?

David: If you’re using PCR to identify viruses you’re using an approach where you already know what you’re looking for (you are biased towards viruses that are fully characterized and have been entered in known databases). So PCR is useful if you have a well-known virus already understood to be a human health threat. For example, we could identify Zika virus that way.  

But in our case, we have different approaches. The first is a general global viral discovery pipeline. It does not only rely on genes from known viruses. Instead, we use the genes from known viruses to identify novel viruses from environmental samples: think of these novel viruses as, let’s say, the cousins of viruses already in a known database (their genomes are similar to known viruses but with a certain degree of difference). It’s a homology-based pipeline, but one that’s more sophisticated than the usual BLAST approach. 

A T4 bacteriophage

Our second approach is to target very specific groups of viruses that are unique. These include specific families from single stranded DNA viruses, double stranded DNA viruses, single stranded RNA viruses etc. In fact, we have developed specific tools that can identify hallmark genes associated with many types of viruses. These even include giant viruses. Giant viruses are eukaryotic viruses that have long genomes (sometimes 1 megabase, 2 megabases, even almost 3 megabases in size!). They also have very unique capsid proteins that can’t be found elsewhere. So we can develop and use these unique viral characteristics to create specific models for identifying similar viruses. This same process of identification goes for virophages.

Amy: What exactly is a virophage?

David: A virophages is a virus that co-infects a eukaryotic cell along with a giant virus. In the majority of cases the virophage is a parasite of a giant virus. But in some cases it seems that virophages only use part of the giant virus machinery to replicate within the final host (a eukaryotic cell that is often an algae or protozoa -as currently known). The infectious process is kind of like a Russian doll. There are many different levels of infection.

Amy: Where are you finding these giant viruses and virophages?

David: They seem to be everywhere. We are submitting a paper in which we found virophages in all kinds of environments including the human gut. Finding them in the gut was something people had speculated about but we finally proved it is true. Because most people don’t have the right tools to mine for virophages, or they don’t have the right tools to mine the database, or they don’t have samples from the right habitats or styles of life in the database. So a luxury we have here at JGI is that we can apply novel tools to a very wide variety of different samples – and we can pretty much find everything everywhere:)

Amy: Yes. It seems very important to stay up to date on the latest technologies for viral identification. Because if you don’t know HOW to look for a new virus you probably won’t find it.

David: Yes, if you only study already known viruses you may be missing part of a larger unexplored picture.

Amy: I’m interested identifying novel viruses that might play a role in the disease ME/CFS. Let’s say I had blood samples from patients with ME/CFS. Can I send them to you so that you can use your technology to search for new viruses in the samples? How would that kind of collaboration work?

David: We have two ways to collaborate. First we have official calls for research. About 40-60% of those projects get accepted. But those projects are more along the lines of the DOE (energy, carbon cycling, and environmental viruses rather than human viruses). 

But for identification of novel human viruses: We are developing new tools for the discovery of RNA and DNA viruses. And we need validation of these tools. So we want to apply them to all kinds of samples as opposed to just those from the environment. Also because everything is connected we sometimes find environmental viruses in humans and vice versa. So if you’re interested in human viral discovery we don’t have official calls for that research. But you can talk to me and say, “Hey I have these samples and I’d like to know if your RNA viral pipeline could work for my samples.” Then I can talk to my boss, and if we all have this common interest or a common outcome (a paper, a patent) we can hopefully find a way to collaborate.

Amy: What are some of the most interesting viruses you’ve identified in the human body? Do you mostly have gut microbiome samples?

David: So as I mentioned before, all the data we get is deposited in the interactive IMG/VR database. On the database website you can access the habitat of the samples to see where our human viral samples come from. At the moment IMG/VR includes viruses found in five different human body subtypes: skin, urogenital, salivary, gut, and the respiratory tract. 

One interesting thing we’ve found is that viruses are very specific to their habitats. That means if you find a virus in a marine environment it’s very unusual to find it somewhere else. You can find it across the globe in different oceans but, if it’s a marine virus, it’s a marine virus. The same is true of human viruses. If you have a human virus it’s probably going to be only found in the human body. And if it’s a human gut virus, only in the human gut. So, we saw very few connections between the different body subtypes. The findings are similar to those reported in a separate recent paper: The healthy human phageome. In that study, healthy humans were found to have a certain pattern of viruses always present in the gut. We also found that to be true – there seem to be marker viruses that could indicate the health status of a person. Probably because such viruses are maintaining a balance with the bad bacteria.

Amy: And in disease there might be shifts away from that core viral state?

David: Yes, definitely. For example, we have a very recent paper (in editorial status) where we compare the virome of IBS mice to that of healthy mice. And we found a certain core pattern in the gut virome of the healthy mice and a very stochastic pattern in the gut virome of the IBS mice. Basically, totally different virome patterns identified between the two groups. And those patterns may give us some leads as to what happens in human IBS. Hopefully the paper will be out soon.

Amy: Wow, you have some great papers coming out.

David: Yes. We are also doing work related to phage therapy. We are trying to predict certain phages and then predict their bacterial hosts. And of course, many of these bacterial hosts are harmful to humans. That gives us the opportunity to engineer the phages in ways that might allow them to best kill the bad bacteria.  

So, we predict phages in the human body and then say “This viral (phage) sequence is probably infecting this particular bacterial host.” But we are collaborating with other groups here at Berkeley on the project because, if we do identify a bacterial host, we need to know if we can culture it in the lab. Culturing the bacterium allows us to test it. We need to do that because our computer-based predictions are pretty robust – but we must be 100% sure that what we predict computationally holds up in the lab. Also, we need to confirm that the phage sequences we predict are perfect before attempting to use them as therapeutics.

Amy: That’s great. That’s what the phage therapy community needs right now. Much more information on verified phage/bacteria relationships (that can be easily accessed in a database). It’s also great for our general understanding of the microbiome. Because if we fail to account for the phages in any microbiome ecosystem, we’re unlikely to really understand the community dynamics. 

David: Yes. For example, if you study cell biology there’s another layer beneath that knowledge. You also need to understand molecular biology, because you need to understand molecules in order to understand cells. Similarly, if you want to understand bacteria in a microbiome community, it’s important to understand that the way the bacteria behave is based on their relationship to the phages. For example, there is a theory called “kill the winner.” Sometimes a bacterium grows exponentially in an environment. If its levels come back to equilibrium it’s usually because there is a phage that’s killing it. This equilibrium is built into the foundation of the ecosystem. And without understanding viruses you can’t fully understand that equilibrium. So yes, it makes sense to think that way.   

Amy: Yes that’s even true of an intervention like a probiotic. It would be helpful to understand how the bacteria in a probiotic might be modulated by neighboring phages.

So give me some numbers. I know it’s a ridiculous request. But about how many viruses are you finding in just the human samples you have so far?

David: So, starting with only DNA viruses…we have around a million sequences but some are redundant. So, if you only take the unique viral entities, we have about 400,000. Of those 400,000 about 30,000 or so are unique viral sequences coming from the human body.  

When it comes to RNA viruses, that’s a new ongoing project that we haven’t published on yet. But so far, we have around another 100,000 novel predicted RNA viral genomes. And about 20% of them are coming from human samples.

Amy: Do you see an end in sight?

David: There must be an end in sight, right? But the problem with viruses is that they also recombine. It’s complicated because it’s very hard to define a viral species. For example, there are cases where two closely related viruses are trying to avoid predation by bacteria, and to do that they recombine and create a third virus. Then…that third virus can recombine with another virus. Another example is that two viruses infecting the same host, at the same time, can also recombine. So, I think that the possibilities are countless and it’s going to be very complicated. We are definitely not even close to scratching the surface at all.

Growth rate of virus identification and microbial host prediction for the new release of IMG/VR database. Growth over time in the total and unique number of viral sequences in IMG/VR. (Image composite by David Paez-Espino).

Amy: It’s daunting but also exciting. Because we understand very little about the human microbiome and most ecosystems. So it’s great to realize there’s so much more we can learn about what might be going on.  

David: Right. We have a collection of ~21 million different genes from the viruses we’ve collected so far in IMG/VR. And 80-90% of these viral genes are hypothetical or have unknown functions. 

This same pattern is even true with E. coli. People have been studying E. coli for decades but we still don’t know the function of ~30% of E. coli’s genes. So, imagine how little we know about these novel viruses that are predicted from fragments here or there. It’s very complicated to make any kind of estimation.

Amy: It’s mind-blowing. To clarify: E. coli is one of the most well-studied bacterial species and we still don’t know a great deal about it.

David: Yes. E. coli is so well-studied that you’d think all the metabolic pathways are perfectly covered. But if you actually look at the databases, people still have no idea about what 30% of the E. coli genome is doing. So, these are exciting times, yes.

Amy: It’s clear that no matter what you’re studying related to the human body, you need to account for these viruses and other microbes. Whether you’re an immunologist, a neurologist etc. We need to account for their DNA/RNA and metabolites.

David: We have advanced a lot in science. Especially thanks to early studies of the microbiome that created reference genomes. But, for example, I recently went to a conference that discussed the drinking water microbiome. And they were not even testing for viruses (phages) in the drinking water. They are looking at 16S RNA sequences from about 20 known pathogens. And we’re are able to drink apparently safe water with just that knowledge. The basic necessities are covered. But – if we could understand even more about what microbes/viruses are in that water we could develop even better solutions.

This is especially true when it comes to human rare diseases. The majority of them could be caused by something we are not controlling. Medicine is doing a great job, but still – new knowledge about viruses can add to the picture. For example, now we understand that even some forms of cancer are caused by human microbiome dysbiosis accompanied by proliferation of a certain virus. So many of these novel or rare diseases could be better explained if we understand more about viruses and other microbes. 

This is also true for therapeutics (treatment). Now we can think about phages killing bacteria in a more precise way than antibiotics. Or we can use viruses that can be delivered to specific human tissue along with something that can access your immune system. You can design the virus to deliver something to the immune system “a la carte.” This precise medicine needs to evolve along with the new technologies.

My ideal job is applications. But I am also very pro basic science. For example – the CRISPR/Cas system. We are now engineering human cells from the use of bacterial genes that are designed to be a defense system against phage. So, the tremendous repertoire we have of novel genes can give us many new tools and technologies that we can manipulate at our will in the near future. We have been doing that for years, but the pace of discovery is now way, way faster. It’s a catalytic process: we are speeding up the different steps. 

Amy: Yes. The pace of discovery is incredible. I recently submitted a paper to a journal. They wrote back and asked me to give specific numbers for how many bacterial and viral species persist in the human body. I had to explain this wasn’t possible, and even if I made an estimate the number would change by the next month…or even the next day. 

David: Yes, for example, we have the largest viral database in this field. But even then, we’re only operating with assembled sequences larger than 5 KB. That represents less than 2% of all the sequences we have. Which is pretty much nothing. So, it’s not just the discovery pipeline that’s important. We can also go back and analyze metagenomes from 15 years ago. Because now we have different sequencing technology, we have different assembly technology, we have new binning algorithms, we have more robust models for gene annotation and functional predictions. And everything is still getting improved and improved and improved. So yes, as you said, there are no magic numbers. 

Amy: Definitely no magic numbers:) David thanks for speaking with me. I’m fascinated by your research and very excited to see where your findings lead. 














Leave a Reply

Your email address will not be published. Required fields are marked *