The limits of virus discovery and how to overcome them/The envelope of sequence bioinformatics in 2022

Artem Babaian
Department of Molecular Genetics; Donnelly Centre for Cellular and Biomolecular Research; University of Toronto
Meeting: The limits of Virus Discovery, and how to overcome them

Over the past 14 years, public databases have archived >30 petabases (3x10^16 bases) of sequencing data in >10 million samples, a modern ark of Earth’s genetic diversity. Millions of these samples contain viral sequences, often captured incidentally to the goals of the original study.

Recently, we developed Serratus, an ultra-high throughput cloud-computing architecture to explore the genetic diversity of RNA viruses. In 11 days, we processed 5.7 million sequencing datasets (10.2 petabases) to discover >130,000 novel RNA viruses, a 9.8-fold increase relative to the 15,000 known RNA viruses.

We will review the assortment of methods used in virus discovery, as well as how the limitations of each algorithm clash with the biology of RNA viruses. Then, we look ahead at the novel algorithms promising accelerated and deeper homology search.

Together, the advancement of sequence homology algorithms and exponential growth of public sequencing data will drive the new “Platinum Age” of virus discovery---we aim to uncover 100 million RNA viruses by 2030.

Rayan Chikhi
Institut Pasteur & CNRS
Primer: The envelope of sequence bioinformatics in 2022

This mysterious title is meant to suggest the immense realm of possibilities in sequence bioinformatics today. There is an explosion of data and computational biology research is struggling to keep pace, despite having made great advances. This primer provides a high level survey of sequence alignment, genome assembly, and large-scale sequence search. I cover the underlying aspects of the algorithms, their strengths and weaknesses, and also touch on their capacity to scale to the petabyte regime. I also provide a mini-primer on sequence assembly, as this is another significant (and underrated) source of sensitivity limits. Taken together the different components of the primer lay the groundwork for understanding state-of-the-art algorithms for virus discovery at the petabyte scale. Additional resources:

"Also we, together with other researchers are in the process of organizing an "RdRp Summit" for the systematic and interoperable data standard for RNA viruses, if you're interested in computational standards for RNA virus classification please email: rdrp.summit at gmail dot com"