SemiBin – Using self-supervised deep learning for better metagenomics binning

Dr Luis Pedro Coelho, Big Data Biology Lab, Fudan University

Abstract: Microorganisms live in all environments on the globe and influence phenomena as varied as human health, soil (including soil used for food production), and climate change. Metagenomics is the technique whereby DNA sequencing is simultaneously performed on the multiple organisms present in a particular sample. Due to technological limitations, metagenomics returns a myriad of fragments, often named „contigs“, rather than complete genomes. Attempting to infer which contigs form genomes is called binning.

The two most popular approaches, reference-based and reference-independent binning map to the machine learning concepts of supervised and unsupervised learning. Reference-based binning can work well to discover variations of already-known species, but only reference-free methods can find novel ones and a large fraction of contigs cannot be classified into known species (in many cases, these are the vast majority of contigs in a sample). SemiBin originated as an attempt to bridge the gap between these two approaches and develop a semi-unsupervised model in which supervision was used to inform the process, but it was still possible to infer novel groups. While successful at recovering more genomes than the previous state-of-the-art, this approach was computationally intensive. Recently, we have overcome this by developing a completely self-supervised approach, which achieves the high quality results of the semi-unsupervised model (even surpassing it) at a fraction of the computational costs.

Additionally, we have made the pipeline work with either the short-read datasets that still comprise the majority of metagenomics as well as the newer long-read methods that are increasingly used. Although the problem of clustering short or long-read data are conceptually identical, their properties are sufficiently distinct that we found that using different approaches for each achieves the best results.

SemiBin is available at https://github.com/BigDataBiology/SemiBin and the manuscript describing the semi-supervised version is available at https://doi.org/10.1038/s41467-022-29843-y A preprint describing the recent updates is available at https://www.biorxiv.org/content/10.1101/2023.01.09.523201v1

Biography: Luis Pedro Coelho is the principal investigator (PI) of the Big Data Biology Lab at Fudan University. Previously, he worked as a postdoctoral researcher in Peer Bork’s group at the European Molecular Biology Laboratory (EMBL). He has a PhD from Carnegie Mellon University and an MSc from Instituto Superior Técnico in Lisbon. He currently works on the analysis of microbial communities in different environments, such as the marine environment or the human gut using computational methods.

More information about can be found at https://luispedro.org.