BlobToolKit

Identification and analysis of non-target data in all Eukaryotic genome projects

The BlobToolKit project will develop software tools and protocols to aid in the separation of cobionts and removal of contaminants from genomic datasets prior to genome sequence assembly.

BlobToolKit follows on from the development of Blobology1 and BlobTools2 over the past few years in the Blaxter Lab at the University of Edinburgh. As development on BlobToolKit has only just begun, users interested in applying these approaches should continue to use Dom Laetsch's BlobTools package directly until the methods are fully incorporated into BlobToolKit.

If you'd like to get in touch about any aspect of the project, please tweet us @blaxterlab or @rjchallis or email richard.challis@ed.ac.uk

1 Kumar et al. 2013. Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots. Frontiers in Genetics, 4:237
2 Laetsch & Blaxter 2017. BlobTools: Interrogation of genome assemblies [version 1; referees: awaiting peer review]. F1000Research, 6:1287 (doi: 10.12688/f1000research.12232.1)

About

Genomics has become one of the cornerstones of biology. Knowing an organism's genome sequence immediately allows us to work out what kinds of biology it is able to do, and acts as a platform upon which we can build experiments to test, for example, the dynamics of gene activity during stress or disease. If genomes are the cornerstones, genome databases are the libraries built from these data that allow science to collaborate and build upon its successes. Genome sequencing is getting easier, as technologies improve by leaps and bounds: new, high throughput sequencers and advanced computing. The human genome cost $3 billion to sequence the first time round: now it would cost about $15,000. This reduction in cost has opened up genome sequencing to many research projects on new species, and there are now about 30,000 bacterial genomes and 3,000 eukaryotic genomes in public databases.

When genomes are contaminated, the genome databases, the reference libraries, are also contaminated, and the scientific process becomes muddied: errors can be made that affect many later steps in understanding the natural world, or exploiting it for bioscience. Obviously no scientist knowingly submits contaminated genome data to the central databases, but as genome sequencing projects become more common, more and more contaminated data are getting into the databases of record.

How does contamination happen? Organisms live in environments with other species, and it is often not possible or not advisable to separate these before making DNA to be sequenced. For example, most animals have bacteria in their guts, and getting rid of these before extracting DNA from a whole specimen of a tiny species is difficult. Similarly, plants naturally have communities of fungi and bacteria growing in and on their leaves and roots. In the case of symbiotic organisms, where the interaction is very intimate, the specimen is indivisible. The genomes of the different contributing species will be mixed up in the raw sequence data generated from such samples.

We propose to build a set of computational tools, BlobToolKit, that will identify contaminants. BlobToolKit will be useful both during the process of making new genomes for the first time (where they will separate out the different organisms in the mix of raw sequence data), and during reanalyses of existing genome assemblies.

BlobToolKit will be made freely available as a standalone program, as a service on the internet, and as a system that will be plugged into the big public databases to report on possible contamination. The project, a collaboration between the University of Edinburgh and the European Bioinformatics Institute, aims, within 3 years, to have identified all the problems in "legacy" genomes already submitted to public databases, and to have in place a system that prevents further contamination happening.

BlobToolKit reports will be provided as part of the submission process to those scientists reporting genome assemblies, ensuring the exposure of our technology to its users. We will further promote BlobToolKit by publication of our results in open access journals, presentations and workshops at relevant meetings, discussion with standards organisations, delivering training workshops to interested groups of scientists, and maintaining a rich resource of training and tutorial materials on the web. Our aim is to steer the scientific community to a culture in which contamination in genome assembly is understood and expected, and freely available and versatile software tools are known that can assist in the flagging and prevention of contamination in the public record.

Tools

BlobTools

Dom Laetsch's BlobTools is currently the most up to date version available.

Demo

BlobToolKit will add interactive filtering and visualisations to the static reports generated by BlobTools. We will be linking to a demo as soon as we have the basic functionality implemented in a prototype ready for testing.

API

While BlobToolKit is being developed in Python and JavaScript, it will be built around a RESTful API (Application Programming Interface) to make the methods available to tools and services written in other languages.

We are currently in the process of defining the API and will link to full documentation when we begin development. As with other aspects of the project, we would welcome feedback from potential users so please get in touch if you are interested in being involved in developing or testing the API.

Source code

BlobToolKit source code will be developed openly on github. For now we only have a proof-of-concept visualisation, that requires very specific input data. We will link to active repositories here when we begin development on BlobToolKIt, interactive visualisations and the API.

Blog