Genomics has become one of the cornerstones of biology. Knowing an organism’s genome sequence immediately allows us to work out what kinds of biology it is able to do, and acts as a platform upon which we can build experiments to test, for example, the dynamics of gene activity during stress or disease. If genomes are the cornerstones, genome databases are the libraries built from these data that allow science to collaborate and build upon its successes. Genome sequencing is getting easier, as technologies improve by leaps and bounds: new, high throughput sequencers and advanced computing. The human genome cost $3 billion to sequence the first time round: now it would cost about $15,000. This reduction in cost has opened up genome sequencing to many research projects on new species, and there are now about 30,000 bacterial genomes and 3,000 eukaryotic genomes in public databases.
When genomes are contaminated, the genome databases, the reference libraries, are also contaminated, and the scientific process becomes muddied: errors can be made that affect many later steps in understanding the natural world, or exploiting it for bioscience. Obviously no scientist knowingly submits contaminated genome data to the central databases, but as genome sequencing projects become more common, more and more contaminated data are getting into the databases of record.
How does contamination happen? Organisms live in environments with other species, and it is often not possible or not advisable to separate these before making DNA to be sequenced. For example, most animals have bacteria in their guts, and getting rid of these before extracting DNA from a whole specimen of a tiny species is difficult. Similarly, plants naturally have communities of fungi and bacteria growing in and on their leaves and roots. In the case of symbiotic organisms, where the interaction is very intimate, the specimen is indivisible. The genomes of the different contributing species will be mixed up in the raw sequence data generated from such samples.
We propose to build a set of computational tools, BlobToolKit, that will identify contaminants. BlobToolKit will be useful both during the process of making new genomes for the first time (where they will separate out the different organisms in the mix of raw sequence data), and during reanalyses of existing genome assemblies.
BlobToolKit will be made freely available as a standalone program, as a service on the internet, and as a system that will be plugged into the big public databases to report on possible contamination. The project, a collaboration between the University of Edinburgh and the European Bioinformatics Institute, aims, within 3 years, to have identified all the problems in “legacy” genomes already submitted to public databases, and to have in place a system that prevents further contamination happening.
BlobToolKit reports will be provided as part of the submission process to those scientists reporting genome assemblies, ensuring the exposure of our technology to its users. We will further promote BlobToolKit by publication of our results in open access journals, presentations and workshops at relevant meetings, discussion with standards organisations, delivering training workshops to interested groups of scientists, and maintaining a rich resource of training and tutorial materials on the web. Our aim is to steer the scientific community to a culture in which contamination in genome assembly is understood and expected, and freely available and versatile software tools are known that can assist in the flagging and prevention of contamination in the public record.