Pipeline

The BlobToolKit Pipeline is a configurable Snakemake pipeline to run analyses on publicly available, International Nucleotide Sequence Database Collaboration (INSDC) registered eukaryotic genome assemblies. The Pipeline automates all steps in the generation of BlobDir datasets, including retrieval/formatting of public database files, retrieval of assembly-specific sequences and read files, read mapping, BLAST/Diamond searches and BUSCO analyses.

The Pipeline is also suitable for running on local assemblies. If files are available, the fetch_assembly and fetch_fastq steps can be omitted and the remainder of the Pipeline will use these local files (see Pipeline Tutorials for more details).

The datasets we have analysed using the Pipeline are available on a public instance of the BlobToolKit Viewer at blobtoolkit.genomehubs.org/view. This public Viewer instance includes a browsable summary of progress towards our goal of analysing all public assemblies, which is updated as we add new datasets and as the set of publicly available datasets continues to grow.

See the Pipeline Tutorials for more information on how to analyse assemblies using the Pipeline or check out our open-source code on GitHub.