The Pipeline is implemented as a Snakemake workflow that determines which jobs should be run to generate a complete BlobDir dataset based on the Pipeline configuration. The Pipeline can be run in the same way for both public and local assemblies, provided any local assembly and read files are available in the working directory.
Before you start, make sure you have the latest version of the Pipeline from the blobtoolkit/insdc-pipeline GitHub repository and have installed all dependencies listed in the README file. The GitHub repository is named insdc-pipeline as the Pipeline was developed for the analysis of publicly available (INSDC registered) datasets:
mkdir -p /home/ubuntu/blobtoolkit cd /home/ubuntu/blobtoolkit git clone https://github.com/blobtoolkit/insdc-pipeline
Create a working directory. All results will be written to this directory and the final BlobDir dataset will be written as a subdirectory of the working directory. If running the Pipeline on a local assembly, copy the assembly FASTA and read FASTQ files into this directory:
mkdir -p /path/to/working/directory cp Dalbomicans_v1.fasta /path/to/working/directory/ cp library*.fastq.gz /path/to/working/directory/
Create/edit a YAML configuration file for the assembly (see the Pipeline configuration tutorial) and place it in the working directory:
cat /path/to/working/directory/Dalbomicans_v1.yaml assembly: prefix: Dalbomicans_v1 ... taxon: taxid: 7291 name: Drosophila albomicans keep_intermediates: true
Change directory into the Pipeline directory and activate the Conda environment described in the README file:
cd /home/ubuntu/blobtoolkit/insdc-pipeline conda activate snake_env
Set some parameters as environment variables to make it clear which values can be changed to suit your local compute environment:
WORKDIR=/path/to/working/directory # The working directory created above ASSEMBLY=Dalbomicans_v1 # The assembly prefix CONDA_DIR=/home/ubuntu/blobtoolkit/.conda # A directory to contain the Conda environments for individual Snakemake rules THREADS=64 # The maximum number of parallel threads to run
Run the Pipeline using the variables set above. Additional flags are used to print each command before it is run (-p
) and to limit the number of concurrent rules using blobtools add
to prevent multiple processes attempting to update the BlobDir at the same time (--resources btk=1
):
snakemake -p \ --use-conda \ --conda-prefix $CONDA_DIR \ --directory $WORKDIR/ \ --configfile $WORKDIR/$ASSEMBLY.yaml \ --stats $ASSEMBLY.snakemake.stats \ -j $THREADS \ --resources btk=1
To perform a dry-run to check the config file and see which rules will be used without actually running the Pipeline, add the -n
flag to the command above.
The Pipeline is likely to take at least several hours to run, depending on the size and contiguity of your assembly and the number of BUSCO lineages. Once complete, the working directory will contain a number of analysis outputs and a BlobDir subdirectory that can be visualised in the BlobToolKit Viewer.