Running the Pipeline on a cluster

As a Snakemake workflow, the Pipeline can be run on various types of cluster, however some variables may need to be set explicitly. In particular running the transferCompleted.smk workflow to validate generated BlobDirs and generate a set of static images requires Firefox and Selenium Webdriver and may need to be run under Singularity, using the BlobToolKit Docker image.

The commands below should work within typical job submission scripts on SGE and LSF clusters and assume the install instructions have been used to set up a Conda environment. See Configuring the pipeline for more information on the configuration file options.

Set variables

either in a single job script or an array job wrapper:

SOFTWARE=/path/to/software
SCRATCH=/path/to/scratch

# BlobToolKit pipeline directory
export PIPELINE=$SOFTWARE/blobtoolkit/insdc-pipeline

# Working directory should contain the $ASSEMBLY.yaml config file
export WORKDIR=$SCRATCH/workdir

# Set a destination directory for completed analyses if running validation steps
export CLUSTER_DESTDIR=$SCRATCH/completed

# Directory for Snakemake to create conda environments
export CONDA_DIR=$SOFTWARE/.conda

# Add btk_env python packages to PYTHONPATH
export PYTHONPATH=$SOFTWARE/miniconda3/envs/btk_env/lib/python3.6/site-packages:$PYTHONPATH

# Control resource use - finer grained control available using a cluster.yaml file
export THREADS=32
export MULTICORE=8
export MAXCORE=16

Additional variables are required if using Singularity:

# Make sure singularity is in the PATH
export PATH=$SOFTWARE/singularity/bin:${PATH}

# Directory for Snakemake to store singularity images
export SINGULARITY_DIR=$SOFTWARE/.singularity

Running Singularity as described below also requires a minimal copy of your /etc/passwd file:

$ getent passwd "$USER" > $HOME/singularity.etc.passwd

Run the Pipeline

Place commands to run the pipeline in a job script (can be the same script as variables above if not running array jobs):

# Be sure to set the assembly name, this should match the YAML config filename
ASSEMBLY=assemblyName

# make sure conda activate is available in script
eval "$(conda shell.bash hook)"

# activate the Conda environment
conda activate btk_env

# Run snakemake
snakemake -p \
          --use-conda \
          --conda-prefix $CONDA_DIR \
          --directory $WORKDIR/ \
          --configfile $WORKDIR/$ASSEMBLY.yaml \
          --latency-wait 60 \
          --rerun-incomplete \
          --stats $ASSEMBLY.snakemake.stats \
          -j $THREADS \
          -s $PIPELINE/Snakefile \
          --resources btk=1

# Test for success
if [ $? -ne 0 ];then
  echo ERROR: failed during replaceHits
  exit 1
fi

Use singularity

Validation and image generation steps can be included as part of the job script. Here they also provide an example of additional variables required when using Singularity:

# Exported paths below are relative to the container filesystem and should not be changed
mkdir -p $CLUSTER_DESTDIR
export DESTDIR=/blobtoolkit/output
export PYTHONPATH=/home/blobtoolkit/miniconda3/envs/btk_env/lib/python3.7/site-packages:$PYTHONPATH
export PATH=/home/blobtoolkit/miniconda3/envs/btk_env/bin:$PATH

# Run the transferCompleted workflow
snakemake -p \
          --directory $WORKDIR/ \
          --configfile $WORKDIR/$ASSEMBLY.yaml \
          --latency-wait 60 \
          --rerun-incomplete \
          --stats $ASSEMBLY.snakemake.stats \
          -j $THREADS \
          -s $PIPELINE/transferCompleted.smk \
          --resources btk=1 \
          --use-singularity \
          --singularity-prefix $SINGULARITY_DIR \
          --singularity-args "-B $WORKDIR:/blobtoolkit/datasets \
                              -B $CLUSTER_DESTDIR:/blobtoolkit/output \
                              -B $HOME/singularity.etc.passwd:/etc/passwd" \
          --config destdir=$DESTDIR

# Test for success
if [ $? -ne 0 ];then
  echo ERROR: failed during transferCompleted
  exit 1
fi

Submit rules as jobs

On a cluster with drmaa, Snakemake can submit each rule as a separate job. This allows for using resources more efficiently, but the large number of jobs generated may affect priority calculations, depending on the cluster policies.

This requires additional options in the Snakemake run command. An example cluster.yaml file is provided in the INSDC-Pipeline repository:

# Run snakemake
snakemake -p \
          --use-conda \
          --conda-prefix $CONDA_DIR \
          --directory $WORKDIR/ \
          --configfile $WORKDIR/$ASSEMBLY.yaml \
          --cluster-config $CLUSTER_CONFIG \
          --drmaa " -o {log}.o \
                    -e {log}.e \
                    -R \"select[mem>{cluster.mem}] rusage[mem={cluster.mem}] span[hosts=1]\" \
                    -M {cluster.mem} \
                    -n {cluster.threads} \
                    -q {cluster.queue} " \
              --latency-wait 60 \
          --rerun-incomplete \
          --stats $ASSEMBLY.snakemake.stats \
          -j $THREADS \
          -s $PIPELINE/Snakefile \
          --resources btk=1