Install

The quickest way to install and use BlobToolKit is using pip install. We recommend first setting up a new conda environment with python=3.9. The advantage of conda is that everything can be installed in a separate user directory, even on a cluster, without affecting anything else installed.

Install Conda

Conda can be installed on the command line using the Miniconda installer:

curl -L https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh > Miniconda3.sh; chmod +x Miniconda3.sh; ./Miniconda3.sh
# You must open a new terminal window before using conda as the commands will not be available in the current session 

Install BlobToolKit

To create and activate a new conda environment called “btk” with python=3.9:

conda create -y -n btk -c conda-forge python=3.9
conda activate btk

To install blobtoolkit:

pip install "blobtoolkit[full]"

To test that the installation worked, type blobtools -h and you should see the help text for that command.

Databases

A local copy of the NCBI taxdump (newer format) is required for any features that use taxonomy information. Typical usage also requires copies of the NCBI nucleotide (nt) and UniProt databases. These can all be fetched automatically when running the BlobToolKit Pipeline, alternatively use the commands below to fetch copies for standalone use.

1. Fetch the NCBI Taxdump

mkdir -p taxdump;
cd taxdump;
curl -L ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz | tar xzf -;
cd ..;

2. Fetch the nt database

mkdir -p nt
wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz" -P nt/ && \
        for file in nt/*.tar.gz; \
            do tar xf $file -C nt && rm $file; \
        done


3. Fetch and format the UniProt database

Formatting the UniProt database requires the NCBI taxdump to be downloaded and uncompressed in a sister directory, as described in step 2 above.

mkdir -p uniprot
wget -q -O uniprot/reference_proteomes.tar.gz \
 ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/$(curl \
     -vs ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/ 2>&1 | \
     awk '/tar.gz/ {print $9}')
cd uniprot
tar xf reference_proteomes.tar.gz

touch reference_proteomes.fasta.gz
find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | xargs cat >> reference_proteomes.fasta.gz

echo -e "accession\taccession.version\ttaxid\tgi" > reference_proteomes.taxid_map
zcat */*/*.idmapping.gz | grep "NCBI_TaxID" | awk '{print $1 "\t" $1 "\t" $3 "\t" 0}' >> reference_proteomes.taxid_map

diamond makedb -p 16 --in reference_proteomes.fasta.gz --taxonmap reference_proteomes.taxid_map --taxonnodes ../taxdump/nodes.dmp -d reference_proteomes.dmnd
cd -

Fetch any BUSCO lineages that you plan to use

mkdir -p busco
wget -q -O eukaryota_odb10.gz "https://busco-data.ezlab.org/v4/data/lineages/eukaryota_odb10.2020-09-10.tar.gz" \
        && tar xf eukaryota_odb10.gz -C busco