The quickest way to install and use BlobToolKit is using pip install. We recommend first setting up a new conda environment with python=3.9. The advantage of conda is that everything can be installed in a separate user directory, even on a cluster, without affecting anything else installed.
Install Conda
Conda can be installed on the command line using the Miniconda installer:
curl -L https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh > Miniconda3.sh; chmod +x Miniconda3.sh; ./Miniconda3.sh
# You must open a new terminal window before using conda as the commands will not be available in the current session
Install BlobToolKit
To create and activate a new conda environment called “btk” with python=3.9:
conda create -y -n btk -c conda-forge python=3.9
conda activate btk
To install blobtoolkit:
pip install "blobtoolkit[full]"
To test that the installation worked, type blobtools -h
and you should see the help text for that command.
Databases
A local copy of the NCBI taxdump (newer format) is required for any features that use taxonomy information. Typical usage also requires copies of the NCBI nucleotide (nt) and UniProt databases. These can all be fetched automatically when running the BlobToolKit Pipeline, alternatively use the commands below to fetch copies for standalone use.
1. Fetch the NCBI Taxdump
mkdir -p taxdump; cd taxdump; curl -L ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz | tar xzf -; cd ..;
2. Fetch the nt database
mkdir -p nt wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz" -P nt/ && \ for file in nt/*.tar.gz; \ do tar xf $file -C nt && rm $file; \ done
3. Fetch and format the UniProt database
Formatting the UniProt database requires the NCBI taxdump to be downloaded and uncompressed in a sister directory, as described in step 2 above.
mkdir -p uniprot wget -q -O uniprot/reference_proteomes.tar.gz \ ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/$(curl \ -vs ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/ 2>&1 | \ awk '/tar.gz/ {print $9}') cd uniprot tar xf reference_proteomes.tar.gz touch reference_proteomes.fasta.gz find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | xargs cat >> reference_proteomes.fasta.gz echo -e "accession\taccession.version\ttaxid\tgi" > reference_proteomes.taxid_map zcat */*/*.idmapping.gz | grep "NCBI_TaxID" | awk '{print $1 "\t" $1 "\t" $3 "\t" 0}' >> reference_proteomes.taxid_map diamond makedb -p 16 --in reference_proteomes.fasta.gz --taxonmap reference_proteomes.taxid_map --taxonnodes ../taxdump/nodes.dmp -d reference_proteomes.dmnd cd -
Fetch any BUSCO lineages that you plan to use
mkdir -p busco wget -q -O eukaryota_odb10.gz "https://busco-data.ezlab.org/v4/data/lineages/eukaryota_odb10.2020-09-10.tar.gz" \ && tar xf eukaryota_odb10.gz -C busco