The BlobTools approach uses BLAST hits to provide taxonomic annotation for each sequence in an assembly. When run using the BlobToolKit Pipeline, a wrapper script is used to break long sequences into chunks to obtain a distribution of BLAST hits and closely related taxa can be automatically filtered out. This is particularly important for publicly available assemblies as the BLAST databases may already contain the query sequence.
Example
Example BLAST and Diamond BLAST hit files for Strongyloides venezuelensis can be obtained from the BlobToolKit downloads site:
cd ~/BTK_TUTORIAL/FILES curl https://blobtoolkit.genomehubs.org/download/ALIAS/S_v/S_venezuelensis_HH1/S_venezuelensis_HH1.blastn.nt_v5.root.1.minus.6247.out.gz | gunzip -c > ASSEMBLY_NAME.ncbi.blastn.out curl https://blobtoolkit.genomehubs.org/download/ALIAS/S_v/S_venezuelensis_HH1/S_venezuelensis_HH1.diamond.reference_proteomes.root.1.minus.6247.out.gz | gunzip -c > ASSEMBLY_NAME.diamond.blastx.out
These files can be imported to assign taxonomic labels to the assembly contigs:
~/blobtoolkit/blobtools2/blobtools add \
--hits ~/BTK_TUTORIAL/FILES/ASSEMBLY_NAME.ncbi.blastn.out \
--hits ~/BTK_TUTORIAL/FILES/ASSEMBLY_NAME.diamond.blastx.out \
--taxrule bestsumorder \
--taxdump ~/blobtoolkit/taxdump \
~/BTK_TUTORIAL/DATASETS/ASSEMBLY_NAME
This example uses the --taxrule bestsumorder
, which assigns taxonomy based on the sum of bitscores for the hits to a given taxon based on the first file, only using subsequent results files for contigs with no hits a previous file.
Configuration
There are a number of configuration options when adding hits using blobtools add
:
--hits
– BLAST or Diamond blast file to import. [Required]--hits-cols
– Specify the BLAST file column order. [Default: 1=qseqid,2=staxids,3=bitscore,5=sseqid,10=qstart,11=qend,14=evalue]--taxrule
– Rule to use when assigning BLAST hits to taxa. [Default: bestsumorder]--taxdump
– Directory containing NCBI taxdump files. [Required]--evalue
– Set evalue cutoff when parsing hits file. [Default: 1]--bitscore
– Set bitscore cutoff when parsing hits file. [Default: 1]--hit-count
– Number of hits to parse when inferring taxonomy. [Default: 1]--replace
– Replace existing fields if present. [Default: false]
--hits
One or more BLAST or Diamond BLAST tabular output format files can be specified using the --hits
flag. The order of columns in these files can be defined using --hits-cols
(described below), but the default import format can be obtained with commands similar to:
NCBI blastn against the nt database (see Install BlobToolKit > Databases):
blastn -db nt \
-query ASSEMBLY_NAME.fasta \
-outfmt "6 qseqid staxids bitscore std" \
-max_target_seqs 10 \
-max_hsps 1 \
-evalue 1e-25 \
-num_threads 16 \
-out ASSEMBLY_NAME.ncbi.blastn.out"
Diamond blast against UniProt (see Install BlobToolKit > Databases):
diamond blastx \
--query ASSEMBLY_NAME.fasta \
--db /path/to/uniprot.db.with.taxids \
--outfmt 6 qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore \
--sensitive \
--max-target-seqs 1 \
--evalue 1e-25 \
--threads 16 \
> ASSEMBLY_NAME.diamond.blastx.out
Usage:
blobtools add \
--hits ASSEMBLY_NAME.ncbi.blastn.out \
--hits ASSEMBLY_NAME.diamond.blastx.out \
...
--hits-cols
Tablular BLAST output files with a different column order can be imported by specifying a comma separated list of <column_number=field_name>
. The default value of 1=qseqid,2=staxids,3=bitscore,5=sseqid,10=qstart,11=qend,14=evalue
matches the output from the example commands above. Columns not listed in the default setting are not used.
Usage:
blobtools add \
...
--hits-cols 1=qseqid,2=staxids,3=bitscore,5=sseqid,10=qstart,11=qend,14=evalue
...
--taxrule
BlobTools2 assigns a putative taxonomic origin to each scaffold at 8 ranks from superkingdom to species based on aggregating hits in the --hits
files. The --taxrule
flag determines the rule to use when assigning BLAST hits to taxa. Two options are available: bestsum
and bestsumorder
. Of the taxa represented in the BLAST hits, bestsum
assigns the taxon for which the sum of bitscores is greatest across all files while the default bestsumorder
taxrule uses the maximum bitscore in the first file and only uses hits from subsequent files for taxa without a hit in previous files.
Usage:
blobtools add \
...
--taxrule bestsum \
...
--taxdump
Taxonomy information must be loaded from a local copy of the NCBI taxdump (see Install BlobToolKit > Databases).
Usage:
blobtools add \
...
--taxdump ~/blobtoolkit/taxdump \
...
--evalue
BLAST results can be filtered based on an evalue cutoff that will be applied before the --taxrule
. Any hits with an evalue weaker than the value specified will be excluded.
Usage:
blobtools add \
...
--evalue 1e-50 \
...
--bitscore
BLAST results can be filtered based on a bitscore cutoff that will be applied before the --taxrule
. Any hits with an bitscore lower the value specified will be excluded.
Usage:
blobtools add \
...
--bitscore 100 \
...
--hit-count
By default the 10 highest scoring hits to a given taxon will be used when applying the --taxrule
. To use more or fewer hits, set the --hit-count
accordingly.
Usage:
blobtools add \
...
--hit-count 5 \
...
--replace
If a blobtools add
command would overwrite an existing field, the default behaviour is to issue a warning and not replace the existing field. To change this behaviour and allow existing fields to be overwritten, set the --replace
flag.
Usage:
blobtools add \
...
--replace \
...