Filtering a dataset – BlobToolKit

Datasets can be filtered based on the values in any variable (e.g. GC proportion and length) or category field (e.g. assigned phylum), or by using a list of identifiers (sequence IDs). Filters may be applied to a complete dataset to allow for use of a reduced dataset without repeating analyses or applied to assembly FASTA and read FASTQ files to allow for reassembly and reanalysis. Filter parameters are all shared between BlobTools2 and the BlobToolKit Viewer, allowing interactive sessions to be reproduced on the command line.

Usage

blobtools filter [options] /path/to/BlobDir

Example

Filter an assembly FASTA file to exclude sequences shorter than 1000 bp:

~/blobtoolkit/blobtools2/blobtools filter \
    --param length--Min=1000 \
    --fasta ~/BTK_TUTORIAL/FILES/ASSEMBLY_NAME.fasta \
    ~/BTK_TUTORIAL/DATASETS/ASSEMBLY_NAME

Filter parameters and outputs can be combined, to also filter the assembly to remove sequences with no hit in the reference databases and print a set of summary statistics for the filtered assembly to the terminal:

~/blobtoolkit/blobtools2/blobtools filter \
     --param length--Min=1000 \
     --param bestsumorder_phylum--Keys=no-hit \
     --fasta ~/BTK_TUTORIAL/FILES/ASSEMBLY_NAME.fasta \
     --summary STDOUT \
     ~/BTK_TUTORIAL/DATASETS/ASSEMBLY_NAME

To use filter based on interactive filtering in the viewer, filter parameters can be specified using a query string from the browser address bar (either just the query string portion or the entire url can be used):

~/blobtoolkit/blobtools2/blobtools filter \
     --query-string "http://localhost:8080/view/all/dataset/ASSEMBLY_NAME/blob?length--Min=1000&bestsumorder_phylum--Keys=no-hit" \
     --fasta ~/BTK_TUTORIAL/FILES/ASSEMBLY_NAME.fasta \
     ~/BTK_TUTORIAL/DATASETS/ASSEMBLY_NAME

If your interactive session includes a selection, details of the selection are not captured in the url query string. Instead the selection can be exported as a JSON format list file that includes details of any filter parameters and a list of selected contigs (see Reproducing interactive sessions for more details). This file can be used to filter the assembly with the --json option.

~/blobtoolkit/blobtools2/blobtools filter \
     --json /path/to/exported_list_file.json \
     --fasta ~/BTK_TUTORIAL/FILES/ASSEMBLY_NAME.fasta \
     ~/BTK_TUTORIAL/DATASETS/ASSEMBLY_NAME

Configuration

Configuration options for blobtools filter can be considered in three groups:

Setting filter parameters:

--param – String of type param=value to specify individual filter parameters.
--query-string – List of param=value pairs (separated by &) pairs from url query string.
--json – JSON format list file as generated by BlobtoolKit Viewer.
--list – Text file containing a space or newline separated list of identifiers.
--invert– Flag to invert the filter (exclude matching records).

Specifying files to filter:

--output – Path to directory to save a filtered copy of the BlobDir.
--fasta – FASTA format assembly file to be filtered.
--fastq – FASTQ format read file to be filtered (requires --cov).
- --cov – BAM/SAM/CRAM read alignment file.
--text – Generic text file to be filtered.
- --text-delimiter – Text file delimiter. [Default: whitespace]
- --text-id-column – Index of column containing identifiers (1-based). [Default: 1]
- --text-header – Flag to indicate first row of text file contains field names. [Default: False]
--suffix STRING String to be added to filtered filename. [Default: filtered]

Generating summary data:

--summary – Filename for a JSON-format summary of the filtered dataset.
- --summary-rank – Taxonomic level for summary. [Default: phylum]
- --taxrule – Taxrule used when processing hits. [Default: bestsumorder]
--table – Filename for a tabular output of filtered dataset.
- --table-fields – Comma separated list of field IDs to include in the table output. Use ‘plot’ to include all plot axes. [Default: plot]

`--param`

Individual param=value pairs can be specified to filter based on Variable or Category fields.

Variable params operate on numeric values. Available options for Variable fields, such as gc and length, are:

<field_id>--Min – Lowest value to include.
<field_id>--Max – Highest value to include.
<field_id>--Inv – Include values outside the range specified by --Min and --Max.

Category params operate on keys. Available options for Category fields, such as bestsumorder_phylum, are:

--Keys – Comma-separated list of strings matching category names or integers matching category keys to exclude.
--Inv – Include rather than exclude --Keys.

Usage:

blobtools filter \
    --param gc--Min=0.3 \
    --param bestsumorder_phylum--Keys=no-hit
    ...

`--query-string`

Lists of parameters can be specified using the URL query-string format (param1=value&param2=value). For convenience a complete URL can be copied from the browser address bar during an interactive Viewer session and pasted as a --query-string to reproduce the session on the command line.

Usage:

blobtools filter \
    --query-string "gc--Min=0.3&bestsumorder_phylum--Keys=no-hit"
    ...

or:

blobtools filter \
    --query-string "http://localhost:8080/view/all/dataset/ASSEMBLY_NAME/blob?gc--Min=0.3&bestsumorder_phylum--Keys=no-hit"
    ...

`--json`

Selection-based filters are not captured in the query string so to reproduce an interactive selection on the command line, it is necessary to export the current selection from the viewer as a list, which can be loaded using the --json flag.

Usage:

blobtools filter \
    --json examples/list.json
    ...

`--invert`

All filters can be inverted to make them inclusive rather than exclusive.

Usage:

blobtools filter \
    --invert
    ...

`--output`

Use the --output flag to specify an output directory to create a filtered BlobDir dataset based on any specified filter parameters.

Usage:

blobtools filter \
    ...
    --output /path/to/filtered/BlobDir
    ...