Configuring the pipeline

The Pipeline needs to be configured to match your local compute environment and to set parameters specific to your assembly and read data. These configuration options are set in a YAML format file before the Pipeline is run. The Pipeline has been designed for use with publicly available assemblies, but most configuration options are equally applicable for use with local datasets. Starting with a configuration file for a publicly available Drosophila albomicans assembly, the configuration is divided into six sections: assembly, busco, reads, settings, similarity, taxon and keep_intermediates:

assembly:
  accession: GCA_000298335.1
  alias: DroAlb_1.0
  bioproject: PRJNA39511
  biosample: SAMN00003213
  level: scaffold
  scaffold-count: 26354
  span: 253560284
  prefix: ACVV01

busco:
  lineages:
    - diptera_odb9,
    - arthropoda_odb9
    - eukaryota_odb9
  lineage_dir: /path/to/busco/lineages 

reads:
  paired:
    - 
      - SRR026696
      - ILLUMINA
      - 482114248
      - ftp.sra.ebi.ac.uk/vol1/fastq/SRR026/SRR026696/SRR026696_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR026/SRR026696/SRR026696_2.fastq.gz
    - 
      - SRR026697
      - ILLUMINA
      - 552054360
      - ftp.sra.ebi.ac.uk/vol1/fastq/SRR026/SRR026697/SRR026697_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR026/SRR026697/SRR026697_2.fastq.gz
  single: []
  coverage:
    max: 100
    min: 0.5
  
settings:
  blobtools2_path: /path/to/blobtoolkit/blobtools2
  taxonomy: /path/to/blobtoolkit/taxdump/
  tmp: /tmp
  blast_chunk: 100000
  blast_max_chunks: 10
  blast_overlap: 500
  chunk: 1000000

similarity:
  defaults:
    evalue: 1e-25
    max_target_seqs: 10
    root: 1
    mask_ids: 
      - 7215
  databases:
    - 
      local: /path/to/databases/ncbi_2019_08
      name: nt
      source: ncbi
      tool: blast
      type: nucl
    - 
      local: /ceph/software/databases/uniprot_2019_07
      max_target_seqs: 1
      name: reference_proteomes
      source: uniprot
      tool: diamond
      type: prot
  taxrule: bestsumorder

taxon:
  taxid: 7291
  name: Drosophila albomicans

keep_intermediates: true

`assembly`

assembly:
  accession: GCA_000298335.1
  alias: DroAlb_1.0
  bioproject: PRJNA39511
  biosample: SAMN00003213
  level: scaffold
  scaffold-count: 26354
  span: 253560284
  prefix: ACVV01

The assembly section for a publicly available assembly contains details of the assembly GCA accession and alias to allow the assembly FASTA to be retrieved from NCBI. Assembly bioproject and biosample are included to allow direct links to the source data from within the BlobToolKit Viewer. Assembly level can be set to scaffold, contig or chromosome depending on the top level to which the genome has been assembled. The scaffold-count and span provide metadata that can be checked against the values obtained when importing the assembly and span can be used to determine read coverage prior to mapping. Finally the prefix is set to the assembly WGS accession by convention and is used to determine the assembly filename. For local assemblies many of these fields may be omitted and the prefix should be set to the basename of your assembly FASTA file, e.g. if this were a local assembly saved as Dalbomicans_v1.fasta:

assembly:
  accession: draft
  level: scaffold
  scaffold-count: 26354
  span: 253560284
  prefix: Dalbomicans_v1

`busco`

busco:
  lineages:
    - diptera_odb9,
    - arthropoda_odb9
    - eukaryota_odb9
  lineage_dir: /path/to/busco/lineages

Running BUSCO analyses as part of the Pipeline is optional, but the section should be present in either case. A separate BUSCO analysis will be run for each of the listed lineages. If any of the listed lineages are not already available in the lineage_dir, they will be fetched automatically when the Pipeline is run. Valid lineage names are any lineages available from the BUSCO website. To run the Pipeline without any BUSCO analyses, the list of lineages should be left empty and a minimal busco section would be:

busco:
   lineage_dir: /path/to/busco/lineages
   lineages: []

`reads`

reads:
  paired:
    - 
      - SRR026696
      - ILLUMINA
      - 482114248
      - ftp.sra.ebi.ac.uk/vol1/fastq/SRR026/SRR026696/SRR026696_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR026/SRR026696/SRR026696_2.fastq.gz
    - 
      - SRR026697
      - ILLUMINA
      - 552054360
      - ftp.sra.ebi.ac.uk/vol1/fastq/SRR026/SRR026697/SRR026697_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR026/SRR026697/SRR026697_2.fastq.gz
  single: []
  coverage:
    max: 100
    min: 0.5

The reads section contains information on paired and/or single ended read files to be mapped to the genome. Either or both of these sections may be present, depending on the read files available for mapping, and both follow the same pattern. Each read file is represented by an array of up to four values: (i) name, (ii) sequencing platform, (iii) base count and (iv) remote file URL(s). For local files, only the first two values are required.

(i) For publicly available assemblies, the name should be the SRA accession and must match the accession in the remote file URL. For local assemblies the name should match the basename of your read FASTQ file. Local filenames must match the pattern name.fastq.gz for single end reads, name_subreads.fastq.gz for PacBio reads or name_1.fastq.gz and name_2.fastq.gz for paired end files.

(ii) The sequencing platform is used to set the appropriate parameters for read mapping using Minimap2. Valid values are “ILLUMINA”, “OXFORD_NANOPORE”, “PACBIO_SMRT” and “LS454”.

(iii) If set, the base count value is used in conjunction with the coverage subsection described below to determine whether read files should be subsampled prior to mapping.

(iv) For public assemblies, the remote file URL(s) from which the read files can be obtained must be set so they can be fetched by the Pipeline. for paired files, the two URLs should be separated by a semicolon as in the example above.

The coverage subsection allows max and min coverage values to be set based on base count values provided in the paired or single subsections. Files with a coverage (base count / assembly.span) greater than max will be subsampled using seqtk, while files with less than min coverage will not be mapped.

To map all reads in local PacBio read files named library1_subreads.fastq.gz and library2_subreads.fastq.gz, regardless of coverage, the reads section would look like:

reads:
  single:
    - 
      - library1
      - PACBIO_SMRT
    - 
      - library2
      - PACBIO_SMRT

To map all reads in local paired Illumina read files named readfile_1.fastq.gz and readfile_2.fastq.gz, the reads section would look like:

reads:
  paired:
    - 
      - readfile
      - ILLUMINA

settings

settings:
  blobtools2_path: /path/to/blobtoolkit/blobtools2
  taxonomy: /path/to/databases/taxonomy_2019_08
  tmp: /tmp
  blast_chunk: 100000
  blast_max_chunks: 10
  blast_overlap: 500
  chunk: 1000000

Configuration options in the settings section are used to set some program/file locations that are not handled elsewhere and a number of parameters for the BLAST wrapper script. Most Pipeline dependencies are handled using Conda environments, however the blobtools2_path must be set to the location of this program on your local system.

The Pipeline requires a local copy of the NCBI taxonomy new_taxdump, which will be automatically fetched and added to the taxonomy directory. This file should match the taxonomy used in the nt database (see below) so it is useful to keep an explicit record of the date downloaded in the directory name. Finally, some database preparation steps can use a large amount of disk space so a suitable tmp directory should be specified.

The remaining parameters set defaults that need to be included in the file but will not normally need to be changed. These mostly relate to the BLAST wrapper script that splits long scaffold sequences into chunks before running BLAST searches to avoid subsequent taxonomic inference being based on a single region within the sequence. blast_chunk value determines a minimum length of the chunks that long scaffolds will be split into by the BLAST wrapper script. blast_max_chunks ensures that very long scaffolds will only be split into a limited number of chunks, for scaffolds longer than blast_chunk x blast-max_chunks, the chunk length will be increased so all subsequences are of equal size. The blast_overlap value sets a small overlap between chunks to allow for hits that span the break points. (The similarly named chunk parameter is unrelated and should no longer be required but this needs testing.)

`similarity`

similarity:
  defaults:
    evalue: 1e-25
    max_target_seqs: 10
    root: 1
    mask_ids: 
      - 7215
  databases:
    - 
      local: /path/to/databases/ncbi_2019_08
      name: nt
      source: ncbi
      tool: blast
      type: nucl
    - 
      local: /ceph/software/databases/uniprot_2019_07
      max_target_seqs: 1
      name: reference_proteomes
      source: uniprot
      tool: diamond
      type: prot
  taxrule: bestsumorder

The similarity section controls the databases and settings used for sequence similarity searches. The structure of this section reflects a slightly more flexible approach to database specification than is currently in use.

Similarity defaults are settings applied to all sequence similarity searches and in the example file includes the evalue and max_target_seqs values. The root value is the NCBI Taxonomy ID of the root of the clade that you wish to search against. For analyses of public datasets we always set this to 1 to include all taxa but it provides the option to limit a search to a single kingdom or phylum when analysing local assemblies. mask_ids is a list of Taxonomy IDs to for which all descendants will be excluded from the similarity searches. For the analysis of public assemblies we set this to the genus of the taxon being analysed (7215 is Drosophila) to avoid assigning taxonomy based on the data in the assembly that may already be in the database we are searching against. For analyses of local assemblies this should usually be set to be an empty list (i.e. mask_ids: []).

The databases subsection is a list of database-specific parameters. local is the path to a local copy of the database (that will be fetched/generated if it does not already exist). name is the database name, for NCBI BLAST searches this must match nt and be a version 5 database to support the taxonomy-based filtering discussed above. source is either “ncbi” or “uniprot” and determines where the database files should be fetched from. tool is either “blast” for NCBI BLAST or “diamond” and type is “nucl” for nucleotide or “proc” for protein. The flexibility that this implies has not been tested recently so it is best to leave all but local unchanged.

The taxrule specifies which BlobTools2 taxrule should be used to assign taxonomic labels to the assembly scaffolds. In practice, this should always be set to “bestsumorder” as Diamond searches are only performed for scaffolds with no NCBI BLAST hits.

`taxon`

taxon:
  taxid: 7291
  name: Drosophila albomicans

The taxon section should contain a NCBI taxid to allow full taxonomic information to be added to the dataset metadata. The taxon name should be the species, subspecies or strain name, as appropriate.

`keep_intermediates`

keep_intermediates: true

keep_intermediates is a flag that should be set to “true” if you wish to keep all intermediate files generated by the pipeline. When running the pipeline on public assemblies we set this flag to “false” so some large files are discarded once they have been used. For local assemblies, these files (e.g. read-alignment BAM files) are likely to be useful for further analyses.