Getting Started¶
The pipeline is built using Nextflow as it's workflow manager.
Entry Points¶
Currently, there are 3 entrypoints for the Aquascope pipeline
QUALITY_ALIGN
: for executing quality control, quality reporting, and alignmentFREYJA_ONLY
: for executingfreyja
sub-workflow, including variant calling and abundance estimations- Requires aligned and trimmed BAM files as input
AQUASCOPE
: for executing bothQUALITY_ALIGN
andFREYJA_ONLY
as END-TO-END analysis
Processes¶
- FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.
- NanoPlot gives general quality metrics about your sequenced reads. its a Plotting tool for long read sequencing data and alignments.
- Fastp A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance.
- Qualimap Qualimap examines sequencing alignment data in SAM/BAM files according to the features of the mapped reads and provides an overall view of the data that helps to the detect biases in the sequencing and/or mapping of the data and eases decision-making for further analysis.
- Minimap2 Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.
- Samtools Samtools is a suite of programs for interacting with high-throughput sequencing data.
- ivarTrim iVar uses primer positions supplied in a BED file to soft clip primer sequences from an aligned and sorted BAM file. Following this, the reads are trimmed based on a quality threshold(Default: 20)
- AmpliconClip Clips the ends of read alignments if they intersect with regions defined in a BED file. While this tool was originally written for clipping read alignment positions which correspond to amplicon primer locations it can also be used in other contexts.
- ivarVariantCalling iVar uses the output of the samtools mpileup command to call variants - single nucleotide variants(SNVs) and indels.
- Freyja Perform variant calling using samtools and iVar on a BAMFILE and generates relative lineage abundances from VARIANTS and DEPTHS.
- MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
- Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Dependencies¶
-
Install
Nextflow
(>=21.04.0
) -
Install any necessary software, based on deployment strategy, visiting docs for configuration related information:
Singularity
-
The following software is also utilized:
- python=3.9
- samtools=1.21
- fastqc=0.12.1
- nanoplot=1.41.6
- fastp=0.23.4
- fastqc=0.12.1
- qualimap=2.3
- minimap2=2.24
- multiqc=1.21
- freyja=1.5.2
Core Nextflow arguments¶
NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).
-profile
¶
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.
Note that multiple profiles can be loaded, for example: -profile test,docker
- the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles.
If -profile
is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH
. This is not recommended.
docker
- A generic configuration profile to be used with Docker
singularity
- A generic configuration profile to be used with Singularity
podman
- A generic configuration profile to be used with Podman
shifter
- A generic configuration profile to be used with Shifter
charliecloud
- A generic configuration profile to be used with Charliecloud
conda
- A generic configuration profile to be used with Conda. Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud.
test
- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters
-resume
¶
Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously.
You can also supply a run name to resume a specific run: -resume [run-name]
. Use the nextflow log
command to show previous run names.
-c
¶
Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.
Nextflow memory requirements¶
In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc
or ~./bash_profile
):
NXF_OPTS='-Xms1g -Xmx4g'
Reproducibility¶
It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.
First, go to the CDCgov/aquascope releases page and find the latest version number - numeric only (eg. 3.0.0
). Then specify this when running the pipeline with -r
(one hyphen) - eg. -r 3.0.0
.
This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future.