Genome Assembly – Practical | CDC Influenza Division Bioinformatics Workshop

Genome Assembly Practical Exercises

These exercises accompany the Genome Assembly and MIRA-NF module. They walk you through downloading FASTQ files from SRA, preparing samplesheets, running MIRA-NF, and building a complete wrapper script.

Exercises developed by Kristine Lacek

Exercise 1 — FASTQ Knowledge Check

Question:

How many lines does a single read occupy in a FASTQ file?

Attempt the answer; feedback will appear below.

Question:

What character appears on line 3 of every FASTQ record?

Attempt the answer; feedback will appear below.

Exercise 2 — FASTQ Download and Preparation

Create a working directory structure:

mkdir ~/MIRA_NGS
mkdir ~/MIRA_NGS/day3
cd ~/MIRA_NGS/day3

Using sra-toolkit and a loop, pull down the following samples from SRA:

SRR accession	Sample ID
SRR37675411	f58cd412
SRR37675412	f133a406
SRR37675413	e91ea59e

fastq-dump --split-3 --defline-seq '@$sn/$ri' --defline-qual '+' <SRR_ACCESSION>

Rename the FASTQs according to the sample ID mapping, adding _R1 and _R2 suffixes.
Compress all FASTQs with gzip.
Move the compressed files to a subfolder called fastqs.

Question:

What two commands create ~/MIRA_NGS and then ~/MIRA_NGS/day3 inside it?

Attempt the answer; feedback will appear below.

Possible Solution

mkdir ~/MIRA_NGS
mkdir ~/MIRA_NGS/day3
cd ~/MIRA_NGS/day3

# Download from SRA
for i in SRR37675411 SRR37675412 SRR37675413; do
    fastq-dump --split-3 --defline-seq '@$sn/$ri' --defline-qual '+' $i
done

# Create accessions mapping
cat > accessions.txt << EOF
SRR37675411,f58cd412
SRR37675412,f133a406
SRR37675413,e91ea59e
EOF

# Rename using the mapping
for i in $(cat accessions.txt); do
    mv $(echo $i | cut -f1 -d,)_1.fastq $(echo $i | cut -f2 -d,)_R1.fastq
    mv $(echo $i | cut -f1 -d,)_2.fastq $(echo $i | cut -f2 -d,)_R2.fastq
done

# Compress and organize
gzip *.fastq
mkdir fastqs
mv *.fastq.gz fastqs

Exercise 3 — Build a Samplesheet for MIRA-NF

Create a samplesheet using echo, ls, cut, uniq, and sed:

echo "sample_id,sample_type" > samplesheet.csv
ls fastqs | cut -f1 -d_ | uniq | sed "s/$/,Test/g" >> samplesheet.csv

Verify the samplesheet looks correct with cat samplesheet.csv.

Question:

What is the header line for an Illumina MIRA-NF samplesheet?

Attempt the answer; feedback will appear below.

Exercise 4 — Build a Mira-NF Execution Statement

Build a Mira-NF Docker run command. You can write it directly as a shell script (.sh file) to reuse in the future.

The first time you run this, it will pull the image from Docker Hub, which may take a few minutes. Subsequent runs will be faster.

Create a file called run_mira.sh using vim or another editor.

Add the following content, adjusting paths to match your setup:

#!/bin/bash
docker run \
 --privileged \
 -v ${PWD}:/data \
 cdcgov/mira-nf:v2.1.1 \
 nextflow run /MIRA-NF/main.nf \
     -profile mira_nf_container \
     --input /data/samplesheet.csv \
     --runpath /data \
     --outdir /data/mira-output \
     --e Flu-Illumina \
     --nextclade true

Make it executable and run:
```
chmod +x run_mira.sh
./run_mira.sh
```

Question:

Which Mira-NF flag specifies the experiment type (e.g., Flu-Illumina)?

Attempt the answer; feedback will appear below.

Question:

What Nextflow flag sets the compute environment profile (e.g., mira_nf_container)?

Attempt the answer; feedback will appear below.

Exercise 5 — Putting It All Together: Full Pipeline Script

For a second run, build a complete wrapper script (mira_wrapper.sh) that handles everything from downloading FASTQs to running MIRA-NF.

Use the following accessions:

SRR37675414,d969e179
SRR37675415,ba21bd1f
SRR37675416,ad336cc2
SRR37675417,89bb6967
SRR37675418,81ae9ee5
SRR37675419,73ceed0e
SRR37675420,6796f13d
SRR37675421,314ac5ba
SRR37675422,240aa994
SRR37675423,239e44e6

Your script should:

Download all FASTQs from SRA using a loop
Rename them using the sample ID mapping
Compress the FASTQs
Create a fastqs/ directory and move files into it
Generate the samplesheet automatically
Run MIRA-NF

Possible Solution

#!/bin/bash

# Download FASTQs from SRA
for i in $(cut -f1 -d, accessions.txt); do
    fastq-dump --split-3 --defline-seq '@$sn/$ri' --defline-qual '+' $i
done

# Rename FASTQs using sample IDs
for i in $(cat accessions.txt); do
    mv $(echo $i | cut -f1 -d,)_1.fastq $(echo $i | cut -f2 -d,)_R1.fastq
    mv $(echo $i | cut -f1 -d,)_2.fastq $(echo $i | cut -f2 -d,)_R2.fastq
done

# Compress and organize
gzip *.fastq
mkdir fastqs
mv *.fastq.gz fastqs

# Generate samplesheet
echo "sample_id,sample_type" > samplesheet.csv
ls fastqs | cut -f1 -d_ | uniq | sed "s/$/,Test/g" >> samplesheet.csv

# Run MIRA-NF
docker run \
    --privileged \
    -v ${PWD}:/data \
    cdcgov/mira-nf:v2.1.1 \
    nextflow run /MIRA-NF/main.nf \
        -profile mira_nf_container \
        --input /data/samplesheet.csv \
        --runpath /data \
        --outdir /data/mira-output \
        --e Flu-Illumina \
        --nextclade true

Exercise 6 — Post-Pipeline Challenges

After MIRA-NF completes, extend your wrapper script with one or more of the following:

Copy the amended consensus.fasta to your home directory.
Use sed on the amended consensus FASTA to convert sample IDs into strain names.
Rename the aggregate_outputs directory to <runID>_outputs.
Count the number of samples that pass QC by grepping the aggregate_outputs/csv-reports/<runid>_summary.csv.
Challenge: Loop through the amended consensus.fasta and break it into 8 per-segment multifastas.

Possible Solutions

Copy consensus: cp outputs/aggregate_outputs/mira-reports/*amended_consensus.fasta ~/
Sed strain names: Use sed -s with an alternative delimiter (% or #) since strain names contain /
Count passing samples: grep -c "PASS" outputs/aggregate_outputs/csv-reports/*_summary.csv
Per-segment split: Use grep with segment names and redirect to individual files in a loop

Exercise 8 — Multiple Sequence Alignment Practical

Using the Mira output amended_consensus.fasta, perform the following:

Extract the H3 HA segments and align them with MAFFT. Do any sequences have gaps in your MSA?
Do the same for A_PB2 segments.
What happens if you try to align all HA segments (H1, H3, H5, and B) together?

Question:

What happens when you try to align H1, H3, H5, and B HA segments together with MAFFT?

Attempt the answer; feedback will appear below.

Possible Solution

# Extract H3 HA segments and align
grep -A1 HA_H3 aggregate_outputs/mira-reports/mira_pipeline_test_amended_consensus.fasta | sed "s/--//g" > H3.fasta
mafft H3.fasta > aligned_h3.fasta

Exercise 9 — MSA Pipeline Extension

Add to your Mira wrapper shell script the commands for extracting and aligning pass-QC sequences for:

HA: A_HA_H1, B_HA, A_HA_H3
NA: B_NA, A_NA_N1, A_NA_N2

Exercise 10 — BLAST Practical

Using NCBI BLAST, determine which season and region (Americas, Eurasia, Africa, Oceania) each HA segment most closely matches.
Using GISAID BLAST, how do your results differ?
Discussion: What are some factors that might influence your BLAST results?

Question:

For influenza, which BLAST database generally provides more complete results — NCBI or GISAID?

Attempt the answer; feedback will appear below.