Genome Assembly Practical Exercises

These exercises accompany the Genome Assembly and MIRA-NF module. They walk you through downloading FASTQ files from SRA, preparing samplesheets, running MIRA-NF, and building a complete wrapper script.


Exercises developed by Kristine Lacek

Exercise 1 — FASTQ Knowledge Check

Question:

How many lines does a single read occupy in a FASTQ file?

Attempt the answer; feedback will appear below.

Question:

What character appears on line 3 of every FASTQ record?

Attempt the answer; feedback will appear below.

Exercise 2 — FASTQ Download and Preparation

  1. Create a working directory structure:

    mkdir ~/MIRA_NGS
    mkdir ~/MIRA_NGS/day3
    cd ~/MIRA_NGS/day3
    
  2. Using sra-toolkit and a loop, pull down the following samples from SRA:

    SRR accession Sample ID
    SRR37675411 f58cd412
    SRR37675412 f133a406
    SRR37675413 e91ea59e
    fastq-dump --split-3 --defline-seq '@$sn/$ri' --defline-qual '+' <SRR_ACCESSION>
    
  3. Rename the FASTQs according to the sample ID mapping, adding _R1 and _R2 suffixes.

  4. Compress all FASTQs with gzip.

  5. Move the compressed files to a subfolder called fastqs.

Question:

What two commands create ~/MIRA_NGS and then ~/MIRA_NGS/day3 inside it?

Attempt the answer; feedback will appear below.
Possible Solution
mkdir ~/MIRA_NGS
mkdir ~/MIRA_NGS/day3
cd ~/MIRA_NGS/day3

# Download from SRA
for i in SRR37675411 SRR37675412 SRR37675413; do
    fastq-dump --split-3 --defline-seq '@$sn/$ri' --defline-qual '+' $i
done

# Create accessions mapping
cat > accessions.txt << EOF
SRR37675411,f58cd412
SRR37675412,f133a406
SRR37675413,e91ea59e
EOF

# Rename using the mapping
for i in $(cat accessions.txt); do
    mv $(echo $i | cut -f1 -d,)_1.fastq $(echo $i | cut -f2 -d,)_R1.fastq
    mv $(echo $i | cut -f1 -d,)_2.fastq $(echo $i | cut -f2 -d,)_R2.fastq
done

# Compress and organize
gzip *.fastq
mkdir fastqs
mv *.fastq.gz fastqs

Exercise 3 — Build a Samplesheet for MIRA-NF

  1. Create a samplesheet using echo, ls, cut, uniq, and sed:

    echo "sample_id,sample_type" > samplesheet.csv
    ls fastqs | cut -f1 -d_ | uniq | sed "s/$/,Test/g" >> samplesheet.csv
    
  2. Verify the samplesheet looks correct with cat samplesheet.csv.

Question:

What is the header line for an Illumina MIRA-NF samplesheet?

Attempt the answer; feedback will appear below.

Exercise 4 — Build a Mira-NF Execution Statement

Build a Mira-NF Docker run command. You can write it directly as a shell script (.sh file) to reuse in the future.

  • The first time you run this, it will pull the image from Docker Hub, which may take a few minutes. Subsequent runs will be faster.
  1. Create a file called run_mira.sh using vim or another editor.

  2. Add the following content, adjusting paths to match your setup:

    #!/bin/bash
    docker run \
     --privileged \
     --user $(id -u):$(id -g) \
     -v ${PWD}:/data \
     cdcgov/mira-nf:v2.1.0 \
     nextflow run /MIRA-NF/main.nf \
         -profile mira_nf_container \
         --input /data/samplesheet.csv \
         --runpath /data \
         --outdir /data/mira-output \
         --e Flu-Illumina \
         --nextclade true
    
  3. Make it executable and run:

    chmod +x run_mira.sh
    ./run_mira.sh
    

Question:

Which Mira-NF flag specifies the experiment type (e.g., Flu-Illumina)?

Attempt the answer; feedback will appear below.

Question:

What Nextflow flag sets the compute environment profile (e.g., mira_nf_container)?

Attempt the answer; feedback will appear below.

Exercise 5 — Putting It All Together: Full Pipeline Script

For a second run, build a complete wrapper script (mira_wrapper.sh) that handles everything from downloading FASTQs to running MIRA-NF.

Use the following accessions:

SRR37675414,d969e179
SRR37675415,ba21bd1f
SRR37675416,ad336cc2
SRR37675417,89bb6967
SRR37675418,81ae9ee5
SRR37675419,73ceed0e
SRR37675420,6796f13d
SRR37675421,314ac5ba
SRR37675422,240aa994
SRR37675423,239e44e6

Your script should:

  1. Download all FASTQs from SRA using a loop
  2. Rename them using the sample ID mapping
  3. Compress the FASTQs
  4. Create a fastqs/ directory and move files into it
  5. Generate the samplesheet automatically
  6. Run MIRA-NF
Possible Solution
#!/bin/bash

# Download FASTQs from SRA
for i in $(cut -f1 -d, accessions.txt); do
    fastq-dump --split-3 --defline-seq '@$sn/$ri' --defline-qual '+' $i
done

# Rename FASTQs using sample IDs
for i in $(cat accessions.txt); do
    mv $(echo $i | cut -f1 -d,)_1.fastq $(echo $i | cut -f2 -d,)_R1.fastq
    mv $(echo $i | cut -f1 -d,)_2.fastq $(echo $i | cut -f2 -d,)_R2.fastq
done

# Compress and organize
gzip *.fastq
mkdir fastqs
mv *.fastq.gz fastqs

# Generate samplesheet
echo "sample_id,sample_type" > samplesheet.csv
ls fastqs | cut -f1 -d_ | uniq | sed "s/$/,Test/g" >> samplesheet.csv

# Run MIRA-NF
docker run \
    --privileged \
    -v ${PWD}:/data \
    --user $(id -u):$(id -g) \
    cdcgov/mira-nf:v2.1.0 \
    nextflow run /MIRA-NF/main.nf \
        -profile mira_nf_container \
        --input /data/samplesheet.csv \
        --runpath /data \
        --outdir /data/mira-output \
        --e Flu-Illumina \
        --nextclade true

Exercise 6 — Post-Pipeline Challenges

After MIRA-NF completes, extend your wrapper script with one or more of the following:

  1. Copy the amended consensus.fasta to your home directory.
  2. Use sed on the amended consensus FASTA to convert sample IDs into strain names.
  3. Rename the aggregate_outputs directory to <runID>_outputs.
  4. Count the number of samples that pass QC by grepping the aggregate_outputs/csv-reports/<runid>_summary.csv.
  5. Challenge: Loop through the amended consensus.fasta and break it into 8 per-segment multifastas.
Possible Solutions
  • Copy consensus: cp outputs/aggregate_outputs/mira-reports/*amended_consensus.fasta ~/
  • Sed strain names: Use sed -s with an alternative delimiter (% or #) since strain names contain /
  • Count passing samples: grep -c "PASS" outputs/aggregate_outputs/csv-reports/*_summary.csv
  • Per-segment split: Use grep with segment names and redirect to individual files in a loop

Exercise 7 — Samtools Practical

Navigate to the Mira output from your first run and perform the following tasks using samtools:

  1. View the first 10 lines of each A_HA.bam file across the three sample outputs. Approximately where in the gene are these reads?
  2. Do these BAM files need to be sorted?
  3. Remove the A_HA.bam.bai file, then re-index the HA BAM file with samtools index.
  4. Using ls -lah, how large is the A_HA.bam file? Using samtools view with redirect or the -o argument, convert A_HA.bam to A_HA.sam. How large is the SAM file?

Question:

If the first reads in a BAM file all map to position 1, does the file need to be sorted?

Attempt the answer; feedback will appear below.
Possible Solution
# View first 10 aligned reads
samtools view A_HA_H1.bam | head -10

# First reads all map to position 1 — the file is already sorted

# Remove and recreate the index
rm A_HA_H1.bam.bai
samtools index A_HA_H1.bam

# Convert BAM to SAM
samtools view -o A_HA_H1.sam A_HA_H1.bam

Exercise 8 — Multiple Sequence Alignment Practical

Using the Mira output amended_consensus.fasta, perform the following:

  1. Extract the H3 HA segments and align them with MAFFT. Do any sequences have gaps in your MSA?
  2. Do the same for A_PB2 segments.
  3. What happens if you try to align all HA segments (H1, H3, H5, and B) together?

Question:

What happens when you try to align H1, H3, H5, and B HA segments together with MAFFT?

Attempt the answer; feedback will appear below.
Possible Solution
# Extract H3 HA segments and align
grep -A1 HA_H3 aggregate_outputs/mira-reports/mira_pipeline_test_amended_consensus.fasta | sed "s/--//g" > H3.fasta
mafft H3.fasta > aligned_h3.fasta

Exercise 9 — MSA Pipeline Extension

Add to your Mira wrapper shell script the commands for extracting and aligning pass-QC sequences for:

  • HA: A_HA_H1, B_HA, A_HA_H3
  • NA: B_NA, A_NA_N1, A_NA_N2

Exercise 10 — BLAST Practical

  1. Using NCBI BLAST, determine which season and region (Americas, Eurasia, Africa, Oceania) each HA segment most closely matches.
  2. Using GISAID BLAST, how do your results differ?
  3. Discussion: What are some factors that might influence your BLAST results?

Question:

For influenza, which BLAST database generally provides more complete results — NCBI or GISAID?

Attempt the answer; feedback will appear below.