Genome Assembly Practical Exercises
These exercises accompany the Genome Assembly and MIRA-NF module. They walk you through downloading FASTQ files from SRA, preparing samplesheets, running MIRA-NF, and building a complete wrapper script.
Exercises developed by Kristine Lacek
Exercise 1 — FASTQ Knowledge Check
Question:
How many lines does a single read occupy in a FASTQ file?
Question:
What character appears on line 3 of every FASTQ record?
Exercise 2 — FASTQ Download and Preparation
-
Create a working directory structure:
mkdir ~/MIRA_NGS mkdir ~/MIRA_NGS/day3 cd ~/MIRA_NGS/day3 -
Using
sra-toolkitand a loop, pull down the following samples from SRA:SRR accession Sample ID SRR37675411 f58cd412 SRR37675412 f133a406 SRR37675413 e91ea59e fastq-dump --split-3 --defline-seq '@$sn/$ri' --defline-qual '+' <SRR_ACCESSION> -
Rename the FASTQs according to the sample ID mapping, adding
_R1and_R2suffixes. -
Compress all FASTQs with
gzip. -
Move the compressed files to a subfolder called
fastqs.
Question:
What two commands create ~/MIRA_NGS and then ~/MIRA_NGS/day3 inside it?
Possible Solution
mkdir ~/MIRA_NGS
mkdir ~/MIRA_NGS/day3
cd ~/MIRA_NGS/day3
# Download from SRA
for i in SRR37675411 SRR37675412 SRR37675413; do
fastq-dump --split-3 --defline-seq '@$sn/$ri' --defline-qual '+' $i
done
# Create accessions mapping
cat > accessions.txt << EOF
SRR37675411,f58cd412
SRR37675412,f133a406
SRR37675413,e91ea59e
EOF
# Rename using the mapping
for i in $(cat accessions.txt); do
mv $(echo $i | cut -f1 -d,)_1.fastq $(echo $i | cut -f2 -d,)_R1.fastq
mv $(echo $i | cut -f1 -d,)_2.fastq $(echo $i | cut -f2 -d,)_R2.fastq
done
# Compress and organize
gzip *.fastq
mkdir fastqs
mv *.fastq.gz fastqs
Exercise 3 — Build a Samplesheet for MIRA-NF
-
Create a samplesheet using
echo,ls,cut,uniq, andsed:echo "sample_id,sample_type" > samplesheet.csv ls fastqs | cut -f1 -d_ | uniq | sed "s/$/,Test/g" >> samplesheet.csv -
Verify the samplesheet looks correct with
cat samplesheet.csv.
Question:
What is the header line for an Illumina MIRA-NF samplesheet?
Exercise 4 — Build a Mira-NF Execution Statement
Build a Mira-NF Docker run command. You can write it directly as a shell script (.sh file) to reuse in the future.
- The first time you run this, it will pull the image from Docker Hub, which may take a few minutes. Subsequent runs will be faster.
-
Create a file called
run_mira.shusingvimor another editor. -
Add the following content, adjusting paths to match your setup:
#!/bin/bash docker run \ --privileged \ --user $(id -u):$(id -g) \ -v ${PWD}:/data \ cdcgov/mira-nf:v2.1.0 \ nextflow run /MIRA-NF/main.nf \ -profile mira_nf_container \ --input /data/samplesheet.csv \ --runpath /data \ --outdir /data/mira-output \ --e Flu-Illumina \ --nextclade true -
Make it executable and run:
chmod +x run_mira.sh ./run_mira.sh
Question:
Which Mira-NF flag specifies the experiment type (e.g., Flu-Illumina)?
Question:
What Nextflow flag sets the compute environment profile (e.g., mira_nf_container)?
Exercise 5 — Putting It All Together: Full Pipeline Script
For a second run, build a complete wrapper script (mira_wrapper.sh) that handles everything from downloading FASTQs to running MIRA-NF.
Use the following accessions:
SRR37675414,d969e179
SRR37675415,ba21bd1f
SRR37675416,ad336cc2
SRR37675417,89bb6967
SRR37675418,81ae9ee5
SRR37675419,73ceed0e
SRR37675420,6796f13d
SRR37675421,314ac5ba
SRR37675422,240aa994
SRR37675423,239e44e6
Your script should:
- Download all FASTQs from SRA using a loop
- Rename them using the sample ID mapping
- Compress the FASTQs
- Create a
fastqs/directory and move files into it - Generate the samplesheet automatically
- Run MIRA-NF
Possible Solution
#!/bin/bash
# Download FASTQs from SRA
for i in $(cut -f1 -d, accessions.txt); do
fastq-dump --split-3 --defline-seq '@$sn/$ri' --defline-qual '+' $i
done
# Rename FASTQs using sample IDs
for i in $(cat accessions.txt); do
mv $(echo $i | cut -f1 -d,)_1.fastq $(echo $i | cut -f2 -d,)_R1.fastq
mv $(echo $i | cut -f1 -d,)_2.fastq $(echo $i | cut -f2 -d,)_R2.fastq
done
# Compress and organize
gzip *.fastq
mkdir fastqs
mv *.fastq.gz fastqs
# Generate samplesheet
echo "sample_id,sample_type" > samplesheet.csv
ls fastqs | cut -f1 -d_ | uniq | sed "s/$/,Test/g" >> samplesheet.csv
# Run MIRA-NF
docker run \
--privileged \
-v ${PWD}:/data \
--user $(id -u):$(id -g) \
cdcgov/mira-nf:v2.1.0 \
nextflow run /MIRA-NF/main.nf \
-profile mira_nf_container \
--input /data/samplesheet.csv \
--runpath /data \
--outdir /data/mira-output \
--e Flu-Illumina \
--nextclade true
Exercise 6 — Post-Pipeline Challenges
After MIRA-NF completes, extend your wrapper script with one or more of the following:
- Copy the amended
consensus.fastato your home directory. - Use
sedon the amended consensus FASTA to convert sample IDs into strain names. - Rename the
aggregate_outputsdirectory to<runID>_outputs. - Count the number of samples that pass QC by grepping the
aggregate_outputs/csv-reports/<runid>_summary.csv. - Challenge: Loop through the amended
consensus.fastaand break it into 8 per-segment multifastas.
Possible Solutions
- Copy consensus:
cp outputs/aggregate_outputs/mira-reports/*amended_consensus.fasta ~/ - Sed strain names: Use
sed -swith an alternative delimiter (%or#) since strain names contain/ - Count passing samples:
grep -c "PASS" outputs/aggregate_outputs/csv-reports/*_summary.csv - Per-segment split: Use
grepwith segment names and redirect to individual files in a loop
Exercise 7 — Samtools Practical
Navigate to the Mira output from your first run and perform the following tasks using samtools:
- View the first 10 lines of each
A_HA.bamfile across the three sample outputs. Approximately where in the gene are these reads? - Do these BAM files need to be sorted?
- Remove the
A_HA.bam.baifile, then re-index the HA BAM file withsamtools index. - Using
ls -lah, how large is theA_HA.bamfile? Usingsamtools viewwith redirect or the-oargument, convertA_HA.bamtoA_HA.sam. How large is the SAM file?
Question:
If the first reads in a BAM file all map to position 1, does the file need to be sorted?
Possible Solution
# View first 10 aligned reads
samtools view A_HA_H1.bam | head -10
# First reads all map to position 1 — the file is already sorted
# Remove and recreate the index
rm A_HA_H1.bam.bai
samtools index A_HA_H1.bam
# Convert BAM to SAM
samtools view -o A_HA_H1.sam A_HA_H1.bam
Exercise 8 — Multiple Sequence Alignment Practical
Using the Mira output amended_consensus.fasta, perform the following:
- Extract the H3 HA segments and align them with MAFFT. Do any sequences have gaps in your MSA?
- Do the same for A_PB2 segments.
- What happens if you try to align all HA segments (H1, H3, H5, and B) together?
Question:
What happens when you try to align H1, H3, H5, and B HA segments together with MAFFT?
Possible Solution
# Extract H3 HA segments and align
grep -A1 HA_H3 aggregate_outputs/mira-reports/mira_pipeline_test_amended_consensus.fasta | sed "s/--//g" > H3.fasta
mafft H3.fasta > aligned_h3.fasta
Exercise 9 — MSA Pipeline Extension
Add to your Mira wrapper shell script the commands for extracting and aligning pass-QC sequences for:
- HA: A_HA_H1, B_HA, A_HA_H3
- NA: B_NA, A_NA_N1, A_NA_N2
Exercise 10 — BLAST Practical
- Using NCBI BLAST, determine which season and region (Americas, Eurasia, Africa, Oceania) each HA segment most closely matches.
- Using GISAID BLAST, how do your results differ?
- Discussion: What are some factors that might influence your BLAST results?
Question:
For influenza, which BLAST database generally provides more complete results — NCBI or GISAID?