Bioinformatics Programming – Exercises | CDC Influenza Division Bioinformatics Workshop

Bioinformatics Programming Exercises

These exercises accompany the Intro to Bioinformatics Programming module. They alternate between hands-on practice and knowledge checks.

Exercises developed by Kristine Lacek and Logan Fink

Exercise 1 — Pseudo Code Practical

Open a new .sh file for each of these problems and write out logical pseudo code as comments (#). This is to practice approaching a coding problem step by step, so there is no need to worry about syntax.

Think about what variables you might need to initiate or logic gates to use.

#1. If you had a file that was a list of numbers, determine the mean

Possible Solution

Count up the number of entries in the file and store that number as a variable (e.g., total_entries).
Use a function to iterate through each number and add them line by line until the end of the file, then store that sum as a variable (sum_entries).
Divide sum_entries by total_entries, and capture the final value as a float so it is not rounded to a whole integer.

#2. Using the same file from question 1, calculate the percentage of numbers that are below 6

Possible Solution

Instantiate two variables, x and y, and set both to zero (x = count of values below 6, y = count of values 6 or above).
Write a function to determine whether a value is less than 6.
Iterate through the list and use each value as input to the function.
If the condition is true, increment x by 1; if false, increment y by 1.
Create a new variable z as x + y (total values), then divide x by z and capture the result as a float.

#3. If you had a file containing all of the flu samples tested over the course of a year, determine how many different flu subtypes appear in that list?

Possible Solution

Create an empty list to hold unique subtype values (e.g., unique_list).
Iterate through the subtype list and check whether each subtype is already in unique_list.
If it is not present, add it to unique_list.
Alternative: sort the list and condense to unique values (pipeline approach, e.g., sort/uniq).

#4. Translate a DNA sequence into its possible protein sequences (keeping in mind reading frames, both coding and non-coding)

Possible Solution

Obtain a codon table for converting DNA codons into amino acids.
Starting at nucleotide 1, iterate in steps of 3 and translate codons to build a protein sequence.
Repeat from nucleotide 2 (frame 2).
Repeat from nucleotide 3 (frame 3).
Convert the DNA to its reverse complement.
Repeat translation for the three reverse-complement frames.

Exercise 2 — Logic and Variables Practical

Retrieve the ordinal number file from github (ordinal_check.sh) to complete the following exercises:

wget https://raw.githubusercontent.com/CDCgov/id-bioifx-workshop/refs/heads/main/practical/bash_practical_exercises/logic_and_variable_practical/ordinal_check.sh

Change the ordinal statement to execute as true if the number is greater than 50 and less than 100
Change the ordinal statement to execute as true if the number is less than 25 or greater than 75
Change the ordinal statement to execute as true if the number is greater than 1 and less than 10, or greater than or equal to 90 and less than 100

Possible Solution

Change the ordinal statement to execute as true if the number is greater than 50 and less than 100.

if [ "$n" -gt 50 ] && [ "$n" -lt 100 ]; then
  # amend echo statements to reflect exercise instructions
fi

Change the ordinal statement to execute as true if the number is less than 25 or greater than 75.

if [ "$n" -lt 25 ] || [ "$n" -gt 75 ]; then
  # amend echo statements to reflect exercise instructions
fi

Change the ordinal statement to execute as true if the number is greater than 1 and less than 10, or greater than or equal to 90 and less than 100.

if { [ "$n" -gt 1 ] && [ "$n" -lt 10 ]; } || { [ "$n" -ge 90 ] && [ "$n" -lt 100 ]; }; then
  # amend echo statements to reflect exercise instructions
fi

Exercise 3 — Loops Practical

You’ll need to make a directory that contains a series of files for the next exercise. Use the directions to below to create the directory and download the files.

mkdir loops_practical && cd loops_practical; for ((i=1;i<=99;i++)); do wget https://raw.githubusercontent.com/CDCgov/id-bioifx-workshop/refs/heads/main/practical/bash_practical_exercises/loops_practical/cat${i}; done

For every file in the loops_practical directory, if the file is not empty, print the name of the file to stout. (wc –byte < filename can be used to give the size of a file)
For each file in a directory, find out if the file contains a shebang line as the first line, if so, print the filename to stout
Find the sick cat! (Hint: execute the files with shebangs!)

Question:

What is the number of the sick cat?

Attempt the answer; feedback will appear below.

Possible Solution

Print each non-empty file name in the loops_practical directory.

for i in *; do
  bytesize=$(wc --byte < "$i")
  if [ "$bytesize" -gt 0 ]; then
    echo "$i"
  fi
done

Find the “sick cat” by executing only non-empty bash-readable files and printing successful file names.
```
for i in *; do
  bytesize=$(wc --byte < "$i")
  if [ "$bytesize" -gt 0 ]; then
    if bash "$i" 2>/dev/null; then
      echo "$i"
    fi
  fi
done
```
A less elegant way to find the sick cat would be to execute every file in the directory, and scroll through the outputs.
```
for i in *; do
    echo "$i"
    bash "$i"
done
```

Exercise 4 — Pipeline practical

Download the files “flu_types.txt”, “decode_the_secret_message.txt”, and “secret_message_key.txt” into a directory called “pipeline_practical” using the following instructions.

mkdir pipeline_practical && cd pipeline_practical; for i in decode_the_secret_message.txt flu_types.txt secret_message_key.txt; do wget https://raw.githubusercontent.com/CDCgov/id-bioifx-workshop/refs/heads/main/practical/bash_practical_exercises/pipeline_practical/${i} ; done

List the contents of a directory, pipe that output to word count to find how many files there are (may be helpful to use man wc to find out what wc can do)
In one line, sort the contents of flu_types and output a list of the unique values
Using the following instructions, decode the secret message (see if you can do it in a one-liner): 1%76q#948^4q5@23q2q492q07&/@i5#q#76
- Convert all the numbers into letters using the provided variable
- Reverse the order of the sequence
- Cut using “/” as delimiter, take the second field
- Convert all the letters into uppercase (Hint: you can translate [:lower:] into [:upper:])

Possible Solution

Count files in a directory using a pipe to wc.
```
ls | wc
```

Sort a file and return unique values (bonus: count unique values).

cat flu_types.txt | sort | uniq
cat flu_types.txt | sort | uniq | wc

Decode the secret message by converting characters, reversing, cutting field 2 by /, then uppercasing.

secret_message="$(cat decode_the_secret_message.txt)"
secret_key="$(cat secret_message_key.txt)"
echo "$secret_message" | tr "$secret_key" '123456789@#0%^&q=' | rev | cut -d"/" -f 2 | tr '[:lower:]' '[:upper:]'

One-line direct decode example.

echo "1%76q#948^4q5@23q2q492q07&/@i5#q#76" | tr '123456789@#0%^&q=' '!abehnoprstuwxy_' | rev | cut -d"/" -f 2 | tr '[:lower:]' '[:upper:]'