Day 4 of Training Workshop
The findings and conclusions in this presentation are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
Use of trade names and commercial sources is for identification only and does not imply endorsement by the U.S. Department of Health and Human Services.
References to non-CDC sites on the Internet do not constitute or imply endorsement of these organizations or their programs by CDC or the U.S. Department of Health and Human Services. CDC is not responsible for the content of pages found at these sites.
Before we get started, let’s confirm your computer has nextstrain setup properly. Open a terminal and run the following command:
You should see the following output:
Checking setup…
✔ yes: operating system is supported
✔ yes: runtime data dir doesn't have spaces
✔ yes: runtime appears set up
✔ yes: snakemake is installed and runnable
✔ yes: augur is installed and runnable
✔ yes: auspice is installed and runnable
Setting default runtime to conda.
All good! Set up of conda complete.
If you ran into an error when running the command, go to the following website.
Install the nextstrain-cli using the instructions for your operating system.
For the “Set up a Nextstrain runtime” section, follow the “Conda” instructions for your operating system. We’ll go around to help if you are running into issues.
Before I move on we’ll be going through the first part of the practical downloading data from GISAID to use in our analysis.
Phylogenetics is the study of evolutionary relationships among organisms.
library(ape)
library(ggtree)
library(ggplot2)
library(dplyr)
`%notin%` <- Negate(`%in%`)
# Branch lengths are explicit so we can display them
tree_text <- "((A:1,B:0.5,(C:1.5,D:0.5):0.5):0.5,E:0.5);"
tree <- read.tree(text = tree_text)
tree <- ape::root(tree, "E")
# Basic ggtree base plot
base_plot <- ggtree(tree, branch.length = "branch.length") + theme_tree2()
tips_plot <- base_plot +
geom_tiplab(offset = 0.02) + # tip labels
geom_tippoint(size = 3, shape = 21, fill = "red", colour = "black") + # highlight tips
ggtitle("Tips highlighted")
tips_plotbranches_plot <- base_plot +
geom_tiplab(offset = 0.02) +
# place branch-length labels along branches; use aes(label=branch.length)
# round to two decimals for display
geom_label(aes(x = branch, label = round(branch.length, 2)), color = "red") +
# emphasise branches visually by increasing line width (optional)
# geom_segment2(aes(subset = !isTip), size = 1.2) +
ggtitle("Branch lengths labeled (highlighted)")
branches_plotNtip <- length(tree$tip.label)
root_node <- Ntip + 1
tree_df <- fortify(tree) %>%
mutate(branch_color = case_when(label == "E" ~ "purple",
.default = "black"))
root_plot <- ggtree(tree_df, aes(color=I(branch_color))) + theme_tree2() +
geom_tiplab(offset = 0.02) +
# mark the root with a star or large point
geom_point2(aes(subset = (node == !!root_node)), shape = 8, size = 5, colour = "purple") +
# optionally annotate "root" text
geom_text2(aes(subset = (node == !!root_node), label = "root"), hjust = -0.5, vjust = -1, size = 4, colour = "purple") +
geom_hilight(node=5, fill="purple", alpha=.6) +
geom_text2(aes(subset = (node == 5), label = "Outgroup"), hjust = -0.1, vjust = -2.0, size = 4, colour = "purple") +
ggtitle("Root and outgroup highlighted")
root_plotConceptually, tree rooting should be done:
Outgroup rooting provides a more robust evolutionary history, and is generally preferred.
Strict Clock
Relaxed Clock
Clades are the fundamental grouping of phylogenetic trees. They are defined as a monophyletic group of taxa in a tree that includes a single common ancestor and all its descendants.
The seasonal flu research community recently implemented a nomenclature and clade proposal system for tracking important emerging clades. These correspond to the “subclade” designations from Nextclade outputs.
Defining a subclade in Nextstrain (augur clades command):
| clade | gene | site | alt |
|---|---|---|---|
| A | HA1 | 45 | N |
| A | HA1 | 48 | I |
| A | nuc | 473 | T |
| clade | gene | site | alt |
|---|---|---|---|
| K | clade | J.2.4 | |
| K | HA1 | 2 | N |
| K | HA1 | 144 | N |
| K | HA1 | 158 | D |
| K | HA1 | 160 | K |
| K | HA1 | 173 | R |
Subclade K example:
| clade | gene | site | alt |
|---|---|---|---|
| K | clade | J.2.4 | |
| K | HA1 | 2 | N |
| K | HA1 | 144 | N |
| K | HA1 | 158 | D |
| K | HA1 | 160 | K |
| K | HA1 | 173 | R |
Influenza subclades are not equivalent to Pango (SARS-CoV-2) lineages.
Pango lineage designations only requirement is a group of viruses having a shared ancestry.
Lineage proposals are through public submission
This has led to the designation of 5.7K+ lineages. 🤯
Subclade proposal criteria:
Each of these components feeds into the calculation of normalized branch scores across a phylogenetic tree.
A threshold score of 1.0 is set.
library(networkD3)
library(data.tree)
library(htmlwidgets)
library(collapsibleTree)
library(yaml)
library(viridis)
# Create tree via a nested yaml string
yaml <- "
name: A (3C)
A.2 (3C.2):
B (*A.2.1):
B.1:
B.1.1:
B.1.2:
B.1.2.1:
D (*B.1.2.1.2):
E (*B.1.2.2):
E.1:
F (*E.1.1):
F.1:
F.1.1:
G (*E.1.2):
G.1:
G.1.1:
G.1.1.1:
G.1.1.2:
G.1.2:
G.1.3:
G.1.3.1:
J (*G.1.3.1.1):
J.1:
J.1.1:
J.2:
J.2.1:
J.2.2:
J.2.3:
J.2.4:
K (*J.2.4.1):
J.2.5:
J.3:
J.4:
G.1.3.2:
G.2:
G.2.1:
G.2.2:
G.3:
G.4:
E.2:
B.2:
B.3:
B.4:
A.3 (3C.3):
C (*A.3.1):
C.1
A.3.2:
"
# Load the yaml tree into a list
h3List <- yaml.load(yaml)
# Transform into a data.tree object
h3Node <- as.Node(h3List, interpretNullAsList = TRUE)
# Assign a vector of colors for nodes
h3Colors <- viridis::turbo(h3Node$totalCount)
# Assign colors to tree nodes
i <- 1
h3Node$Do(function(node) {
node$color <- h3Colors[i]
i <<- i + 1
})
# Display collapsible tree
collapsibleTree(h3Node,
fill = "color",
tooltip = TRUE)Quick break.
flowchart TD
A[Sequence selection and filtering]
style A text-align:center
B[Multiple sequence alignment]
style B text-align:center
C[Model selection]
style C text-align:center
D[Tree inference]
style D text-align:center
E[Tree evaluation and visualization]
style E text-align:center
A --> |The data that is <br/> included is **important**| B
B --> C
C --> |Also important for <br/> proper tree contruction| D
D --> E
When starting the process of building a tree:
Consensus sequence data quality (not just NGS coverage)
What is the question you are trying to address?
Global surveillence?
Targeted regional analysis?
Country specific questions?
Timing of a newly emerged variant?
Examples of MSA software:
Assuming nucleotide sequence data as input:
Distance Based:
Character Based:
Nextstrain is an open source toolkit for analyzing and visualizing pathogen genomic data.
The two core parts of Nextstrain are:
flowchart TB
subgraph Inputs
direction LR
A[sequences.fasta]
B[metadata.tsv]
end
subgraph Augur
direction LR
C(filter)
D(align)
E(tree)
F(refine)
G(export)
C --> D
D --> E
E --> F
F --> G
end
Inputs --> Augur
Augur --> H[Build Dataset]
H --> I[Auspice]
%% Styling
classDef input fill:#ffffff,stroke:#555,rx:12,ry:12;
classDef step fill:#bfe3ff,stroke:#3b82c4,rx:14,ry:14;
classDef output fill:#ffffff,stroke:#555,rx:12,ry:12;
classDef final fill:#b9f5b0,stroke:#2e7d32,rx:14,ry:14;
class A,B input;
class C,D,E,F,G step;
class H output;
class I final;
Click on the following link which will direct you to the exercise we will be going through together.
We’ll go over features of your build and navigate it together.
Nextstrain narratives are markdown (.md) files that allow the communication of results interactively in a slide-like format.
Click on the link above for detailed documentation on narratives.
For an example of a narrative, take a look at the Twenty years of West Nile virus narrative.
Open the example narrative file /profiles/gisaid/custom.md in VS Code.
At the top of the file is the narrative header:
For local builds, you will have the prefix https://nextstrain.org followed by your build file name with underscores replaced by the / character (ie. custom_h3n2_ha.json dataset becomes https://nextstrain.org/custom/h3n2/ha)
Basic slides are set up by providing copied links as first level headers:
Copy the end of the link and add the prefix of
https://nextstrain.org/
Timeline animations can be used in narratives by copying the link associated with the animation.
The link will briefly show after clicking the “Play ▶️” button underneath the “Date Range” toolbar.
Images can be embedded into the narrative sidebar or main display from direct links or base64 conversion.
# [Image Embedding](https://nextstrain.org/custom/h3n2/ha)
Images can be embedded from the web in the side panel.

```auspiceMainDisplayMarkdown
### Or in the main body display via the `auspiceMainDisplayMarkdown` section.
For main body markdown display.

```Note the usage of the auspiceMainDisplayMarkdown block to have markdown for the main narrative display.
Normal features of markdown such as:
Can all be used in both the narrative side panel and main display.
Again, the main display requires the auspiceMainDisplayMarkdown block for use.
Before moving on to the exercise open a new terminal in the
seasonal-flu-demo folder and enter into a new nextstrain shell session:
Then, run another build using the following command:
Repeat this process for a third build.
Open a new terminal, enter a nextstrain shell, and execute the following:
Develop a narrative walking through your current build focusing on viruses circulating in Oceania/Australia (custom_h3n2_ha).
We’ll take some time to compare the three different builds we’ve created. To address caveats of sampling.