Intro to Phylogenetics Using Nextstrain

Day 4 of Training Workshop

Norman Hassell

Disclaimer

The findings and conclusions in this presentation are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
Use of trade names and commercial sources is for identification only and does not imply endorsement by the U.S. Department of Health and Human Services.
References to non-CDC sites on the Internet do not constitute or imply endorsement of these organizations or their programs by CDC or the U.S. Department of Health and Human Services. CDC is not responsible for the content of pages found at these sites.

Section 1: Overview

Getting Started

Before we get started, let’s confirm your computer has nextstrain setup properly. Open a terminal and run the following command:

nextstrain setup --set-default conda

You should see the following output:

Checking setup…
✔ yes: operating system is supported
✔ yes: runtime data dir doesn't have spaces
✔ yes: runtime appears set up
✔ yes: snakemake is installed and runnable
✔ yes: augur is installed and runnable
✔ yes: auspice is installed and runnable

Setting default runtime to conda.

All good!  Set up of conda complete.

Getting Started (Cont.)

If you ran into an error when running the command, go to the following website.

Install the nextstrain-cli using the instructions for your operating system.

For the “Set up a Nextstrain runtime” section, follow the “Conda” instructions for your operating system. We’ll go around to help if you are running into issues.

Getting Started (Cont.)

Before I move on we’ll be going through the first part of the practical downloading data from GISAID to use in our analysis.

Intro to the Intro

Phylogenetics is the study of evolutionary relationships among organisms.

Phylogenetic tree of influenza A HAs. The 2 groups are colored cyan (group 1) and green (group 2), each of which can be further subdivided into 3 clades (H8, H9, and H12; H1, H2, H5, and H6; H11, H13, and H16) and 2 clades (H3, H4, and H14; H7, H10, and H15). https://www.pnas.org/doi/full/10.1073/pnas.0807142105

Phylogenetic trees of the eight individual gene segments of influenza A(H1N1)pdm09 viruses from sub-Saharan Africa from the 2011–2013 influenza seasons. https://www.nature.com/articles/s41598-024-70023-3

Basic Components of a Phylogenetic Tree (Tips)

Code

library(ape)
library(ggtree)
library(ggplot2)
library(dplyr)

`%notin%` <- Negate(`%in%`)

# Branch lengths are explicit so we can display them
tree_text <- "((A:1,B:0.5,(C:1.5,D:0.5):0.5):0.5,E:0.5);"
tree <- read.tree(text = tree_text)
tree <- ape::root(tree, "E")

# Basic ggtree base plot
base_plot <- ggtree(tree, branch.length = "branch.length") + theme_tree2()

tips_plot <- base_plot +
  geom_tiplab(offset = 0.02) +        # tip labels
  geom_tippoint(size = 3, shape = 21, fill = "red", colour = "black") +  # highlight tips
  ggtitle("Tips highlighted")

tips_plot

Basic Components of a Phylogenetic Tree (Nodes)

Code

nodes_plot <- base_plot +
  geom_tiplab(offset = 0.02) +
  # mark internal (non-tip) nodes: use aes(subset = !isTip)
  geom_point2(aes(subset = !isTip), size = 3.5, shape = 21, fill = "green", colour = "black") +
  ggtitle("Nodes highlighted")

nodes_plot

Basic Components of a Phylogenetic Tree (Branches)

Code

branches_plot <- base_plot +
  geom_tiplab(offset = 0.02) +
  # place branch-length labels along branches; use aes(label=branch.length)
  # round to two decimals for display
  geom_label(aes(x = branch, label = round(branch.length, 2)), color = "red") +
  # emphasise branches visually by increasing line width (optional)
  # geom_segment2(aes(subset = !isTip), size = 1.2) +
  ggtitle("Branch lengths labeled (highlighted)")

branches_plot

Basic Components of a Phylogenetic Tree (Root)

Code

Ntip <- length(tree$tip.label)
root_node <- Ntip + 1

tree_df <- fortify(tree) %>%
  mutate(branch_color = case_when(label == "E" ~ "purple",
                              .default = "black"))

root_plot <- ggtree(tree_df, aes(color=I(branch_color))) + theme_tree2() +
  geom_tiplab(offset = 0.02) +
  # mark the root with a star or large point
  geom_point2(aes(subset = (node == !!root_node)), shape = 8, size = 5, colour = "purple") +
  # optionally annotate "root" text
  geom_text2(aes(subset = (node == !!root_node), label = "root"), hjust = -0.5, vjust = -1, size = 4, colour = "purple") +
  geom_hilight(node=5, fill="purple", alpha=.6) +
  geom_text2(aes(subset = (node == 5), label = "Outgroup"), hjust = -0.1, vjust = -2.0, size = 4, colour = "purple") +
  ggtitle("Root and outgroup highlighted")

root_plot

Notes on Tree Rooting

Conceptually, tree rooting should be done:

Using a known outgroup (a taxon closely related to, but outside, the study group) to place the ancestor
Or using midpoint rooting, which sets the root at the midpoint of the longest path between any two taxa

Outgroup rooting provides a more robust evolutionary history, and is generally preferred.

Types of Trees

Molecular Clock Concept

Strict Clock

Assumes constant rate accross all lineages

Relaxed Clock

Allows rate variation across branches

The Concept of Clades

Clades are the fundamental grouping of phylogenetic trees. They are defined as a monophyletic group of taxa in a tree that includes a single common ancestor and all its descendants.

Clades in Seasonal Influenza

Clades in Seasonal Influenza (cont.)

The seasonal flu research community recently implemented a nomenclature and clade proposal system for tracking important emerging clades. These correspond to the “subclade” designations from Nextclade outputs.

Clades in Seasonal Influenza (cont.)

Defining a subclade in Nextstrain (augur clades command):

clade	gene	site	alt
A	HA1	45	N
A	HA1	48	I
A	nuc	473	T

clade	gene	site	alt
K	clade	J.2.4
K	HA1	2	N
K	HA1	144	N
K	HA1	158	D
K	HA1	160	K
K	HA1	173	R

Clades in Seasonal Influenza (cont.)

Subclade K example:

clade	gene	site	alt
K	clade	J.2.4
K	HA1	2	N
K	HA1	144	N
K	HA1	158	D
K	HA1	160	K
K	HA1	173	R

Subclades vs. Lineages (Pango 🤦)

Influenza subclades are not equivalent to Pango (SARS-CoV-2) lineages.

Pango lineage designations only requirement is a group of viruses having a shared ancestry.

Lineage proposals are through public submission

This has led to the designation of 5.7K+ lineages. 🤯

Subclade Proposal Process

Subclade proposal criteria:

Size: large groups should have a higher priority for designation.
Divergence: the more mutations have accumulated relative to the break point of the parent clade, the higher the priority of a novel clade
Specific mutations: Ideally, breakpoints sit on long branches with significant mutation. Such mutations will be better defined for well studied segments/genomes.

Each of these components feeds into the calculation of normalized branch scores across a phylogenetic tree.

A threshold score of 1.0 is set.

Subclade Naming

The system of subclade naming uses a variation of the Pango nomenclature system.¹
Capitol letters are used as aliases and numbers separated by periods for the retained part of the hierarchical name.
Shortening (aliasing) of names after three hierarchical levels (i.e. D for C.1.1.1).
However, a new alias may be given earlier in the case of a rapid expansion or special designation.

Example of Implementation Using H3N2 HA

Code

library(networkD3)
library(data.tree)
library(htmlwidgets)
library(collapsibleTree)
library(yaml)
library(viridis)

# Create tree via a nested yaml string
yaml <- "
name: A (3C)
A.2 (3C.2):
  B (*A.2.1):
    B.1:
      B.1.1:
      B.1.2:
        B.1.2.1:
          D (*B.1.2.1.2):
        E (*B.1.2.2):
          E.1:
            F (*E.1.1):
              F.1:
                F.1.1:
            G (*E.1.2):
              G.1:
                G.1.1:
                  G.1.1.1:
                  G.1.1.2:
                G.1.2:
                G.1.3:
                  G.1.3.1:
                    J (*G.1.3.1.1):
                      J.1:
                        J.1.1:
                      J.2:
                        J.2.1:
                        J.2.2:
                        J.2.3:
                        J.2.4:
                          K (*J.2.4.1):
                        J.2.5:
                      J.3:
                      J.4:
                  G.1.3.2:
              G.2:
                G.2.1:
                G.2.2:
              G.3:
              G.4:
          E.2:
  B.2:
  B.3:
  B.4:
A.3 (3C.3):
  C (*A.3.1):
    C.1
  A.3.2:
"

# Load the yaml tree into a list
h3List <- yaml.load(yaml)

# Transform into a data.tree object
h3Node <- as.Node(h3List, interpretNullAsList = TRUE)

# Assign a vector of colors for nodes
h3Colors <- viridis::turbo(h3Node$totalCount)

# Assign colors to tree nodes
i <- 1
h3Node$Do(function(node) {
  node$color <- h3Colors[i]
  i <<- i + 1
})

# Display collapsible tree
collapsibleTree(h3Node,
                fill = "color",
                tooltip = TRUE)

Section 2: Tree Building Overview

Quick break.

Phylogenetic Tree Building Steps

flowchart TD
    A[Sequence selection and filtering]
    style A text-align:center
    B[Multiple sequence alignment]
    style B text-align:center
    C[Model selection]
    style C text-align:center
    D[Tree inference]
    style D text-align:center
    E[Tree evaluation and visualization]
    style E text-align:center

    A -->  |The data that is <br/> included is **important**| B
    B --> C 
    C --> |Also important for <br/> proper tree contruction| D
    D --> E

The overall steps of phylogenetic tree building are:

Sequence selection and filtering. This is one of the most critical steps in tree building and is often not as heavily considered as it should be. It is very central to the focus of the analysis.
Multiple sequence alignment
Model selection. This is also important for tree construction. An inappropriate model selection can produce inaccurate topologies and lead to false conclusions from your data.
Tree inference. The algorithm used for inference of the tree can impact the topology. There are many methods depending on your overall goals. A neighbor joining algorithm may be sufficient for a quick assessment, or a maximum likelihood method for more robust statistical support, or a Bayesian method for robust phylogenetic reconstruction and divergence time estimation.
Tree evaluation and visualization. Lastly, understanding and interpreting your tree by visualization is critical to hypothesis testing and evolutionary analysis.

Sequence Selection

Sequence Selection

When starting the process of building a tree:

Consensus sequence data quality (not just NGS coverage)
What is the question you are trying to address?
- Global surveillence?
- Targeted regional analysis?
- Country specific questions?
- Timing of a newly emerged variant?
  - Be especially careful here. This requires intensive analysis and scrutiny. Understand the limitations of analysis.

Multiple Sequence Alignment (MSA)

MSA Algorithms

Examples of MSA software:

Model Selection

Assuming nucleotide sequence data as input:

Jukes-Cantor (simplest model)¹
HKY
TN93
GTR (Model implemented in Nextstrain flu pipelines)

Tree Inference - Algorithms

Distance Based:

Neighbor joining
UPGMA

Character Based:

Maximum Parsimony
Maximum Likelihood (used in Nextstrain pipelines)
Bayesian Inference

Tree Inference - Software

Tree Evaluation and Visualization

Nextstrain (auspice)
auspice.us (browser based auspice)
IcyTree (browser based)
FigTree
ITOL
Dendroscope
archaeopteryx
Programmatic libraries:
- ETE Toolkit
- ggtree

Intro to Nextstrain

Nextstrain is an open source toolkit for analyzing and visualizing pathogen genomic data.

The two core parts of Nextstrain are:

Augur (modular bioinformatics toolkit)
Auspice (browser based vizualization)

Nextstrain process overview

flowchart TB
    subgraph Inputs
        direction LR
        A[sequences.fasta]
        B[metadata.tsv]
    end

    subgraph Augur
    direction LR
        C(filter)
        D(align)
        E(tree)
        F(refine)
        G(export)
        C --> D
        D --> E
        E --> F
        F --> G
    end

    Inputs --> Augur
    Augur --> H[Build Dataset]
    H --> I[Auspice]

     %% Styling
    classDef input fill:#ffffff,stroke:#555,rx:12,ry:12;
    classDef step fill:#bfe3ff,stroke:#3b82c4,rx:14,ry:14;
    classDef output fill:#ffffff,stroke:#555,rx:12,ry:12;
    classDef final fill:#b9f5b0,stroke:#2e7d32,rx:14,ry:14;

    class A,B input;
    class C,D,E,F,G step;
    class H output;
    class I final;

The essential components of a Nextstrain workflow are shown in this diagram. Generally, as input we are providing a sequence file in fasta format and a columnar data file associated with our sequence file. These files are then taken through the augur pipeline. First, the “filter” step is used to sample data by defined criteria. Then the filtered sequence data is aligned to a provided reference in the “align” step. A tree is then built from this alignment using the “tree” command. The tree is then processed through the “refine” step, which refines the tree using sequence metadata, removes outliers, and produces a time scaled phylogeny. All of the essential files needed for a Nextstrain build are then processed in the export step, where the files are combined into one or more JSON files. These JSON files form a “build” data set, which can then be directly viewed in auspice.

Section 3: Nextstrain Build Exercise

Click on the following link which will direct you to the exercise we will be going through together.

Section 5: Communicating Results Using Nextstrain Narratives

Introduction to Nextstrain Narratives

Nextstrain narratives are markdown (.md) files that allow the communication of results interactively in a slide-like format.

Click on the link above for detailed documentation on narratives.

For an example of a narrative, take a look at the Twenty years of West Nile virus narrative.

Components of a Narrative: Header

Open the example narrative file /profiles/gisaid/custom.md in VS Code.

At the top of the file is the narrative header:

---
title: Custom Local Hosted Narrative
authors: Norman Hassell
date: "2026-03-02"
dataset: "https://nextstrain.org/custom/h3n2/ha"
abstract: "
Custom narrative of custom build
"
---

For local builds, you will have the prefix https://nextstrain.org followed by your build file name with underscores replaced by the / character (ie. custom_h3n2_ha.json dataset becomes https://nextstrain.org/custom/h3n2/ha)

Components of a Narrative: Header (Result)

Components of a Narrative: Basic Slides

Basic slides are set up by providing copied links as first level headers:

# [Custom Dataset Slide](https://nextstrain.org/custom/h3n2/ha?d=tree,map,frequencies&f_sample_type=sample&p=grid)

Example of a basic slide view.

The slide is filtered to a "Sample Type" of "sample".

Copy the end of the link and add the prefix of https://nextstrain.org/

Components of a Narrative: Basic Slides (Result)

Components of a Narrative: Animations

Timeline animations can be used in narratives by copying the link associated with the animation.

The link will briefly show after clicking the “Play ▶️” button underneath the “Date Range” toolbar.

# [Custom Dataset Animations](https://nextstrain.org/custom/h3n2/ha?animate=2024-06-04,2025-12-20,0,0,30000&d=map&f_sample_type=sample&p=full)

Animations can be captured.

Components of a Narrative: Animations (Result)

Components of a Narrative: Images

Images can be embedded into the narrative sidebar or main display from direct links or base64 conversion.

# [Image Embedding](https://nextstrain.org/custom/h3n2/ha)

Images can be embedded from the web in the side panel.

![](https://www.esrf.fr/files/live/sites/www/files/news/general-old/general-2009/Influenza-Virus.jpg)

```auspiceMainDisplayMarkdown
### Or in the main body display via the `auspiceMainDisplayMarkdown` section.

For main body markdown display.

![Flu](https://www.embl.org/news/wp-content/uploads/2020/04/INFLUENZA_EDIT.jpg)
```

Note the usage of the auspiceMainDisplayMarkdown block to have markdown for the main narrative display.

Components of a Narrative: Images (Results)

Narratives Continued

Normal features of markdown such as:

Tables
Lists
Text formatting
HTML embedding

Can all be used in both the narrative side panel and main display.

Again, the main display requires the auspiceMainDisplayMarkdown block for use.

Narrative Exercise

Before moving on to the exercise open a new terminal in the
seasonal-flu-demo folder and enter into a new nextstrain shell session:

nextstrain shell .

Then, run another build using the following command:

nextstrain build . --configfile profiles/gisaid/custom_gisaid_global.yaml

Repeat this process for a third build.

Open a new terminal, enter a nextstrain shell, and execute the following:

nextstrain build . --configfile profiles/gisaid/custom_gisaid_norefs.yaml

Narrative Exercise

Copy the narrative file located at profiles/gisaid/custom.md into your auspice/ folder:

cp profiles/gisaid/custom.md aupsice/custom.md

With your first build focusing on viruses circulating in Oceania/Australia (custom_h3n2_ha) modify the aupsice/custom.md narrative markdown to add slides to addressing the following:

For subclade K, what is the rate estimate for the tree overall vs. the rate estimate for this subclade?
Are there any reagent viruses in this subclade? If so, highlight them.
Add another narrative slide of your choice, highlighting something you’ve found.
Close out your narrative with a summary slide.

Narrative Exercise Answers

Copy the narrative practical answers file to your auspice/ folder to view the practical section answers:

cp profiles/gisaid/custom_practical.md aupsice/custom_practical.md

View this narraive in the auspice viewer launched in your browser

Or uplaod the narrative file + build jsons into auspice.us

How do your answers compare to the practical answers?

Section 6: Build Comparisons

We’ll take some time to compare the three different builds we’ve created. To address caveats of sampling.

Intro to Phylogenetics Using Nextstrain

Disclaimer

Section 1: Overview

Getting Started

Getting Started (Cont.)

Getting Started (Cont.)

Intro to the Intro

Basic Components of a Phylogenetic Tree (Tips)

Basic Components of a Phylogenetic Tree (Nodes)

Basic Components of a Phylogenetic Tree (Branches)

Basic Components of a Phylogenetic Tree (Root)

Notes on Tree Rooting

Types of Trees

Molecular Clock Concept

The Concept of Clades

Clades in Seasonal Influenza

Clades in Seasonal Influenza (cont.)

Clades in Seasonal Influenza (cont.)

Clades in Seasonal Influenza (cont.)

Subclades vs. Lineages (Pango 🤦)

Subclade Proposal Process

Subclade Naming

Example of Implementation Using H3N2 HA

Section 2: Tree Building Overview

Phylogenetic Tree Building Steps

Sequence Selection

Sequence Selection

Multiple Sequence Alignment (MSA)

MSA Algorithms

Model Selection

Tree Inference - Algorithms

Tree Inference - Software

Tree Evaluation and Visualization

Intro to Nextstrain

Nextstrain process overview

Section 3: Nextstrain Build Exercise

Section 4: Overview of Nextstrain Features and Navigation

Section 5: Communicating Results Using Nextstrain Narratives

Introduction to Nextstrain Narratives

Components of a Narrative: Header

Components of a Narrative: Header (Result)

Components of a Narrative: Basic Slides

Components of a Narrative: Basic Slides (Result)

Components of a Narrative: Animations

Components of a Narrative: Animations (Result)

Components of a Narrative: Images

Components of a Narrative: Images (Results)

Narratives Continued

Narrative Exercise

Narrative Exercise

Narrative Exercise Answers

Section 6: Build Comparisons