GISAID - EpiCoV • seqsender

Overview

GISAID, short for the Global Initiative on Sharing All Influenza Data, is an organization that manages a restricted-access database containing genomic sequence data of select virus, primarily influenza viruses. The database has expanded to include the coronavirus responsible for the COVID-19 pandemic as well as other pathogens.

Prerequisites

For all GISAID submissions, seqsender makes use of GISAID’s Command Line Interface Tools (CLIs) to batch uploading meta- and sequence-data to their databases. Prior to perform a batch upload to EpiCoV database, submitters must

Download the EpiCoV CLI package from the GISAID Platform that is compatible with their machine (e.g., Linux, macOS, or Windows).

Unzip the downloaded package and store it in a subfolder called gisaid_cli within a submission directory of choice (e.g., submission_dir).

Requirement files

After submitters had obtained the GISAID CLI for EpiCoV, they must also prepare the requirement files (such as config.yaml, metadata.csv, sequence.fasta, raw reads, etc.) and store them in a submission folder of choice (e.g., submission_name) within a parent submission directory (e.g., submission_dir). That way seqsender will be able to scoop up the necessary files in that folder, generate submission files, and then batch uploading them to the submitting database of choices.

Here is a list of the requirement files and where to store them:

Config file in a yaml format
Fasta file in afasta format
Metadata file in a csv format

Config file

Config file is a yaml file that provides a brief description about the submission and contains user credentials that allow seqsender to authenticate the database prior to upload a submission.

NOTE:

To submit to NCBI only, one can remove the GISAID Submission (b) section from the config file. Vice versa, to submit to GISAID only, just remove the NCBI Submission (a) section.
Submission_Position determines the order of databases in which we will submit to first. For instance, if GISAID is set as 1, seqsender will submit to GISAID first, then after all samples are assigned with a GISAID accession number, seqsender will proceed to submit to NCBI. This order of submission ensures samples are linked correctly between the two databases.
Username and Password under the NCBI Submission (b) section are the credentials used to authenticate the NCBI FTP Server (not to mistake with individual NCBI account). See PRE-REQUISITES for more details.

Fasta file

Fasta file contains nucleotide sequences for all samples. See Genbank Fasta Format for more details.

Metadata file

The metadata worksheet is a comma-delimited (csv) file that contains required attributes that are useful for the rapid analysis and trace back of SARS-COV-2 cases.

Here is a short description about the fields in the metadata worksheet.

Column_name	Description
sequence_name	Sequence identifier used in fasta file. This is used to create the fasta file for Genbank or GISAID.
organism	The most descriptive organism name for the samples. If relevant, you can search the organism name in the NCBI Taxonomy database. For FLU, organism must be “Influenza A Virus”. For COV, organism must be “Severe acute respiratory syndrome coronavirus 2”.
collection_date	The date on which the sample was collected; must be in the ISO format: YYYY-MM-DD. For example: 2020-03-25
authors	Citing authors. List of Last, First Middle, suffix separated by a semicolon “;” E.g.: “Baker, Howard Henry, Jr.; Powell, Earl Alexander, III.;”
gs-virus_name	For example: hCoV-19/Country/SampleID/YYYY There are four parts delineated by the forward slash “/” character: “hCoV-19”: despite common usage of virus synonyms such as SARS-CoV-2 or nCoV-19, this first part must remain “hCoV-19” verbatim (to ensure backwards compatibility with EpiCoV db). “Country” is full name of country of sample collection (e.g., Australia), including spaces. For backwards compatibility, the exception being to use “USA” for United States of America. “SampleID” is recommended to be of the format, Loc-Lab-Number, where: Loc is location abbreviation (use abbreviated state or province for location, such as “VIC” for Victoria, Australia, or “CA” for California, USA) Lab is lab name abbreviation (e.g., “CDC” for Centres for Disease Control) Number is sample number or lab code (e.g., 02978, or S47y) “YYYY” is four digit year of sample collection. Note, this must be the same as the YYYY provided in the collection_date value, else a “date inplausible” error will occur In this example, the virus_name could be: hCoV-19/Australia/VIC-CDC-02978/2022, or hCoV-19/USA/CA-CDC-S47y/2022, respectively. NOTE: virus_name field must match exactly the header of the respective sequence in the fasta file.
gs-type	For hCoV-19, this will always be “betacoronavirus”.
gs-passage	“Original” if the sample was sequenced directly from swabs, otherwise add the name of the cell line (e.g., “Vero”) used to culture the specimen.
gs-location	Format as “Continent / Country / Region / Sub-region”
gs-host	For clinical samples, this is “Human”. Otherwise add the species name of the organism from which the sample was originally sourced.
gs-gender	Synonym for “Biological sex”. Should be “Female”, “Male”, or “Other”
gs-patient_age	Age in years of the person from whom the specimen was collected. May take format other than integer years, for example, “0.5” (i.e., 6 months), “5 days”, “7 months”. If units are not given, they are assumed in years.
gs-patient_status	E.g., “Hospitalized”, “Released”, “Live”, “Deceased”
gs-seq_technology	Add the sequencer brand and model. See a list of options here.
gs-orig_lab	Full name of laboratory from where sample originated.
gs-orig_lab_addr	Complete building address of laboratory from where sample originated.
gs-subm_lab	Full name of laboratory submitting this record to GISAID. See a list of options here.
gs-subm_lab_addr	Complete building address of the submitting laboratory.

NOTE: The prefix of “gs-” is used to identity attributes for GISAID submissions.

Optional Attributes

To include additional attributes to EpiCoV submissions, just append gs- in front of the desired attributes. Here is a list of optional attributes:

Column_name	Description
add_location	e.g. Cruise Ship, Convention, Live animal market
add_host_info	e.g. Patient infected while traveling in
sampling_strategy	e.g. Sentinel surveillance (ILI), Sentinel surveillance (ARI), Sentinel surveillance (SARI), Non-sentinel-surveillance (hospital), Non-sentinel-surveillance (GP network), Longitudinal sampling on same patient(s), S gene dropout
specimen	e.g. Sputum, Alveolar lavage fluid, Oro-pharyngeal swab, Blood, Tracheal swab, Urine, Stool, Cloakal swab, Organ, Feces, Other
outbreak	Date, Location e.g. type of gathering, Family cluster, etc.
last_vaccinated	provide details if applicable
treatment	Include drug name, dosage
assembly_method	e.g. CLC Genomics Workbench 12, Geneious 10.2.4, SPAdes/MEGAHIT v1.2.9, UGENE v. 33, etc.
coverage	e.g. 70x, 1,000x, 10,000x (average)
provider_sample_id	Sample ID given by originating laboratory
subm_sample_id	Sample ID given by the submitting laboratory
consortium	Sequencing consortium the submitting lab is affiliated to
comment	Comment
comment_type	Comment icon

You are now ready to install seqsender and batch upload your submission

Any questions or issues? Please report them on our Github issue tracker.