NCBI - GenBank • seqsender

Overview

The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part of the National Institutes of Health in the United States) as part of the International Nucleotide Sequence Database Collaboration (INSDC).

Before submitters can batch uploading meta- and sequence-data to GenBank database using seqsender, they must ensure the requirement files (such as config.yaml, metadata.csv, sequence.fasta, raw reads, etc.) are already prepared ahead of time and stored them in a submission folder of choice (e.g., submission_name) within a parent submission directory (e.g., submission_dir). That way seqsender will be able to scoop up the necessary files in that folder, generate submission files, and then batch uploading them to the submitting database of choices.

Requirement files

Config file in a yaml format
Fasta file in afasta format
Metadata file in a csv format

A quick look of where to store all of the requirement files

Config file

Config file is a yaml file that provides a brief description about the submission and contains user credentials that allow seqsender to authenticate the database prior to upload a submission.

NOTE:

To submit to NCBI only, one can remove the GISAID Submission (b) section from the config file. Vice versa, to submit to GISAID only, just remove the NCBI Submission (a) section.
Submission_Position determines the order of databases in which we will submit to first. For instance, if GISAID is set as 1, seqsender will submit to GISAID first, then after all samples are assigned with a GISAID accession number, seqsender will proceed to submit to NCBI. This order of submission ensures samples are linked correctly between the two databases.
Username and Password under the NCBI Submission (b) section are the credentials used to authenticate the NCBI FTP Server (not to mistake with individual NCBI account). See PRE-REQUISITES for more details.

Fasta file

Fasta file contains nucleotide sequences for all samples. See Genbank Fasta Format for more details.

Metadata file

The metadata worksheet is a comma-delimited (csv) file that contains required attributes that are useful for the rapid analysis and trace back of Influenza A Virus or SARS-COV-2 cases.

Here is a short description about the fields in the metadata worksheet.

Column_name	Description
sequence_name	Sequence identifier used in fasta file. This is used to create the fasta file for Genbank or GISAID.
organism	The most descriptive organism name for the samples. If relevant, you can search the organism name in the NCBI Taxonomy database. For FLU, organism must be “Influenza A Virus”. For COV, organism must be “Severe acute respiratory syndrome coronavirus 2”.
collection_date	The date on which the sample was collected; must be in the ISO format: YYYY-MM-DD. For example: 2020-03-25
authors	Citing authors. List of Last, First Middle, suffix separated by a semicolon “;” E.g.: “Baker, Howard Henry, Jr.; Powell, Earl Alexander, III.;”
ncbi-spuid	Submitter Provided Unique Identifiers. This is used to report back assigned accessions as well as for cross-linking objects within submission.
ncbi-spuid_namespace	If SPUID is used, spuid_namespace has to be provided. The values of spuid_namespace are from controlled vocabulary and need to be coordinated with NCBI prior to submission.
ncbi-bioproject	Associated BioProject accession number. For example: PRJNA217342
gb-seq_id	Identification to be used for the sequence in the FASTA.
gb-subm_lab	Full name of organization, institute, or laboratory, etc., who is submitting this record
gb-subm_lab_division	The division of organization, institute, or laboratory, etc., who is submitting this record
gb-subm_lab_addr	The address of organization, institute, or laboratory, etc., who is submitting this record
gb-publication_title	The title and relevant publication details (volume, issue, etc.) of a paper that discusses the submission. If left empty, the program will used the name of the submission as title.
gb-publication_status	Options: “unpublished” or “in-press” or “published”
src-isolate	Identification or description of the specific individual from which this sample was obtained
src-country	Geographical origin of the sample; use the appropriate name from this list. Use a colon to separate the country or ocean from more detailed information about the location, eg “Canada: Vancouver” or “Germany: halfway down Zugspitze, Alps”. Entering multiple localities in one attribute is not allowed.
src-host	The natural (as opposed to laboratory) host to the organism from which the sample was obtained. Use the full taxonomic name, eg, Homo sapiens
src-serotype	For Influenza A only; must be in format HxNx, Hx, Nx or mixed; where x is a numeral
src-isolation_source	Describes the physical, environmental and/or local geographical source of the biological sample from which the sample was derived.
cmt-StructuredCommentPrefix	Structured comment keyword. For FLU use “FluData”, HIV use “HIV-DataBaseData”, and for COV and other organisms use “Assembly-Data”.
cmt-StructuredCommentSuffix	Structured comment keyword. For FLU use “FluData”, HIV use “HIV-DataBaseData”, and for COV and other organisms use “Assembly-Data”.

NOTE: The prefix of “gb-” is used to identity attributes for GenBank submissions. The prefix of “src-” is used to identity attributes for Source Information Table. Likewise, the prefix of “cmt-” is used to identity attributes for Structured Comment Table.

To include additional attributes to Source Information table, just append src- in front of the desired attributes, e.g. src-subtype, src-passage, etc. See Genbank Source Table Modifier for more details.

To include additional attributes to Structured Comment Table, just append cmt- in front of the desired attributes, and most importantly, the fields must be sandwiched between cmt-StructuredCommentPrefix and cmt-StructuredCommentSuffix. For examples, cmt-StructuredCommentPrefix, cmt-Assembly Method, cmt-Coverage, ..., cmt-Sequencing Technology, cmt-StructuredCommentSuffix. See Genbank Structured Comment for more details.

You are now ready to install seqsender and batch upload your submission

Any questions or issues? Please report them on our Github issue tracker.