NCBI - SRA • seqsender

Overview

Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data. The archive accepts data from all branches of life as well as metagenomic and environmental surveys. SRA stores raw sequencing data and alignment information to enhance reproducibility and facilitate new discoveries through data analysis.

Before submitters can upload sequence read archives to SRA database using seqsender,they must ensure the requirement files (such as config.yaml, metadata.csv, sequence.fasta, raw reads, etc.) are already prepared ahead of time and stored them in a submission folder of choice (e.g., submission_name) within a parent submission directory (e.g., submission_dir). That way seqsender will be able to scoop up the necessary files in that folder, generate submission files, and then batch uploading them to the submitting database of choices.

Requirement files

Config file in a yaml format
Sequence read archives in a bam/sff/hdf5/fastq format
Metadata file in a csv format

A quick look of where to store all of the requirement files

Config file

Config file is a yaml file that provides a brief description about the submission and contains user credentials that allow seqsender to authenticate the database prior to upload a submission.

NOTE:

To submit to NCBI only, one can remove the GISAID Submission (b) section from the config file. Vice versa, to submit to GISAID only, just remove the NCBI Submission (a) section.
Submission_Position determines the order of databases in which we will submit to first. For instance, if GISAID is set as 1, seqsender will submit to GISAID first, then after all samples are assigned with a GISAID accession number, seqsender will proceed to submit to NCBI. This order of submission ensures samples are linked correctly between the two databases.
Username and Password under the NCBI Submission (b) section are the credentials used to authenticate the NCBI FTP Server (not to mistake with individual NCBI account). See PRE-REQUISITES for more details.

Sequence read archives

Currently, NCBI accepts binary files such as BAM, SFF, and HDF5 formats and text formats such as FASTQ. See SRA Submit Formats for more details.

NOTE:

Sequence read archive for all samples must be stored in a subfolder called raw_reads inside a submission folder of choice

Metadata file

The metadata worksheet is a comma-delimited (csv) file that contains required attributes that are useful for the rapid analysis and trace back of Influenza A Virus or SARS-COV-2 cases.

Here is a short description about the fields in the metadata worksheet.

Column_name	Description
sequence_name	Sequence identifier used in fasta file. This is used to create the fasta file for Genbank or GISAID.
organism	The most descriptive organism name for the samples. If relevant, you can search the organism name in the NCBI Taxonomy database. For FLU, organism must be “Influenza A Virus”. For COV, organism must be “Severe acute respiratory syndrome coronavirus 2”.
collection_date	The date on which the sample was collected; must be in the ISO format: YYYY-MM-DD. For example: 2020-03-25
authors	Citing authors. List of Last, First Middle, suffix separated by a semicolon “;” E.g.: “Baker, Howard Henry, Jr.; Powell, Earl Alexander, III.;”
ncbi-spuid	Submitter Provided Unique Identifiers. This is used to report back assigned accessions as well as for cross-linking objects within submission.
ncbi-spuid_namespace	If SPUID is used, spuid_namespace has to be provided. The values of spuid_namespace are from controlled vocabulary and need to be coordinated with NCBI prior to submission.
ncbi-bioproject	Associated BioProject accession number. For example: PRJNA217342
sra-file_location	Location of raw reads files. Options: “local” or “cloud”
sra-file_name	Name of the raw read files. All file names must be unique and not contain any sensitive information. Files can be compressed using gzip or bzip2, and may be submitted in a tar archive but archiving and/or compressing your files is not required. Do not use zip! If there are multiple files, concatenate them with a commas (“,”), e.g. “sample1_R1.fastq.gz, sample1_R2.fastq.gz”. Store files in /seqsender/data/raw_reads/ or provide full path to the raw read files.
sra-library_name	Short unique identifier for sequencing library. Each name must be unique!
sra-instrument_model	Type of instrument model used for sequencing. See a list of options here.
sra-library_strategy	The sequencing technique intended for the library. See a list of options here.
sra-library_source	The type of source material that is being sequenced. See a list of options here.
sra-library_selection	The method used to select and/or enrich the material being sequenced. See a list of options here.
sra-library_layout	Whether to expect SINGLE or PAIRED end reads. Options: “single” or “paired”

NOTE: The prefix of “sra-” is used to identity attributes for SRA submissions

To include additional attributes to SRA submissions, just append sra- in front of the desired attributes, e.g. sra-loader, sra-platform, etc. See SRA metadata section for more details.

You are now ready to install seqsender and batch upload your submission

Any questions or issues? Please report them on our Github issue tracker.