Metadata

Outline*

The metadata describes how the associated run data have been obtained. The metadata are composed of 6 objects, Submission, Study, Sample, Experiment, Run and Analysis. Each of these objects is defined by its XML schema and is related each other. Accession numbers with distinct prefixes are assigned to Submission (DRA), Study (DRP), Sample (DRS), Experiment (DRX), Run (DRR) and Analysis (DRZ) objects. Metadata and accession number system are common in DRA/ERA/SRA. Assigned accession numbers can be cited to refer the records in the relevant paper.

Metadata objects
Metadata objects
Submission

A submission contains contact details of the submitter and directions for the release of data. The submission object defines whether the submitted metadata and submitted data, should become immediately public or if it should remain confidential for a period of no more than two years. Once data has been released it can be withdrawn from public access only by contacting us.

Study

The study is used to describe the study/project in detail. The study should have more than one experiment and may have more than one analysis. It contains a title, a project name and an abstract, as it would appear in the publication. After publication of the paper containing the submitted data, please add the PubMed ID to the Study.

Another important aspect of the study is the BioProject ID. BioProject ID(s) is assigned by the INSDC for a large-scale genome sequencing, transcriptome and epigenetics projects. It is useful for grouping multiple submissions which may be archived in different databases. If you are intending on submitting higher-level data based on the raw reads, please submit and obtain a BioProject ID. DDBJ BioProject

Sample

The sample is used to describe taxonomic information of the sequenced samples. The mandatory fields are minimal, however, since the sample is one of the most important objects to be described biologically, it is highly recommended that "TAG-VALUE" pairs are generated to describe the sample in as much detail as possible (example, strain - 1234). We recommend the adoption of GSC (Genomic Standards Consortium) terms for the TAG names where possible. For a full list of terms in the specific standards please visit the GSC wiki.

Experiment

The experiment is used to describe each unique set of experimental setup details including insrtument platform and model details, library preparation details, and any additional information required to correctly interpret the submitted data. Where any of these values differ between runs, a new experiment object must exist. Each experiment references a study and a sample by alias. It is recommended to demultiplex pooled data by barcode for submission.

Run

The run is used to describe the data files and their relationship with the experiments. For pooled data, it is recommended to submit demultiplexed data as different SRA runs.

Analysis

Packages data associated with sequence read objects that are intended for downstream usage or that otherwise needs an archival home. Please submit the analyzed data to the appropriate INSDC-associated databases (DDBJ and DOR) if there. Examples include annotations and QC reports.

Metadata fields in metadata creation tool*

Examples of metadata.

The items with * are required.
The items with ** are conditionally required for an element containing the items.

Required*
Conditionally required*

Submission

Center Name

Enter submitter's organization.

Center Name*

A submitter's center name. Center Name List. A center name abbreviation is required to submit data to DRA.

In the metadata creation tool, the center name is automatically filled with the account information.

The Center Name is an abbreviation operationally used by SRA and is not for indicating ownership of submission. Submitters listed in Submitter hold ownership of submission.

Lab Name*
Laboratory name within submitting institution. The Lab name is pre-entered with "Lab/Group", "Department (2)", "Department (1)", "Organization" of D-way account. Text can be editted.

Hold Until

Specify how to release the data.

Hold Until*
Direct the DRA to release the record on or after the specified date.Submitter can set the hold date for a maximum of 2 years and can change the date before the record is released.
Immediate Release*
Direct the DRA to release the record immediately after submission is processed.

Submitter

The DRA contacts the listed address(es) regarding the submission by e-mail.Include contact information of PI and non-PI member(s) who actually submits data.The contact information is not made public. If you want to display the contact information, enter the information in the BioProject.

Name*
Name of submitter.
E-mail*
E-mail of submitter.

BioProject

BioProject ID*
Select a project registered to BioProject or submit a new project. For submission to BioProject, please refer to the BioProject Handbook.

BioSample

BioSample ID*
Select samples registered to BioSample or create and submit new samples. For submission to BioSample, please refer to BioSample Handbook.

Experiment

Alias
Name of the experiment designated by the archive. This alias is used to reference metadata objects without accession numbers.
BioSample Used*
Select the BioSample this experiment uses.
Title*
Short text that can be used to call out experiment records in searches or in displays. A title like "[Sequencing Instrument Model] [paired end] sequencing of [BioSample ID]" (for example, "Illumina HiSeq 2000 paired end sequencing of SAMD00025741") is automatically constructed. To enter user-defined titles, download Experiment metadata into a tab-delimited text file, edit title values and upload it.
Library Name
The submitter's name for this library.
Library Source*
The Library Source specifies the type of source material that is being sequenced.
Library Source Description
GENOMIC Genomic DNA (includes PCR products from genomic DNA).
TRANSCRIPTOMIC Transcription products or non genomic DNA (EST, cDNA, RT-PCR, screened libraries).
METAGENOMIC Mixed material from metagenome.
METATRANSCRIPTOMIC Transcription products from community targets.
SYNTHETIC Synthetic DNA.
VIRAL RNA Viral RNA.
OTHER Other, unspecified, or unknown library source material.
Library Selection*
Whether any method was used to select and/or enrich the material being sequenced.
Library Selection Description
RANDOM Random shearing only.
PCR Source material was selected by designed primers.
RANDOM PCR Source material was selected by randomly generated primers.
RT-PCR Source material was selected by reverse transcription PCR.
HMPR Hypo-methylated partial restriction digest.
MF Methyl Filtrated.
repeat fractionation Selection for less repetitive (and more gene rich) sequence through Cot filtration (CF) or other fractionation techniques based on DNA kinetics.
size fractionation Physical selection of size appropriate targets.
MSLL Methylation Spanning Linking Library.
cDNA complementary DNA.
cDNA_randomPriming
cDNA_oligo_dT
PolyA PolyA selection or enrichment for messenger RNA (mRNA); should replace cDNA enumeration.
Oligo-dT enrichment of messenger RNA (mRNA) by hybridization to Oligo-dT.
Inverse rRNA depletion of ribosomal RNA by oligo hybridization.
ChIP Chromatin immunoprecipitation.
MNase Micrococcal Nuclease (MNase) digestion.
DNAse Deoxyribonuclease (DNase) digestion.
Hybrid Selection Selection by hybridization in array or solution.
Reduced Representation Reproducible genomic subsets, often generated by restriction fragment size selection, containing a manageable number of loci to facilitate re-sampling.
Restriction Digest DNA fractionation using restriction enzymes.
5-methylcytidine antibody Selection of methylated DNA fragments using an antibody raised against 5-methylcytosine or 5-methylcytidine (m5C)MBD2 protein methyl-CpG binding domain : Enrichment by methyl-CpG binding domain.
MBD2 protein methyl-CpG binding domain MBD2 protein methyl-CpG binding domain.
CAGE Cap-analysis gene expression.
RACE Rapid Amplification of cDNA Ends.
MDA multiple displacement amplification.
padlock probes capture method Padlock Probes capture strategy to be used in conjuction with Bisulfite-Seq.
other Other library enrichment, screening, or selection process.
unspecified Library enrichment, screening, or selection is not specified.
Library Strategy*
Sequencing technique intended for this library.
Library Strategy Description
WGS Whole genome shotgun.
WGA Whole genome amplification.
WXS Random sequencing of exonic regions selected from the genome.
RNA-Seq Random sequencing of whole transcriptome.
miRNA-Seq Micro RNA and other small non-coding RNA sequencing.
ncRNA-Seq Capture of other non-coding RNA types, including post-translation modification types such as snRNA (small nuclear RNA) or snoRNA (small nucleolar RNA), or expression regulation types such as siRNA (small interfering RNA) or piRNA/piwi/RNA (piwi-interacting RNA).
ssRNA-seq strand-specific RNA sequencing
WCS Whole chromosome (or other replicon) shotgun.
CLONE Genomic clone based (hierarchical) sequencing.
POOLCLONE Shotgun of pooled clones (usually BACs and Fosmids).
AMPLICON Sequencing of overlapping or distinct PCR or RT-PCR products.
CLONEEND Clone end (5', 3', or both) sequencing.
FINISHING Sequencing intended to finish (close) gaps in existing coverage.
RAD-Seq Restriction Site Associated DNA Sequence
ChIP-Seq Direct sequencing of chromatin immunoprecipitates.
MNase-Seq Direct sequencing following MNase digestion.
DNase-Hypersensitivity Sequencing of hypersensitive sites, or segments of open chromatin that are more readily cleaved by DNaseI.
Bisulfite-Seq Sequencing following treatment of DNA with bisulfite to convert cytosine residues to uracil depending on methylation status.
EST Single pass sequencing of cDNA templates.
FL-cDNA Full-length sequencing of cDNA templates.
CTS Concatenated Tag Sequencing.
MRE-Seq Methylation-Sensitive Restriction Enzyme Sequencing strategy.
MeDIP-Seq Methylated DNA Immunoprecipitation Sequencing strategy.
MBD-Seq Direct sequencing of methylated fractions sequencing strategy.
Tn-Seq Gene fitness determination through transposon seeding.
FAIRE-seq Formaldehyde Assisted Isolation of Regulatory Elements
SELEX Systematic Evolution of Ligands by EXponential enrichment
RIP-Seq Direct sequencing of RNA immunoprecipitates (includes CLIP-Seq, HITS-CLIP and PAR-CLIP).
ChIA-PET Direct sequencing of proximity-ligated chromatin immunoprecipitates.
Hi-C Chromosome Conformation Capture technique where a biotin-labeled nucleotide is incorporated at the ligation junction, enabling selective purification of chimeric DNA ligation junctions followed by deep sequencing
ATAC-seq Assay for Transposase-Accessible Chromatin (ATAC) strategy is used to study genome-wide chromatin accessibility. alternative method to DNase-seq that uses an engineered Tn5 transposase to cleave DNA and to integrate primer DNA sequences into the cleaved genomic DNA
Targeted-Capture
Tethered Chromatin Conformation Capture
Synthetic-Long-Read binning and barcoding of large DNA fragments to facilitate assembly of the fragment
Other Library strategy not listed.
Library Construction Protocol

Free form text describing the protocol by which the sequencing library was constructed. Please include protocols of DNA fragmentation, ligation and enrichment. If a library preparation kit is used, include the name and version (if any) of the kit (for example, Illumina Nextera DNA Library Preparation Kit).

Reference: Alnasir J, Shanahan HP. Investigation into the annotation of protocol sequencing steps in the sequence read archive. Gigascience. 2015 May 9;4:23. doi: 10.1186/s13742-015-0064-7. eCollection 2015. PMID: 25960871 (Open Access)

Instrument*
Select a sequencing instrument model.
Instrument Model
454 GS
454 GS 20
454 GS FLX
454 GS FLX+
454 GS FLX Titanium
454 GS Junior
Illumina Genome Analyzer
Illumina Genome Analyzer II
Illumina Genome Analyzer IIx
Illumina HiSeq 1000
Illumina HiSeq 1500
Illumina HiSeq 2000
Illumina HiSeq 2500
Illumina HiSeq 3000
Illumina HiSeq 4000
Illumina MiSeq
Illumina HiScanSQ
HiSeq X Five
HiSeq X Ten
NextSeq 500
NextSeq 550
Helicos HeliScope
AB SOLiD System
AB SOLiD System 2.0
AB SOLiD System 3.0
AB SOLiD 3 Plus System
AB SOLiD 4 System
AB SOLiD 4hq System
AB SOLiD PI System
AB 5500 Genetic Analyzer
AB 5500xl Genetic Analyzer
AB 5500xl-W Genetic Analysis System
Complete Genomics
MinION
GridION
PromethION
PacBio RS
PacBio RS II
Sequel
Ion Torrent PGM
Ion Torrent Proton
AB 310 Genetic Analyzer
AB 3130 Genetic Analyzer
AB 3130xL Genetic Analyzer
AB 3500 Genetic Analyzer
AB 3500xL Genetic Analyzer
AB 3730 Genetic Analyzer
AB 3730xL Genetic Analyzer
Spot Type*
Select a layout of reads in sequencing data files.
Spot TypeDescription
singleSingle read
paired (FF)Paired reads with same direction.
paired (FR)Paired reads with opposite direction.
Nominal Length*
Size of the insert for Paired reads.
Nominal Sdev
Standard deviation of insert size.
Spot Length*

The read length in submitted sequencing files. For mate pairs, this number includes mate pairs, but does not include gap lengths.

  • When the spot length is constant, enter a constant value.
  • For 454 platforms producing reads with variable length, enter a constant flow count.
  • For fastq files with variable length, enter an average length.

Run

Alias
Name of the run designated by the archive. This alias is used to reference metadata objects without accession numbers.
Title*
Short text that can be used to call out run records in searches or in displays. A title like "[Sequencing Instrument Model] [paired end] sequencing of [BioSample ID]" (for example, "Illumina HiSeq 2000 paired end sequencing of SAMD00025741") is automatically constructed. To enter user-defined titles, download Run metadata into a tab-delimited text file, edit title values and upload it.
Experiment Referenced*
Select the experiment this run belongs to.

Data files for Run

Select data files for a Run.

Run/Analysis
Specify whether a data file belongs to the Run or Analysis. In the web submission form, this field is un-editable and is automatically filled according to the selected Run or Analysis. To upload metadata in tsv file, this field needs to be specified manually.
File Name*
The name of a sequence data file. Uploaded filenames are automatically filled in.
Run/Analysis contains files*
Select a Run to which the data file belongs.
File Type*
The sequence data file format. For the fastq files with variable read length, select 'generic_fastq'. For the fastq files with constant read length, select 'fastq'.

File Type Description
generic_fastq fastq files with variable read length
fastq fastq files with constant read length
sff 454 Standard Flowgram Format file
hdf5 PacBio hdf5 Format file
SOLiD_native SOLiD csfasta and qual files. # Support for this format is planned to be depracated in May, 2017.
bam Binary SAM format for use by loaders that combine alignment and sequencing data
tab A tab-delimited table maps "SN in SQ line of BAM header" and "reference fasta file"
reference_fasta Reference sequence file in single fasta format used to construct SRA archive file format. Filename must end with ".fa"
MD5 Checksum*
MD5 checksum of a sequence data file. How to obtain the MD5 checksum values.

Analysis

Alias
Name of the analysis designated by the archive.This alias is used to reference metadata objects without accession numbers.
Title*
Title of the analyis object.
Description*
Describes the contents of the analysis.
Analysis Type*
Select an Analysis type. Submit alignment data to Run in bam format.
Analysis Type Description
De Novo Assembly A placement of sequences including trace, SRA, GI records into a multiple alignment from which a consensus is computed..
Sequence Annotation Per sequence annotation of named attributes and values.
Example: Processed sequencing data for submission to dbEST without assembly.
Reads have already been submitted to one of the sequence read archives in raw form.
The fasta data submitted under this analysis object result from the following treatments, which may serve to filter reads from the raw dataset:
    - sequencing adapter removal
    - low quality trimming
    - poly-A tail removal
    - strand orientation
    - contaminant removal.
Abundance Measurement Identify the tools and processing steps used to produce the abundance measurements (coverage tracks).

Data files for Analysis

Select data files for an Analysis.

Run/Analysis
Specify whether a data file belongs to the Run or Analysis. In the web submission form, this field is un-editable and is automatically filled according to the selected Run or Analysis. To upload metadata in tsv file, this field needs to be specified manually.
File Name*
The name of an analysis file.
Run/Analysis contains files*
Select an Analysis to which the data file belongs.
File Type*
The analysis data file format.
File Type Description
bam Binary form of the Sequence alignment/map format for read placements, from the SAM tools project.
See http://sourceforge.net/projects/samtools/.
tab A tab delimited text file that can be viewed as a spreadsheet. The first line should contain column headers..
ace Multiple alignment file output from the phred assembler and similar programs.
See http://www.phrap.org/consed/distributions/README.16.0.txt for a description of the ACE file format..
fasta Sequence data format indicating sequence base calls.The format is simple: a header line initiated with the > character, data lines following with base calls..
wig The wiggle (WIG) format allows display of continuous-valued data in track format.This display type is useful for GC percent, probability scores, and transcriptome data.
See http://genome.ucsc.edu/goldenPath/help/wiggle.html for a description of the Wiggle Track format..
BED BED format provides a flexible way to define the data lines that are displayed in an annotation track.
See http://genome.ucsc.edu/FAQ/FAQformat#format1 for a description of the BED format..
VCF Variant Call Format.
See http://www.1000genomes.org/wiki/analysis/variant%20call%20format/vcf-variant-call-format-version-41 for a description of the VCF format.
MAF Mutation Annotation Format
GFF General Feature Format
csv
tsv
MD5 Checksum*
MD5 checksum of a run data file. How to obtain the MD5 checksum values.

XML Schema*

For full details of the metadata, please see XML Schemas (NCBI).