Frequently asked questions

FAQ: 21

Metadata*

Do I have to register a separate BioProject/BioSample for each genome I am sequencing?

If multiple cultured genomes are part of the same research effort, then they can belong to the same BioProject. However, each culture must be registered as a separate BioSample.

Metagenomic assemblies, where multiple genomes are assembled with high confidence from a single metagenomic sample, register a BioProject for metagenomic assembly project, and BioSamples for each sample of metagenomic assembly.

Created: February 12, 2015; Last updated: December 13, 2016

What is the relationship between BioSamples, SRA Experiments, SRA Runs, and my data files?

BioSample is descriptive information about the biological source materials, or samples, used to generate experimental data in any of primary data archives. Biological and technical replicates need to be registered as separate BioSamples distinguished by the "replicate" attribute having values such as "biological replicate 1" and "biological replicate 2".

Each SRA Experiment is a unique sequencing library for a specific sample. Importantly, much of the descriptive information that is displayed in the public record of your data is captured at the level of the DRA Experiment.

SRA Runs are simply a manifest of data file(s) that should be linked to a given sequencing library – no information present in the Run is displayed on the public record of your project. Note that all data files listed in a Run will be merged into a single SRA archive file (and fastq file for distribution), so files from different samples should not be grouped in the same Run. Paired-end data files (forward/reverse), conversely, MUST be listed in a single run in order for the two files to be correctly processed as paired-end. Do not divide a sample for a paired-end library (for example, forward and reverse).

Created: June 4, 2014; Last updated: January 4, 2017

How do I import a BioProject or BioSample accession into the DRA?

BioProject and BioSample submissions must be made through the Submission Portal D-way. Once you begin a BioProject or BioSample submission, it will be assigned a temporary tracking ID (PSUB/SSUB[number], respectively) – this is not the final accession! Once a BioProject is complete, it is assigned an accession like PRJDB[number]. Once a BioSample submission is complete, each sample will receive an accession like SAMD[number]. When creating DRA experiments, please specify the PSUB ID or PRJDB[number] accession as your BioProject, and SSUB ID or SAMD[number] as your BioSample. Note that a given data file can be linked to a single BioSample only.

When sample preparation and sequencing are carried out by different research groups, submitting DRA Experiment can refer BioProject and BioSample IDs obtained in the other submission account. If you need to refer external BioProject and BioSample IDs, contact to the DRA team. When referencing external objects, please be aware of triggering of data release among BioProject, BioSample and DRA submissions.

Created: June 4, 2014; Last updated: December 13, 2016

How many samples do I need for my DRA submission?

BioSample is descriptive information about the biological source materials, or samples, used to generate experimental data in any of primary data archives. Biological and technical replicates are represented by separate BioSamples with distinct 'replicate' attribute, e.g., 'replicate = biological replicate 1'.

For environmental samples, each physical isolate should be considered a BioSample, whereas uniquely attributable reads within an isolate are not. Note that a given DRA data file can be linked to a single BioSample only.

Basic guidance for BioSample registration are:
  • Register a separate BioSample for each unique source, e.g., RNA from the wings is a separate BioSample than RNA from legs if those two sources were sequenced independently.
  • A genome assembly can have only one BioSample. For a genome assembled from reads of multiple BioSamples, register a new BioSample and indicate which other BioSamples were used to generate the assembly. For example, if the reads from a male and from a female were submitted to DRA separately but the reads were combined to assemble the genome, register a new BioSample for the male plus the female, providing the accessions of the male and the female BioSamples in the new BioSample registration. Example genome entry.
  • Endosymbionts: Because sequences are annotated by genome, one would need separate BioSamples for an insect and its endosymbiont. In the insect genome assembly submission, we recommend indicating that the endosymbiont’s BioSample is separate and references the insect BioSample.
Examples:
  • 23,000 unique 16S amplicons from a single seawater collection point - 1 BioSample (1 sample was collected and then analyzed to deduce 16S diversity)
  • 3 "identical" transgenic mice treated with the same drug as part of an experiment - 3 BioSamples (biological and technical replicates are represented by separate BioSamples)
  • To examine gene expression profiles, CHO cells infected with a virus and sampled at 0, 2, 4, and 8 hours post infection - 4 BioSamples (4 time points)
  • To analyze differences in gene expression levels, RNA-seq data from a single male anteater taken from the brain, heart, lungs, testes, and liver - 5 BioSamples (5 different tissues isolated)
Created: June 4, 2014; Last updated: December 13, 2016

How should I describe a pooled sample distinguished by barcode sequences in metadata?

Divide sequence data files per sample and submit each file as single BioSample-Experiment-Run set. If you need to describe the relationship between barcode sequence and sample, please describe in the Library Construction Protocol of Experiment as free-text.

Created: January 23, 2014; Last updated: December 13, 2016

From 12th, May, 2014, the DDBJ SRA uses the BioProject instead of SRA Study. Please select the BioProject accession in the DRA submission system.

Created: January 23, 2014; Last updated: October 13, 2015

Is there an appropriate way to submit submissions containing many metadata objects?

When there are many Experiment and Run objects, these can be submitted in tab-delimited text files generated by using spreadsheet editor (for example, Excel). Please read the DRA Handbook.

Created: January 23, 2014; Last updated: December 13, 2016

Sequencing data*

How can I turn the "Validate data files" button active?

When all sequencing data files listed in the Run metadata are uploaded to the DRA server, the "Validate data files" button becomes clickable and users are able to start the validation process. If the button remains inactive after submitting metadata ("metadata_submitted"), check the following points.
  • All data files listed in the Run metadata have not yet been uploaded.
  • File contains spaces is not recognized.
  • Uploaded file in directory is not recognized.
Created: October 5, 2015

How are my data files processed?

Uploaded data files are processed per Run. All files under a Run are merged into single binary SRA file by using SRA toolkit. During this conversion, length and format of all reads are checked.

Read names are editted and identifiers (DRR accession number + serial number) are automatically inserted (example: DRR000001). Original read names should be unique in a Run. A DRR accession number is used as a filename. If the "generic_fastq" is selected for the filetype, read names are replaced with the DRR accession number + serial number. (example: DRR030615).

リード名は編集され,DRR アクセッション番号に連番が付された ID が自動的に挿入されます。ファイル名には DRR アクセッション番号が付与されます (例: DRR000001)。filetype に "generic_fastq" を指定した場合,リード名は DRR 番号に連番が付された ID で置換されます (例: DRR030615)。

Example of read names:

@DRR000001.1 3060N:7:1:1116:340 length=36
GATGGTAAGATAGAAGCAGTTGAAGTTTACAAACCG
+DRR000001.1 3060N:7:1:1116:340 length=36
IIIII%IIIIIIIIII7IHII26:C6EI)+,9,%%*
@DRR000001.2 3060N:7:1:1114:186 length=36
GATATTGGCCTGCAGAAGTTCTTCCTGAAAGATGAT
+DRR000001.2 3060N:7:1:1114:186 length=36
IIIIIIIIIIIIIGI8IIDI6II;?:,+9+>.A1,I
@DRR000001.3 3060N:7:1:945:361 length=36
GTCAGGATCGGTCTCGCCTTTTAATAGAGGGAGATA
+DRR000001.3 3060N:7:1:945:361 length=36
IIIIIIIIIIIIIIII=3IIII>>I;-52/./+.I,

When "PAIRED" is selected in Experiment, paired reads are grouped in a Run.

DRA generates fastq from SRA files by using SRA toolkit and provide sequencing data in both file formats.

More than two fastq files are provided for paired reads. Paired reads are divided into a file with "_1" (example, DRR000001_1.fastq.bz2) and "_2" (example, DRR000001_2.fastq.bz2). Reads without pair are provided in a file without "_1" nor "_2" (example, DRR000001.fastq.bz2).

Created: December 25, 2014; Last updated: December 25, 2015

I can not scp transfer my files.

First, confirm the following basic points.

  • Authentification is by using SSH key not by password.
  • A private key is pair of a public key registered in a D-way submission account.
  • A private key file has read permission.
  • A passphrase for private key is correctly entered.

When transferring data files by using a private key generated in the other operating system, please check format of a private key. Convert private key

In Unix/Mac OS X: Convert a key in the Windows PuTTY file format into the OpenSSH.

In Windows WinSCP: Convert a key in the Unix/Mac OS X OpenSSH file format into the Windows PuTTY format.

When these are correct, please confirm your system administrators whether scp (port 22) is allowed or not.

Created: November 19, 2014; Last updated: February 12, 2015

What is an MD5 checksum and how do I compute it?

MD5 checksums are used by the DRA to verify the integrity of transmitted data. MD5 checksums are a 32-character alphanumeric string like. Please refer to the manual.

bf4ac50dcd58bd2860dfac48c7fca348

Created: June 4, 2014; Last updated: June 6, 2014

How to deal with validation errors?

data excessive while validating formatter within short read archive module - cummulative length of reads data in file(s): 152 is greater than spot length declared in experiment: 76 in spot 'xxxx'

Spot length value in Experiment differs from actual read length. For paired library, enter a sum of paired read lengths in the Spot length.

fastq-load err: data inconsistent while validating formatter within short read archive module - cummulative length of reads data in file(s): 70 is less than spot length declared in experiment: 152, most probably mate-pair is absent in spot 'xxxx'

When 'fastq' is selected for the filetype in Run, "read length should be constant" and "paired reads must appear in the same order in the paired files". If the fastq files do not meet these conditions, validation errors occur. Revise the filetype from 'fastq' to 'generic_fastq'.

constraint violated while executing function within virtual database module

Read names are possibly not unique in Run.

path not found while accessing directory within file system module - no message text available

Files are not recognized. This error occurs in the following cases: "filename contains whitespace", "files are in sub-directories" and "fastq files are tar archived".

CheckSum Error

The md5 values in Run differs from actual md5. Check "files are not corrupted" and "md5 values in Run are not wrong".

Created: January 23, 2014; Last updated: February 2, 2015

Update*

How do I change hold date?

Please login to the submission system and change the date. You can set the hold date for a maximum of 2 years, and this date may be brought forward or pushed back at any time.

hold_date

We will send you an e-mail reminder 30 days before the scheduled release date, inviting you to postpone the release date as necessary.
Please see the video tutorial.

Created: January 23, 2014; Last updated: October 9, 2015

How do I add reference information?

DDBJ Sequence Database

See the relevant item in Data Updates/Corrections and contact us from this form with "Our paper was published" in [Subject].

DRA

Add publication information to the BioProject referenced by relevant DRA submission. Contact BioProject team to add publication.

BioProject

Contact BioProject team to add publication information. Basically, citation of the BioProject accession is not recommended.

BioSample

When sequencing data derived from relevant samples are deposited in DDBJ Sequence Database and DRA, please add publication information as described above.

For a publication about isolation and growth condition specifications of the organism/material, add pubmed id etc to isol_growth_condt. For a primary genome report, please add the relevant pubmed id etc to ref_biomaterial.

If you want to add publication of the other types, please contact BioSample team.

Created: January 23, 2014; Last updated: September 5, 2016

Accession number*

Which accession numbers should be cited in publication?

A DRA submission is composed of following objects with unique prefix. LINK : Prefix Letter List

  • Submission : DRA
  • BioProject (Study) : PRJD
  • Experiment : DRX
  • BioSample (Sample) : SAMD
  • Run : DRR
  • Analysis : DRZ
Metadata objects
Metadata objects

Please cite accession number(s) of objects you want to refer in your publication.

In general, do not cite the BioProject accession number.

Created: April 2, 2015; Last updated: October 13, 2015

I have not received accession numbers yet - is something wrong?

Please login to the submission system and check the status of your submission.

  • If the status is "metadata_submitted", you need to validate your data files by clicking the [Validate data files] button.
  • If the status is "data_error", please check the error messages of data validation and modify metadata, re-upload data files as necessary.
  • If the status is "data_validating", the DRA system is validating your data files. Validation of large files may take time.
  • The DRA team is reviewing the submissions.

Please contact DRA team, when necessary.

Created: January 23, 2014; Last updated: June 4, 2014

Downloading files*

How do I download files?

Download files from DDBJ ftp server at ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq.

wget

wget is a convenient way to download files over FTP.

wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/DRA000/DRA000001/DRX000001/DRR000001.fastq.bz2

ascp

Aspera ascp command line client can be dowloaded here. Please select the correct operating system. The ascp command line client is distributed as part of the Aspera connect high-performance transfer browser plug-in.

Your command should look similar to this:

ascp -i <aspera connect SSH key> <option> -P 33001 anonftp@ascp.ddbj.nig.ac.jp:<file or files to download> <download location>

Examples:

ascp -i <aspera connect SSH key> -QT -l 300m -P 33001 anonftp@ascp.ddbj.nig.ac.jp:/ddbj_database/dra/fastq/DRA000/DRA000001/DRX000001/DRR000001.fastq.bz2 .

Created: January 23, 2014; Last updated: June 4, 2014

Why is reads number of fastq less than that of SRA file?

The DRA generates fastq files from the raw data SRA files by using the fastq-dump in the NCBI SRA Toolkit with following options.

fastq-dump -M 25 -E --skip-technical --split-3 -W <SRA file>

  • -M 25: Minimum read length to output is 25 (default is 25)
  • -E: No sequences starting or ending with >= 10N
  • --skip-technical: Dump only biological reads
  • --split-3: Legacy 3-file splitting for mate-pairs: first and second biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq, respectively. If only one biological read is present, it is placed in *.fastq.
  • -W: Apply left and right clips

Reads are filtered and trimmed according to above dumping conditions, reads number of fastq is generally less than that of SRA file. Users can generate unfiltered and untrimmed fastq files by using following fastq-dump options.

fastq-dump -M 1 --split-3 <SRA file>

Created: January 23, 2014

Data transfer*

How to transfer data files from the NIG supercomputer to my DRA directory?

If the private key was generated on Unix/Mac OS X

Transfer your private key to the NIG supercomputer (Linux). Next, transfer the files by executing.

scp <Your Files> <D-way Login ID>@dradata.ddbj.nig.ac.jp:~/<Submission ID>

  • <Your Files> Files to be transferred.
    Ex: file1 file2 (file1 and file2), file* (all files whose filenames start with “file”)
  • <D-way Login ID> D-way Login ID (ex. drauser)
  • <Submission ID> Submission ID (ex. drauser-0003)

If the private key was generated on Windows PC

After the conversion of the key into the OpenSSH format used in Linux, transfer the private key to the supercomputer. Then, specify the private key using -i option of scp.

scp -i <Private Key> <Your Files> <D-way Login ID>@dradata.ddbj.nig.ac.jp:~/ <Submission ID>

  • <Private Key> The private key file path (ex. /home/mishima/id.rsa) 
Created: December 12, 2014; Last updated: January 20, 2015

Data release*

How are linked BioProject/BioSample/sequence data released?

Linked BioProject, BioSample, DDBJ and DRA data are released as follows.

  • Release of the BioProject records DO NOT trigger release of the other linked data.
  • Release of the BioSample records DO NOT trigger release of the other linked data, however, DO trigger release of the referencing BioProject.
  • Release of the DDBJ and DRA nucleotide sequence data DO trigger release of the linked BioProject and BioSample records.

All metadata and sequencing data in a DRA submission are released at once.

Release of linked BioProject/BioSample/sequence records
Release of linked BioProject/BioSample/sequence records

DRA Handbook: Release of DRA
BioProject Handbook: Release of BioProject
BioSample Handbook: Release of BioSample

Created: December 15, 2014; Last updated: February 16, 2017

Contact form*

How can I contact you when the form is not available?

When the contact form to BioProject/BioSample/DRA/D-way/JGA is not available, please send an e-mail with the following information.

*Required

Name *
E-mail address *
Title *
D-way account
Accession number/Submission ID
Message *

Contacts (click the service name to send an e-mail)





Created: August 25, 2017