Trace Archive

Created: March 19, 2014; Last Updated: May 14, 2014

    Trace Archive overview

    DDBJ Trace Archive (DTA) is a permanent repository of DNA sequence chromatograms (traces), base calls, and quality estimates for single-pass reads from various large-scale sequencing projects. DTA is a member of the International Nucleotide Sequence Database Collaboration (INSDC) and collects the data in a collaboration with NCBI and EBI. NCBI Trace Archive issues and manages IDs.

    Released data can be searched and retrieved at the NCBI Trace Archive.

    Metadata

    There are fields that are required for specific combinations of STRATEGY and TRACE_TYPE_CODE. You may check requirements in the Validation Table. Metadata can be searched at the NCBI Trace Archive.

    Trace Archive RFC

    Required*
    May be required, depending upon the trace type and strategy employed*

    Metadata Field List

    ACCESSION
    DDBJ/EMBL/Genbank accession number Type: varchar(30) Example: AC22227 The is assigned upon deposition to a public repository (DDBJ/EMBL/Genbank). This field will not be applicable to all trace types (primarily WGS). However, if this field contains a validaccession identifier correlation between the primary sequence data (in Trace) and the secondary sequence data (in the public repository) is facilitated.
    AMPLIFICATION_FORWARD*
    The forward amplification primer sequence Type: varchar(100) Example: GGATTCTGACTAACGAGC The field is to allow submitters to define the primers used to amplify templates for sequencing. This field is required when =PCR or RT-PCR.
    AMPLIFICATION_REVERSE*
    The reverse amplification primer sequence. Type: varchar(100) Example: GGATTCTGACTAACGAGC The field is to allow submitters to define the primers used to amplify templates for sequencing. This field is required when =PCR or RT-PCR.
    AMPLIFICATION_SIZE
    The expected amplification size for a pair of primers. Type: int Example: 500 The field allows submitters to define the expected amplification size for a pair of primers (defined in the and fields). This number should be given in base pairs. If =PCR, the amplification size is based on amplification of genomic DNA. If the =RT-PCR, then the amplification size is based on amplification of transcript.
    ANONYMIZED_ID
    Anonymous ID for an individual. Type: varchar(100) Example: 2222anonym Used in projects to maintain the anonymity of donors. In many cases, there may be a controlled access database that can map many anonymized_ids in the trace archive to a single individual id for which phenotypic information may be available.
    ATTEMPT
    Number of times the sequencing project has been attempted by the center and/or submitted to the Trace Archive. Type: tinyint(1-255) Example: 2
    BASE_FILE
    File name with base calls. Type: varchar(200) Example: ./mytraces/123clone.fasta Trace files which do not include the basecalls must provide this information in a separate file. The file designations are recorde din the field of the metadata file. If basecalls are provided in separate files the information in these files will overwrite any information in the trace (usually *.scf) file. If the base calls that would be provided in the are the same as the information in the trace file, DO NOT PROVIDE THE FILE. If the center provides the and, then the peak index information should also be provided in a file called.
    CENTER_NAME*
    Name of the sequencing center. Type: varchar(50) Example: WUGSC Sequencing centers wishing to submit data must contact the DDBJ Trace Archive administrators to determine a center abbreviation. This abbreviation issued in the field. This field has a controlled vocabulary. For the complete list of submitting centers see: http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?view=submitting_centers

    These center names are controlled separately from those of the Sequence Read Archive

    CENTER_PROJECT*
    Center defined project name. Type: varchar(100) Example: HBBB The reflects a sequencing center's internal designation for a specific sequencing project.This field can be useful for grouping related traces.
    CHEMISTRY
    Description of the chemistry used in the sequencing reaction. Type: varchar(50) Example: BIGDYEV3.0
    CHEMISTRY_TYPE
    Type of chemistry used in the sequencing reaction. Type: char(50) Example: P The uses a controlled list.
    Accepted values are:
    PrimerTerminatorp=primer; t=terminator
    CHROMOSOME
    Chromosome to which the trace is assigned. Type: varchar(8) Example: 11 The indicates to which chromosome a trace has been assigned. Gene names or cytogenetic positions are not appropriate substitutes for chromosome information.
    CLIP_QUALITY_LEFT
    Left clip of the read, in base pairs, based on quality analysis. Type: int Example: 56 The field indicates the base at the beginning of the sequence at which the read should be clipped due to poor quality sequence. The given value would be the first base of the high quality region of the trace.
    CLIP_QUALITY_RIGHT
    Right clip of the read, in base pairs, based on quality analysis. Type: int Example: 256 The field indicates the base at the end of the sequence at which the read should be clipped due to poor quality sequence. The given value would be the last base of the high quality region of the trace.
    CLIP_VECTOR_LEFT*
    Left clip of the read, in base pairs, based on vector sequence. Type: int Example: 75 The field indicates the base at the beginning of the sequence at which the read should be clipped due to vector sequence. The given value would be the first base of non-vector sequence. This field is required for almost all combinations of and . This information can be omitted if the field is populated or is PCR or RT-PCR.
    CLIP_VECTOR_RIGHT*
    Right clip of the read, in base pairs, based on vector sequence. Type: int Example: 275 The field indicates the base at the end of the sequence at which the read should be clipped due to vector sequence. The given value would be the last non-vector sequence. This field is required for almost all combinations of and . This information can be omitted if the field is populated or is PCR or RT-PCR.NOTE: Many centers combine vector and quality analysis, and thus have only one set of clip values. Inthis case, the set of values should be placed in the / fields.
    CLONE_ID*
    The name of the clone from which the trace was derived. Type: varchar(30) Example: RP23-1123F10 The field issued to store the identifier related to an individual clone, for example a BAC clone, PAC clone or cDNA clone. If the clone is registered with the clone registry(http://www.ncbi.nlm.nih.gov/clone/), standard clone registry nomenclature (http://www.ncbi.nlm.nih.gov/clone/content/overview/) should be used.
    This field is required for the following combination of and :
    =cDNA;=Any
    =EST;=Any
    =CLONEEND;=CLONEEND
    =CLONE;=Any
    =ENCODE;=SHOTGUN;
    PrimerWalk; CLONEEND =FINISHING;=Any
    CLONE_ID_LIST*
    Semi-colon delimited list of clones if the Strategy is PoolClone. Type: varchar(30) Example: RP23-200A2;RP23-500P1 The field is used only if =PoolClone. In this case, the list of clones is provided as a semicolon delimited list. If the clones are registered with the Clone Registry (http://www.ncbi.nlm.nih.gov/clone/), standard clone registry nomenclature (http://www.ncbi.nlm.nih.gov/clone/content/overview/) should be used (see field).Note: The list of clones is not limited, but the size of the individual clone within the list is limited to 30 bytes.
    This field is required for the following combination of and :
    =PoolClone;=Any
    COLLECTION_DATE*
    The full date, in "Mar 2 2006 12:00AM" format, on which an environmental sample was collected. Type: datetime Example: Mar 2 2006 12:00AM The field is used to define the date and time on which an environmental sample was collected.
    This field is required for the following combination of and :
    =Env Sample-Geo; =Any
    =Env Sample-Host; =Any
    CVECTOR_ACCESSION
    Repository (DDBJ/EMBL/Genbank) accession identifier for the cloning vector. Type: varchar(50) Example: AY451994 The field holds the accession number for the cloning vector used. This cloning vector relates to the clone named in the field.
    CVECTOR_CODE
    Center defined code for the cloning vector. Type: varchar(50) Example: PBACE3.6 The field holds the user defined identifier for the cloning vector. Submitters are encouraged to submit all vector sequence information to public repositories.
    DEPTH
    Depth (in meters) at which an environmental sample was collected. Type: float Example: 10M The field is applicable to water samples and earth samples. If the value of this field is NULL, it is anticipated the sample was taken from the surface of the environment. While this field is only applicable to environmental samples, it is not required.
    ELEVATION
    Elevation (in meters) at which an environmental sample was collected. Type: float Example: 500 If the value of this field is NULL it is assumed the data were obtained at sea level. The field is only applicable to some environmental sample data, but is not a required field.
    ENVIRONMENT_TYPE*
    Type of environment from which an environmental sample was collected. Type: varchar(250) Example: sea water The field is used to describe the specific environment from which an environmental sample was taken. While the and fields describe the location many types of environmental types could exist at this location (for example, soil, sludge, tree roots, etc).
    This field would be required for the following combination of and :
    =Env Sample -Geo; =Any
    EXTENDED_DATA
    Extra ancillary information wrapped around in a EXTENDED_DATA block, where actual values are provided with a special <field> tag. Type: varchar() Example:
    <extended_data>
        <field name='SamplingSiteMonthChlorophyllLevel'>1.4 mg_mm</field>
        <field name='SamplingSiteYearlyChlorophyllLevel'>1.12 mg_mm</field>
        <field name='SamplingSiteYearlyChlorophyllLevelStdError'>0.19 mg_mm</field>
    </extended_data>
    The '=' sign and the field separator character '|' should be excluded from names and their values. No other validity checks will be performed on the data.
    FEATURE_ID_FILE
    File describing the features and their locations on a chip. Type: varchar(200) Example: ./mytraces/chip2.cdf The provides the location and sequence of the features for a given chip when ="CHIP".
    FEATURE_ID_FILE_NAME*
    Reference to a common FEATURE_ID_FILE which should be submitted first. Type: varchar(200) Example: This field is required when ="CHIP".
    FEATURE_SIGNAL_FILE
    File giving the signal and variance for features on a chip. Type: varchar(200) Example: ./mytraces/chip2.signal The provides the signal and variance of signal for the features on a given chip when ="CHIP".
    FEATURE_SIGNAL_FILE_NAME*
    Reference to a common FEATURE_SIGNAL_FILE which should be submitted first. Type: varchar(200) Example: This field is required when ="CHIP".
    GENE_NAME
    Gene name or some other common identifier. Type: varchar(100) Example: transporter 1 Free text. Mainly this field would be for ='Re-sequencing' or'ENCODE'. When a group is analyzing a particular gene, they may want to refer to that gene by it's name or some other common identifier.
    HI_FILTER_SIZE
    The largest filter used to stratify an environmental sample. Type: varchar(50) Example: 50 micron The field is applicable only to environmental sample data but is not a required field.
    HOST_CONDITION
    The condition of the host from which an environmental sample was obtained. Type: varchar(100) Example: HIV-positive The field is only applicable to environmental sample data and is used to describe the condition (healthy, sick, etc) of the host from which a sample was taken.
    HOST_ID*
    Unique identifier for the specific host from which an environmental sample was taken. Type: varchar(100) Example: yerkes pedigree #C0479 'Clint' The field is only applicable to environmental sample data and is used to capture the unique name for the specific host from which a sample was obtained.
    This field would be required for the following combination of and :
    =Env Sample-Host; =Any
    HOST_LOCATION*
    Specific location on the host from which an environmental sample was collected. Type: varchar(100) Example: rumen The field is only applicable to environmental sample data and is used to describe the specific part of the host from which the sample was obtained, for example: dental plaque, hindgut, root surfaces.
    This field would be required for the following combination of and :
    =Env Sample-Host; =Any
    HOST_SPECIES*
    The host from which an environmental sample was obtained. Type: varchar(100) Example: Pan troglodytes The field is only applicable to environmental sample data.
    This field would be required for the following combination of and :
    =Env Sample-Host; =Any
    INDIVIDUAL_ID
    Publicly available identifier to denote a specific individual or sample from which a trace was derived. Type: varchar(100) Example: NA12345 The field provides a center specific unique id that can associate as pecific trace to an individual. This will be used primarily for population based studies.
    INSERT_FLANK_LEFT*
    Flanking sequence at the cloning junction. Type: varchar(100) Example: AAGGTGCGATGCAGTGGCAGTAGCAGTGTCGACGTGACGATTCGTCCGGA The field should provide from 50 up to 100 bases of sequence (including linkers) to the left of the cloning junction. This information will allow users to perform their own vector trimming of reads. This field is required for almost all combinations of and . This field can be omitted if is populated.However, is the preferred choice. If there was no cloning step involved in the sequencing, please populate the field with 'NONE'.
    INSERT_FLANK_RIGHT*
    Flanking sequence at the cloning junction. Type: varchar(100) Example: AAGGCGCGATGCAGTGAGCGAGGCTGACGTCGGCTAGCGTCGCGTCGGGT The field should provide from 50 up to 100 bases of sequence (including linkers) to the right of the cloning junction. This information will allow users to perform their own vector trimming of reads. This field is required for almost all combinations of and . This field can be omitted if is populated.However, is the preferred choice. If there was no cloning step involved in the sequencing, please populate the field with 'NONE'. It is anticipated that if is populated that will also be populated. It is not anticipated that a mixture of clip values and junction sequence will be specified. (i.e. and populated for the same record.
    INSERT_SIZE*
    Expected size of the insert (referred to by the value in the TEMPLATE_ID field) in base pairs Type: int Example: 2000 The field indicates the expected insert size of the clone that is sequenced. It is understood that this is an estimate based upon the average insert sizes found in a given library. However, this information is critical for certain experiments, such as whole genome assembly.
    This field would be required for the following combination of and :
    =Any;=WGS=Any;
    =WCS=cDNA;=CLONEEND=CLONEEND;
    =CLONEEND
    INSERT_STDEV*
    Approximate standard deviation of value in INSERT_SIZE field. Type: int Example: 200 The field reflects the approximate standard deviation of the insert size. It is understood that this information is an approximation and may change as better data is obtained. This field would be required for the following combination of and :
    =Any;=WGS=Any;
    =WCS=cDNA;
    =CLONEEND=CLONEEND;=CLONEEND
    LATITUDE*
    The latitude measurement (using standard GPS notation) from which a sample was collected. Type: float Example: 54.736 The field is required to describe the collection of some environmental sample data. The latitude range is [-90,90] with the equator as 0 latitude and positive values of latitude are north of the equator. This field would be required for the following combination of and:
    =Env Sample- Geo;=Any
    LIBRARY_ID*
    The source of the clone identified in the CLONE_ID field Type: varchar(100) Example: RP23 The field documents the source library of the archival clone resource. Many genomic libraries have been registered with the Clone Registry (http://www.ncbi.nlm.nih.gov/clone) and the standard nomenclature (http://www.ncbi.nlm.nih.gov/clone/content/overview/) should be used for these libraries.
    This field would be requiredfor the following combination of and :
    =cDNA;=Any=EST;=Any
    =CLONEEND;=CLONEEND=CLONE;
    =Any=ENCODE;=SHOTGUN;PrimerWalk; CLONEEND
    LONGITUDE*
    The longitude measurement (using standard GPS notation) from which a sample was collected. Type: float Example: -86.403 The field is required to describe the collection of some environmental sample data. The longitude is ranging from 0° at the Prime Meridian to +180° eastward and -180° westward.
    This field would be required for the following combination of and :
    =Env Sample-Geo; =Any
    LO_FILTER_SIZE
    The smallest filter size used to stratify an environmental sample. Type: varchar(50) Example: 25 micron The field is only applicable to environmental sample data but is not a required field.
    NCBI_PROJECT_ID
    BioProject ID generated by the INSDC. Type: int Example: 7 field would allow to link traces to BioProject database and easily retrieve sets of traces from each Project. Genome sequencing centers may apply their project to the DDBJ BioProject prior the submission of genomic sequence data. Submitters need not submit sequencing data at the time they register their project.
    ORGANISM_NAME*
    Description of species for BARCODE project from which trace is derived. Type: varchar(100) Example: Acanthocybium solandri The field is used to classify the read by species for BARCODE data, using proper taxonomic name in accordance with Taxonomy Browser. ="BARCODESPECIES" for all traces from this project. This field would be required for the =BARCODE.
    PEAK_FILE
    Name of file that contains the list of peak values. Type: varchar(200) Example: ./mytraces/123clone.peak Consult the field description for more information.
    PH
    The pH at which an environmental sample was collected. Type: float Example: 7.2 The field is only applicable to environmental sample data but is not a required field.
    PICK_GROUP_ID
    Id to group traces picked at the same time. Type: int Example: 939065
    PLACE_NAME
    Country in which the biological sample was collected and/or common name for a given location. Type: varchar(250) Example: Octopus Springs The field is applicable to environmental sample data, but is not required.
    PLATE_ID
    Submitter defined plate id. Type: varchar(32) Example: 203 The and fields are intended to identify the storage location of the sequencing template (not the library well coordinate of an archival clone named in the field). This may enable flipped or contaminated trays to be easily identified. If a particular experiment did not require the use of a plate, please populate this field with '0'.
    POPULATION_ID
    Center provided id to designate a population from which a trace (or group of traces) was derived. Type: varchar(100) Example: CEPH The field is used to capture center specific designations of groups of individuals. This will likely only be useful in population studies(usually =SNP).
    PREP_GROUP_ID
    ID that defines groups of traces prepared at the same time. Type: varchar(30) Example: A2
    PRIMER
    The primer sequence (used in the sequencing reaction). Type: varchar(200) Example: GAATACCTACGATCGCC The value of the field is the actual base sequence of the sequencing primer used. If a center uses a primer extensively, the primer sequence can be entered into the list of primer codes and the field can be used.
    PRIMER_CODE
    Identifier for the sequencing primer used. Type: varchar(30) Example: Sp6
    PRIMER_LIST*
    A ';' delimited list of primers used in a mapping experiment (such as AFLP). Type: varchar(100) Example: AAGGTCTGCGCGTGTC;AGCTGCGTACGTAATCG; This field is required if ="AFLP" and ="PCR".
    PROGRAM_ID*
    The program used to create the trace file. Type: varchar(100) Example: phred-19990722h The field is used to indicate the base calling program. This field is free text. Program name, version numbers or dates are very useful.
    More example values:
    • phred-19980904e
    • abi-3.1
    • ATQA
    • TraceTuner
    • Licor
    • Megabase
    • Beckman
    PROJECT_NAME
    Term by which to group traces from different centers based on a common project. Type: varchar(50) Example: New Project In this way sequencing centers that are working on the same large project can group all of the traces for this project using a common term. This field has a controlled vocabulary. Sequencing centers wishing to submit data must contact the DDBJ Trace Archive to determine a name that all members of the project agree on.
    QUAL_FILE
    Name of file containing the quality scores. Type: varchar(200) Example: ./mytraces/123clone.fasta.qs Trace files which do not include the quality scores must provide this information in a separate file. The file designations are recorded in the fields of the metadata file. The actual quality scores are stored in the file designated in the field. If quality scores are provided in separate files the information in these files will overwrite any information in the trace (usually *.scf) file. If the quality scores that would be provided in the are the same as the information in the trace file, DO NOT PROVIDE THE FILE. However, it is important to note that if some formats do not include the quality scores, then these values must be provided as ancillary information. If the center provides the and, then the peak index information should also be provided in a file called.
    REFERENCE_ACCESSION*
    Reference accession (use accession and version to specify a particular instance of a sequence) used as the basis for a re-sequencing project. In case of Comparative strategy show the basis for primers design. Type: varchar(50) Example: NT_029829.1 This field is required for the following combination of and :
    =Re-sequencing;Comparative =Any
    REFERENCE_ACC_MAX*
    Finish position for a particular amplicon in re-sequencing or comparative projects. Type: int Example: 30929 This field points to the finishing coordinate of the described in the field. All coordinates should be in 1 base coordinates (i.e.sequences start at base 1, not base 0). This field is required for the following combination of and :
    =Re-sequencing; =SHOTGUN; PCR;RT-PCR
    REFERENCE_ACC_MIN*
    Start position for a particular amplicon in re-sequencing or comparative projects. Type: int Example: 29829 This field points to the starting coordinate of the described in thefield. All coordinates should be in 1 base coordinates (i.e.sequences start at base 1, not base 0). This field is required forthe following combination of and :
    =Re-sequencing; =SHOTGUN; PCR;RT-PCR
    REFERENCE_OFFSET*
    Sequence offset of accession specified in REFERENCE_ACCESSION field to define the coordinate start position used as the basis for a re-sequencing project. Type: int Example: 1520899 This field points to the starting coordinate of the described in thefield. All coordinates should be in 1 base coordinates (i.e.sequences start at base 1, not base 0). This field is required forthe following combination of and :
    =Re-sequencing; =CHIP
    REFERENCE_SET_MAX
    Finish position for a entire re-sequencing region. This region may include several amplicons. Type: int Example: 29829 This field points to the starting coordinate of the described in the field for a entire re-sequencing region. All coordinates should be in 1 base coordinates (i.e. sequences start at base 1, not base 0).The REFERENCE_ACC_[MIN|MAX] and REFERENCE_SET_[MIN|MAX] should refer to the same REFERENCE_ACC.
    REFERENCE_SET_MIN
    Start position for a entire re-sequencing region. This region may include several amplicons. Type: int Example: 29829 This field points to the starting coordinate of the described in the field for a entire re-sequencing region. All coordinates should be in 1 base coordinates (i.e. sequences start at base 1, not base 0).The REFERENCE_ACC_[MIN|MAX] and REFERENCE_SET_[MIN|MAX] should refer to the same REFERENCE_ACC.
    RUN_DATE
    Date the sequencing reaction was run. Type: datetime Example: 2000-10-28
    RUN_GROUP_ID
    ID used to group traces run on the same machine. Type: varchar(30) Example: group2
    RUN_LANE
    Lane or capillary of the trace. Type: int Example: 1 The documents the specific lane or capillary on which a trace was obtained.
    RUN_MACHINE_ID
    ID of the specific sequencing machine on which a trace was obtained. Type: varchar(30) Example: machine2
    RUN_MACHINE_TYPE
    Type or model of machine on which a trace was obtained. Type: varchar(30) Example: ABI 310
    SALINITY
    The salinity at which an environmental sample was collected measured in parts per thousand units (promille). Type: float Example: 20 The field is only applicable to environmental sample data but is not a required field.
    SEQ_LIB_ID*
    Center specified M13/PUC library that is actually sequenced. Type: varchar(255) Example: 22194 The field is the center identifier for the M13/PUC based clone that is actually sequenced. This will allow grouping of traces by the actual ligation event and is applicable to most projects. Thi svalue will be unique within a given center.
    This field would be required for the following combination of and :
    =Any;=SHOTGUN
    =Any;=WGS/WCS
    SOURCE_TYPE*
    Source of the DNA. Type: varchar(50) Example: GENOMIC DNA The field consists of a code. Possible values are:
    • G=Genomic DNA (includes PCR products from genomic DNA)
    • N=Non Genomic DNA (EST, cDNA, RT-PCR, screened libraries)
    • VIRAL RNA=Viral RNA
    • SYNTHETIC=Synthetic DNA
    Accepted values are G, N, GENOMIC, NON GENOMIC, VIRAL RNA,SYNTHETIC
    SPECIES_CODE*
    Description of species from which trace is derived. Type: varchar(100) Example: Homo sapiens The field is used to classify the read by species, using proper taxonomic names where possible. This field currently is maintained as a controlled vocabulary. For a list of species currently contained within the Trace Archive, see: http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=stat&f=xml_list_species&m=obtain&s=speciesTo submit a new species, please contact the DDBJ Trace Archive prior to submission. For cases in which it is unclear ofthe taxonomic origin of a specific trace the taxonomic classification 'ENVIRONMENTAL SEQUENCE' can be used in a case of environmental samples or 'ARTIFICIAL SEQUENCE' in a case of artificial material.
    STRAIN*
    Strain from which a trace is derived. Type: varchar(50) Example: C57BL/6J is required for ="SNP"
    STRATEGY*
    Experimental STRATEGY. Type: varchar(50) Example: MODEL VERIFY Experimental used when obtaining the trace. It is proposed that this would be a controlled vocabulary, but that submitters would contribute to this list as needed to define various experiments and projects.

    • AFLP: Amplified Fragment Length Polymorphism
    • BARCODE: DNA sequence analysis of a uniform target gene to enable species identification
    • CCS: Concatenated cDNA sequencing
    • cDNA: Sequences generated in the process of sequencing cDNA clones
    • CF-S: Cot-filtered single/low-copy genomic DNA
    • CF-M: Cot-filtered moderately repetitive genomic DNA
    • CF-H: Cot-filtered highly repetitive genomic DNA
    • CF-T: Cot-filtered theoretical single-copy DNA
    • CLONE: Genomic clone based (hierarchical) sequencing
    • CLONEEND: Sequences generated from the end of a clone(BAC/PAC/Fosmid or cDNA)
    • Comparative: Sequences obtained using primers design from related species
    • CTS: Concatenated Tag Sequencing
    • Env Sample-GEO: Geographically generated environmental sample
    • Env Sample-Host: Environmental samples collected from a specific host
    • EST: single pass sequencing of cDNA templates
    • FINISHING: a read specifically made for finishing, could be either BAC finishing or Whole Genome Assembly (WGA) finishing
    • MODEL VERIFY: Sequences obtained to verify proposed gene models
    • PoolClone: Pools of clones (BACs mostly)
    • SNP: Reads used for SNP identification
    • TARGETED LOCUS: Sequences obtained from templates generated by primers designed to amplify a specific genetic locus
    • Re-sequencing: Re-sequencing of targeted genomic regions
    • RT-PCR: Sequences obtained using templates generated by Reverse Transcriptase Polymerase Chain Reaction
    • WGA: Whole Genome Assembly
    SUBMISSION_TYPE*
    Type of submission. Type: varchar(50) Example: NEW The field allowed values:
    • NEW: use to submit new data
    • UPDATE: use to renew traces and their ancillary information. Previous data will be saved with their TI's; new traces with the same trace_name's will receive new TI's and they will become active
    • UPDATEINFO: use to update or add ancillary information for already existing traces without re-submitting the entire package of data
    • WITHDRAW: use to withdraw traces
    SVECTOR_ACCESSION
    DDBJ/EMBL/Genbank accession of the sequencing vector. Type: varchar(50) Example: X52325
    SVECTOR_CODE
    Center defined code for the sequencing vector. Type: varchar(50) Example: pBluescript SK(+)
    TEMPERATURE
    The temperature (in oC) at which an environmental sample was collected. Type: float Example: 30 The field is only applicable to environmental sample data but it is not a required field.
    TEMPLATE_ID
    Submitter defined identifier for the sequencing template. Type: varchar(50) Example: HBBBA2211 The field is used to uniquely identify the actual template that is sequenced. This field, in conjunction with the TRACE_END field, can be used to identify traces that should be marked as 'mate_pairs'because they come from opposite ends of the same clone.
    TRACE_END
    Defines the end of the template contained in the read. Type: varchar(50) Example: F The field can have the following values:
    • F: FORWARD
    • R: REVERSE
    • N: UNKNOWN
    TRACE_FILE*
    Filename with the trace, relative to the top of the volume. Type: varchar(200) Example: ./traces/TRACE001.scf
    TRACE_FORMAT*
    Format of the trace file. Type: varchar(20) Example: scf The field can have the following values:
    • SCF - A standard file format for data from DNA sequencing instruments.
    • ABI - A ABI-trace file is a binary file including the trace data and the sequence.
    TRACE_NAME*
    Center defined trace identifier. Type: varchar(250) Example: HBBBA1U2211 The field must be unique within a center, but is not required to be unique between centers. The combination of and act as a unique key within the Trace Archive.
    TRACE_TYPE_CODE*
    Sequencing strategy by which the trace was obtained. Type: varchar(50) Example: wgs The field reflects the sequencing used to obtain the trace.

    • CHIP: Sequences obtained using microarrays (also called DNAchips or gene chips)
    • CLONEEND: Sequences generated from the end of a large insert(BAC/PAC/Fosmid) or cDNA clone
    • EST: Single Pass Expressed Sequence Tag
    • HTP SELEX: High throughput SELEX
    • OTHER: Other than PCR, PrimerWalk, SHOTGUN or TRANSPOSON for FINISHING
    • PCR: Sequences obtained using templates generated by genomic Polymerase Chain Reaction
    • PrimerWalk: Sequences generated through a primer walkingstep
    • RT-PCR: Sequences obtained using templates generated by Reverse Transcriptase Polymerase Chain Reaction
    • SHOTGUN: Shotgun sequencing of clones (genomic or cDNA)
    • TRANSPOSON: Sequences obtained using templates generated bytransposons
    • WCS: Whole Chromosome Shotgun
    • WGS: Whole Genome Shotgun
    TRANSPOSON_ACC*
    DDBJ/EMBL/Genbank accession for transposon used in generating sequencing template. Type: varchar(50) Example: X00913 The would be required for the following combination of and :
    =Any;=TRANSPOSON
    TRANSPOSON_CODE*
    Center defined code for transposon used in generating sequencing template. Type: varchar(50) Example: Mu transposon This field would be required for the following combination of and :
    =Any;=TRANSPOSON
    WELL_ID
    Center defined well identifier for the sequencing reaction. Type: varchar(50) Example: A1 The field in combination with the field , is used to define the storage location of the sequencing reaction (see note with the field). Typically,sequencing reactions are performed in standard microtiter dishes having either 96 or 384 wells (see standard configurations below).
    Standard 96 well microtiter configuration
    Standard 96 well microtiter configuration
    Standard 384 well microtiter configuration
    Standard 384 well microtiter configuration

    Internal Fields List

    BASECALL_LENGTH
    Length of the trace in base pairs. Type: int Example: 396
    BASES_20
    Number of base pairs for which quality score exceed 20. Type: smallint Example: 50 Warning: There are some depositions that do not have quality scores. This is likely due to the center submitting ABI files and not providing quality calls separately.
    BASES_40
    Number of base pairs for which quality score exceed 40. Type: smallint Example: 50 Warning: There are some deposition sthat do not have quality scores. This is likely due to the center submitting ABI files and not providing quality calls separately.
    BASES_60
    Number of base pairs for which quality score exceed 60. Type: smallint Example: 50 Warning: There are some depositions that do not have quality scores. This is likely due to the center submitting ABI files and not providing quality calls separately.
    LOAD_DATE
    Date on which the data was loaded. Type: smalldatetime Example: Jan 8 2001 11:59AM
    MATE_PAIR
    TI's of the reads obtained from the other end of the same template. Type: int Example: 203682255 MATE PAIR is the pair of reads obtained from two ends of the same template (FORWARD and REVERSE).
    REPLACED_BY
    TI that replaced the current TI as "active". Type: int Example: 304753779 This field points to the more recent data set. If trace was updated then the field stores the for the new trace. If only ancillary information has been updated, then replaced_by=0 and is not shown.
    STATE
    Indicates the status of the trace. Type: varchar Example: active
    • active
    • updated
    • withdrawn
    TAXID
    NCBI Taxonomy ID. Type: int Example: 10090 This field links DDBJ Trace Archive with NCBI Taxonomy Browser.
    TI
    Trace unique internal Identifier. Type: int Example: 304753779 It is given for a record at the loading stage, and any record,or number of records can be obtain by their identifiers.
    UPDATE_DATE
    Date on which the data was updated/replaced. Type: smalldatetime Example: Jul 19 2001 3:48PM This field is used to store the date of the last update.

    Submit trace data

    Create submission files

    The metadata file (TRACEINFO file) describes the submitted data as well as points to the location of the chromatograms. All submissions when extracted should have a top directory. All metadata files should be placed under that directory. In case when the submission should contain trace files at least one more directory should be introduced to the top directory and all trace files should be placed under that directory. The trace files (either in SCF or in ABI format) should not appear in the top level directory, but rather should be in a subdirectory. It is suggested to use the name of the traces or the name of the project for subdirectories. There may be subdirectories within and this is encouraged to group traces. Below are examples of the submission directory hierarchy.

    Submission directory hierarchy example

    TOP_DIRECTORY/
    TOP_DIRECTORY/TRACEINFO
    TOP_DIRECTORY/traces
    TOP_DIRECTORY/traces/FLJ/
    TOP_DIRECTORY/traces/FLJ/FLJA1U0001.scf
    TOP_DIRECTORY/traces/FLJ/FLJA1U0002.scf
    TOP_DIRECTORY/traces/FLJ/FLJA1U0003.scf

    The metadata file can be either in XML or in tab-delimited format. The metadata requirements are in the Validation Table (spreadsheet format) for specific combinations of STRATEGY and TRACE_TYPE_CODE. Both types of metadata files can contain common fields section at the beginning of it. This section defines common for the submission values if any.

    Below are examples of TRACEINFO metadata files.

    TRACEINFO xml example

    <?xml version="1.0"?>
    <trace_volume>
       <common_fields>
          <center_name>CENTER NAME ACRONYM IS HERE</center_name>
          <center_project>FLJ</center_project>
          <source_type>N</source_type>
          <species_code>HOMO SAPIENS</species_code>
          <strategy>EST</strategy>
          <submission_type>NEW</submission_type>
          <trace_format>SCF</trace_format>
          <trace_type_code>EST</trace_type_code>
       </common_fields>
       <trace>
          <trace_name>F-3NB691000020</trace_name>
          <trace_file>./traces/F-3NB691000020.scf</trace_file>
          <clone_id>3NB691000020</clone_id>
          <library_id>3NB691</library_id>
          <template_id>3NB691000020</template_id>
       </trace>
       <trace>
          <trace_name>F-3NB691000033</trace_name>
          <trace_file>./traces/F-3NB691000033.scf</trace_file>
          <clone_id>3NB691000033</clone_id>
          <library_id>3NB691</library_id>
          <template_id>3NB691000033</template_id>
       </trace>
         --- more information ---
    </trace_volume>

    TRACEINFO tab-delimited text example

    center_name = CENTER NAME ACRONYM IS HERE
    center_project = FLJ
    source_type = N
    species_code = HOMO SAPIENS
    strategy = EST
    submission_type = NEW
    trace_format = SCF
    trace_type_code = EST
    trace_name	clone_id	library_id	template_id	trace_file
    F-3NB691000020	3NB691000020	3NB691	3NB691000020	./traces/F-3NB691000020.scf
    F-3NB691000033	3NB691000033	3NB691	3NB691000033	./traces/F-3NB691000033.scf
    --- more information ---			

    Upload submission files

    DTA creates a directory for data submission. Please contact to the DTA team. Transfer files by SCP according to the manual.

    Submission directory example

    submission/submitter_id/dta/dta_submitter_id-0001
    Directory for the DTA submission is separated from those for the DDBJ Sequence Read Archive.

    Completion of submission

    After submission files become complete, DTA can keep the data private until the submitters instruct us to release the data. After instruction of data release, DTA uploads the files to the NCBI Trace Archive. As soon as the data are loaded to the NCBI Trace Archive, TI numbers are assigned and the data become public.

    Please note that TI number assignment and data release are concurrent events.

    Update

    To update the records, please contact to the DTA team.