Data file

Array-based data*

Overview*

This diagram shows how the raw data files, normalized/processed data files and data matrix files are related in an DOR experiment submission. Raw and processed data can be submitted as one file per hyb, or in a data matrix.

Raw data files, normalized/processed data files and data matrix files.
Raw data files, normalized/processed data files and data matrix files.

Data file support in the DOR can be divided into two categories: Affymetrix data files, and everything else. Non-Affymetrix data must be supplied as plain ASCII tab-delimited text, and the column headings must be left intact. MAGE-TAB uses the column headings within the file to identify which kind of file it is dealing with, and what the quantitation types are. The script will reformat recognized file types into Block Column-Block Row format, and then strip the feature coordinates and the column headings, in the process generating the necessary MAGE objects to describe the data.

For Affymetrix the raw data is the CEL file. For other platforms the raw data is the file which contains the signal intensities, background intensities, etc, for every spot on the array, e.g. GenePix .gpr file, Agilent Feature Extraction software .txt file.

Typical file formats of raw, normalizaed and data matrix.
Data TypeFile Format
RawCEL, gpr or .txt
NormalizedCHP, gpr or .txt
Combined Data Matrix File.txt

Raw data files*

The following list gives a brief overview of how DOR recognizes different file formats. In each case, the data file row containing the column headings is identified by matching it to these sets of known column headings. Submit uneditted raw data files.

Generic
Block Column/Block Row format files are recognized using the following column headings:
Block ColumnBlock RowColumnRow
Affymetrix
MAGE-TAB recognizes and parses CEL and EXP files using both the old GDAC formats and the newer GCOS/XDA formats. These file formats are detected using the Affymetrix data file parser incorporated into the MAGE-TAB package. See below for notes on Affymetrix normalized data file formats.

GenePix
GenePix format files are recognized using the following column headings:
BlockColumnRowXY
Agilent
A file containing these headings is recognized as Agilent format file:
RowColPositionXPositionY
ScanAlyze
The following column headings are recognized as being from a ScanAlyze format file:
GRIDCOLROWLEFTTOPRIGHTBOT
ScanArray/QuantArray
ScanArray Express files are recognized from the following headings:
Array ColumnArray RowSpot ColumnSpot RowXY
while the older QuantArray format has these headings:
Array ColumnArray RowColumnRow
ArrayVision
The following column headings are recognized as indicating an ArrayVision format file:
PrimarySecondary
Newer "lg2" ArrayVision files are identified by the following column headings:
Spot labels
Spotfinder
Spotfinder files are recognized by the following column headings:
MCMRSCSR
BlueFuse
A file containing the following headings is recognized as a BlueFuse file:
COLROWSUBGRIDCOLSUBGRIDROW
UCSF Spot
UCSF Spot files are recognized by the following column headings:
Arr-colxArr-colySpot-colxSpot-coly
NimbleScan
NimbleScan files (Feature, Probe and Pair) all contain the following headings:
PROBE_IDXY
Applied Biosystems
Files generated by Applied Biosystems software have the following headings:
Probe_IDGene_ID
Logical_rowLogical_colCenter_XCenter_Y
ImaGene
ImaGene files are recognized using the following columns:
Meta ColumnMeta RowColumnRowFieldGene ID
The ImaGene 3.0 format is also supported:
Meta_colMeta_rowSub_colSub_rowNameSelected
CSIRO Spot
CSIRO Spot files contain the following columns:
grid_cgrid_rspot_cspot_rindexs

Obviously, this method of determining which file type is being processed is not infallible. You are therefore encouraged to test your data files with MAGE-TAB and report any problems to us.

Normalized data files*

Normalized data files may be submitted in any of the following formats. In addition, files may be parsed using a number of special column headings which can be used to designate a column containing reporter or composite element identifiers.

Generic normalized data:
If you have normalized data mapped to the identifiers used in your array design, you can simply use a single column containing those identifiers to include your data in the final MAGE-TAB. MAGE-TAB supports the use of either Reporter Names or Composite Element Names for this purpose. Please see these ADF help notes for a discussion on these identifier types. Thus, either of the following sets of column headers may be used.

Reporter REF<QT1><QT2><QT3>
Composite Element REF<QT1><QT2><QT3>

where <QT1>, <QT2> etc. are the names of your quantitation types.

Affymetrix normalized data:
MAGE-TAB recognizes and parses CHP files using both the old GDAC formats and the newer GCOS/XDA formats. In addition, Affymetrix data normalized by non-Affymetrix methods (e.g. RMA normalization) can be parsed. Either Composite Element Names (above) or either of the following sets of column headers may be used.

ProbeSet ID<QT1><QT2><QT3>
ProbeSet Name<QT1><QT2><QT3>

Again, <QT1>, <QT2> etc. are the names of your quantitation types for more information on including novel quantitation types.

Data Matrix files*

Normally, a MAGE-TAB document will have one data matrix where rows typically represent genes (though they may also represent other biological entities, such as exons or genomic locations), and columns typically represent samples or experimental conditions.

The main feature of data matrices is that columns in such matrices have references to Name objects in SDRF files, for instance to particular raw data files or particular samples. This enables mapping from biomaterials and their characteristics (especially experimental factor values) to individual processed data columns. Data matrix files accompanying an SDRF are annotated as such using the SDRF columns "Array Data Matrix File" and "Derived Array Data Matrix File". The formats of both types of data matrix are the same, and the only distinction between them is the type of data contained therein (unprocessed (raw) and normalized, respectively).

Syntactically, each data matrix file has two header rows, as shown in the following Figure. The second row contains the names of the quantitation types, such as 'signal', 'p-value', or 'log ratio(Cy3/Cy5)' (from the MAGE-TAB perspective, these are simply labels that do not have to have a particular meaning, but normally should be defined in the data processing protocol). The left-most field on the second header row indicates the nature of the identifiers used in the first column, and may be one of the following:

  • "Reporter REF" or "Composite Element REF" indicating that each row maps to "Repoter Name" or "Composite Element Name" in the ADF, respectively. It is anticipated that this will be the most common use for these data matrices.
  • A Term Source tag, expressed as "Term Source REF:<tag>" (e.g., "Term Source REF:ddbj", where "ddbj" is the Term Source Name), as defined in the IDF; this is used, for example, to map rows to gene annotation in public databases.
  • A genome build: "Coordinates REF:<version>" where the version build is defined in the same way as other Term Sources in the IDF (e.g., "Coordinates REF:ncbi34"). This heading is used to link row-level data to chromosome coordinates in the absence of gene-level annotation.

Where the row-level annotation is not taken from the array design described by an ADF, MAGE-TAB implementations may create virtual array designs to hold this information. Using this mapping each column in the summary data matrix can be automatically and concisely annotated by the most important characteristics, such as experimental factor values.

An example SDRF is shown in the Figure (a), with the corresponding data matrix in figures (b) and (d). Linking sturcture between (a), (b) and ADF is illustrated in (c).

(a) SDRF
Sample
Name
Characteristics
[Organism]
Characteristics
[OrganismPart]
Protocol
REF
Hybridization
Name
Array
Design REF
Scan
Name
Array Data
Matrix File
Protocol
REF
Derived Array Data
Matrix File
liver 1 Homo sapiens liver P-DORD-1 hyb 1 HG_U95A Scan1 CELData.txt P-DORD-2 Matrix.txt
kidney 1 Homo sapiens kidney P-DORD-1 hyb 2 HG_U95A Scan2 CELData.txt P-DORD-2 Matrix.txt
brain 1 Homo sapiens brain P-DORD-1 hyb 3 HG_U95A Scan3 CELData.txt P-DORD-2 Matrix.txt
(b) Data matrix examples - common "Array Data Matrix File" file ("CELData.txt") linked to the "Hybridization" in Figure (a).
Hybridization REF hyb 1 hyb 1 hyb 2 hyb 2 hyb 3 hyb 3
Reporter REF CELIntensity CELIntensityStdev CELIntensity CELIntensityStdev CELIntensity CELIntensityStdev
Gene 1 i11 sd11 i21 sd21 i31 sd31
Gene 2 i12 sd12 i22 sd22 i32 sd32
Gene 3 i13 sd13 i23 sd23 i33 sd33
... ... ... ... ... ... ...
Gene n i1n sd1n i2n sd2n i3n sd3n
(c) Links from the Data Matrix (b) to the Hybridization in the SDRF (a) and the Reporter in the ADF (example not given)
(c) Links from the Data Matrix (b) to the Hybridization in the SDRF (a) and the Reporter in the ADF (example not given)
(d) Data matrix examples - common "Derived Array Data Matrix File" (Matrix.txt) linked to the "Scan" in Figure (a)
Scan REF Scan1 Scan1 Scan2 Scan2 Scan3 Scan3
Reporter REF signal p-value signal p-value signal p-value
Gene 1 x11 p11 x21 p21 x31 p31
Gene 2 x12 p12 x22 p22 x32 p31
Gene 3 x13 p13 x23 p23 x33 p33
... ... ... ... ... ... ...
Gene n x1n p1n x2n p2n x3n p3n

Sequencing data*

Overview*

DOR accepts submissions of human non-identifiable functional genomic sequence data generated by next-generation sequencing methodologies. We accept data for studies that examine gene expression, gene regulation and epigenetics.

Following diagram shows how the raw data files, processed data files, and MAGE-TAB and SRA XML metadata files are related in an DOR experiment submission. The processed data files are submitted to the DOR with the provided MAGE-TAB. On the other hand, accompanied raw data files are submitted to the DDBJ Sequence Read Archive with SRA XML metadata converted from the MAGE-TAB. The DOR and DRA records are linked by accessions.

Broker submission from DOR to DRA.
Broker submission from DOR to DRA.

Raw data files*

Raw data files will be uploaded to DDBJ Sequence Read Archive (DRA) database. The raw data files should be the original sequence read and quality files, unfiltered and untrimmed. The names of these files should be referenced as appropriate in the MAGE-TAB spreadsheet. One raw data file for each SDRF row (assay/hybridization) is required.

Barcode/Multiplexed Data: Submitters are required to split run files so that each barcoded sample ends up with a dedicated run file based on the barcode sequences.

Accepted file types and packaging instructions are summarized in the following table. The DRA website provides additional details.

Raw data files for high-throughput sequencing data.
TechnologyAccepted File Type(s)Notes
Illumina_qseqContains base calls and phred-like quality scores per read. It is important to package these files in the form:<all data from one lane>.tar.gz
454sffContains flowgram (base call, phred quality score, flow value). The .sff files should reflect the sequencing run setup. If the entire picotitre plate was used, then one .sff file per run should be submitted. If the picotitre plate was divided into two or more regions, then a .sff file for each region should be submitted. If a .sff file contains more than one run, or more than one region in the run, please break up this file into constituent parts using the sfffile utility from the 'Off Rig' software package provided by Roche. The read names found in the .sff file are meaningful and reflect the addressing scheme for the picotitre plate as well as a globally unique run id. Please do not rewrite this name as such addressing information will be lost. The .sff file format is nearly optimal in terms of footprint, so there is little to be gained by further compressing them. Therefore, please provide .sff files uncompressed.
AB SOLiD.csfasta and _QV.qualSOLiD_native format. These should be organized into a compressed tar archive (.tar.gz) with all the files from one run constituting one tar file. For paired-ends data, two files of each file type will exist (F3 and R3).
Helicosto be determinedPlease contact us for instructions if you want to submit HeliScope data.

Fastq format files are also accepted, but raw data files listed in the above table are preferred.

Processed data files*

Requirements for processed data files are not yet fully standardized and will depend on the nature of the experiment. Submit processed data files from which the conclusions in associated manuscripts were drawn (e.g., filtered sequence reads with abundance counts, some alignment files, graph, peak files, etc).