This diagram shows how the raw data files, normalized/processed data files and data matrix files are related in an DOR experiment submission. Raw and processed data can be submitted as one file per hyb, or in a data matrix.
Data file support in the DOR can be divided into two categories: Affymetrix data files, and everything else. Non-Affymetrix data must be supplied as plain ASCII tab-delimited text, and the column headings must be left intact. MAGE-TAB uses the column headings within the file to identify which kind of file it is dealing with, and what the quantitation types are. The script will reformat recognized file types into Block Column-Block Row format, and then strip the feature coordinates and the column headings, in the process generating the necessary MAGE objects to describe the data.
For Affymetrix the raw data is the CEL file. For other platforms the raw data is the file which contains the signal intensities, background intensities, etc, for every spot on the array, e.g. GenePix .gpr file, Agilent Feature Extraction software .txt file.
|Data Type||File Format|
|Raw||CEL, gpr or .txt|
|Normalized||CHP, gpr or .txt|
|Combined Data Matrix File||.txt|
The following list gives a brief overview of how DOR recognizes different file formats. In each case, the data file row containing the column headings is identified by matching it to these sets of known column headings. Submit uneditted raw data files.
- Block Column/Block Row format files are recognized using the following column headings:
|Block Column||Block Row||Column||Row|
- MAGE-TAB recognizes and parses CEL and EXP files using both the old GDAC formats and the newer GCOS/XDA formats. These file formats are detected using the Affymetrix data file parser incorporated into the MAGE-TAB package. See below for notes on Affymetrix normalized data file formats.
- GenePix format files are recognized using the following column headings:
- A file containing these headings is recognized as Agilent format file:
- The following column headings are recognized as being from a ScanAlyze format file:
- ScanArray Express files are recognized from the following headings:
|Array Column||Array Row||Spot Column||Spot Row||X||Y|
|Array Column||Array Row||Column||Row|
- The following column headings are recognized as indicating an ArrayVision format file:
- Spotfinder files are recognized by the following column headings:
- A file containing the following headings is recognized as a BlueFuse file:
- UCSF Spot
- UCSF Spot files are recognized by the following column headings:
- NimbleScan files (Feature, Probe and Pair) all contain the following headings:
- Applied Biosystems
- Files generated by Applied Biosystems software have the following headings:
- ImaGene files are recognized using the following columns:
|Meta Column||Meta Row||Column||Row||Field||Gene ID|
- CSIRO Spot
- CSIRO Spot files contain the following columns:
Obviously, this method of determining which file type is being processed is not infallible. You are therefore encouraged to test your data files with MAGE-TAB and report any problems to us.
Normalized data files may be submitted in any of the following formats. In addition, files may be parsed using a number of special column headings which can be used to designate a column containing reporter or composite element identifiers.
Generic normalized data:
If you have normalized data mapped to the identifiers used in your array design, you can simply use a single column containing those identifiers to include your data in the final MAGE-TAB. MAGE-TAB supports the use of either Reporter Names or Composite Element Names for this purpose. Please see these ADF help notes for a discussion on these identifier types. Thus, either of the following sets of column headers may be used.
|Composite Element REF||<QT1>||<QT2>||<QT3>|
where <QT1>, <QT2> etc. are the names of your quantitation types.
Affymetrix normalized data:
MAGE-TAB recognizes and parses CHP files using both the old GDAC formats and the newer GCOS/XDA formats. In addition, Affymetrix data normalized by non-Affymetrix methods (e.g. RMA normalization) can be parsed. Either Composite Element Names (above) or either of the following sets of column headers may be used.
Again, <QT1>, <QT2> etc. are the names of your quantitation types for more information on including novel quantitation types.
Normally, a MAGE-TAB document will have one data matrix where rows typically represent genes (though they may also represent other biological entities, such as exons or genomic locations), and columns typically represent samples or experimental conditions.
The main feature of data matrices is that columns in such matrices have references to Name objects in SDRF files, for instance to particular raw data files or particular samples. This enables mapping from biomaterials and their characteristics (especially experimental factor values) to individual processed data columns. Data matrix files accompanying an SDRF are annotated as such using the SDRF columns "Array Data Matrix File" and "Derived Array Data Matrix File". The formats of both types of data matrix are the same, and the only distinction between them is the type of data contained therein (unprocessed (raw) and normalized, respectively).
Syntactically, each data matrix file has two header rows, as shown in the following Figure. The second row contains the names of the quantitation types, such as 'signal', 'p-value', or 'log ratio(Cy3/Cy5)' (from the MAGE-TAB perspective, these are simply labels that do not have to have a particular meaning, but normally should be defined in the data processing protocol). The left-most field on the second header row indicates the nature of the identifiers used in the first column, and may be one of the following:
- "Reporter REF" or "Composite Element REF" indicating that each row maps to "Repoter Name" or "Composite Element Name" in the ADF, respectively. It is anticipated that this will be the most common use for these data matrices.
- A Term Source tag, expressed as "Term Source REF:<tag>" (e.g., "Term Source REF:ddbj", where "ddbj" is the Term Source Name), as defined in the IDF; this is used, for example, to map rows to gene annotation in public databases.
- A genome build: "Coordinates REF:<version>" where the version build is defined in the same way as other Term Sources in the IDF (e.g., "Coordinates REF:ncbi34"). This heading is used to link row-level data to chromosome coordinates in the absence of gene-level annotation.
Where the row-level annotation is not taken from the array design described by an ADF, MAGE-TAB implementations may create virtual array designs to hold this information. Using this mapping each column in the summary data matrix can be automatically and concisely annotated by the most important characteristics, such as experimental factor values.
An example SDRF is shown in the Figure (a), with the corresponding data matrix in figures (b) and (d). Linking sturcture between (a), (b) and ADF is illustrated in (c).
|Derived Array Data
|liver 1||Homo sapiens||liver||P-DORD-1||hyb 1||HG_U95A||Scan1||CELData.txt||P-DORD-2||Matrix.txt|
|kidney 1||Homo sapiens||kidney||P-DORD-1||hyb 2||HG_U95A||Scan2||CELData.txt||P-DORD-2||Matrix.txt|
|brain 1||Homo sapiens||brain||P-DORD-1||hyb 3||HG_U95A||Scan3||CELData.txt||P-DORD-2||Matrix.txt|
|Hybridization REF||hyb 1||hyb 1||hyb 2||hyb 2||hyb 3||hyb 3|
DOR accepts submissions of human non-identifiable functional genomic sequence data generated by next-generation sequencing methodologies. We accept data for studies that examine gene expression, gene regulation and epigenetics.
Following diagram shows how the raw data files, processed data files, and MAGE-TAB and SRA XML metadata files are related in an DOR experiment submission. The processed data files are submitted to the DOR with the provided MAGE-TAB. On the other hand, accompanied raw data files are submitted to the DDBJ Sequence Read Archive with SRA XML metadata converted from the MAGE-TAB. The DOR and DRA records are linked by accessions.
Raw data files will be uploaded to DDBJ Sequence Read Archive (DRA) database. The raw data files should be the original sequence read and quality files, unfiltered and untrimmed. The names of these files should be referenced as appropriate in the MAGE-TAB spreadsheet. One raw data file for each SDRF row (assay/hybridization) is required.
Barcode/Multiplexed Data: Submitters are required to split run files so that each barcoded sample ends up with a dedicated run file based on the barcode sequences.
Accepted file types and packaging instructions are summarized in the following table. The DRA website provides additional details.
|Technology||Accepted File Type(s)||Notes|
|Illumina||_qseq||Contains base calls and phred-like quality scores per read. It is important to package these files in the form:<all data from one lane>.tar.gz|
|454||sff||Contains flowgram (base call, phred quality score, flow value). The .sff files should reflect the sequencing run setup. If the entire picotitre plate was used, then one .sff file per run should be submitted. If the picotitre plate was divided into two or more regions, then a .sff file for each region should be submitted. If a .sff file contains more than one run, or more than one region in the run, please break up this file into constituent parts using the sfffile utility from the 'Off Rig' software package provided by Roche. The read names found in the .sff file are meaningful and reflect the addressing scheme for the picotitre plate as well as a globally unique run id. Please do not rewrite this name as such addressing information will be lost. The .sff file format is nearly optimal in terms of footprint, so there is little to be gained by further compressing them. Therefore, please provide .sff files uncompressed.|
|AB SOLiD||.csfasta and _QV.qual||SOLiD_native format. These should be organized into a compressed tar archive (.tar.gz) with all the files from one run constituting one tar file. For paired-ends data, two files of each file type will exist (F3 and R3).|
|Helicos||to be determined||Please contact us for instructions if you want to submit HeliScope data.|
Fastq format files are also accepted, but raw data files listed in the above table are preferred.
Requirements for processed data files are not yet fully standardized and will depend on the nature of the experiment. Submit processed data files from which the conclusions in associated manuscripts were drawn (e.g., filtered sequence reads with abundance counts, some alignment files, graph, peak files, etc).