MAGE-TAB

Introduction*

DOR uses the MAGE-TAB version 1.1 format. For full specification, please refer to MAGE-TAB Specification Version 1.1.

The MAGE-TAB format uses a number of different files to capture information about a functional genomics experiment:

The IDF, SDRF, ADF and data matrix files should be in plain, tab-delimited text format.

The IDF file is used to give an overview of the experiment, including the experimental variables (factors) used, protocols, quality control strategy, publication information and contact details. Also included in the IDF file is an list of sources from which controlled vocabulary terms may have been used elsewhere in the MAGE-TAB document. These term sources may be fully-fledged ontologies (e.g. the MGED ontology), databases providing queryable accession numbers (e.g. ArrayExpress/DOR), or simply a file defining terms for local users.

The SDRF file describes the relationship between every step in the chain of biological materials used in the experiment through to the hybridization, and the acquisition and normalization of data. Experimental factors, protocols, protocol parameters and term sources defined in the IDF are referenced by the SDRF.

The ADF file provides the array-level annotation for the experiment, relating the row-level identifiers in the data files to biological sequence annotation. Array designs are usually deposited in the DOR as separate submissions to the experimental data, and in the case of commercial arrays may not need to be submitted to DOR at all. ADF page.

The IDF file should contain a pointer to the SDRF via the SDRF File tag. Data files and ADF files are referenced from the SDRF table directly. All data files should be in a single directory or archive with no sub-directory structure.

Relationships between IDF, SDRF, ADF and raw and processed data files
Relationships between IDF, SDRF, ADF and raw and processed data files

An experimental data submission will usually consist of an IDF file, an SDRF file, and a series of data files. Typically there will be one raw data file per hybridization in an array-based experiment. In a sequencing-based experiment, typically there will be one raw data file per sample. Each hybridization may also have a normalized data file, or the final transformed data may be combined into a data matrix file.

Basic rules*

Blank lines containing zero or more spaces or tabs are permitted in any of these files. Lines starting with the # symbol are interpreted as comments and are ignored.

MAGE-TAB uses the column names end with "Name" (e.g., Sample Name) that contain object identifiers (names). Object identifiers (names) defined in the IDF (e.g., Protocol Name) should be referenced only in the columns named ending with "REF" (e.g., Protocol REF).

All of the MAGE-TAB components (IDF, ADF, SDRF and data matrices) allow for referencing ontology terms or database accessions from external sources. In each case the source of the term(s) is indicated by a separate Term Source REF field in the IDF.

IDF, SDRF and ADF documents contain data divided into columns and rows. Columns are separated by tab characters, while lines are separated by newlines and/or carriage returns. Fields within columns may be escaped by surrounding them with double quotes, indicating that any tab or newline characters contained therein are not to be interpreted as a field delimiter. Quote characters within fields must be escaped with a backslash. Note that column headers are also permitted to be enclosed in double quotes, but no characters other than spaces are permitted between the multiple keywords that comprise a column header.

For more detail on the MAGE-TAB document format, please see the MAGE-TAB Specification Version 1.1.

IDF: Investigation Description Format*

The IDF component of a MAGE-TAB document provides top-level information concerning an investigation in a single tab-delimited format. The IDF consists of a set of unique fields attached to their corresponding values in a simple tab-delimited text format. For example, "Experiment Description" should be followed by a free-text description of the experiment. Most of the fields in the IDF document can handle multiple values.

IDF format*

Field properties for IDF
MAGE-TAB Version1.1  
Investigation TitleText
Experimental DesignOntology termOntology term...
Experimental Design Term Source REFTerm Source NameTerm Source Name...
Experimental Design Term Accession NumberTerm Accession NumberTerm Accession Number...
Experimental Factor NameTextText...
Experimental Factor TypeOntology termOntology term...
Experimental Factor Term Source REFTerm Source NameTerm Source Name...
Experimental Factor Term Accession NumberTerm Accession NumberTerm Accession Number...
   
Person Last NameTextText...
Person First NameTextText...
Person Mid InitialsTextText...
Person EmailTextText...
Person PhoneTextText...
Person FaxTextText...
Person AddressTextText...
Person AffiliationTextText...
Person RolesOntology term (semicolon-delimited list)Ontology term (semicolon-delimited list)...
Person Roles Term Source REFTerm Source NameTerm Source Name...
Person Roles Term Accession NumberTerm Accession NumberTerm Accession Number...
   
Quality Control TypeOntology termOntology term...
Quality Control Term Source REFTerm Source NameTerm Source Name...
Quality Control Term Accession NumberTerm Accession NumberTerm Accession Number...
Replicate TypeOntology termOntology term...
Replicate Term Source REFTerm Source NameTerm Source Name...
Replicate Term Accession NumberTerm Accession NumberTerm Accession Number...
Normalization TypeOntology termOntology term...
Normalization Term Source REFTerm Source NameTerm Source Name...
Normalization Term Accession NumberTerm Accession NumberTerm Accession Number...
Date of ExperimentDate (YYYY-MM-DD)
Public Release DateDate (YYYY-MM-DD)
   
PubMed IDPubMed IDPubMed ID...
Publication DOIDOIDOI...
Publication Author ListTextText...
Publication TitleTextText...
Publication StatusOntology termOntology term...
Publication Status Term Source REFTerm Source NameTerm Source Name...
Publication Status Term Accession NumberTerm Accession NumberTerm Accession Number...
Experiment DescriptionText
   
Protocol NameIDID...
Protocol TypeOntology termOntology term...
Protocol Term Source REFTerm Source NameTerm Source Name...
Protocol Term Accession NumberTerm Accession NumberTerm Accession Number...
Protocol DescriptionTextText...
Protocol ParametersText (semicolon-delimited list)Text (semicolon-delimited list)...
Protocol HardwareTextText...
Protocol SoftwareTextText...
Protocol ContactTextText...
   
SDRF FileText
   
Term Source NameText tag as used in SDRFText tag as used in SDRF...
Term Source FileURIURI...
Term Source VersionTextText...

The second column indicates the type of entry expected for each row. The rows highlighted in blue do not allow multiple values. Rows highlighted in yellow may consist of multiple values in columns listed horizontally, one for each element described. For example, one should use as many Person Last Name columns as there are contacts for the investigation. In cases where multiple terms need to be entered into a single column, they should be separated by semicolons (e.g., Protocol Parameters, Person Roles). All such semicolon-separated roles must be from one ontology.

Note that fields which contain ontology individual terms should indicate the origin of those terms using the relevant Term Source REF field. Dates should be supplied in the ISO format YYYY-MM-DD. See an example of IDF.

Comment fields can be included freely to add comments. The name associated with the comment is included in square brackets in the row name, and the value entered in the body of the IDF. Types are not currently supported. Example use-cases for the IDF are Comment[Goal] to describe the goal of study. DOR will include these fields for our own local implementation.

To specify bibliographic references accompanying the experiment, it is sufficient to enter just the PubMed ID for each citation into the IDF. Where a given article is not yet published, the available information should be given using the IDF fields shown.

IDF fields*

Investigation Title
The overall title of the investigation. This tag can only have one value.
Experimental Design
The experiment design types which are applicable to this study. Typically these terms should come from the MGED Ontology. The ExperimentDesignType subclasses are particularly useful here. See for example the list of BiologicalProperty terms available. Controlled vocabulary term.
Experimental Design Term Source REF
The source of the Experimental Design terms; this must reference one of the Term Source Names defined elsewhere in the IDF file (see below).
Experimental Design Term Accession Number
The accession number for this term, taken from the indicated Term Source.
Experimental Factor Name
A user-defined name for each experimental factor studied by the experiment. These experimental factors represent the variables within the investigation (e.g. growth condition, genotype, organism part, disease state). The actual values of these variables will be listed in the SDRF file, in "Factor Value [<factor name>]" colummns. Used as an identifier within the MAGE-TAB document.
Experimental Factor Type
A term describing the type of each experimental factor. These terms will usually come from the MGED Ontology. The ExperimentalFactorCategory subclasses are particularly useful here. See for example the list of BioMaterialCharacteristicCategory terms available. Controlled vocabulary term.
Experimental Factor Term Source REF
The source of the Experimental Factor Type terms; this must reference one of the Term Source Names defined elsewhere in the IDF file (see below).
Experimental Factor Term Accession Number
The accession number for this term, taken from the indicated Term Source.
Person Last Name
The last name of each person associated with the experiment.
Person First Name
The first name of each person associated with the experiment.
Person Mid Initials
The middle initials of each person associated with the experiment.
Person Email
The email address of each person associated with the experiment. The contact information is not made public.
Person Phone
The telephone number of each person associated with the experiment. The contact information is not made public.
Person Fax
The Fax number of each person associated with the experiment. The contact information is not made public.
Person Address
The street address of each person associated with the experiment. The contact information is not made public.
Person Affiliation
The organization affiliation for each person associated with the experiment. This is used for public display.
Person Roles
The role(s) performed by each person. Typically these terms should come from the MGED Ontology. See for example the list of Roles terms. If more than one role is needed per person, the roles should be given as a semicolon (";") delimited list, for example: "submitter;data_coder;investigator". Controlled vocabulary term.
Person Roles Term Source REF
The source of the Person Roles terms; this must reference one of the Term Source Names defined elsewhere in the IDF file (see below).
Person Roles Term Accession Number
The accession number for this term, taken from the indicated Term Source.
Quality Control Type
The quality control procedures used. Typically these terms should come from the MGED Ontology. See for example the list of QualityControlDescriptionType terms. Controlled vocabulary term.
Quality Control Term Source REF
The source of the Quality Control Type terms; this must reference one of the Term Source Names defined elsewhere in the IDF file (see below).
Quality Control Term Accession Number
The accession number for this term, taken from the indicated Term Source.
Replicate Type
The replicate strategies used. Typically these terms should come from the MGED Ontology. See for example the list of ReplicateDescriptionType terms. Controlled vocabulary term.
Replicate Term Source REF
The source of the Replicate Type terms; this must reference one of the Term Source Names defined elsewhere in the IDF file (see below).
Replicate Term Accession Number
The accession number for this term, taken from the indicated Term Source.
Normalization Type
The normalization strategies used. Typically these terms should come from the MGED Ontology. See for example the list of NormalizationDescriptionType terms. Controlled vocabulary term.
Normalization Term Source REF
The source of the Normalization Type terms; this must reference one of the Term Source Names defined elsewhere in the IDF file (see below).
Normalization Term Accession Number
The accession number for this term, taken from the indicated Term Source.
Date of Experiment
The date on which the experiment was performed. The date should be entered in the "YYYY-MM-DD" format (ex. 2011-01-01). This tag can only have one value.
Public Release Date
The date on which the experimental data will be/was released. The date should be entered in the "YYYY-MM-DD" format (ex. 2011-01-01). This tag can only have one value.
PubMed ID
The PubMed IDs of the publication(s) associated with this investigation (where available).
Publication DOI
A Digital Object Identifier (DOI) for each publication (where available).
Publication Author List
The list of authors associated with each publication.
Publication Title
The title of each publication.
Publication Status
A term describing the status of each publication (e.g. "submitted", "in preparation", "published"). Controlled vocabulary term.
Publication Status Term Source REF
The source of the Publication Status terms; this must reference one of the Term Source Names defined elsewhere in the IDF file (see below).
Publication Status Term Accession Number
The accession number for this term, taken from the indicated Term Source.
Experiment Description
A short paragraph describing the experiment as free-text. This tag can only have one value.
Protocol Name
The names of the protocols used within the MAGE-TAB document. These will be referenced in the SDRF in the "Protocol REF" columns. Used as an identifier within the MAGE-TAB document.
Protocol Type
The type of the protocol, taken from a controlled vocabulary. Typically this term should come from the MGED Ontology. See for example the list of ExperimentalProtocolType terms. Controlled vocabulary term.
Protocol Description
A free-text description of the protocol. This text is included in a single tab-delimited field. If you wish to include tab or newline characters as part of this text, you must enclose the whole text within double quotes (").
Protocol Parameters
A semicolon-delimited list of parameter names; these names are used in the SDRF file (as "Parameter Value [<parameter name>]" headers) to list the values used for each protocol parameter. If more than one parameter was used for a given protocol, they should be separated with semicolons (";"). Used as an identifier within the MAGE-TAB document.
Protocol Hardware
The hardware used by the protocol.
Protocol Software
The software used by the protocol.
Protocol Contact
The name and contact details to be used for enquiries concerning the protocol.
Protocol Term Source REF
The source of the Protocol Type terms; this must reference one of the Term Source Names defined elsewhere in the IDF file (see below). Examples: MGED ontology, OBI.
Protocol Term Accession Number
The accession number for this term, taken from the indicated Term Source.
SDRF File
The name(s) of the SDRF file(s) accompanying this IDF file.
Term Source Name
The names of the Term Sources (ontologies or databases) used within the MAGE-TAB document. This name will be used in all corresponding "Term Source REF" fields. Examples: MGED Ontology, NCI MetaThesaurus, ArrayExpress. Used as an identifier within the MAGE-TAB document.
Term Source File
A filename or valid URI at which the Term Source may be accessed.
Term Source Version
The version of the Term Source used throughout the MAGE-TAB document.
Comment[<user-defined tag>]
A user-defined value which is associated with the investigation. For example, DOR uses "Comment [BioProject ID]" to record the BioProject ID; alternatively, tags such as "Comment [Goal]" might be used to indicate the purpose behind an investigation.
Comment[BioProject ID]
The BioProject ID of the associated project. This is used to group the related INSDC records. See the DDBJ BioProject website for details.
Comment[DRA accession]
The DRA accession number(s) of the associated raw sequencing reads. This field links the processed data in the DOR and raw data in the DRA. When the data set is submitted to the DOR, the DOR registers raw data to the DRA and fills in this field.
Comment[Center Name]
The center name of the associated DRA submission.
Comment[Laboratory Name]
The Laboratory name of the associated DRA submission.

IDF examples*

An example of an IDF document

SDRF: Sample and Data Relationship Format*

The most important concept behind the SDRF is the investigation design graph, where nodes correspond to biomaterials (e.g., samples, RNA extracts, labeled cDNA, etc.) or data objects (e.g., raw or normalized data files), and edges showing the relationships between these objects. Attributes can be attached to nodes and to edges. The attributes are the descriptions of the biomaterial or data properties, e.g., sample descriptions attached to sample nodes, protocols attached to edges, raw data-files attached to hybridizations. The attributes can be pointers to some longer descriptions or external objects, e.g., protocols described in the IDF file.

SDRF names and attributes, pointers to the other objects.
SDRF names and attributes, pointers to the other objects.

The SDRF file consists of a table in which each hybridization channel is represented by a row, and columns represent the steps of the experiment. The ordering of these columns is important, and should read left-to-right in chronological order. The overall organization of this table is shown below.

SDRF overall structure
SDRF overall structure

Each block in the diagram above starts with a "Name" or "File" column (e.g. "Extract Name", "Array Data File"), followed by a set of attribute columns. Each block is separated from its predecessor by "Protocol REF" columns containing references to the "Protocol Name" values defined in the IDF.

A further set of columns is used to specify the values for the variables ("experimental factors") within the experiment. These Factor Value[] columns reference the Experimental Factor Names defined in the IDF, and should be placed after the hybridization section (i.e., to the right of it, in or after the scanning, normalization and data section in the image above). The contents of these columns will usually duplicate those in a material Characteristics or a protocol Parameter Value column.

SDRF sections*

Source

Sources are the starting material for the experiment. The section starts with a Source Name column, which will typically be followed by several Characteristics columns and a Material Type column.

Sample

Samples represent steps in the chain of treatments applied to the original Source.

Extract

Extracts refer to the extracted nucleic acid used in the experiment. If you need to represent separate nucleic acid extraction and chromatin immunoprecipitation steps in your SDRF, we recommend that you use two Extract steps.

Labeled Extract

The Labeled Extracts in an experiment are those materials which have been conjugated to a label of some kind, prior to hybridization on an array. A Label column must be included with the Labeled Extract Name column to indicate which label was used.

Assay/Hybridization

The assay/hybridization is a key section in the SDRF, since it connects the "materials" area of the SDRF from the "data" area. Describe assays using arrays (Hybridization) or assays not using arrays (Assay). Note that the values in Assay Name/Hybridization Name columns may be used in Data Matrix files to link columns of data to individual assays/hybridizations.

Scan

If desired, the act of scanning the hybridized array may be represented as a distinct node in the experimental graph, and encoded in the SDRF using Scan Name columns. These columns are optional, but can be useful in cases where e.g. multiple scans have been made of a single hybridized array, but where the data files do not explicitly reflect this. Note that the values in Scan Name columns may be used in Data Matrix files to link columns of data to individual scanning events.

Array Data File

The raw data files generated by an investigation should be listed in an Array Data File column following the Hybridization Name and (optional) Scan Name columns.

Normalization

Similarly to the use of Scan Name columns above, it is possible to represent the act of normalizing your data independently from the listing of data files themselves. This is done using the optional Normalization Name column.

Derived Array Data File

The processed data files which have been derived from the raw data should be listed in an Derived Array Data File column. Note that this generally only applies to processed data arranged into one file per assay/hybridization (or scan, or normalization). If your files contain processed data columns for more than one assay/hybridization, you should reformat these into the MAGE-TAB Data Matrix format and include them instead in a Derived Array Data Matrix File column.

SDRF column headers*

Source Name
This column contains user-defined names for the Source materials. Used as an identifier within the MAGE-TAB document.
The following columns can be used to annotate Source Name columns:
Sample Name
This column contains user-defined names for each Sample material. Used as an identifier within the MAGE-TAB document.
The following columns can be used to annotate Sample Name columns:
Extract Name
This column contains user-defined names for each Extract material. Used as an identifier within the MAGE-TAB document.
The following columns can be used to annotate Extract Name columns:
Labeled Extract Name
This column contains user-defined names for each Labeled Extract material. Used as an identifier within the MAGE-TAB document.
The following columns can be used to annotate Labeled Extract Name columns
Hybridization Name
This column contains user-defined names for each Hybridization. Used as an identifier within the MAGE-TAB document.
The following columns can be used to annotate Hybridization Name columns
Assay Name
This column contains user-defined names for each Assay. "Assay Name" may be used instead of "Hybridization Name" to identify generic biological assays, such as rtPCR and sequencing. Note that this column should not be used for submission of regular microarray experiments to DOR. All Assay Name columns must be followed by a Technology Type column. Used as an identifier within the MAGE-TAB document.
The following columns can be used to annotate Assay Name columns
Scan Name
This optional column contains user-defined names for each Scan event. Used as an identifier within the MAGE-TAB document.
The following columns can be used to annotate Scan Name columns
Normalization Name
This optional column contains user-defined names for each Normalization event. Used as an identifier within the MAGE-TAB document.
The following columns can be used to annotate Normalization Name columns
Array Data File
This column contains a list of raw data files, one for each row of the SDRF file, linking these data files to their respective hybridizations. The following columns can be used to annotate Array Data File columns
Derived Array Data File
This column contains a list of processed data files, one for each row of the SDRF file, linking these data files to their respective hybridizations. The following columns can be used to annotate Derived Array Data File columns
Array Data Matrix File
This column contains a list of raw data matrix files, where data from multiple hybridizations is stored in a single file, and the data mapped to each hybridization via the Data Matrix format itself. The following columns can be used to annotate Array Data Matrix File columns
Derived Array Data Matrix File
This column contains a list of processed data matrix files, where data from multiple hybridizations is stored in a single file, and the data mapped to each hybridization (or scan, or normalization) via the Data Matrix format itself. The following columns can be used to annotate Derived Array Data Matrix File columns
Image File
This optional column contains a list of image files, one for each row of the SDRF file, linking these image files to their respective hybridizations. Note that DOR does not store image data due to size constraints on the database. If desired, you may use this column to include links to image files stored on your local webserver. The following columns can be used to annotate Derived Array Data File columns
Array Design REF
This column contains references to the array design used for each hybridization. For DOR submissions this should be an ArrayExpress/DOR accession number, e.g. "A-DORD-1". The following columns can be used to annotate Array Design REF columns The Term Source REF column here can be used to point to the source of the array design referenced; however for DOR submissions this should always be ArrayExpress, and so this column is in effect ignored.
Protocol REF
This column contains references to Protocol Names defined in the IDF, or accession numbers of protocols already deposited with ArrayExpress/DOR. The following columns can be used to annotate Protocol REF columns The Term Source REF column here can be used to point to the source of the protocol referenced, if it is not contained within the IDF; for DOR submissions this should always be ArrayExpress, and a suitable ArrayExpress Term Source should be defined in the IDF.
Characteristics[<category term>]
Controlled vocabulary term or measurement. Used as an attribute column following Source Name, Sample Name, Extract Name, or Labeled Extract Name. This column contains terms describing each material according to the characteristics category indicated in the column header. For example, a column headed "Characteristics[OrganismPart]" would contain individual OrganismPart terms. These terms may be user-defined (the default), from an external ontology source (indicated using a Term Source REF column), or a measurement (indicated using a Unit[] column). The following columns can be used to annotate Characteristics[<category term>]:
Provider
Used as an attribute column following Source Name. A free-text string identifying the organization or person from which the Source was obtained.
Material Type
Controlled vocabulary term. Used as an attribute column following Source Name, Sample Name, Extract Name, or Labeled Extract Name. This column contains terms describing the type of each material. For DOR submissions this term should be an instance of MaterialType from the MGED Ontology. Examples: whole_organism, organism_part, cell, total_RNA. The following columns can be used to annotate Material Type columns The Term Source REF column in this case would point to the ontology (defined in the IDF) from which the Material Type terms are taken (the MGED Ontology in the example above).
Label
Controlled vocabulary term. Used as an attribute column following Labeled Extract Name. The label compound which is conjugated to an Extract to create the Labeled Extract. For DOR submissions this term should be an instance of LabelCompound from the MGED Ontology. Examples: Cy3, Cy5, biotin, alexa_546. The following columns can be used to annotate Label columns The Term Source REF column in this case would point to the ontology (defined in the IDF) from which the Label terms are taken (the MGED Ontology in the example above).
Technology Type
Controlled vocabulary term. Used as an attribute column following Assay Name. This column contains terms describing the type of each generic (non-hybridization) assay. Example: high_throughput_sequencing. The following columns can be used to annotate Technology Type columns The Term Source REF column in this case would point to the ontology (defined in the IDF) from which the Technology Type terms are taken.
Factor Value[<experimental factor name>]
Controlled vocabulary term or measurement. This column contains terms describing the experimental factor values (i.e., variables) for each row of the SDRF. The Experimental Factor Name to which it pertains (from the accompanying IDF) should be indicated in the column header. For example, if you have this in your IDF You could then use this factor in your SDRF (assuming you had also defined the "Mouse Anatomy" term source in your IDF)
Factor Value[Tissue] Term Source REF
brainMouse Anatomy
kidneyMouse Anatomy
liverMouse Anatomy
intestineMouse Anatomy
pancreasMouse Anatomy
The terms in the column may be user-defined (the default), from an external ontology source (indicated using a Term Source REF column), or a measurement (indicated using a Unit[] column). In the example above, the column terms would be treated as describing organism parts. For more precise control over the treatment of these terms, the optional form "Factor Value [] ()" is available, e.g. "Factor Value [growth condition EF] (Nutrients)".
Performer
Used as an attribute column following Protocol REF. The name of the researcher or center name who carried out the protocol. For sequencing protocol, this is used as a run center name in the DRA submission.
Date
Used as an attribute column following Protocol REF. The date (and time, where available) upon which the protocol was performed, in the following format: YYYY-MM-DDThh:mm:ssZ (for example, 2008-09-12T16:27:27Z)
Parameter Value[<protocol parameter>]
Used as an attribute column following Protocol REF columns. This column contains values for the protocol parameters referenced in the column header. The following columns can be used to annotate Parameter Value[] columns For example, if a Protocol Name "Array Hybridization" is defined in the accompanying IDF, with Protocol Parameters "hyb temp;hyb volume", the following would be valid.
Unit[<unit category>]
Controlled vocabulary term. Used as an attribute column following Characteristics[], Factor Value[] or Parameter Value[]. This column contains terms describing the unit(s) to be applied to the values in the preceding column. The type of unit is included in the column header, e.g. "Unit[TimeUnit]". These unit types should correspond to Unit subclasses from the MGED Ontology. The following columns can be used to annotate Unit[] columns The Term Source REF column in this case would point to the ontology (defined in the IDF) from which the Unit terms are taken.
Description
Used as an attribute column following Source Name, Sample Name, Extract Name, or Labeled Extract Name. A free-text description to be attached to the corresponding material. To be used sparingly, if at all - most annotations should be provided using controlled vocabulary terms, using Characteristics[] columns.
Term Source REF
Used as an attribute column following any controlled vocabulary column (e.g., Characteristics[]), or column allowing reference of external entities (e.g., Protocol REF). This column contains references to ontology or database Term Sources defined in the IDF, and from which the values in the previous column were taken. The following columns can be used to annotate Term Source REF columns
Term Accession Number
Used as an attribute column following Term Source REF columns. This column contains the accession numbers from the term source used to identify the ontology or database terms in question. For example
Source Name Characteristics [DiseaseState] Term Source REF Term Accession Number
Sample 1acute lymphocytic leukemiaNCI MetathesaurusC0023449
(This example relies on the "NCI Metathesaurus" Term Source having been pre-defined in the IDF accompanying the SDRF.)
Comment[<comment name>]
This column can be used to annotate the main graph node and edge columns listed above. It is included as an extensibility mechanism, and should not generally be used to encode meaningful biological annotation. The column header should contain a name for the type of values included in the column.

SDRF examples*

SDRF examples

Summary of column headers in SDRF*

The "Name" and "File" node columns are linked by "Protocol REF" columns which represent the graph edges (Protocol REF is the only type of edge possible). Furthermore, each node and edge column may be associated with one or more attribute columns containing annotation, e.g., "Source Name" may be associated with "Provider"; "Parameter Value []" with "Unit". In each case the attribute column follows immediately after the respective node or edge column. Similarly, where ontology terms are used a "Term Source REF" column should follow immediately to the right of the column containing the actual ontology terms (see example).
The list in the table below summarizes which label tags can follow each node identifier in the table, and which modifier tags may be used:

SDRF: Association of labels to identifiers - Node and Edge columns
Node/Edge Associated nodes/attributes
Source Name Characteristics, Provider, Material Type, Description, Comment
Sample Name Characteristics, Material Type, Description, Comment
Extract Name Characteristics, Material Type, Description, Comment
Labeled Extract Name Characteristics, Material Type, Description, Label, Comment
Hybridization Name Array Data File, Derived Array Data File, Array Data Matrix File, Derived Array Data Matrix File, Array Design File / REF, Technology Type, Comment
Assay Name Technology Type, Array Data File, Derived Array Data File, Array Data Matrix File, Derived Array Data Matrix File, Array Design File / REF, Comment
Scan Name Array Data File, Derived Array Data File, Array Data Matrix File, Derived Array Data Matrix File, Comment
Normalization Name Derived Array Data File, Derived Array Data Matrix File, Comment
Array Data File Comment
Derived Array Data File Comment
Array Data Matrix File Comment
Derived Array Data Matrix File Comment
Image File Comment
Array Design File / REF Term Source REF, Comment
Protocol REF Term Source REF, Parameter, Performer, Date, Comment
SDRF: Association of labels to identifiers - Attribute columns
Attribute Associated attributes
Characteristics[] Unit, Term Source REF
Provider Comment
Material Type Term Source REF
Technology Type Term Source REF
Label Term Source REF
Factor Value[]() Unit, Term Source REF
Performer Comment
Date Parameter Value[]
Unit, Comment, Term Source REF Unit[]
Term Source REF Description
Term Source REF Term Accession Number
Term Accession Number Comment[]

Ordering and Cardinality of SDRF column headers*

Element column headers in the SDRF, except for Protocol REF, must occur in the following order and with the following cardinalities. The attributes of an element or of another attribute must follow the attributed element or attribute without any intervening element or attribute. When an element or attribute has more than one attribute, there is no ordering defined for that set, except:

  • Factor Value: must occur after all element nodes and the attributes of those element nodes.
  • Comment: must immediately follow either the element or attribute node for which it is a Comment, or another such Comment. This permits an unambiguous association of a Comment with the element or attribute for on which it comments.
  • Term Source REF: must immediately follow the ontology term for which it provides the source reference. This permits an unambiguous association of the Term Source REF to the ontology term.
Ordering and cardinality of column types in the SDRF
Element Nodes and Factor Values Cardinality Notes
Source Name 0..1
Sample Name 0..*
Extract Name 0..*
Labeled Extract Name 0..1
Hybridization Name 0..1 Either Hybridization Name or Assay Name can be present, but not both.
Assay Name 0..1 Either Assay Name or Hybridization Name can be present, but not both.
Scan Name 0..*
Image File 0..*
Array Data File 0..*
Array Data Matrix File 0..*
Normalization Name 0..*
Derived Array Data File 0..*
Derived Array Data Matrix File 0..*
Factor Value 0..*
Protocol REF 0..*
Cardinality of SDRF attributes with respect to their parent element
Attributes - all are optional Cardinality Notes
Characteristics 0..*
Provider 0..1
Material Type 0..1
Label 0..1
Array Design File 0..1
Array Design REF 0..1
Technology Type 0..1 Is an attribute for an Assay Name, but can also be an attribute for a Hybridization Name.
Performer 0..1
Date 0..1
Parameter Value 0..*
Unit 0..1
Description 0..1
Term Source REF 0..1
Term Accession Number 0..1
Comment 0..*

Instruction for Sequencing Data*

Following items are required in addition to the IDF and SDRF information described above.

IDF*

  • A sequencing protocol (Protocol Type="sequencing") must be provided. This protocol must have a Protocol Hardware value saying which sequencing instrument was used.
    List of sequencing instrument:
    454 GS, 454 GS 20, 454 GS FLX, 454 GS FLX Titanium, 454 GS Junior, Illumina Genome Analyzer, Illumina Genome Analyzer II, Illumina Genome Analyzer IIx, Illumina HiSeq 2000, Illumina HiSeq 1000, Illumina MiSeq, AB SOLiD System, AB SOLiD System 2.0, AB SOLiD System 3.0, AB SOLiD 4 System, AB SOLiD 4hq System, AB SOLiD PI System, AB SOLiD 5500, AB SOLiD 5500xl, Helicos HeliScope, Complete Genomics, PacBio RS, Ion Torrent PGM, unspecified.

SDRF*

  • Include Assay Name and Technology Type columns.
  • In the Protocol REF column before the Assay Name column, reference the sequencing protocol in the IDF.
  • The sequencing protocol should have a Performer - this is used as the run center name.
  • Raw files must go in the Array Data File column. In the following Comment[FILE_TYPE] column, select a file format from sff,Illumina_native_qseq,Illumina_native_fastq,SOLiD_native_csfasta,SOLiD_native_qual,Helicos_native. For the necessary raw data files, please see this page.
  • These 4 extra Comment[] columns should be added after Extract Name to provide information about how the library was prepared in SDRF.
    • Comment[LIBRARY_LAYOUT] - either SINGLE or PAIRED. When PAIRED, create following columns and enter values.
      • Comment[ORIENTATION]
      • Comment[NOMINAL_LENGTH]
      • Comment[NOMINAL_SDEV]
    • Comment[LIBRARY_SOURCE] - one of GENOMIC, TRANSCRIPTOMIC, METAGENOMIC, METATRANSCRIPTOMIC, NON GENOMIC, SYNTHETIC, VIRAL RNA, OTHER.
    • Comment[LIBRARY_STRATEGY] - one of WGS, WXS, RNA-Seq, WCS, CLONE, POOLCLONE, AMPLICON, CLONEEND, FINISHING, ChIP-Seq, MNase-Seq, DNase-Hypersensitivity, Bisulfite-Seq, EST, FL-cDNA, CTS, MRE-Seq, MeDIP-Seq, MBD-Seq, OTHER.
    • Comment[LIBRARY_SELECTION] - one of RANDOM, PCR, RANDOM PCR, RT-PCR, HMPR, MF, CF-S, CF-M, CF-H, CF-T, MSLL, cDNA, ChIP, MNase, DNAse, Hybrid Selection, Reduced Representation, Restriction Digest, 5-methylcytidine antibody, MBD2 protein methyl-CpG binding domain, CAGE, RACE, size fractionation, other, unspecified.

Platform Specific Attributes*

Include the following attributes as Comment[] columns after Assay Name in the SDRF: