BioProject Handbook 20150408

Created: March 6, 2015; Last updated: May 31, 2016

    BioProject

    Overview

    Purpose

    New sequencing technologies have significantly increased the volume of data that can be generated. Coupled with this, research is increasingly collaborative and data resulting from collaboration can include distinct types of data which may be submitted by more than one research group to more than one archival database.

    The BioProject resource organizes both the projects and the data from those projects which is deposited into several archival databases maintained by members of the INSDC. This allows searching by characteristics of these projects, using the project description and project content across the INSDC-associated databases.

    Overview of BioSample and BioProject integration with other DDBJ databases
    Overview of BioSample and BioProject integration with other DDBJ databases

    Project

    The definition of a set of related data, a 'project' is very flexible and supports the need to define a complex project and various distinct sub-projects using different parameters.

    For example, BioProject records can be established for:

    • Genome sequencing and assembly
    • Metagenomes
    • Transcriptome sequencing and expression
    • Targeted locus sequencing
    • Genetic or RH Maps
    • Epigenetics
    • Phenotype or Genotype
    • Variation detection

    BioProject represents a submission, initiative, or group of data that is logically related in some manner, or is of interest to retrieve as a distinct dataset. A project may be identified in terms of distinctions in the type of data produced.

    Complex project

    By selecting Project Data Types (for example, "Genome Sequencing" and "Transcriptome or Gene Expression"), multiple studies can be merged into single project.

    In the project spanning multiple species, enter a taxonomic classification common to the species (e.g., genus name).

    In the Sample scope, Material and Capture, select "Other" if appropriate ones are not available.

    A series of publications can be listed in the Publication.

    Primary and Umbrella projects

    There are two basic types of projects; primary and umbrella projects.

    Primary project:

    Submitted projects which are intended to represent and be linked to current or future data submissions. Primary projects can be kept private.

    Umbrella project:

    Administrative project that is created to group multiple projects that are related by a single effort from a single submitter or group of submitters. Umbrella projects may be created automatically using a rule-based logic or may be created by database staff upon request or upon identification of a needed grouping. Umbrella projects cannot be kept private.

    Umbrella projects exist to provide an organizational structure to a large collaborative project and to group projects that are related via funding or submitting source or collaboration. Submitted primary projects are linked to data as it is submitted, and are linked to one or many umbrella projects. Submitted primary projects are not directly linked to other primary projects; they are linked indirectly by way of links to the umbrella project.

    Nucleotide sequence data can not directly refer an umbrella project. Sequence data are linked to an umbrella via a primary project.

    BioProject hierarchy

    Definition of an umbrella project may be done in collaboration with a funding source. For example, there may be a top-most administrative project to represent the overarching initiative ("Genome Support Project"), with a secondary layer of primary projects defining core components of this initiative (reference genomes, rRNA sequencing, metagenomes, etc.).

    Some large initiatives are represented by more than one layer of umbrella projects (see Figure B below); for instance, a top-most level may identify the largest definition of the collaboration; a second level of umbrella projects identify the primary categories of data production; and finally a third layer represents the projects that actually generate the data that is submitted. The Human Microbiome project is an example of this type of complex hierarchy where the top-most project, PRJNA43021, represents the most inclusive definition of the initiative, and a secondary level (such as PRJNA28331) identifies a major sub-project to sequence multiple reference genomes each of which has a distinct project accession.

    Schematic diagram of BioProject hierarchies. (A)Two layers. (B)Three layers.
    Schematic diagram of BioProject hierarchies. (A)Two layers. (B)Three layers.
    Two layers (A)

    Initiatives may be organized as a single Umbrella project with one or many submitted projects that are connected to data. Example: Neanderthal Metagenome.

    Three layers (B)

    Very large initiatives which have distinct sub-projects may have two levels of Umbrella project. For example, a top-level Umbrella project groups all components of the initiative; mid-level Umbrella projects reflect two distinct branches of the project (such as sequencing vs. epigenetics); and several primary projects denote distinct project data types (e.g., genome sequencing, transcriptome, epigenetics, etc.). Example: NIH Human Microbiome Project (HMP) Roadmap Project.

    Data release

    Triggering of data release between primary projects and data records.
    Triggering of data release between primary projects and data records.

    You can "immediately release" or "hold" the registered primary project.

    The submitted primary project data can be kept private until the linked DDBJ, DRA, DTA and DOR records made be public. Hold date of the project data cannot be specified. Primary project data are automatically released when the linked DDBJ record(s) is published. On the other hand, publication of the primary project data do not cause automatic release of the linked DDBJ record(s). Thus, under a primary project, publication of a data record does not cause the indirect release of the other records. Publication of the DDBJ records is independent from the release of the linked project(s).

    FAQ: How are linked BioProject/BioSample/sequence data released?

    Visibility of relationships between public umbrella and primary projects.
    Visibility of relationships between a public umbrella and primary projects.

    An umbrella project cannot be kept private. An umbrella project can have public and private primary projects. Hierarchical relationship between the public umbrella project and the un-released primary project is invisible.

    Released project data are exchanged with the other two INSDC partners NCBI and EBI BioProject databases.

    Use an umbrella project

    Please group related primary projects by using an umbrella project. An umbrella project can group and present outputs from research project.

    NCBI visualizes data under an umbrella project with some statistics for easy navigation.

    Project presentation examples:
    Neanderthal Metagenome
    Escherichia coli O104:H4

    You can submit an umbrella project from DDBJ submission system in the same way as primary project. To remind the DDBJ BioProject team, you need to enter "this is an umbrella project" in the Private comments to DDBJ staff. Registered umbrella project cannot be kept private, but some fields can be omitted.

    To group primary projects under an umbrella, please follow the steps below.
    First, submit and release an umbrella project. If necessary, please share the assigned PRJDB number with relevant researchers.
    When submitting related primary projects, please provide the PRJDB number of parent umbrella in the Linked Project. Released primary projects are automatically linked to the specified umbrella project.

    If you want to add already registered primary projects to the umbrella, please e-mail the PRJDB numbers of umbrella and related primary projects to the DDBJ BioProject team.

    Private primary projects are not released by linking to public umbrella project.

    Metadata

    Required*
    Conditionally required*

    Submitter

    Submitter

    Contact information of submitter(s). Questions and notifications about a submission are contacted to the e-mail address(es) listed here. Personal contact information is considered confidential and is collected to be used by DDBJ BioProject staff should questions arise; the general information about the research center is used for public display.

    First name
    Submitter's first name.
    Last name*
    Submitter's last name.
    E-mail*
    E-mail address. Enter an address from the organizations domain.

    Organization

    Organization to which a contact person belongs.

    Submitting organization*
    full name of organization.
    Submitting organization URL
    The URL of submitter's organization.

    Data Release

    Select "Hold" or "Release". You cannot specify hold date. Please see Release of projects for detailed release mechanism.

    Hold
    Released concurrently when the DDBJ, DRA, DTA and DOR record(s) citing this ID is released.
    Release
    Release project data immediately. Private DDBJ record(s) citing this ID is not released.

    General info

    Project Description

    An informative paragraph that describes the project and provides informative context for the displayed project record.

    Project title*
    Very short descriptive name of the project for caption, labels, etc for public display. For example: Chromosome Y sequencing, Global studies of microbial diversity on human skin.
    Description*
    Description (a paragraph) of the project goals and purposes. Provide enough information (more than 100 characters) in the description for other users to interpret the data.
    Private comments to DDBJ staff
    Use this field if you have questions for database support staff. The content is not made public. If you intend to submit an umbrella project, please inform us that "this is umbrella project".
    Relevance
    Select the primary general relevance of the study.
    RelevanceDescription
    Agricultural
    Medical
    IndustrialCould include bio-remediation, bio-fuels and other areas of research where there are areas of mass production.
    Environmental
    Evolution
    ModelOrganism
    OtherUnspecified major impact categories to be defined in the "Relevance description".
    Relevance description*
    Describe the relevance when the Other is selected.

    Umbrella BioProject

    If you are registering a project that is part of an initiative which is already registered in the BioProject database, then please tell us the existing BioProject accession number and provide a general description of the larger initiative. This information is needed for project linking.

    Initiative description*
    Description of an initiative.
    Umbrella BioProject accession*
    A BioProject accession number of an initiative which is already registered in the BioProject database.

    An URL may be provided, with a label for the resource, to reference a resource that is directly relevant to the submitted project.

    Link description
    Display name of web site that is related to this study.
    URL
    URL of web site that is related to this study.

    Grants

    Funding information for a project.

    Agency
    Name of funding agency. For example: Japan Society for the Promotion of Science.
    Agency abbreviation
    Abbreviation of funding agency. For example: JSPS.
    Grant ID
    Grant number is collected to support searches (e.g., publications often cite Grant numbers). For example: JSPS KAKENHI Grant Number 12345678.
    Grant title
    Grant title may also support searches.

    Consortium

    Consortium name
    If study is carried out as part of a consortium, provide the consortium name.
    Consortium URL
    If the consortium maintains a web site, provide the URL.

    Project type

    Project data type

    Project data type*

    A general label indicating the primary study goal. Select appropriate types. News: A BioProject record can have multiple project data types

    NCBI individually assigns the Project data type based on the experimental data linked to the project. This type is not used by EBI.

    Project Data typeDescription
    Genome Sequencingwhole, or partial, genome sequencing project (with or without a genome assembly)
    Clone Endsclone-end sequencing project
    EpigenomicsDNA methylation, histone modification, chromatin accessibility datasets
    Exomeexome resequencing project
    Mapproject that results in non-sequence map data such as genetic map, radiation hybrid map, cytogenetic map, optical map, and etc.
    Metagenomesequence analysis of environmental samples
    Phenotype and Genotypeproject correlating phenotype and genotype
    Proteomelarge scale proteomics experiment including mass spec. analysis
    Random Surveysequence generated from a random sampling of the collected sample; not intended to be comprehensive sampling of the material.
    Targeted Locus (Loci)project to sequence specific loci, such as a 16S rRNA sequencing
    Transcriptome or Gene Expressionlarge scale RNA sequencing or expression analysis. Includes cDNA, EST, RNA_seq, and microarray.
    Variationproject with a primary goal of identifying large or small sequence variation across populations.
    Othera free text description is provided to indicate Other data type
    Project data type description*
    Describe the project data type when the Other is selected.

    Sample scope/Material/Capture/Methodology

    Sample scope*
    The scope and purity of the biological sample used for the study.
    Sample scopeDescription
    MonoisolateA single animal, cultured cell-line, inbred population (or possibly a heterogeneous population when a single genome assembly is generated from the pooled sample; not preferred).
    MultiisolateMultiple individuals, a population (representation of a species).
    Multi-speciesSample represents multiple species.
    EnvironmentSpecies content of the sample is not known.
    SyntheticSample is synthetically created by a machine.
    OtherSpecify the sample scope that was used in the "Target description".
    Material*
    The type of material that is isolated from the sample for use in the experimental study.
    MaterialDescription
    GenomeA whole genome initiative. May be only the nuclear genome. Use for DNA of a metagenome sample.
    Partial GenomeOne or more chromosomes or replicons were experimentally purified.
    TranscriptomeTranscript data.
    ReagentMaterial studied was obtained by chemical reaction, precipitation.
    ProteomeProtein or peptide data.
    PhenotypePhenotypic descriptive data.
    OtherSpecify the material that was used in the "Target description".
    Capture*
    The scale, or type, of information that the study is designed to generate from the sample material.
    CaptureDescription
    WholeThe project makes use of the whole sample material (most common case).
    Clone EndsCapturing clone end data.
    ExomeCapturing exon-specific data.
    Targeted Locus/LociCapturing specific loci (gene, genomic region, barcode standard).
    Random SurveyNot using whole sample, an incomplete survey of the sample.
    OtherSpecify the scale or type of the captured material in the "Target description".
    Target description*
    Describe the Sample scope/Material/Capture when the Other(s) is selected.
    Methodology*
    The core experimental approach used to obtain the data that is submitted to archival databases.
    MethodologyDescription
    SequencingSequencing using Sanger, 454, Illumina, etc wit
    ArrayData/Sequence are generated by hybridization arrays.
    Mass SpectroscopyData are generated by mass spectroscopy.
    OtherPlease provide data description in the "Methodogy description".
    Methodology description*
    Describe the methodology type when the Other is selected.

    Objective

    Project goals with respect to the type of data that will be generated and submitted to an INSDC-associated database. Select all relevant menu options.

    Objective*
    Project goals with respect to the type of data that will be generated and submitted to an INSDC-associated database. Select all relevant menu options.
    ObjectiveDescription
    Raw Sequence ReadsSubmission of raw sequencing information as it comes out of machine.
    SequenceSequence which is not raw - meaning processed (clipped, matepaired, oriented).
    AnalysisHigher level interpretation of the data.
    AssemblyExperiment will result in assemblies (genome or transcriptome).
    AnnotationExperiment wil result in Annotation.
    VariationSubmission of variations.
    Epigenetic MarkersExperiment will result in Epigenetic markers.
    ExpressionSubmission of gene expression.
    MapsExperiment will result in cytogenetic, physical, Rh, etc...maps.
    PhenotypeExperiment will deliver phenotypes.
    Other

    Locus tag prefix

    Locus tag prefix*
    Locus tag prefix generation box will appear when [Project data type="Genome Sequencing" or "Metagenome"] AND [Capture="Whole"] AND [Objective="Sequence" or "Annotation" or "Assembly"].

    Registration of a unique locus tag prefix is required for studies that result in genome assemblies. Please leave the prefix box empty, when a prefix is not necessary for WGS only submission.

    Locus tag prefix guideline.

    Locus tag prefix format
    The locus_tag prefix can contain only alpha-numeric characters and it must be at least 3 characters long. It should start with a letter, but numerals can be in the 2nd position or later in the string. (ex. A1C). There should be no symbols, such as -_* in the prefix. The locus_tag prefix is to be separated from the tag value by an underscore ‘_’, eg A1C_00001.

    DDBJ BioProject limits the maximum tag length to 12 characters. In the BioProject submission system, the locus tag is displayed in capital letters. However, the tag is reserved in case-insensitive manner.

    Target

    Organism information

    Taxonomy and description of target organism.

    Organism name*

    Organism name in the Taxonomy database. Unclassified sequences including metagenome and environmental samples may be found at here.

    In the project spanning multiple species, enter a taxonomic classification common to the species (e.g., genus name).

    If you intend to submit un-registered novel organism, please provide us the detailed organism information in the Description of novel organism and proposed organism name in the Organism Name.

    Taxonomy ID
    NCBI Taxonomy ID
    Strain, breed, cultivar
    Microbial strain name, or eukaryotic breed or cultivar name. Please provide this or "Isolate name or label"
    Isolate name or label
    A label for an isolated sample, or name of an individual animal (e.g., Clint). Please provide this or "Strain, breed, cultivar".
    Description
    A brief description, to elaborate upon the brief label.
    Description of novel organism
    Enter necessary information to register an organism to the taxonomy database.

    Environmental sample information

    This section appears instead of the Organism information when the Sample scope="Environment" in the Target.

    Environmental sample name*
    Unclassified sequences including metagenome and environmental samples may be found at here. If an appropriate name was not found, describe a novel name you propose and details of sample information in the Environmental sample description.
    Environmental sample description
    Describe details of sample information.

    General Properties

    General properties of target organism.

    Cellularity
    Select a cellularity.
    Cellularity
    Unicellular
    Multicellular
    Colonial
    Reproduction
    Select a Reproduction.
    Reproduction
    Sexual
    Asexual
    Haploid genome size
    Haploid genome size in Kb, Mb or cM.
    Ploidy
    Select a Ploidy.
    Ploidy
    Haploid
    Diploid
    Polyploid
    Allopolyploid

    Organism Replicons

    Describe how many replicons this organism has, how they are named (e.g., 1, 2, 3 vs. I, II, III), the replicon type (chromosome etc.), and the subcellular structure that the replicon is located in.

    Name
    The preferred standard for the replicon name.
    Type
    Select a replicon type.
    Replicon type
    Chromosome
    Plasmid
    Linkage Group
    Segment
    Other
    Location
    The replicon subcellular location. For instance, the nucleus, or a differentiated organella. Please select "Nuclear or Prokaryote" for the chromosomes of eularyotes, bacteria or archaea.
    Location
    Nuclear or Prokaryote
    Macronuclear
    Nucleomorph
    Mitochondrion
    Kinetoplast
    Chloroplast
    Chromoplast
    Plastid
    Virion or Phage
    Proviral or Prophage
    Viroid
    Extrachrom
    Cyanelle
    Apicoplast
    Leucoplast
    Proplastid
    Hydrogenosome
    Chromatophore
    Other
    Size
    The size and unit of measurement for the estimated genome size.
    Description
    A description of any unusual features of the replicon.

    Phenotype

    Phenotype of target organism.

    Disease
    Enter a disease name.
    Biotic Relationship
    Select a BioticRelationship.
    BioticRelationship
    FreeLiving
    Commensal
    Symbiont
    Episymbiont
    Intracellular
    Parasite
    Host
    Endosymbiont
    Trophic Level
    Select a TrophicLevel.
    TrophicLevel
    Autotroph
    Heterotroph
    Mixotroph

    Prokaryote Morphology

    When the target organism is prokaryote, please describe the general morphology if known.

    Shape
    Select all relevant menu options.
    ShapeDescription
    Bacillirod-shaped
    Coccispherical-shaped
    Spirillaspiral-shaped
    Coccobacillielongated coccal form
    Filamentousfilament-shaped (bacilli thar occur in long threads)
    Vibriosvibrio-shaped (short, slightly curved rods)
    Fusobacteriafusiform or spindle-shaped (rods with tapered ends)
    SquareShaped
    CurvedShaped
    Tailed
    Gram
    Choose gram positive or negative.
    Gram
    Positive
    Negative
    Motility
    Choose a Motility.
    Motility
    Yes
    No
    Enveloped
    Choose enveloped or not.
    Enveloped
    Yes
    No
    Endospores
    Choose target bacteria forms endospores or not.
    Endospores
    Yes
    No

    Ecological Environment

    The general habitat for any organism. Please indicate additional extremophile parameters if known.

    Habitat
    Choose a Habitat.
    Habitat
    HostAssociated
    Aquatic
    Terrestrial
    Specialized
    Multiple
    Unknown
    Salinity
    Choose a Salinity.
    Salinity
    NonHalophilic
    Mesophilic
    ModerateHalophilic
    ExtremeHalophilic
    Unknown
    Oxygen requirement
    Choose an Oxygen requirement.
    OxygenReq
    Aerobic
    Microaerophilic
    Facultative
    Anaerobic
    Unknown
    Temperature range
    Choose a temperature range.
    TemperatureRange
    Cryophilic
    Psychrophilic
    Mesophilic
    Thermophilic
    Hyperthermophilic
    Unknown
    Optimum Temperature
    Optimum temperature in Celsius.

    Publication

    PubMed ID
    The PubMed ID(s) will be used to populate the publication information.
    <Publication id="15557739">
    	<DbType>ePubmed</DbType>
    </Publication>
    <ProjectReleaseDate> ...
    
    DOI
    Provide a DOI if a PubMed ID is not available. Provide the additional reference information.
    <Publication id="10.1093/nar/gku1120">
    	<DbType>eDOI</DbType>
    </Publication>
    <ProjectReleaseDate> ...
    
    Reference title*
    A title of reference.
    Journal title*
    A title of journal.
    Year*
    Publication year.
    Volume*
    Journal volume.
    Issue*
    Journal issue.
    Pages from*
    Reference start page.
    Pages to*
    Reference end page.
    First name*
    First name of author.
    MI
    Middle initial.
    Last name*
    Last name of author.
    Suffix
    Suffix for author.
    This publication has multiple authors
    If this is checked, then "et al" is added to the author name provided above.

    XML schema

    ftp://ftp.ddbj.nig.ac.jp/ddbj_database/bioproject/schema/

    Submission to BioProject

    Data submission of human subjects research
    For all data from human subjects researches submitted to DDBJ, it is submitter's responsibility to ensure that the privacy of participant (human subject) is protected in accordance with all applicable laws, regulations and policies of submitter's institute.
    In principle, make sure to remove any direct personal identifiers of human subjects from your submissions.
    Before submitting data from human subjects researches, read the "Data submission of human subjects research".

    Submission to BioProject

    Cases requiring project registration

    Registration for a BioProject accession is required in the following cases.

    • submit sequencing data to DRA
    • submit genome sequences to DDBJ

    Registration for a BioProject accession is encouraged in the following cases.

    • projects that result in a very large volume of data submissions
    • submissions from multiple members of a collaboration
    • submissions to multiple archival databases

    Registration for a BioProject accession is not required in the following cases. Register an accession if necessary.

    • small datasets for which the results are found in one (or a small number) of accession numbers such as a single plasmid, viral or organelle genome sequencing study

    A BioProject accession is required for submission to Sequence Read Archive, and microbial and eukaryotic genomes submissions to the DDBJ. If you obtain a BioProject accession from DDBJ, please submit your related biological data to the DDBJ, Sequence Read Archive and Trace Archive.

    The INSDC stopped to assigning strain-level taxonomy ID to microbes whose genomes have been submitted to the INSDC.

    Submit a new BioProject submission

    According to the Account Handbook, obtain a submission account.

    Submit a new project by clicking the [Submit new Project].

    Submit a new BioProject

    To submit a BioProject, enter content from left to right tabs. Submitter information is copied with account information.

    For BioProject metadata, please see here.

    Enter project content

    To submit genome assemblies to DDBJ, a unique Locus tag prefix is necessary.

    Locus tag prefix generation box will appear when [Project data type="Genome Sequencing" or "Metagenome"] AND [Capture="Whole"] AND [Objective="Sequence" or "Annotation" or "Assembly"]. Registration of a unique locus tag prefix is required for studies that result in genome assemblies.

    The locus_tag prefix can contain only alpha-numeric characters and it must be at least 3 characters long. It should start with a letter, but numerals can be in the 2nd position or later in the string. (ex. A1C). There should be no symbols, such as -_* in the prefix. The locus_tag prefix is to be separated from the tag value by an underscore ‘_’, eg A1C_00001.

    Please leave the prefix box empty, when a prefix is not necessary for WGS only submission.

    Prefix is managed by NCBI. When a project is submitted, our system tries to reserve prefix to NCBI. When the prefix has already been reserved, an error message will be displayed. Please enter a different prefix and submit again.

    When multiple prefixes are necessary, please contact us.

    Reserve a locus tag prefix

    Check the content in "OVERVIEW" and submit a project by clicking the [Submit].

    Submit BioProject

    The "OVERVIEW" tab continue to display submitted content. Updates will not be reflected in the "OVERVIEW" tab.

    Accession number

    Temporary numbers with prefix PSUB are automatically assigned to submitted projects. Projects can be referred by the PSUB numbers until official accession numbers will be assigned. DDBJ BioProject staff review submissions and issue accession numbers with prefix PRJD to completed projects. Submitters can view accession numbers and submission status in your submission account.

    • Do NOT cite numbers with prefix PSUB in publication.
    • Do not double submit the projects which have been registered to EBI and NCBI.

    Submit an umbrella project

    Umbrella project can be submitted as primary project in submission account. Be sure to tell the DDBJ BioProject staff that submitting project is umbrella by writing so in the Private comments to DDBJ staff. Umbrell project cannot be kept private.

    Tell DDBJ staff that submitting project is umbrella

    Link primary project to umbrella

    When submitting project, in the Linked Project, enter abstract and accession number of an umbrella to be linked. The DDBJ BioProject staff link the submitted primary project to the umbrella based on this information.

    Link to umbrella

    Release of projects

    Registered projects can be released in the following two ways:

    • Release immediately after registration.
    • Release when records citing the BioProject accession are made public.

    The projects can be kept private. If DDBJ records citing the BioProject accession are made public, cited projects are automatically released. Private DDBJ records citing this BioProject accession are not made public.

    Hold date cannot be set for BioProject.

    Public projects are exchanged among NCBI and EBI BioProject databases.

    Update

    Registered projects can be updated. Please contact the BioProject staff to update the projects.

    Link experimental data and project

    For the SRA submission, select the BioProject accession that you registered in advance in the Study.

    For genome and TSA submissions to the DDBJ, enter the BioProject accession in the DBLINK.