AnVIL Portal
  • Data Submission Guide
  • Submission Process Overview
  • 1 - Register Study/Obtain Approvals
  • 2 - Set Up a Data Model
  • 3 - Prepare for Submission
  • 4 - Stage Your Data in AnVIL
  • 5 - QC Data
  • Data Submission Resources
  • Consortium Data Access Guidelines
  • Data Withdrawal Procedures

Step 2 - Set Up a Data Model

You can choose to start with one of two template data models and adjust to meet your needs. You’ll coordinate with the AnVIL data ingest team to facilitate this. If your dataset has been accepted by AnVIL and does not easily fit into an existing template, reach out to the AnVIL Team at help@lists.anvilproject.org.

You’ll end this step by completing an intake form to send the data model (in a data dictionary spreadsheet) and all the information the AnVIL team needs to set up your data workspace on the AnVIL.

2.1 - First Steps

Coordinate with the AnVIL Data Ingest Team

Email anvil-data@broadinstitute.org to arrange an AnVIL kickoff meeting to discuss your data, data model, and ingest timeline.

Register for a Terra Account

AnVIL data are stored and organized in Terra data-oriented workspaces. You will need a Terra Account to upload data into AnVIL. If you do not already have an account on Terra, you will find step-by-step instructions to register at Creating a Terra Account.

2.2 - Create Your Data Model

Nodes in the AnVIL Data Model (e.g. "Program" or "Subject" etc. in the diagram below) include different types of data (called "properties"). Each node is a table in a data workspace in AnVIL. Nodes are connected to each other by unique IDs.

Data Model
A graphic representation of an AnVIL data model.

Data submitters will submit data and metadata from the Biospecimen, Clinical, and Data File nodes in spreadsheet-like files that will be displayed in the data workspace as integrated tables. Each row is a distinct "entity" and each column is a different property (type of data).

A data model consists of these components:

  • Entities: the primary object the table contains with a unique key (i.e. a "subject" entity for phenotypic data or "sample" entity for genomic data). Each row in the table is a distinct entity identified by an ID key.
  • Attributes/properties: the columns in a database table (i.e. phenotypic data like demographic or lab results or genomic data metadata like)
  • Associations: the unique identifiers that link data between tables (i.e. a subject_id column in the sample table that links samples with the subject)

Data Model Requirements

Please read the descriptions below carefully, and reach out to your AnVIL team contact with any questions. These requirements below help ensure AnVIL datasets are not only useful to the researchers who created them but enable others to analyze data collectively across studies.

Start by thinking of what data you have and how you have already organized it. Note that to accommodate the most data, AnVIL data models allow as many attribute columns as you need. The requirements help structure all AnVIL data similarly, and make it compatible with analysis in the AnVIL Terra platform.

Required Tables for All Studies (csv, tsv, txt, json format)

All studies must submit the following tables (scroll down for details and template tables):

  • Data Dictionary Table: Specifies the entire data model. It includes (for each separate table in the data model) field names, field descriptions, field types, enumeration values (where applicable), multi-value delimiter symbol used (where applicable)
  • Subject Table: Includes required information about the subjects and (usually) associated phenotypic data. The subject_id (first column) is the key field for that table. This key is typically used in other tables to link additional data (i.e. genomic, sequencing, family) to the subject.
  • Sample Table: Links the subject_ids to the sample_ids where the sample_id (first column) is the key fields for that table
  • Sequencing Table: Includes required information about and links to the sequence data associated with the sample_id where the filename is the key field for the table

Example Additional Tables (CSV format)

  • Family Table: Includes information about a particular family with the family_id (first column) as the key field for the table. Data can include pedigrees or any other family-level information.
  • Discovery Table: Includes information about variants of interest that are linked to the subject_id (must include a subject_id column)

Template Data Models

To enable cross-study analysis within the AnVIL, data submitted for hosting by AnVIL should be consistent with data models already in AnVIL or in the process of being ingested into AnVIL when possible. To ensure this, we recommend you adopt or modify one of the Data Model templates below. Note that these are read-only copies. You should make your own to modify.

Non-standard Data Models

2.3 - Generate Your Data Dictionary

All AnVIL studies must submit a Data dictionary table (spreadsheet file) that defines your complete data model. It includes (for each table separately) field names, field descriptions, field types, examples, enumeration values (where applicable), and multi-value delimiter symbols used (where applicable) for each table in the data model.

General Formatting Requirements

To be compatible with indexing once in AnVIL, special characters (i.e. "%" or "*") cannot be used in any field or file name. If your files contain special characters, they must all be removed/replaced before ingestion.

Phenotypic Data Expectations

Data Model
The subject table (screenshot above) contains a collection of basic information and phenotypic data about the study subjects (e.g., demographics, age, sex, or race.)

Currently, data stored in a phenotypic ("subject") table will fall into one of four categories. Requirements for each category are below.

  • Case/Control (or Case alone) - Information around a particular disease or phenotype of interest for a selected cohort (Example: CMG, CCDG).
  • Electronic Health Record (EHR) - Data derived from EHR information (Example: eMERGE).
  • Survey - Data collected from surveying study subjects (Example: CSER).
  • Family longitudinal - Data collected for multiple families for multiple generations (Example: AMISH).

Required Phenotypic Data

To ensure cross-study functionality on AnVIL, dataset categories have the following requirements.

  • 1 - Required
  • 2 - Required if there are trios or other relationship data in the study.
Data ElementsCase or Case/ControlEHRSurveyFamily longitudinal
subject_id1111
sample_id1111
project_id1111
dbgap_project_id1111
data_use_limitation1111
sex1111
condition1---
affected_status1---
family_id2-21
paternal_id2-21
maternal_id2-21
proband_relationship2-2-

Ensuring Uniform Terminology

AnVIL includes a diverse set of studies and a wide range of collected phenotypic data. To maximize useful information for search and synthetic cohort creation, all phenotypic data:

  • Must be clearly linked to a subject, and the subject must be clearly linked to other data (e.g., genome, exome, RNASeq, array, etc.).
  • Must be composed (where possible) of structured values. Ideally, these values are concept codes from established ontologies including, but not limited to:
    • NCIt - A vocabulary for a diverse set of biological concepts (e.g., disease, phenotype, relationship, anatomy, etc.).
    • SNOMED - A vocabulary focused on concepts related to clinical data (license required).
    • UMLS Metathesaurus - Links concepts from multiple vocabularies and ontologies (license required, free to individuals in the USA, includes access to SNOMED).
    • UBERON - A vocabulary focused on anatomical structure.
    • HPO - An ontology focused on phenotypic abnormalities.
    • OMIM - An ontology for rare Mendelian diseases.
    • Orphanet - An ontology for orphan drugs and rare diseases.
    • ICD - An ontology for US billing codes.
    • MeSH - An ontology for biomedical and health-related information.
    • RxNorm - Normalized names for clinical drugs and links to many of the drug vocabularies.

Genomic Data Expectations

Data Model
The sample table (screenshot above) organizes biospecimen information such as genomic data files associated with subjects in the subject table. Note the `subject_id` column that connects the sample data to the right subject.

Access Restrictions

Known Data Use Limitations (DUL) need to be clearly defined by the data depositor. This is the list of requirements for gaining access and using the data. You will need to submit your protocols for gaining access at the time of ingest.

Additional Resources

Reading

Videos

Hands-on tutorial

For hands-on practice with a data model and data tables in Terra, please go through parts 1 and 2 of the Terra Data Tables Quickstart tutorial (estimated time 30-40 minutes).


Help us make these docs great!
All AnVIL docs are open source. See something that’s wrong or unclear? Submit a pull request.
Make a contribution