Datasets

A Dataset describes one or more related data files submitted as part of an experiment. Each dataset is linked to exactly one experiment via experiment_id and contains metadata about the data files, the platform used to collect the data, and detailed variable-level metadata.

What is a dataset?

A dataset typically corresponds to a single data file (e.g., one CSV of CTD profiles), but it may include multiple files when they share the same platform, instruments, and data submitter. The key organizing principle is that a dataset represents data collected on a single platform using a consistent set of instruments, managed by a single data submitter. If data come from different platforms or instrument configurations, they should be separate datasets.

Dataset Types

There are two kinds of datasets:

Field Dataset

For data collected during in-situ experiments — CTD casts, bottle samples, sensor deployments, etc.

Key fields:

Field	Purpose
`dataset_type`	Classification (cast, bottle, flow-through, underway, mooring, etc.)
`data_product_type`	Raw sensor data, data compilation, etc.
`data_accessibility`	Access level: `open_access`, `conditional_access`, or `scheduled_access`
`platform_info`	Ship, buoy, or vehicle that collected the data
`filenames`	Names of the data file(s) in this dataset
`variables`	Array of variable metadata (see Variables)
`data_submitter`	Person responsible for this submission

→ Full FieldDataset schema reference

Model Output Dataset

For simulation output from computational models.

Key fields:

Field	Purpose
`simulation_type`	Control run or perturbation
`mcdr_forcing_description`	Description of the mCDR forcing applied
`model_output_variables`	Checklist of output variables
`output_frequency`	Temporal resolution (e.g., daily, monthly)
`hardware_configuration`	Compute details

→ Full ModelOutputDataset schema reference

Platform Information

Every field dataset includes platform metadata — the ship, buoy, autonomous vehicle, or other platform used for data collection.

{
  "platform_info": {
    "name": "R/V Wecoma",
    "platform_type": "http://vocab.nerc.ac.uk/collection/L06/current/31/",
    "platform_id": "SHIP-001",
    "owner": "Oregon State University",
    "country": "US"
  }
}

Platform types use the NERC L06 vocabulary for standardized classification.

Linking Datasets to Experiments

Every dataset has an experiment_id field that links it to the experiment it belongs to. This allows multiple datasets (e.g., CTD profiles, bottle samples, sensor time series) to be associated with the same experiment.

{
  "experiment_id": "EXAMPLE-001-BASELINE",
  "name": "Baseline CTD profiles",
  "filenames": ["baseline_ctd_casts.csv"]
}

Variables

Each dataset contains an array of variables — detailed metadata describing each measurement or data column across the dataset's files. Variables are the most richly typed part of the schema, with specific subclasses for pH, total alkalinity, DIC, CO₂, and other common oceanographic measurements.

See Variables for the complete variable hierarchy and type selection guide.