Storage Layout#

HydroModPy V1 organises storage around three SQL databases plus columnar file stores. The model is workspace > project > run, with each level holding its own catalog:

  • a machine-wide global index (index.duckdb) federating registered workspaces;

  • a per-workspace input cache (data/cache.duckdb) tracking downloaded and custom datasets;

  • a per-project simulation catalog (catalog.duckdb) holding simulations, parameters, metrics, provenance, calibration trace, and the workflow ledger.

Field arrays are persisted as Zarr v2 stores under simulations/<basename>.zarr/ or .zarr.zip. Tabular outputs are persisted as Parquet v2.6 under simulations/<basename>.parquet/ and exposed as DuckDB views.

For background on the cache/catalog split, see The three workspace databases. For the migration policy that applies to every change in this layout, see Schema Evolution.

Three-level layout#

<workspace>/
|-- workspace.toml                metadata of the research workspace
|-- data/
|   |-- cache.duckdb              input cache (one per workspace)
|   |-- dem/
|   |   |-- raw/                  immutable downloads + sidecar .json
|   |   `-- processed/            reprojected and clipped derivatives
|   |-- climate/
|   `-- ...
`-- projects/
    `-- <project_name>/
        |-- hydromodpy.toml       project config (Pydantic root)
        |-- catalog.duckdb        simulation catalog (one per project)
        `-- simulations/
            |-- <basename>.zarr/  or .zarr.zip
            `-- <basename>.parquet/
                |-- timeseries.parquet
                |-- budgets.parquet
                `-- mass_balance.parquet

The machine-wide index.duckdb lives at $XDG_STATE_HOME/hydromodpy/ (~/.local/state/hydromodpy/ on Linux) and federates several workspaces through ATTACH read-only.

Storage basename rule#

The per-simulation basename is built by StoragePathResolver:

<basename> = "<project>__<name>__<sim_id_first_chars>"

The database identity stays the full sim_id stored in the project catalog.duckdb.

Project catalog DuckDB schema#

Tables in <project>/catalog.duckdb (see hydromodpy/results/catalog/migrations/ for the authoritative DDL shipped as Alembic-like migrations).

Table

Key columns

simulations

sim_id (UUID v7), name, project, solver, status, mesh_hash, n_cells, n_layers, n_timesteps, crs_wkt, bbox_xmin/ymin/xmax/ymax, period_start/end, zarr_path, storage_basename, duration_s, tags, description, scientific_objective

parameters

sim_id, param_name, zone_id (default "__global__"), value, unit, parameterization

metrics

sim_id, station_id (default "__outlet__"), metric_name, value, n_samples

calibration_sessions

session_id, project, method, objective_name, n_iterations, best_sim_id, status

calibration_iterations

session_id, iteration, sim_id, params_hash, parameters (JSON), objective_value

provenance

sim_id, variable, source_type, source_ref, source_sha256, payload_sha256, fetched_at, n_records

observation_points

sim_id, station_id, x, y, cell_id, layer, crs_wkt, crs_epsg

geographic_features

sim_id, feature_name, geometry_kind, geoparquet_path

runs_environment

sim_id, python_version, hydromodpy_version, platform, git_commit, solver_binary_sha256, rng_seed

tracked_files

sim_id, role, category, original_path (workspace-relative), canonical_path, sha256, size_bytes

stations

station_id, name, latitude, longitude, variable_type, source, active

observations

station_id, variable_type, datetime, value, unit, quality

workflow_steps

workflow ledger merged into the catalog: step_id, sim_id, step_name, status, started_at, ended_at, payload

schema_migrations

one row per applied migration: version, component, slug, checksum, applied_at

Companion views (read-only):

  • v_simulation_summary,

  • v_best_per_project,

  • v_metrics_wide, v_params_wide.

Workspace cache DuckDB schema#

<workspace>/data/cache.duckdb is keyed on the data variable, source, and SHA-256 fingerprint. Key tables:

Table

Role

entries

one row per cached file. Columns entry_id, variable, source_type, file_path (workspace-relative POSIX), payload_sha256, fetched_at, size_bytes.

artifacts

derived layers (reprojections, clipping, mosaics) tied back to a parent entries row.

provenance

upstream source ref, license, citation, query parameters.

failures

failed downloads tagged for retry.

validation_reports

sidecar JSON validation outcome per entries row.

schema_migrations

same migration ledger format as the project catalog.

Machine global index DuckDB schema#

$XDG_STATE_HOME/hydromodpy/index.duckdb keeps the list of registered workspaces and rebuilds federated views on demand:

Table or view

Role

workspaces

one row per registered workspace URI plus the path to its catalog.duckdb files.

all_simulations

federated view over every project catalog reachable from registered workspaces. Read-only.

all_metrics, all_projects

companion federated views used by hmp index search.

Per-simulation Zarr store#

Zarr v2 root layout, written atomically (tmp + rename + fsync) under a file lock:

<basename>.zarr/
|-- meta/                  ACDD root attributes + ZARR_SCHEMA_VERSION
|-- mesh/                  topology, cell types, coordinates
|-- field/                 head, watertable_*, accumulation_flux, ...
|-- topography/            DEM-derived rasters (renamed from raster/)
|-- particles/             particle trajectories (renamed from pathlines/)
`-- budget/                budget components per timestep

Compression defaults to Blosc-zstd. Each variable carries CF standard_name and _FillValue attributes, consolidated metadata is strict, and ZARR_SCHEMA_VERSION is pinned to "2".

Per-simulation Parquet directory#

Parquet v2.6 files produced through write_table_atomic (tmp + os.replace) with ZSTD compression, row-group size 50 000, page index and bloom filters where available. Each file carries hmp.schema_version = "v2" in the KV-metadata mixin and is exposed as a DuckDB view named after the table.

File

Columns

timeseries.parquet

sim_id, station_id, variable, datetime, value, unit, qflag

budgets.parquet

sim_id, timestep (BIGINT), zone_id, component, flux_in, flux_out, unit

mass_balance.parquet

sim_id, timestep, total_in, total_out, storage_in, storage_out, percent_error

GeoParquet 1.1 is used for vector layers persisted alongside the simulation (catchment outline, drainage network).

Backend abstraction#

Project-catalog SQL access goes through the CatalogBackend Protocol in normal runtime paths. The in-tree V1 adapter is DuckDBBackend. HydroModPy V1 does not promise a fully portable non-DuckDB backend: cache stores, diagnostics, migration runners and portable-package snapshots are DuckDB-specific by contract. Field readers go through hmp.read which dispatches to Zarr or Parquet stores via the field registry. See results for the Python surface.

Concurrency and retry#

DuckDB project-catalog writes use connect_with_retry and the @with_lock_retry decorator on write paths. Data-cache writes use the data-cache DuckDB adapter. Zarr writes acquire a filelock on the store root, write to a sibling tmp directory, and promote with os.replace. Short-lived cross-process lock contention resolves transparently instead of surfacing as an error.

Lockfile and reproducibility#

hydromodpy.lock is written best-effort next to the project catalog when the run can inspect the input cache and collect fingerprints. The lockfile pins:

  • the resolved Pydantic configuration tree;

  • hydromodpy version and git commit;

  • the solver binary release tag and SHA-256;

  • ZARR_SCHEMA_VERSION, PARQUET_SCHEMA_VERSION, and the catalog migration version applied;

  • input-data fingerprints from provenance.

A frozen replay (hmp run --frozen) refuses any source whose fingerprint has changed since the lockfile was written. When no input cache is available, a normal run may complete without a lockfile; that is a reproducibility warning, not a failed run.

Direct DuckDB exceptions#

The normal application path uses catalog/cache adapters. Direct duckdb.connect is accepted only for:

  • migration runners that bootstrap schema files;

  • concrete backend constructors and adapters;

  • read-only diagnostics and doctor output;

  • portable .hmp package snapshots;

  • tests and performance benchmarks;

  • developer-only CLI inspection commands;

  • experimental validity_frame ingestion.

New direct DuckDB calls in user-facing CLI commands or hydromodpy._api should be considered a contract regression unless this list is updated with a rationale.

Portable .hmp packages#

hmp export-package <sim_id> -o run.hmp bundles:

  • the resolved TOML and the lockfile;

  • the matched DuckDB rows for that sim_id plus referenced rows;

  • the per-simulation Zarr store and Parquet directory;

  • a JSON manifest with the schema version and SHA-256 checksums;

  • a RO-Crate sidecar when enough metadata is available.

hmp add run.hmp re-materialises the bundle in the target project, refusing packages whose schema version exceeds the current library.

See also#