Storage Layout#
HydroModPy V1 organises storage around three SQL databases plus columnar file stores. The model is workspace > project > run, with each level holding its own catalog:
a machine-wide global index (
index.duckdb) federating registered workspaces;a per-workspace input cache (
data/cache.duckdb) tracking downloaded and custom datasets;a per-project simulation catalog (
catalog.duckdb) holding simulations, parameters, metrics, provenance, calibration trace, and the workflow ledger.
Field arrays are persisted as Zarr v2 stores under
simulations/<basename>.zarr/ or .zarr.zip. Tabular outputs are
persisted as Parquet v2.6 under simulations/<basename>.parquet/
and exposed as DuckDB views.
For background on the cache/catalog split, see The three workspace databases. For the migration policy that applies to every change in this layout, see Schema Evolution.
Three-level layout#
<workspace>/
|-- workspace.toml metadata of the research workspace
|-- data/
| |-- cache.duckdb input cache (one per workspace)
| |-- dem/
| | |-- raw/ immutable downloads + sidecar .json
| | `-- processed/ reprojected and clipped derivatives
| |-- climate/
| `-- ...
`-- projects/
`-- <project_name>/
|-- hydromodpy.toml project config (Pydantic root)
|-- catalog.duckdb simulation catalog (one per project)
`-- simulations/
|-- <basename>.zarr/ or .zarr.zip
`-- <basename>.parquet/
|-- timeseries.parquet
|-- budgets.parquet
`-- mass_balance.parquet
The machine-wide index.duckdb lives at $XDG_STATE_HOME/hydromodpy/
(~/.local/state/hydromodpy/ on Linux) and federates several workspaces
through ATTACH read-only.
Storage basename rule#
The per-simulation basename is built by StoragePathResolver:
<basename> = "<project>__<name>__<sim_id_first_chars>"
The database identity stays the full sim_id stored in the project
catalog.duckdb.
Project catalog DuckDB schema#
Tables in <project>/catalog.duckdb (see
hydromodpy/results/catalog/migrations/ for the authoritative DDL
shipped as Alembic-like migrations).
Table |
Key columns |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
workflow ledger merged into the catalog: |
|
one row per applied migration: |
Companion views (read-only):
v_simulation_summary,v_best_per_project,v_metrics_wide,v_params_wide.
Workspace cache DuckDB schema#
<workspace>/data/cache.duckdb is keyed on the data variable, source,
and SHA-256 fingerprint. Key tables:
Table |
Role |
|---|---|
|
one row per cached file. Columns |
|
derived layers (reprojections, clipping, mosaics) tied back to a
parent |
|
upstream source ref, license, citation, query parameters. |
|
failed downloads tagged for retry. |
|
sidecar JSON validation outcome per |
|
same migration ledger format as the project catalog. |
Machine global index DuckDB schema#
$XDG_STATE_HOME/hydromodpy/index.duckdb keeps the list of registered
workspaces and rebuilds federated views on demand:
Table or view |
Role |
|---|---|
|
one row per registered workspace URI plus the path to its
|
|
federated view over every project catalog reachable from registered workspaces. Read-only. |
|
companion federated views used by |
Per-simulation Zarr store#
Zarr v2 root layout, written atomically (tmp + rename + fsync) under a file lock:
<basename>.zarr/
|-- meta/ ACDD root attributes + ZARR_SCHEMA_VERSION
|-- mesh/ topology, cell types, coordinates
|-- field/ head, watertable_*, accumulation_flux, ...
|-- topography/ DEM-derived rasters (renamed from raster/)
|-- particles/ particle trajectories (renamed from pathlines/)
`-- budget/ budget components per timestep
Compression defaults to Blosc-zstd. Each variable carries CF
standard_name and _FillValue attributes, consolidated metadata
is strict, and ZARR_SCHEMA_VERSION is pinned to "2".
Per-simulation Parquet directory#
Parquet v2.6 files produced through write_table_atomic (tmp +
os.replace) with ZSTD compression, row-group size 50 000, page index
and bloom filters where available. Each file carries
hmp.schema_version = "v2" in the KV-metadata mixin and is exposed as
a DuckDB view named after the table.
File |
Columns |
|---|---|
|
|
|
|
|
|
GeoParquet 1.1 is used for vector layers persisted alongside the simulation (catchment outline, drainage network).
Backend abstraction#
Project-catalog SQL access goes through the
CatalogBackend Protocol in
normal runtime paths. The in-tree V1 adapter is DuckDBBackend.
HydroModPy V1 does not promise a fully portable non-DuckDB backend:
cache stores, diagnostics, migration runners and portable-package
snapshots are DuckDB-specific by contract.
Field readers go through hmp.read which dispatches to Zarr or
Parquet stores via the field registry. See
results for the Python surface.
Concurrency and retry#
DuckDB project-catalog writes use connect_with_retry and the
@with_lock_retry decorator on write paths. Data-cache writes use the
data-cache DuckDB adapter. Zarr writes acquire a filelock on the
store root, write to a sibling tmp directory, and promote with
os.replace. Short-lived cross-process lock contention resolves
transparently instead of surfacing as an error.
Lockfile and reproducibility#
hydromodpy.lock is written best-effort next to the project catalog
when the run can inspect the input cache and collect fingerprints. The
lockfile pins:
the resolved Pydantic configuration tree;
hydromodpyversion and git commit;the solver binary release tag and SHA-256;
ZARR_SCHEMA_VERSION,PARQUET_SCHEMA_VERSION, and the catalog migration version applied;input-data fingerprints from
provenance.
A frozen replay (hmp run --frozen) refuses any source whose
fingerprint has changed since the lockfile was written. When no input
cache is available, a normal run may complete without a lockfile; that
is a reproducibility warning, not a failed run.
Direct DuckDB exceptions#
The normal application path uses catalog/cache adapters. Direct
duckdb.connect is accepted only for:
migration runners that bootstrap schema files;
concrete backend constructors and adapters;
read-only diagnostics and doctor output;
portable
.hmppackage snapshots;tests and performance benchmarks;
developer-only CLI inspection commands;
experimental
validity_frameingestion.
New direct DuckDB calls in user-facing CLI commands or hydromodpy._api
should be considered a contract regression unless this list is updated
with a rationale.
Portable .hmp packages#
hmp export-package <sim_id> -o run.hmp bundles:
the resolved TOML and the lockfile;
the matched DuckDB rows for that
sim_idplus referenced rows;the per-simulation Zarr store and Parquet directory;
a JSON manifest with the schema version and SHA-256 checksums;
a RO-Crate sidecar when enough metadata is available.
hmp add run.hmp re-materialises the bundle in the target project,
refusing packages whose schema version exceeds the current library.
See also#
The three workspace databases for the cache vs catalog split.
Schema Evolution for the migration policy.
Artifact Policy for non-canonical artifacts and sidecars.
results for the Python API on top of this storage (
SimulationCatalog,Run,SimulationGroup,hmp.readfacade).data for the input cache writer.