The three workspace databases#
HydroModPy splits SQL state across three scopes: machine, workspace,
project. Each scope owns a DuckDB file with a focused role and an
independent lifecycle. The hydromodpy.catalog layer exposes each
scope through its own door: hmp.open(ws) for the project catalog,
hmp.index() for the machine-wide federation, and
InputsNamespace (or the hmp data CLI)
for the input cache.
For the complete layout, see Storage Layout. For the migration policy applied to every database below, see Schema Evolution.
Machine global index – $XDG_STATE_HOME/hydromodpy/index.duckdb#
Federates every registered workspace and exposes cross-workspace
queries through ATTACH in read-only mode. Recreated from the
registered workspaces alone – it carries no science output of its own.
Tables: workspaces, projects, simulations_cache,
index_metadata plus the view v_workspace_health.
Exposed through:
hydromodpy.core.state.global_index.GlobalIndexhmp.index()(machine-wide discovery and federation)CLI verbs
hmp index search / forget / prune
Workspace input cache – <workspace>/data/cache.duckdb#
Tracks downloaded or custom datasets used as model inputs. One file per workspace, mutualised across every project of that workspace. Purgeable and reconstructible from upstream sources.
Tables: entries, api_coverage, artifacts, provenance,
stations, coverage, failures, validation_reports plus
the view v_entries_summary.
Exposed through:
hydromodpy.data.registry.DataCatalogDuckDB(low-level)hydromodpy.catalog.InputsNamespace(or thehmp dataCLI)project.data/workspace.dataaccessors
Each row carries a workspace-relative POSIX file_path so caches
remain portable between machines.
Project simulation catalog – <project>/catalog.duckdb#
Holds the science output: simulation metadata, parameters, metrics,
per-sim provenance, calibration history, workflow ledger and audit
trail. Scoped to one project (typically one catchment) and
irreplaceable. Each simulation gets a row plus per-simulation Zarr and
Parquet artefacts under <project>/simulations/.
Tables: 26 (dim_*, solvers, statuses, flow_regimes, mesh_topologies,
simulations, parameters, metrics, metric_definitions, observations,
observation_points, provenance, runs_environment, audit_log, deletions,
tracked_files, geographic_features, geographic_metadata, parquet_files,
tags, stations, calibration_sessions, calibration_iterations,
workflow_steps) plus the views v_simulation_summary and
v_metrics_pivot.
Exposed through:
hydromodpy.results.catalog.SimulationCataloghydromodpy.results.run.Runhydromodpy.results.simulation_group.SimulationGrouphmp.open(project_path)door (returns theSimulationCatalog;cat.find(...),cat.frame,cat.latest(),cat[ref])CLI verb
hmp catalog ...
Provenance bridge#
Each simulation records, in its provenance rows, which input-cache
entries it consumed. run.input_entries() walks the bridge to list
them, and entry.used_by() returns the simulations that referenced a
given entry. Cross-workspace lookups go through the machine index.
Why three scopes#
Machine index – cross-workspace discovery without copying data.
Workspace cache – input mutualisation between projects sharing a geographic area.
Project catalog – irreplaceable science output that warrants its own backup policy and stays writable while other projects keep using the same workspace cache.
Three scopes match three distinct lifecycles: the index is fully recreatable from registered workspaces, the cache is purgeable and reconstructible from upstream sources, the project catalog is the only SQL store that holds output which cannot be regenerated without re-running the simulations.
Unified architecture#
A single set of patterns governs the three databases:
Single migrations runner:
hydromodpy.core.migrations.runnerexposesapply_migrations(db_path, migrations_dir)(with a<db_path>.lockfilelock to serialise concurrent callers) and is used by all three databases. Each scope owns a flatmigrations/directory containing one0001_initial.sql.Per-scope doors:
hydromodpy.catalogexposes each database through its own door:hmp.open(ws)returns the projectSimulationCatalog,hmp.index()federates the machine-wide scope, andInputsNamespacereaches the input cache. Users writehmp.open(ws).find(solver="modflow6")on the catalog itself.Backend Protocol:
hydromodpy.results.catalog.ports.CatalogBackendis atyping.Protocolwithexecute / query / fetch_one / fetch_all / insert / upsert / transaction / close. The catalog mixins call the protocol so swapping the adapter does not touch call sites.Authentication Protocol:
hydromodpy.core.auth.AuthBackendexposes a structuralcurrent_user / can_read / can_writesurface withLocalAuthBackendas the V1 default.URI-aware paths: every workspace / cache / state argument is typed
Path | UPath. The runtime accepts local paths andfile://URIs; any other scheme raisesNotImplementedError.
Refactor outcomes#
The catalog stack used to concentrate three god-classes (each above 1000 LOC). V1 split them into single-concern modules:
SimulationZarr->simulation_zarr(facade) +zarr_schema+zarr_writer+zarr_reader+zarr_finalizer.DataCatalogDuckDB->catalog_duckdb(facade) +cache_store+cache_queries+cache_lifecycle.WritesMixin->writes(facade) +writes_duckdb+writes_parquet+writes_zarr+writes_helpers.
The original public symbols (SimulationZarr, DataCatalogDuckDB,
SimulationCatalog, WritesMixin) keep their import path and
their full API; golden tests pin bit-identical artefacts before and
after each split.