A project is defined by four YAML files. load_project() deep-merges them in order (bronze → silver → gold into main) and validates the result before the pipeline runs.
pipeline:name:my_project# (required) pipeline name — used in dlt and gold output dirsincludes:bronze:bronze.yaml# (required) filename of the bronze config in this foldersilver:silver.yaml# (required) filename of the silver configgold:gold.yaml# (required) filename of the gold configpaths:bronze:./data/bronze# (required) bronze output directory — local or s3://silver:./data/silver# (required) silver output directorygold:./data/gold# (required) gold output directoryexport:./data/export# (required) export output directorybi_export:# (optional) BI export configenabled:trueprojects:-name:defaulttables:-summary.parquet
Any string value in any YAML file may contain ${VAR} or ${VAR:-default} placeholders. Expansion happens after all four files are merged.
# Raises EnvironmentError if ORACLE_USER is not setconnection_string:"oracle+oracledb://${ORACLE_USER}:${ORACLE_PASSWORD}@${ORACLE_DSN}"# Falls back to "us-east-1" if AWS_DEFAULT_REGION is not setbucket_url:"s3://${BUCKET:-my-default-bucket}/data"
openmedallion reads the YAML file, looks up the dialect key, and builds the
SQLAlchemy connection string internally. Connection is tested and available
tables are printed before ingestion starts.
# /workspace/secrets.yaml (keep outside repo — never commit real credentials)oracle:host:db.corp.comport:1521# defaultservice:XEusername:hrpassword:secretpostgres:host:pg.corp.comport:5432# defaultdatabase:analyticsusername:etlpassword:secretsqlite:path:data/mydb.db# relative to CWD where medallion runs
Set the path and dialect via env vars (supports ${VAR:-default} expansion):
Column names to ingest. Omit to ingest all columns. For SQL sources the projection is pushed to the database; for local_files it is applied after reading. Always include the cursor_column when using append mode.
incremental
object
—
Omit for full-replace each run.
incremental.mode
enum
—
replace (default), append, or merge.
incremental.cursor_column
string
append
Column used to track the high-watermark. Always include this column in select.
incremental.initial_value
string
append
Value to use on the very first run.
incremental.primary_key
string or list
merge
Column(s) that uniquely identify a row in the Parquet file (used to build the file index).
incremental.merge_key
string or list
merge
Column(s) used to match incoming rows against existing rows. Usually identical to primary_key.
explore
list
—
List of report specs to generate after this table is written. See Inline explore.
Small lookup/reference tables. Full re-pull and overwrite every run. Safe when source rows can be deleted.
—
append
Immutable event or log tables. Fetches only rows newer than the cursor and appends to the existing Parquet. No deduplication.
cursor_column, initial_value
merge
Tables where rows can be updated after their first write (e.g. a permit whose status changes over time). Upserts by merge_key — updates matching rows, inserts new ones.
primary_key, merge_key
merge and cursor_column can be combined: the cursor limits how many rows are fetched from the source database, while merge_key controls how they land in the Parquet file. This is the right pattern for large tables where only recent rows are added but existing rows may also change.
# Large table — limit source pull by cursor, upsert by primary key-name:folderincremental:mode:mergeprimary_key:folderrsnmerge_key:folderrsncursor_column:indateinitial_value:"2025-01-0100:00:00"# Lookup table — full re-pull every run, upsert to avoid duplicates across runs-name:validstatusincremental:mode:mergeprimary_key:statuscodemerge_key:statuscode# Composite key — child records keyed on two columns-name:folderprocessincremental:mode:mergeprimary_key:[folderrsn,processrsn]merge_key:[folderrsn,processrsn]# Append-only event log-name:audit_logincremental:mode:appendcursor_column:created_atinitial_value:"2024-01-01T00:00:00"
bronze_to_silver:# Optional: register all silver Parquet files as DuckDB views or tables# after the silver run completes.duckdb:enabled:true# false or omit to skip (default: false)path:data/silver/silver.duckdb# where to write the .duckdb filemode:views# views (default) | tablestables:-source_file:ORDERS.parquet# filename in bronze directoryoutput_file:orders.parquet# filename to write in silver directorytransforms:-type:renamecolumns:ORDER_ID:order_idUPDATED_AT:updated_at-type:castcolumns:order_id:Int64amount:Float64-type:dropcolumns:[INTERNAL_COL,DEBUG_COL]-type:udffile:projects/my_project/udf/silver/base.pyfunction:enrich_ordersargs:threshold:500.0derived_tables:-output_file:order_lines_enriched.parquetudf:file:projects/my_project/udf/silver/derived.pyfunction:build_enrichedselect:[order_id,product_id,line_revenue]# optional column projection
Each entry must have a type field. Valid types: rename, cast, drop, udf, fillna, clip, normalize, deduplicate, filter_rows, map_values, allowed_values, coerce_bool.
Validates string columns against an allowlist. Values not in the list become null (the row is kept). To drop out-of-range rows instead, use filter_rows.
silver_to_gold:# Optional: register all gold Parquet files as DuckDB views or tables# after the gold run completes. All projects are registered into one file.duckdb:enabled:true# false or omit to skip (default: false)path:data/gold/gold.duckdb# where to write the .duckdb filemode:tables# tables recommended for gold (shareable)projects:-name:analytics# subfolder under paths.goldaggregations:-source_file:orders.parquet# silver filename to readpre_agg_udf:# (optional) runs before group_byfile:projects/my_project/udf/gold/transforms.pyfunction:prepare_ordersargs:include_region:truegroup_by:[customer_id,region]# columns to group onmetrics:-{column:order_id, agg:count, alias:total_orders}-{column:amount, agg:sum, alias:total_revenue}-{column:amount, agg:mean, alias:avg_order_value}-{column:amount, agg:median, alias:median_order_value}-{column:amount, agg:count_distinct, alias:unique_amounts}having:"total_orders>1"# post-agg filtersort:columns:[total_revenue]descending:truelimit:100output_file:customer_summary.parquet-source_file:orders.parquet# pass-through (no group_by)select:[order_id,amount,status]output_file:orders_flat.parquet
Optional block. Same semantics as bronze_to_silver.duckdb. When enabled: true, Parquet files from all gold project subdirectories are registered into a single .duckdb file after the gold run completes.
Key
Type
Default
Description
enabled
bool
false
Set to true to activate DuckDB registration.
path
string
—
Path to write the .duckdb file.
mode
enum
views
views or tables. Recommend tables for gold (shareable).
Add an explore: list directly to any table entry in bronze.yaml, silver.yaml, or gold.yaml. Reports are generated immediately after the Parquet file for that table is written — no separate explore.yaml or paths.explore key needed.
# bronze.yaml — on a source table-name:employeesexplore:-report_type:profileoutput_file:employees_profile.htmltitle:"EmployeesBronzeQuality"# silver.yaml — on a transformed table-source_file:employees.parquetexplore:-report_type:walkeroutput_file:employees_explorer.htmltitle:"EmployeesExplorer"# gold.yaml — on an aggregationaggregations:-output_file:headcount.parquetexplore:-report_type:profileoutput_file:headcount_profile.htmltitle:"HeadcountbyDepartment"
If the optional dependency is not installed, the report is silently skipped with a notice. An unknown report_type also prints a warning and skips without aborting the run.