OpenMedallion¶

Declarative medallion pipelines in pure open-source Python — local first, cloud portable, fast by default.

OpenMedallion is an opinionated open-source library for building Bronze → Silver → Gold data warehouse and lakehouse pipelines using dlt, Polars, and Hamilton — without depending on expensive enterprise platforms or proprietary tooling.

Why OpenMedallion?¶

Modern open-source data tools are individually excellent — but combining them into a production-ready medallion architecture is still fragmented.

You already have great tools for ingestion, transformation, loading, orchestration, and validation. But you still have to stitch everything together yourself — writing glue code, defining project structure, creating naming conventions, managing layer boundaries, and maintaining all of it over time.

OpenMedallion exists to reduce that friction.

Without OpenMedallion	With OpenMedallion
Glue code per project	Convention-driven project layout
Ad-hoc layer boundaries	Enforced Bronze / Silver / Gold contracts
Inline transforms	Composable Python UDFs
Manual orchestration	Hamilton DAG — wired automatically
Cloud-only dev loop	Local Parquet first, S3 with one config change

Quickstart¶

pip install openmedallion

medallion init my_project       # scaffold: YAML configs + UDF stubs + kestra_flow.yml
medallion run my_project        # Bronze → Silver → Gold in one command
medallion run my_project --layer silver   # re-run a single layer
medallion dag                   # print the Hamilton DAG
medallion serve                 # launch the live pipeline tracker UI

Key Features¶

Declarative YAML config — define pipeline layers without writing boilerplate
Incremental loads — append and merge modes via dlt cursor columns and primary keys
Composable UDFs — drop Python functions into udf/silver/ or udf/gold/; no new framework to learn
Live DAG tracker — Hamilton-powered web UI to visualise and monitor execution
Local first — run the full pipeline against Parquet files with zero cloud credentials
Cloud portable — swap filesystem for S3 in one line; logic stays unchanged
Source agnostic — any dlt source: SQL databases, REST APIs, filesystems, and more
Fast by default — Polars for all transforms; no pandas bottlenecks

How It Works¶

OpenMedallion wires three best-in-class open-source tools under a unified declarative config:

YAML config
    │
    ▼
Hamilton DAG           ← orchestrates which layer runs and in what order
    │
    ├── Bronze  (dlt)     ← ingests raw data from any source into Parquet
    ├── Silver  (Polars)  ← typed UDF transforms: rename, cast, filter, enrich
    └── Gold    (Polars)  ← YAML-declared group-by aggregations + window metrics

Layer	Tool	Role
🟤 Bronze	dlt	Schema-inferred raw load from any source
⚪ Silver	Polars	Typed, composable Python UDFs
🟡 Gold	Polars	YAML-declared group-by metrics
📤 Export	Polars	Parquet + CSV for BI tools
🔗 Orchestration	Hamilton	DAG wiring with live web tracker

Installation¶

pip install openmedallion

Optional extras:

pip install "openmedallion[s3]"    # S3 support via s3fs + boto3
pip install "openmedallion[viz]"   # DAG visualisation via graphviz

Requires Python 3.11+

Project Structure¶

medallion init my_project generates a complete, ready-to-run project:

my_project/
├── main.yaml                    # pipeline name + layer includes + paths
├── backend/
│   ├── bronze.yaml              # source connection + incremental config
│   ├── silver.yaml              # table transforms (rename, cast, filter, UDFs)
│   ├── gold.yaml                # aggregations (group_by + metrics + window fns)
│   └── udf/
│       ├── silver/              # Python UDFs called from silver.yaml
│       └── gold/                # Python UDFs called from gold.yaml
├── frontend/                    # dashboard files (Tableau, Power BI, etc.)
├── data/                        # gitignored pipeline outputs
├── summary/                     # analysis write-ups
├── kestra_flow.yml              # Kestra orchestration flow — mount via docker-compose.yml
└── README.md                    # pre-filled project documentation template

Configuration¶

main.yaml — declare your layers and data paths:

pipeline:
  name: customer_warehouse

includes:
  bronze: bronze.yaml
  silver: silver.yaml
  gold:   gold.yaml

paths:
  bronze: "./data/bronze"
  silver: "./data/silver"
  gold:   "./data/gold"
  export: "./data/export"

silver.yaml — declarative transforms with optional UDFs:

bronze_to_silver:
  tables:
    - source_file: ORDERS.parquet
      output_file: orders.parquet
      transforms:
        - type: rename
          columns:
            ORDER_ID:    order_id
            CUSTOMER_ID: customer_id
        - type: cast
          columns:
            order_id: Int64
            amount:   Float64
        - type: udf
          file: udf/silver/enrich.py
          function: flag_large_orders
          args:
            threshold: 500.0

gold.yaml — YAML-declared aggregations:

silver_to_gold:
  projects:
    - name: customer_warehouse
      aggregations:
        - source_file: orders.parquet
          group_by: [customer_id]
          metrics:
            - {column: order_id, agg: count, alias: total_orders}
            - {column: amount,   agg: sum,   alias: total_spent}
          output_file: customer_summary.parquet

Python UDFs¶

Business logic stays in plain Python — no custom DSL, no magic.

# udf/silver/enrich.py
import polars as pl

def flag_large_orders(df: pl.DataFrame, threshold: float = 500.0) -> pl.DataFrame:
    return df.with_columns(
        (pl.col("amount") >= threshold).alias("is_large_order")
    )

Drop the file next to your config, reference it in silver.yaml, done.

Incremental Loads¶

OpenMedallion supports dlt's native incremental strategies out of the box:

# bronze.yaml
source:
  type: sql_database
  dialect: sqlite
  connection_string: "sqlite:///data/mydb.db"
  tables:
    - name: orders
      incremental:
        mode: append          # cursor-based — only new rows
        cursor_column: created_at
        initial_value: "2024-01-01"
    - name: customers
      incremental:
        mode: merge           # upsert — handles updates + deletes
        primary_key: customer_id

dlt tracks cursor state automatically. Re-running bronze only pulls the delta.

Scheduling with Kestra¶

medallion init generates a kestra_flow.yml inside every new project — a ready-to-use Kestra flow that orchestrates bronze → silver → gold with per-task observability and retry support.

1. Start a local Kestra server¶

# from the repo root — requires Docker
make kestra-up
# UI available at http://localhost:8080

2. Register a project flow¶

Add one volume mount to the kestra service in docker-compose.yml:

- ./my_project/kestra_flow.yml:/app/flows/my_project.yml

Kestra picks up the file automatically on the next make kestra-up — no copying needed.

3. Trigger a run¶

From the UI at http://localhost:8080, or via the API:

curl -X POST \
  http://localhost:8080/api/v1/executions/openmedallion.projects/my_project

4. Enable scheduled refresh¶

Uncomment the triggers: block in kestra_flow.yml:

triggers:
  - id: daily_refresh
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "0 6 * * *"   # every day at 06:00 UTC

Restart with make kestra-up and Kestra picks up the change immediately.

Kestra vs GitHub Actions¶

	Kestra	GitHub Actions
Best for	Recurring pipeline runs, local/on-prem data	CI tests + PyPI publish on tag push
Scheduling	Cron + backfill	Cron only, no backfill
Observability	Per-task logs, run history, retry from failed task	Flat job log
Infrastructure	Self-hosted Docker	GitHub-managed runners

Recommended split: GitHub Actions for CI + publish; Kestra for pipeline scheduling.

Examples¶

Three self-contained examples — no cloud credentials required. See examples/README.md for a side-by-side comparison.

Example	Tables	What it demonstrates
`local_parquet_demo/`	1	Zero-credential quickstart: full Bronze → Silver → Gold with local Parquet files
`incremental_sql_demo/`	2	Incremental append + merge from SQLite; delta load simulation
`ecommerce_analytics_demo/`	3	Multi-table joins, margin analysis, and monthly trends — most complete example

When to Use OpenMedallion¶

A great fit if you:

Want a standard medallion project layout without inventing one from scratch
Prefer YAML-first config with Python escape hatches for complex logic
Need local-first development that can scale to S3 with minimal changes
Want full ownership of your code and infrastructure
Are building on a tight budget without enterprise platform procurement

Not a fit if you need:

A full enterprise data platform (Databricks, Snowflake, BigQuery)
A no-code or drag-and-drop ETL tool
A universal framework for every possible pipeline architecture

Tradeoffs¶

You get	You accept
Lower cost — fully open-source	More engineering responsibility than a managed platform
Full control over code and infrastructure	Initial setup and config learning curve
No vendor lock-in	You own the infrastructure decisions
Transparent, inspectable pipeline	Not a drag-and-drop tool

Roadmap¶

Item	Status
Bronze / Silver / Gold pipeline	✅ 2026.4.1
Hamilton DAG + live tracker	✅ 2026.4.1
Local Parquet + S3 storage	✅ 2026.4.1
Incremental append + merge	✅ 2026.4.1
CLI scaffolding (`medallion init`)	✅ 2026.4.1
PyPI publish (OIDC trusted publishing)	✅ 2026.4.1
LazyFrame UDF contract	🔜 2026.5
Schema contract enforcement	🔜 2026.6
Lineage + metadata helpers	🔜 2026.6
Additional cloud destinations	🔜 2026.6

Contributing¶

Contributions are welcome. Good areas to contribute:

Bug fixes and edge-case handling
Documentation improvements and example additions
Tests and coverage
New pipeline templates
New source or destination adapters
CLI enhancements

If you are interested in open-source data architecture, your help is appreciated.

License¶

MIT — free to use, modify, and distribute.

If OpenMedallion looks useful, consider starring the repo — it helps others find it.