Lakehouse Architecture — The Best of Both Worlds

Overview

Lakehouse architecture, pioneered by Databricks, unifies the best of Data Lakes and Data Warehouses. It stores data in open formats (Parquet) on low-cost object storage while adding transactional guarantees, schema enforcement, and time travel — features previously only available in proprietary warehouses. Open table formats like Delta Lake, Apache Iceberg, and Hudi make this possible by maintaining metadata layers that track changes, partitions, and statistics without locking data into a vendor-specific format.

When to Use

You need warehouse reliability (ACID, schema enforcement) with lake economics
Vendor lock-in in proprietary warehouses is a concern
You want one storage layer for both BI and ML workloads
Time travel and auditability are required
Data is consumed by multiple engines (Spark, Presto, DuckDB, Snowflake)

Lake vs Warehouse vs Lakehouse

Feature	Data Lake	Data Warehouse	Lakehouse
Storage cost	Low (object storage)	High (proprietary)	Low (object storage)
ACID transactions	No	Yes	Yes
Schema enforcement	No	Yes	Yes
Time travel	No	Limited	Yes
Open formats	Yes	No	Yes
ML/AI support	Excellent	Limited	Excellent
BI query performance	Slow	Fast	Fast (with indexing)

Open Table Formats

Format	Key Feature	Best For
Delta Lake	ACID, time travel, schema enforcement	Spark ecosystems, streaming
Apache Iceberg	Hidden partitioning, partition evolution	Cloud-native, multi-engine
Apache Hudi	Incremental processing, record-level updates	CDC, near-real-time ingestion

Delta Lake Example

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("LakehouseDelta") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Create a Delta table with schema enforcement
spark.sql("""
    CREATE TABLE IF NOT EXISTS bronze.orders (
        order_id STRING,
        customer_id STRING,
        total DECIMAL(10,2),
        status STRING,
        created_at TIMESTAMP
    ) USING DELTA
    LOCATION 's3://lakehouse/bronze/orders'
""")

# Insert with ACID guarantees
spark.sql("""
    INSERT INTO bronze.orders
    VALUES ('ord-001', 'cust-123', 99.99, 'placed', current_timestamp())
""")

# Update in place (not possible in plain Parquet)
spark.sql("""
    UPDATE bronze.orders
    SET status = 'shipped'
    WHERE order_id = 'ord-001'
""")

# Time travel — query previous state
spark.read.format("delta") \
    .option("versionAsOf", 0) \
    .load("s3://lakehouse/bronze/orders") \
    .show()

Apache Iceberg Example

from pyiceberg.catalog import load_catalog

# Load a REST catalog (e.g., Tabular, AWS Glue)
catalog = load_catalog("rest", **{
    "uri": "https://catalog.example.com",
    "warehouse": "s3://lakehouse/"
})

# Create a table with hidden partitioning
table = catalog.create_table(
    identifier="silver.events",
    schema=Schema(
        NestedField(1, "event_id", StringType()),
        NestedField(2, "event_type", StringType()),
        NestedField(3, "timestamp", TimestampType()),
        NestedField(4, "payload", StringType())
    ),
    partition_spec=PartitionSpec(
        PartitionField(source_id=3, field_id=1000, transform=Day(), name="day")
    )
)

# Append data
table.append(df)

# Query with partition evolution (change partitioning without rewriting data)
with table.update_spec() as update:
    update.add_field("hour", Hour(source_column="timestamp"))

Apache Hudi Example

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("HudiExample") \
    .getOrCreate()

# Write with record-level upserts
df.write.format("hudi") \
    .option("hoodie.table.name", "customer_orders") \
    .option("hoodie.datasource.write.recordkey.field", "order_id") \
    .option("hoodie.datasource.write.precombine.field", "updated_at") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .mode("append") \
    .save("s3://lakehouse/hudi/customer_orders")

# Incremental read — only new/changed records since last sync
spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", "20240101000000") \
    .load("s3://lakehouse/hudi/customer_orders")

Medallion Architecture with Lakehouse

Bronze (Raw)
    ├── Delta Lake / Iceberg / Hudi tables
    ├── Minimal transformation
    └── Full history retained

Silver (Cleaned)
    ├── Deduplicated, typed, validated
    ├── Joined with reference data
    └── Quality checks enforced

Gold (Curated)
    ├── Aggregated business metrics
    ├── Optimized for BI consumption
    └── Managed as data products

Common Mistakes

Treating lakehouse as just storage — the table format layer is critical; without it, you just have a lake
Over-optimizing prematurely — start with one format (Delta is most mature), add others only if needed
No compaction strategy — small files kill performance; schedule regular compaction jobs
Ignoring metastore — Glue, Hive, or Unity Catalog are essential for table discovery and governance
Mixing table formats in one pipeline — each format has different semantics; mixing them adds complexity

FAQ

Which table format should I choose?

Delta Lake: Best for Spark-centric pipelines and streaming workloads
Iceberg: Best for multi-engine environments and partition evolution
Hudi: Best for CDC and incremental data processing

Can I query lakehouse tables from Snowflake or BigQuery? Yes. Snowflake supports Iceberg tables natively. BigQuery supports BigLake tables over Iceberg and Delta via connectors.

Is Lakehouse cheaper than a traditional warehouse? Storage is significantly cheaper (S3/GCS vs proprietary storage). Compute costs depend on engine choice. Overall TCO is usually lower, especially for large datasets.

Lakehouse Architecture — The Best of Both Worlds

Overview

When to Use

Lake vs Warehouse vs Lakehouse

Open Table Formats

Delta Lake Example

Apache Iceberg Example

Apache Hudi Example

Medallion Architecture with Lakehouse

Common Mistakes

FAQ

Data Lake vs Data Warehouse — Architecture Guide

Data Mesh Architecture — Decentralized Data Ownership

Overview

When to Use

Lake vs Warehouse vs Lakehouse

Open Table Formats

Delta Lake Example

Apache Iceberg Example

Apache Hudi Example

Medallion Architecture with Lakehouse

Common Mistakes

FAQ

Related Resources

Data Lake vs Data Warehouse — Architecture Guide

Data Mesh Architecture — Decentralized Data Ownership