Skip to content
SP StackPractices
intermediate By StackPractices

Lakehouse Architecture — The Best of Both Worlds

A practical guide to Lakehouse architecture: combining data lake storage flexibility with data warehouse reliability using open table formats like Delta Lake, Apache Iceberg, and Hudi.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Lakehouse architecture, pioneered by Databricks, unifies the best of Data Lakes and Data Warehouses. It stores data in open formats (Parquet) on low-cost object storage while adding transactional guarantees, schema enforcement, and time travel — features previously only available in proprietary warehouses. Open table formats like Delta Lake, Apache Iceberg, and Hudi make this possible by maintaining metadata layers that track changes, partitions, and statistics without locking data into a vendor-specific format.

When to Use

  • You need warehouse reliability (ACID, schema enforcement) with lake economics
  • Vendor lock-in in proprietary warehouses is a concern
  • You want one storage layer for both BI and ML workloads
  • Time travel and auditability are required
  • Data is consumed by multiple engines (Spark, Presto, DuckDB, Snowflake)

Lake vs Warehouse vs Lakehouse

FeatureData LakeData WarehouseLakehouse
Storage costLow (object storage)High (proprietary)Low (object storage)
ACID transactionsNoYesYes
Schema enforcementNoYesYes
Time travelNoLimitedYes
Open formatsYesNoYes
ML/AI supportExcellentLimitedExcellent
BI query performanceSlowFastFast (with indexing)

Open Table Formats

FormatKey FeatureBest For
Delta LakeACID, time travel, schema enforcementSpark ecosystems, streaming
Apache IcebergHidden partitioning, partition evolutionCloud-native, multi-engine
Apache HudiIncremental processing, record-level updatesCDC, near-real-time ingestion

Delta Lake Example

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("LakehouseDelta") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Create a Delta table with schema enforcement
spark.sql("""
    CREATE TABLE IF NOT EXISTS bronze.orders (
        order_id STRING,
        customer_id STRING,
        total DECIMAL(10,2),
        status STRING,
        created_at TIMESTAMP
    ) USING DELTA
    LOCATION 's3://lakehouse/bronze/orders'
""")

# Insert with ACID guarantees
spark.sql("""
    INSERT INTO bronze.orders
    VALUES ('ord-001', 'cust-123', 99.99, 'placed', current_timestamp())
""")

# Update in place (not possible in plain Parquet)
spark.sql("""
    UPDATE bronze.orders
    SET status = 'shipped'
    WHERE order_id = 'ord-001'
""")

# Time travel — query previous state
spark.read.format("delta") \
    .option("versionAsOf", 0) \
    .load("s3://lakehouse/bronze/orders") \
    .show()

Apache Iceberg Example

from pyiceberg.catalog import load_catalog

# Load a REST catalog (e.g., Tabular, AWS Glue)
catalog = load_catalog("rest", **{
    "uri": "https://catalog.example.com",
    "warehouse": "s3://lakehouse/"
})

# Create a table with hidden partitioning
table = catalog.create_table(
    identifier="silver.events",
    schema=Schema(
        NestedField(1, "event_id", StringType()),
        NestedField(2, "event_type", StringType()),
        NestedField(3, "timestamp", TimestampType()),
        NestedField(4, "payload", StringType())
    ),
    partition_spec=PartitionSpec(
        PartitionField(source_id=3, field_id=1000, transform=Day(), name="day")
    )
)

# Append data
table.append(df)

# Query with partition evolution (change partitioning without rewriting data)
with table.update_spec() as update:
    update.add_field("hour", Hour(source_column="timestamp"))

Apache Hudi Example

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("HudiExample") \
    .getOrCreate()

# Write with record-level upserts
df.write.format("hudi") \
    .option("hoodie.table.name", "customer_orders") \
    .option("hoodie.datasource.write.recordkey.field", "order_id") \
    .option("hoodie.datasource.write.precombine.field", "updated_at") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .mode("append") \
    .save("s3://lakehouse/hudi/customer_orders")

# Incremental read — only new/changed records since last sync
spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", "20240101000000") \
    .load("s3://lakehouse/hudi/customer_orders")

Medallion Architecture with Lakehouse

Bronze (Raw)
    ├── Delta Lake / Iceberg / Hudi tables
    ├── Minimal transformation
    └── Full history retained

Silver (Cleaned)
    ├── Deduplicated, typed, validated
    ├── Joined with reference data
    └── Quality checks enforced

Gold (Curated)
    ├── Aggregated business metrics
    ├── Optimized for BI consumption
    └── Managed as data products

Common Mistakes

  • Treating lakehouse as just storage — the table format layer is critical; without it, you just have a lake
  • Over-optimizing prematurely — start with one format (Delta is most mature), add others only if needed
  • No compaction strategy — small files kill performance; schedule regular compaction jobs
  • Ignoring metastore — Glue, Hive, or Unity Catalog are essential for table discovery and governance
  • Mixing table formats in one pipeline — each format has different semantics; mixing them adds complexity

FAQ

Which table format should I choose?

  • Delta Lake: Best for Spark-centric pipelines and streaming workloads
  • Iceberg: Best for multi-engine environments and partition evolution
  • Hudi: Best for CDC and incremental data processing

Can I query lakehouse tables from Snowflake or BigQuery? Yes. Snowflake supports Iceberg tables natively. BigQuery supports BigLake tables over Iceberg and Delta via connectors.

Is Lakehouse cheaper than a traditional warehouse? Storage is significantly cheaper (S3/GCS vs proprietary storage). Compute costs depend on engine choice. Overall TCO is usually lower, especially for large datasets.