Lakehouse Architecture — The Best of Both Worlds
A practical guide to Lakehouse architecture: combining data lake storage flexibility with data warehouse reliability using open table formats like Delta Lake, Apache Iceberg, and Hudi.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Lakehouse architecture, pioneered by Databricks, unifies the best of Data Lakes and Data Warehouses. It stores data in open formats (Parquet) on low-cost object storage while adding transactional guarantees, schema enforcement, and time travel — features previously only available in proprietary warehouses. Open table formats like Delta Lake, Apache Iceberg, and Hudi make this possible by maintaining metadata layers that track changes, partitions, and statistics without locking data into a vendor-specific format.
When to Use
- You need warehouse reliability (ACID, schema enforcement) with lake economics
- Vendor lock-in in proprietary warehouses is a concern
- You want one storage layer for both BI and ML workloads
- Time travel and auditability are required
- Data is consumed by multiple engines (Spark, Presto, DuckDB, Snowflake)
Lake vs Warehouse vs Lakehouse
| Feature | Data Lake | Data Warehouse | Lakehouse |
|---|---|---|---|
| Storage cost | Low (object storage) | High (proprietary) | Low (object storage) |
| ACID transactions | No | Yes | Yes |
| Schema enforcement | No | Yes | Yes |
| Time travel | No | Limited | Yes |
| Open formats | Yes | No | Yes |
| ML/AI support | Excellent | Limited | Excellent |
| BI query performance | Slow | Fast | Fast (with indexing) |
Open Table Formats
| Format | Key Feature | Best For |
|---|---|---|
| Delta Lake | ACID, time travel, schema enforcement | Spark ecosystems, streaming |
| Apache Iceberg | Hidden partitioning, partition evolution | Cloud-native, multi-engine |
| Apache Hudi | Incremental processing, record-level updates | CDC, near-real-time ingestion |
Delta Lake Example
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("LakehouseDelta") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# Create a Delta table with schema enforcement
spark.sql("""
CREATE TABLE IF NOT EXISTS bronze.orders (
order_id STRING,
customer_id STRING,
total DECIMAL(10,2),
status STRING,
created_at TIMESTAMP
) USING DELTA
LOCATION 's3://lakehouse/bronze/orders'
""")
# Insert with ACID guarantees
spark.sql("""
INSERT INTO bronze.orders
VALUES ('ord-001', 'cust-123', 99.99, 'placed', current_timestamp())
""")
# Update in place (not possible in plain Parquet)
spark.sql("""
UPDATE bronze.orders
SET status = 'shipped'
WHERE order_id = 'ord-001'
""")
# Time travel — query previous state
spark.read.format("delta") \
.option("versionAsOf", 0) \
.load("s3://lakehouse/bronze/orders") \
.show()
Apache Iceberg Example
from pyiceberg.catalog import load_catalog
# Load a REST catalog (e.g., Tabular, AWS Glue)
catalog = load_catalog("rest", **{
"uri": "https://catalog.example.com",
"warehouse": "s3://lakehouse/"
})
# Create a table with hidden partitioning
table = catalog.create_table(
identifier="silver.events",
schema=Schema(
NestedField(1, "event_id", StringType()),
NestedField(2, "event_type", StringType()),
NestedField(3, "timestamp", TimestampType()),
NestedField(4, "payload", StringType())
),
partition_spec=PartitionSpec(
PartitionField(source_id=3, field_id=1000, transform=Day(), name="day")
)
)
# Append data
table.append(df)
# Query with partition evolution (change partitioning without rewriting data)
with table.update_spec() as update:
update.add_field("hour", Hour(source_column="timestamp"))
Apache Hudi Example
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("HudiExample") \
.getOrCreate()
# Write with record-level upserts
df.write.format("hudi") \
.option("hoodie.table.name", "customer_orders") \
.option("hoodie.datasource.write.recordkey.field", "order_id") \
.option("hoodie.datasource.write.precombine.field", "updated_at") \
.option("hoodie.datasource.write.operation", "upsert") \
.mode("append") \
.save("s3://lakehouse/hudi/customer_orders")
# Incremental read — only new/changed records since last sync
spark.read.format("hudi") \
.option("hoodie.datasource.query.type", "incremental") \
.option("hoodie.datasource.read.begin.instanttime", "20240101000000") \
.load("s3://lakehouse/hudi/customer_orders")
Medallion Architecture with Lakehouse
Bronze (Raw)
├── Delta Lake / Iceberg / Hudi tables
├── Minimal transformation
└── Full history retained
Silver (Cleaned)
├── Deduplicated, typed, validated
├── Joined with reference data
└── Quality checks enforced
Gold (Curated)
├── Aggregated business metrics
├── Optimized for BI consumption
└── Managed as data products
Common Mistakes
- Treating lakehouse as just storage — the table format layer is critical; without it, you just have a lake
- Over-optimizing prematurely — start with one format (Delta is most mature), add others only if needed
- No compaction strategy — small files kill performance; schedule regular compaction jobs
- Ignoring metastore — Glue, Hive, or Unity Catalog are essential for table discovery and governance
- Mixing table formats in one pipeline — each format has different semantics; mixing them adds complexity
FAQ
Which table format should I choose?
- Delta Lake: Best for Spark-centric pipelines and streaming workloads
- Iceberg: Best for multi-engine environments and partition evolution
- Hudi: Best for CDC and incremental data processing
Can I query lakehouse tables from Snowflake or BigQuery? Yes. Snowflake supports Iceberg tables natively. BigQuery supports BigLake tables over Iceberg and Delta via connectors.
Is Lakehouse cheaper than a traditional warehouse? Storage is significantly cheaper (S3/GCS vs proprietary storage). Compute costs depend on engine choice. Overall TCO is usually lower, especially for large datasets.
Related Resources
Data Lake vs Data Warehouse — Architecture Guide
A practical guide to Data Lake architecture: structured vs unstructured storage, lakehouse concepts, ETL vs ELT patterns, and when to choose a lake over a warehouse.
GuideData Mesh Architecture — Decentralized Data Ownership
A practical guide to Data Mesh: decentralizing data ownership to domain teams, treating data as a product, and enabling self-serve data infrastructure.