In the global overview, we explored how the Data Lakehouse regroup the robustness of a warehouse with the flexibility of Data lake. Now let’s go deeper to understand how it works, what patterns it supports, which tools are used, and what the limitations of this architecture.
Quick Recap ↻
A Data Lakehouse builds on the foundation of a Data Lake 🛶 by adding core warehouse capabilities: transactional updates, schema enforcement, metadata management, performance optimizations, and governance.
Instead of maintaining two separate systems -Data Warehouse and Data Lake- in Hybrid Data Warehousing architecture, a Data Lakehouse provides a unified platform that supports both use cases with a single storage layer.
Why this architecture has been created ?
Data Lakes offered flexibility and scale, but without structure or governance, many became swamps 🐸. Data Warehouses provided consistency and reliability, but lacked support for modern workloads like real-time streams, or machine learning.
Hybrid Data Warehousing tried to bridge the gap by having both. While effective and widely used today, they introduced complexity: two storage layers, duplicated data, fragmented governance, and costly orchestration.
The Lakehouse emerged as a unified solution. It combines warehouse capabilities directly into the lake, letting teams run BI and AI on the same platform without copying data or managing separate tools.
Core Architectural Principles

Data Storage and Table Formats
At the core of a Lakehouse is object storage like HDFS, S3, ADLS, or GCS. Data is stored in columnar formats such as Parquet or ORC, then managed through modern table formats like Delta Lake (Databricks), Apache Hudi (e.g. Onehouse), or Apache Iceberg (e.g. Snowflake, Athena, Dremio). These formats layer in key capabilities: transactional logs, versioning, schema evolution, time travel, and metadata.
For example, the Lakehouse allows ACID-style operations (INSERT
, DELETE
, MERGE
, etc.) and make data mutable. This feature enables fine-grained deletes, which are essential for complying with regulations like GDPR or CCPA (a consumer wanting to delete his data). Without it, deleting a single record would require rewriting entire files, which is costly and error-prone at scale.
Serving and Query Layer
Data lakehouse processing leverages distributed engines like Apache Spark or Flink to support both batch and streaming workloads with high performance. These engines operate directly on open table formats and are optimized for large-scale transformations, incremental updates, and complex analytics.
Key Features
These features are what make the Lakehouse suitable for modern data workloads :
- Schema evolution to adapt to changing structures without rewriting full datasets
- Predicate pushdown to reduce compute by filtering at storage level
- Partitioning and Z-ordering, to accelerate queries by pruning data
- Time travel for querying historical versions
- Compaction to reduce the small files problem and improve read performance
- Delta file vacuuming to remove obsolete files and reclaim storage space after updates or deletes
Today’s Technologies
These tools support both batch and streaming, allow schema-on-read or evolving schemas, and are optimized for ML, BI, and real-time analytics, typically using engines over open columnar storage :
- Cloud-native (fully managed, lakehouse-first or evolved):
- Databricks Lakehouse Platform (built on Delta Lake, Spark-native, leveraging cloud object storage from hyperscalers)
- Snowflake (support Iceberg and Unistore, evolving toward lakehouse)
- BigQuery (via BigLake & open table formats supporting Iceberg and Delta)
- Microsoft Fabric (OneLake as unified storage layer + Spark + SQL engine)
- Self-managed / open-source formats and engines :
- Delta Lake (created originally by Databricks)
- Apache Iceberg (created originally by Netflix origin, also used by Dremio or Snowflake, etc.)
- Apache Hudi (can run on HDFS or cloud object storage)
- Query engines (lake query engines over open formats):
- Apache Spark (widely used for processing on Delta/Iceberg/Hudi)
- Trino / Presto (interactive SQL over data lakes, supports Iceberg, Delta, Hudi)
- Dremio (self-service analytics on open formats with SQL acceleration)
To conclude, Is this architecture still relevant?
Data Lakehouse architecture is a major milestone in the evolution of data platforms. While the learning curve can be steep, it pays off by enabling powerful, unified analytics and AI workloads.
That said, there is things to consider :
- Moving from classic SQL environments with stored procedures to a file-based, Spark engine can be challenging and may require rethinking your data processing approach.
- For highly structured, low-latency, high-concurrency workloads, traditional MPP data warehouses can still offer superior performance.
- Poor management of Delta tables (e.g. without regular vacuuming or compaction) can lead to performance and cost issues.
👉 Next up: Dive into Data Fabric or Data Mesh in our next architectural deep dives.