The Rise of Apache Iceberg: Leading the Data Lake Table Format Race

Riddhi Agrahari
Jun 20, 2024
3 min read

Updated: Jun 24, 2024

The world of data lakes is constantly evolving, and the way we manage data within them is crucial. In this evolving landscape, Apache Iceberg is rapidly emerging as a leader / de facto standard in data table formats. Here's why:

Addressing Challenges of Traditional Formats:

Traditional data lake table formats like Parquet and ORC were designed for efficient data storage, not for managing schema evolution, data updates, and deletes. This leads to challenges like:

Data Inconsistencies: Multiple versions of data might coexist, causing confusion and potentially impacting analysis.
Limited Schema Evolution: Changing data structures can be cumbersome and require complex workarounds.
Lack of Transactions: Updating or deleting data often necessitates rewriting entire files, impacting performance and data lineage tracking.

Apache Iceberg Fills the Gap:

Iceberg addresses these challenges head-on by introducing a metadata layer on top of data files stored in formats like Parquet or ORC. This metadata layer tracks information about the data, including:

Schema: Definition of data columns and their types.
Partitions: How data is organized within the table for efficient querying.
Data Files: Locations and details of individual data files.
Snapshots: Specific points in time representing the state of the data.

Benefits of Using Apache Iceberg:

ACID Transactions: Supports ACID (Atomicity, Consistency, Isolation, Durability) properties, enabling reliable data updates and deletes.
Seamless Schema Evolution: Allows schema changes without impacting existing data or requiring data rewrites.
Time Travel: Facilitates querying historical versions of data based on snapshots.
Hidden Partitioning: Automatically manages partitions based on data values, simplifying table management.
Efficient Querying: Optimizes queries by leveraging metadata about partitions and data files.

Gaining Traction in the Data Lake Ecosystem:

Here's evidence suggesting that Apache Iceberg is gaining significant traction within the data lake space:

Large-Scale Adoption: Leading companies like Netflix, Uber, eBay, and Databricks are already using Iceberg in production for their data lakes.
Integration with Major Vendors: Major data lake vendors like Cloudera, Databricks, and Snowflake are actively integrating Iceberg into their platforms.
Growing Community Support: The Apache Iceberg project boasts a vibrant community of developers and users actively contributing to its development and improvement.

While some may say its not yet the undisputed de facto standard, Iceberg's strong technical advantages, growing industry adoption, and vendor support make it a compelling choice for data lake table formats. Its ability to address the limitations of traditional formats and its focus on data management best practices position it as a frontrunner in the evolving data lake landscape.

Additional points to consider:

The data lake format landscape is still evolving, and other contenders like Delta Lake are also vying for dominance.
The choice of table format depends on specific data lake needs and existing infrastructure.
It's important to stay current with evolving standards and assess the suitability of different formats for your specific use case.

Conclusion:

The shift in mindset from Data Warehouse to Data Lake is complete in most Banks given the size and source of data that is flowing in. However, a standardised approach to data lake implementation is still under debate. Traditional data lake table formats like Parquet and ORC were designed for efficient data storage, not for managing schema evolution, data updates, and deletes. This has meant there is need for a new age format that supports these common use cases. Apache Icebergs support by a range of players including Snowflake, Cloudera, Dremio, Databricks has meant the rise of a near defacto storage format.

Drona Pay has created a modular open source data stack that has helped banks modernise their data architecture while building using Apache Iceberg as the storage format for data from systems including CBS, LMS, Treasury, Internet Banking, Mobile Banking, Payments etc. The Drona Pay Modern Data Stack provides scalable data ingestion, processing, storage, governance and visualization built on top of leading open source elements including Iceberg, Debezium, Kafka, Airflow, Spark and Superset. By embracing the Drona Pay stack, Banks can offer an end to end data lake infrastructure that can be supported on leading cloud vendors and on premise.

By understanding the advantages of Apache Iceberg and its growing presence within the data lake ecosystem, Drona Pay has made an informed decision about its data lake architecture to ensure efficient data management for Banks.

The Rise of Apache Iceberg: Leading the Data Lake Table Format Race

Recent Posts

Comments

See Our Technology In Action