Building a Scalable, Cost-Effective Data Lakehouse for Indian Banks: A Powerful Open-Source Approach

Riddhi Agrahari
Jun 20, 2024
4 min read

Updated: Jun 24, 2024

In the data-driven world of Indian banking, unlocking the power of information is crucial. Banks look at data infrastructure projects with great care as it is a strategic element with significant focus from the top levels. Strategic technology imperatives when deciding on a data lake or data infrastructure project today include

Cloud + On Premise: The ability to leverage cloud for burstable volume while using on premise for predictable load.
Minimize Vendor lock in: CIOs and CTOs want to have flexibility / modular choices to preventing vendor lock-in which often results in spiralling costs
Scalability and Cost-Effectiveness: Build an infrastructure that handles growing data volumes efficiently while keeping costs under control.
Future-Proofing: Choose a flexible architecture that adapts to new technologies and evolving needs of the bank.
Data Governance & Compliance: Ensure data accuracy, security, and regulatory compliance with frameworks like RBI's "Master Directions - IT".
Best of breed: Data infrastructure covers a range of areas including data ingestion, data processing, data storage, data governance, visualization etc. and hence CIOs aim for a best of breed approach with certain elements already in place due to historic choices.

Concerns about exorbitant costs can hinder the adoption of data lakehouses. Here's how to leverage open-source technologies to build a robust, cost-effective data lakehouse architecture, empowering Indian banks to gain valuable insights without breaking the bank.

The Open-Source Powerhouse:

This architecture utilizes a best-in-breed approach with open-source tools:

Debezium: This CDC (Change Data Capture) tool captures real-time data modifications from various source systems (core banking, loan management) within the bank's infrastructure (cloud or on-premise).
Apache Spark: The workhorse for data ingestion, transformation, and loading into the data lakehouse. Spark's distributed processing power efficiently handles large datasets.
Apache Iceberg: This table format provides efficient data storage within the data lakehouse. It ensures schema evolution, data lineage tracking, and efficient querying for diverse data sets.
Presto: An open-source SQL query engine that enables interactive querying of data stored in the Iceberg tables. Unlike Spark, Presto doesn't require complex code, making it user-friendly for analysts.
Apache Airflow: This open-source workflow orchestration platform automates data pipelines. It schedules data ingestion, transformation, and loading tasks, ensuring a reliable and repeatable data flow.
Datahub.io: This open-source data catalog acts as a central repository for data metadata within the lakehouse. It facilitates data discovery, lineage tracking, and collaboration among data users.

Cloud-Agnostic and On-Premise Friendly:

This architecture is designed to be cloud-agnostic and can be deployed on either public cloud platforms (AWS, Azure, GCP) or on-premise infrastructure. By leveraging cloud storage solutions for the data lakehouse, banks can benefit from scalability and pay-as-you-go pricing models, reducing upfront costs.

Cost-Effectiveness:

Open-Source Advantage: Eliminating vendor licensing fees for core components significantly reduces costs compared to proprietary data lakehouse solutions.
Pay-As-You-Go Cloud Storage: Cloud storage options offer scalable storage based on data volume, eliminating the need for upfront investment in hardware infrastructure.
Reduced Reliance on External Expertise: Open-source communities provide extensive documentation and support, minimizing dependence on expensive external consultants for system setup and maintenance.

Modular and Scalable Architecture:

Each component of the architecture is modular, allowing for independent scaling as data volumes or processing needs grow.

Independent Scaling: Individual components like Spark clusters or cloud storage instances can be scaled up or down based on processing requirements.
Easy Integration: Open-source tools are known for their interoperability. New data sources or tools can be easily integrated into the existing architecture as needed.

Best-of-Breed Approach:

This architecture leverages best-in-class open-source tools for each specific function:

Real-Time Data Capture: Debezium is a leader in CDC technology, ensuring real-time data updates within the data lakehouse.
Distributed Processing Power: Apache Spark is widely recognized for its ability to handle large-scale data processing efficiently.
Advanced Table Format: Apache Iceberg offers significant advantages in schema evolution, data lineage tracking, and query performance compared to traditional formats.
Interactive SQL querying: Presto allows analysts to query data directly using familiar SQL syntax, eliminating the need for complex Spark programming.
Workflow Automation: Apache Airflow is a mature and popular open-source solution for data pipeline orchestration.
Centralized Data Catalog: Datahub.io is gaining traction as a valuable tool for data discovery and collaboration within data-driven organizations.

Putting It All Together: A Step-by-Step Approach

Data Source Identification: Identify the critical data sources from core banking systems, loan management platforms, payment systems, etc., that need to be integrated into the data lakehouse.
Debezium Deployment: Deploy Debezium connectors for each identified data source to capture real-time data changes.
Spark Processing: Develop Spark jobs to transform and prepare the captured data for loading into the data lakehouse.
Iceberg Management: Utilize Apache Iceberg as the table format within the data lakehouse (cloud storage or on-premise HDFS).
Airflow Orchestration: Set up Apache Airflow to orchestrate the data pipeline. This includes scheduling Debezium data capture, triggering Spark transformations, and loading data into the Iceberg tables.
Presto Integration: Configure Presto to query data directly from the Iceberg tables.

Conclusion:

Building a data lake and warehouse to integrate data for report generation in Indian banks is a complex endeavour. However, by acknowledging the challenges, adopting a strategic approach, and implementing the right solutions, banks can unlock the immense potential of their data and gain a competitive edge in the rapidly evolving financial landscape.

Drona Pay has created a modular open source data stack that has helped banks modernise their data architecture while building using Apache Iceberg as the storage format for data from systems including CBS, LMS, Treasury, Internet Banking, Mobile Banking, Payments etc. The Drona Pay Modern Data Stack provides scalable data ingestion, processing, storage, governance and visualization built on top of leading open source elements including Iceberg, Debezium, Kafka, Airflow, Spark and Superset. By embracing the Drona Pay stack, Banks can offer an end to end data lake infrastructure that can be supported on leading cloud vendors and on premise.

Building a Scalable, Cost-Effective Data Lakehouse for Indian Banks: A Powerful Open-Source Approach

Recent Posts

Comments

See Our Technology In Action