Apache Hudi (Hadoop Upserts Deletes and Incrementals) is a powerful framework designed for managing large datasets on cloud storage systems, enabling efficient data ingestion, storage, and retrieval. One of the key features of Hudi is its support for two distinct storage types: Copy-On-Write (COW) and Merge-On-Read (MOR). Each of these storage strategies has unique characteristics and serves different use cases. In this blog, we will explore COW and MOR.
Prerequisites
Before you begin, ensure you have the following installed on your local machine:
Docker
Docker Compose
Local Setup
To set up Apache Hudi locally, follow these steps:
Clone the Repository:
git clone https://github.com/dnisha/hudi-on-localhost.git cd hudi-on-localhost
Start Docker Compose:
docker-compose up -d
Access the Notebooks:
Open your browser and navigate to http://localhost:8888 for the Jupyter Notebook.
Also, open http://localhost:9001/login for MinIO.
Username: minioadmin Password: minioadmin
What is Copy-On-Write (COW)?
Copy-On-Write (COW) is a storage type in Apache Hudi that allows for atomic write operations. When data is updated or inserted:
Hudi creates a new version of the entire data file.
The existing data file remains unchanged until the new file is successfully written.
This ensures that the operation is atomic, meaning it either completely succeeds or fails without partial updates.
[ Good Read: cloud data warehouse vs data lake ]
1.Steps to Evaluate COW
Open the Notebook:
In your browser, navigate to hudi_cow_evaluation.ipynb.
2. Run Configuration Code:
Execute all configuration-related code in the notebook.
Ensure you specify the COPY_ON_WRITE table type, as shown in the provided image.
3. Updating a Record:
a. Focus on updating a record in the 34 partition of the COW bucket.
b. Since you are using the COPY_ON_WRITE table type, a new Parquet file will be created for this update. You can find this file in the bucket located at warehouse/cow/transactions/document=34. Open
What is Merge-On-Read (MOR)?
Merge-On-Read (MOR) is an alternative storage type in Apache Hudi that employs a different approach to data management. Here’s how it works:
Base Parquet Files and Log Files: In MOR, Hudi maintains a combination of base Parquet files alongside log files that capture incremental changes.
On-the-Fly Merging: When a read operation is executed, Hudi merges the base files and log files in real-time, providing the most up-to-date view of the data.
This approach allows for efficient handling of updates and inserts while enabling faster read operations, as the system does not need to rewrite entire files for every change.
You can check more info about: COW and MOR in Apache Hudi.
Comments