Back to all tech blogs

Make data migration easy with Debezium and Apache Kafka

  • Backend
8 min read
How Team Search has built an ETL pipeline with a throughput of 3000 events per second

At Kleinanzeigen, Germany’s leading second-hand marketplace, we recently tackled a fascinating yet challenging task: migrating data from one system to another.

In this article, I’ll share the story of our journey—why it was challenging for our team I’ll also describe the technical path we followed and the lessons we learned along the way.

Let’s dive in 🚀

Why might this guide be interesting for you?

  • You’re facing a similar task
  • Old approaches don’t work anymore
  • You already use Kafka
  • The complexity of your systems is growing

What did our task look like?

Why might this process be interesting to you? Data migration is an increasingly relevant topic as systems grow more complex. With many companies transitioning from monolithic architectures to microservices, traditional approaches often fall short. When migrating large volumes of data in real-time, you can’t afford data loss or downtime.

Let’s take a closer look at the task we faced.

Disclaimer: The data schema fields shown here are artificial. While conceptually similar to the actual schema we worked with, it is not identical.

Our main challenge was to merge three parts of data into a single “God object,” which contains all the necessary information for each user_id.

  • Some data resided in the same MySQL database but across different tables.
  • The remaining data was hosted in an external service, which we accessed via HTTP.
On the left side there are three parts of the data in JSON format and on the right there's a final object in JSON format with all necessary fields: user_id, new_user_id, email and dark_mode
De-normalisation overview for our data on how 3 source JSON objects should look at the end of the ETL pipeline
On the left side there are three parts of the data in JSON format and on the right there's a final object in JSON format with all necessary fields: user_id, new_user_id, email and dark_mode

Our team primarily works on classic backend services, not data/ETL pipelines. While we were familiar with Apache Kafka as a message broker, tools like Debezium and Kafka Streams weren’t part of our existing toolkit.

How did we break this down into software components?

Let’s break the task into actionable steps to clarify our approach.

On the left side there are three parts of the data in JSON format and on the right there's a final object in JSON format with all necessary fields: user_id, new_user_id, email and dark_mode
Three steps we need to implement
On the left side there are three parts of the data in JSON format and on the right there's a final object in JSON format with all necessary fields: user_id, new_user_id, email and dark_mode

As shown in the diagram:

1. Streaming Data from MySQL to Kafka:

The first step was streaming data from the MySQL database to a Kafka topic. This ensured we could track all real-time modifications in the database.

2. Aggregating Data Across MySQL Tables:

We needed to combine data stored in different MySQL tables. Conceptually, this was akin to running a join query like:

SELECT u.user_id, u.email, p.dark_mode
FROM users u
JOIN preferences p ON u.user_id = p.user_id;
SQL

The challenge was ensuring this aggregation occurred in real-time.

1. Fetching External Data:

The final step involved fetching additional data from an external service to complete the “God object.” Since we didn’t control this service, we needed a caching mechanism for the aggregated MySQL data. This way, if the database changes were reprocessed, we could avoid redundant table aggregations.

On a data level, the transformation looked like this:

The diagram with three steps of the data transformation. On the diagram you'll see the data looks on each step on this de-normalisation process of User data. The first step of de-normalisation is to merge email and dark_mode and then on the last part we merge a new_user_id as well
The data transformation we need to implement on each step
The diagram with three steps of the data transformation. On the diagram you'll see the data looks on each step on this de-normalisation process of User data. The first step of de-normalisation is to merge email and dark_mode and then on the last part we merge a new_user_id as well
  • MySQL provided the email and dark_mode fields.
  • An external service supplied the ID mappings to complete the denormalised “God object.”

Why Apache Kafka?

The Debezium, Kafka Streams and Kafka logos
Elements of the Apache Kafka ecosystem
The Debezium, Kafka Streams and Kafka logos

You might wonder, “Why are we talking about Apache Kafka for this use case?”

While we already use Apache Kafka extensively as a message broker, its ecosystem offers much more than basic messaging. Here’s what makes Kafka such a powerful tool:

  • Debezium: A Change Data Capture (CDC) solution to stream database changes into Kafka as events.
  • MirrorMaker: Helps mirror topics from one Kafka cluster to another.
  • Kafka Streams: Enables in-flight data transformations and aggregations.
  • Compacted Topics: Built-in configuration that allows Kafka topics to behave like a cache, storing the latest snapshots of events for each unique key.

With these tools, Kafka becomes an end-to-end platform for streaming, processing, and storing event-driven data.

Steps we took

Let’s walk through the steps we followed to achieve our goal.

We realised that in addition to the usual Kafka producer/consumer functionality, we needed:

1. A compacted topic to aggregate changes from the database.

2. Debezium as a CDC solution to stream real-time database changes into Kafka.

However, we decided against using Kafka Streams to aggregate data across MySQL tables and Kafka topics, as it wasn’t necessary for our specific use case.

1. Streaming MySQL modifications with Debezium

The first challenge was streaming real-time changes from MySQL. To choose the right tool, we had to consider two key nuances of our task:

1. The data arrives continuously in real-time, meaning we couldn’t take a one-time snapshot.

2. We needed to ensure every record in MySQL was processed at least once, including historical data.

Debezium proved to be the perfect fit for our requirements.

According to the official website:

Debezium is an open-source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event, even when things go wrong.

If you already have existing Kafka Connect and Debezium infrastructure, setting up a new Debezium connector is straightforward.

That said, there is one drawback: Troubleshooting can be challenging because issues often arise at the infrastructure level rather than in the application code.

From an architectural perspective, the setup is simple—we reused the existing Debezium and Kafka Connect clusters.

The first part of our pipeline where we stream data from MySQL database to Kafka by using Debezium
The first part of our pipeline where we stream data from MySQL database to Kafka
The first part of our pipeline where we stream data from MySQL database to Kafka by using Debezium

No code required: Setting up the Debezium connector

Creating a new connector didn’t require writing any application code. Instead, we defined a configuration file, which was similar to this example from the official documentation:

{
  "name": "mysql-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "tasks.max": "1",
    "database.hostname": "localhost",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.server.id": "184054",
    "database.server.name": "fullfillment",
    "database.include.list": "inventory",
    "table.include.list": "inventory.orders",
    "snapshot.mode": "initial"
  }
}
JSON

This configuration allows you to:

  • Filter data: Specify the databases and tables to monitor.
  • Modify data: Apply simple transformations at the configuration level.
  • Control snapshots: Use the snapshot.mode option to define criteria for capturing the initial state of the database.

By leveraging Debezium and Kafka Connect, we were able to stream real-time database changes efficiently without writing a single line of application code.

2. Aggregating Data

The next step involved aggregating data for the same user_id stored across different MySQL tables. Here’s where things got interesting: one table had relatively infrequent updates to email data, while another had frequent updates to dark_mode information.

We could have used Kafka Streams for this purpose, leveraging RocksDB to store intermediate states. However, we chose a more pragmatic approach to avoid adding unnecessary architectural complexity.

The second part of our pipeline where we aggregate event data from twoKafka topics. We use MySQL for this instead of Kafka Streams
The second part of our pipeline where we aggregate event data from two Kafka topics
The second part of our pipeline where we aggregate event data from twoKafka topics. We use MySQL for this instead of Kafka Streams

Our approach

The idea was straightforward:

1. Listen for every data modification event.

2. Extract the aggregated data directly from MySQL using a single SQL query.

3. Send the aggregated result to an output Kafka topic.

A simple SQL query like the one below handled the aggregation:

SELECT u.user_id, u.email, p.dark_mode  
FROM users u  
JOIN preferences p ON u.user_id = p.user_id  
WHERE u.user_id = 1;
SQL

Benefits

The simplicity of this setup was its biggest advantage. By listening for Kafka events, fetching the data directly with SQL, and aggregating it, we avoided introducing additional components like Kafka Streams.

Drawbacks

While effective, this approach could become a bottleneck under high load, as processing events would depend on fetching data from MySQL in real-time.

For our use case, this wasn’t a concern because:

1. Our load wasn’t high enough to cause delays.

2. Any potential bottlenecks were mitigated downstream by aggregating data in a Kafka topic, especially during full re-consumption of MySQL changes.

3. Using a compacted Kafka topic for quick re-consumption

By this stage, our pipeline was functional. However, there was one additional challenge: re-consuming all MySQL data whenever changes occurred in third-party services that we didn’t control.

To address this, we used a compacted Kafka topic.

How Compacted Kafka Topics Work

In a compacted topic, Kafka retains only the latest event for each unique key. Here’s a simple illustration:

First Event for id=1:

{
  "id": 1,
  "data": "first data change" 
}
JSON

Second Event for id=1:

{
  "id": 1,
  "data": "second data change" 
}
JSON

The second event overwrites the first for the same key (id=1). This ensures the most recent event is always available and stored indefinitely.

Benefits

Using a compacted topic allowed us to efficiently re-consume MySQL changes whenever needed. We could:

  • Modify the offset of an existing consumer group.
  • Create a new consumer group to fetch all the latest snapshots stored in the compacted topic.

This setup ensured that all aggregated MySQL data was readily available, even when a full re-consumption was required.

The third part of our pipeline where we call an external service to fetch the latest part of the data.
The third part of our pipeline where we call an external service to fetch the latest part of the data
The third part of our pipeline where we call an external service to fetch the latest part of the data.

Results

Our final migration pipeline ended up looking like the diagram below:

The final architecture split into three sections. The first streaming data from a database. The second join events from two Kafka topics into one. The third cache all denormalised user data.
The final architecture
The final architecture split into three sections. The first streaming data from a database. The second join events from two Kafka topics into one. The third cache all denormalised user data.

We successfully achieved our goals without introducing unnecessary technologies while maintaining a high throughput for the pipeline. Here’s an example of the throughput we achieved:

The throughput of number of messages sent per second in each section of the architecture. In section one up to 50k per second. In section two up to 10k per second. In section three up to 3k per second.
The throughput numbers we’ve got
The throughput of number of messages sent per second in each section of the architecture. In section one up to 50k per second. In section two up to 10k per second. In section three up to 3k per second.

Below is a snapshot of the actual data flowing through part of our pipeline, as visualised in our Grafana dashboard:

The numbers from our Grafana dashboard. The same as from the previous image. Up to 50k messages per second come from Debezium, up to 10k messages per second come from our aggregator and up to 30k messages per second come when we call an external service at the end of our pipeline.
The real numbers of the throughput of our pipeline from Grafana dashboard
The numbers from our Grafana dashboard. The same as from the previous image. Up to 50k messages per second come from Debezium, up to 10k messages per second come from our aggregator and up to 30k messages per second come when we call an external service at the end of our pipeline.

The numbers shown reflect a re-consumption scenario where we processed all available data, pushing our pipeline’s throughput to its limits.

Key Learnings

Through this project, we gained several insights:

  • Debezium is powerful but not without challenges.
    While Debezium is an excellent tool for change data capture (CDC), troubleshooting issues can sometimes be complex, as they often occur at the infrastructure level rather than in the application code.

  • Compacted Kafka topics simplify the architecture.
    Introducing a compacted Kafka topic provided a streamlined way to manage potential bottlenecks during re-processing, ensuring that the most recent state was always readily accessible.

And in closing:

  • Write configuration instead of code

  • Know your tool. There are a lot around Apache Kafka

Related techblogs

Discover all techblogs

Engineers to the rescue: When your public employment service makes it impossible to apply for jobs 

Read more about Engineers to the rescue: When your public employment service makes it impossible to apply for jobs 

Mind Tricks of AI: Fiction vs. Reality

Read more about Mind Tricks of AI: Fiction vs. Reality
Understanding AI hallucinations: The fascinating phenomenon of fabricated responses

Ninja-Commits: A Silent Saboteur

Read more about Ninja-Commits: A Silent Saboteur
How sneaky code changes can undermine your team’s delivery and performance