At Kleinanzeigen, Germany’s leading second-hand marketplace, we recently tackled a fascinating yet challenging task: migrating data from one system to another.
In this article, I’ll share the story of our journey—why it was challenging for our team I’ll also describe the technical path we followed and the lessons we learned along the way.
Let’s dive in 🚀
Why might this guide be interesting for you?
- You’re facing a similar task
- Old approaches don’t work anymore
- You already use Kafka
- The complexity of your systems is growing
What did our task look like?
Why might this process be interesting to you? Data migration is an increasingly relevant topic as systems grow more complex. With many companies transitioning from monolithic architectures to microservices, traditional approaches often fall short. When migrating large volumes of data in real-time, you can’t afford data loss or downtime.
Let’s take a closer look at the task we faced.
Disclaimer: The data schema fields shown here are artificial. While conceptually similar to the actual schema we worked with, it is not identical.
Our main challenge was to merge three parts of data into a single “God object,” which contains all the necessary information for each user_id.
- Some data resided in the same MySQL database but across different tables.
- The remaining data was hosted in an external service, which we accessed via HTTP.

Our team primarily works on classic backend services, not data/ETL pipelines. While we were familiar with Apache Kafka as a message broker, tools like Debezium and Kafka Streams weren’t part of our existing toolkit.
How did we break this down into software components?
Let’s break the task into actionable steps to clarify our approach.

As shown in the diagram:
1. Streaming Data from MySQL to Kafka:
The first step was streaming data from the MySQL database to a Kafka topic. This ensured we could track all real-time modifications in the database.
2. Aggregating Data Across MySQL Tables:
We needed to combine data stored in different MySQL tables. Conceptually, this was akin to running a join query like:
SELECT u.user_id, u.email, p.dark_mode
FROM users u
JOIN preferences p ON u.user_id = p.user_id;
SQLThe challenge was ensuring this aggregation occurred in real-time.
1. Fetching External Data:
The final step involved fetching additional data from an external service to complete the “God object.” Since we didn’t control this service, we needed a caching mechanism for the aggregated MySQL data. This way, if the database changes were reprocessed, we could avoid redundant table aggregations.
On a data level, the transformation looked like this:

- MySQL provided the email and dark_mode fields.
- An external service supplied the ID mappings to complete the denormalised “God object.”
Why Apache Kafka?

You might wonder, “Why are we talking about Apache Kafka for this use case?”
While we already use Apache Kafka extensively as a message broker, its ecosystem offers much more than basic messaging. Here’s what makes Kafka such a powerful tool:
- Debezium: A Change Data Capture (CDC) solution to stream database changes into Kafka as events.
- MirrorMaker: Helps mirror topics from one Kafka cluster to another.
- Kafka Streams: Enables in-flight data transformations and aggregations.
- Compacted Topics: Built-in configuration that allows Kafka topics to behave like a cache, storing the latest snapshots of events for each unique key.
With these tools, Kafka becomes an end-to-end platform for streaming, processing, and storing event-driven data.
Steps we took
Let’s walk through the steps we followed to achieve our goal.
We realised that in addition to the usual Kafka producer/consumer functionality, we needed:
1. A compacted topic to aggregate changes from the database.
2. Debezium as a CDC solution to stream real-time database changes into Kafka.
However, we decided against using Kafka Streams to aggregate data across MySQL tables and Kafka topics, as it wasn’t necessary for our specific use case.
1. Streaming MySQL modifications with Debezium
The first challenge was streaming real-time changes from MySQL. To choose the right tool, we had to consider two key nuances of our task:
1. The data arrives continuously in real-time, meaning we couldn’t take a one-time snapshot.
2. We needed to ensure every record in MySQL was processed at least once, including historical data.
Debezium proved to be the perfect fit for our requirements.
According to the official website:
Debezium is an open-source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event, even when things go wrong.
If you already have existing Kafka Connect and Debezium infrastructure, setting up a new Debezium connector is straightforward.
That said, there is one drawback: Troubleshooting can be challenging because issues often arise at the infrastructure level rather than in the application code.
From an architectural perspective, the setup is simple—we reused the existing Debezium and Kafka Connect clusters.

No code required: Setting up the Debezium connector
Creating a new connector didn’t require writing any application code. Instead, we defined a configuration file, which was similar to this example from the official documentation:
{
"name": "mysql-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "localhost",
"database.port": "3306",
"database.user": "debezium",
"database.password": "dbz",
"database.server.id": "184054",
"database.server.name": "fullfillment",
"database.include.list": "inventory",
"table.include.list": "inventory.orders",
"snapshot.mode": "initial"
}
}
JSONThis configuration allows you to:
- Filter data: Specify the databases and tables to monitor.
- Modify data: Apply simple transformations at the configuration level.
- Control snapshots: Use the snapshot.mode option to define criteria for capturing the initial state of the database.
By leveraging Debezium and Kafka Connect, we were able to stream real-time database changes efficiently without writing a single line of application code.
2. Aggregating Data
The next step involved aggregating data for the same user_id stored across different MySQL tables. Here’s where things got interesting: one table had relatively infrequent updates to email data, while another had frequent updates to dark_mode information.
We could have used Kafka Streams for this purpose, leveraging RocksDB to store intermediate states. However, we chose a more pragmatic approach to avoid adding unnecessary architectural complexity.

Our approach
The idea was straightforward:
1. Listen for every data modification event.
2. Extract the aggregated data directly from MySQL using a single SQL query.
3. Send the aggregated result to an output Kafka topic.
A simple SQL query like the one below handled the aggregation:
SELECT u.user_id, u.email, p.dark_mode
FROM users u
JOIN preferences p ON u.user_id = p.user_id
WHERE u.user_id = 1;
SQLBenefits
The simplicity of this setup was its biggest advantage. By listening for Kafka events, fetching the data directly with SQL, and aggregating it, we avoided introducing additional components like Kafka Streams.
Drawbacks
While effective, this approach could become a bottleneck under high load, as processing events would depend on fetching data from MySQL in real-time.
For our use case, this wasn’t a concern because:
1. Our load wasn’t high enough to cause delays.
2. Any potential bottlenecks were mitigated downstream by aggregating data in a Kafka topic, especially during full re-consumption of MySQL changes.
3. Using a compacted Kafka topic for quick re-consumption
By this stage, our pipeline was functional. However, there was one additional challenge: re-consuming all MySQL data whenever changes occurred in third-party services that we didn’t control.
To address this, we used a compacted Kafka topic.
How Compacted Kafka Topics Work
In a compacted topic, Kafka retains only the latest event for each unique key. Here’s a simple illustration:
First Event for id=1:
{
"id": 1,
"data": "first data change"
}
JSONSecond Event for id=1:
{
"id": 1,
"data": "second data change"
}
JSONThe second event overwrites the first for the same key (id=1). This ensures the most recent event is always available and stored indefinitely.
Benefits
Using a compacted topic allowed us to efficiently re-consume MySQL changes whenever needed. We could:
- Modify the offset of an existing consumer group.
- Create a new consumer group to fetch all the latest snapshots stored in the compacted topic.
This setup ensured that all aggregated MySQL data was readily available, even when a full re-consumption was required.

Results
Our final migration pipeline ended up looking like the diagram below:

We successfully achieved our goals without introducing unnecessary technologies while maintaining a high throughput for the pipeline. Here’s an example of the throughput we achieved:

Below is a snapshot of the actual data flowing through part of our pipeline, as visualised in our Grafana dashboard:

The numbers shown reflect a re-consumption scenario where we processed all available data, pushing our pipeline’s throughput to its limits.
Key Learnings
Through this project, we gained several insights:
- Debezium is powerful but not without challenges.
While Debezium is an excellent tool for change data capture (CDC), troubleshooting issues can sometimes be complex, as they often occur at the infrastructure level rather than in the application code. - Compacted Kafka topics simplify the architecture.
Introducing a compacted Kafka topic provided a streamlined way to manage potential bottlenecks during re-processing, ensuring that the most recent state was always readily accessible.
And in closing:
- Write configuration instead of code
- Know your tool. There are a lot around Apache Kafka