With the exponential growth of data and a lot of businesses moving online, it has become imperative to design systems that can act in real-time or near real-time to make any business decisions. So, after working on multiple backend projects through many years, I finally got to do build a real-time streaming platform. While working on the project, I did start experimenting with different tech stacks to deal with this. So, I am trying to share my learnings in a series of articles. Here is the first of them.
This post is aimed at engineers who are already familiar with microservices and Java and are looking to build their first real-time streaming pipeline. This POC is divided into 4 articles for the purpose of readability. They are as follows:
- The first article (the current one) contains the basic concepts, the problem statement, the tech stack, and everything that we need to know about what we are building.
- Article 2 (System Setup) will contain the steps to set up the MySQL database for Binlogs and Docker infrastructure setup. We also cover how to create Kafka topics using Kafka connect.
- Article 3 (Kafka Streams and Aggregations) will contain the Java backend code where we listen to the Kafka topics to create/update aggregated indices into Elasticsearch.
- Article 4 (Verification and Running your code) will contain steps about using Kibana to download the final aggregated report and run the code in your local and verifying the real-time streaming system using Kibana.
What to Expect
A proof of concept on how to build a real-time streaming (CDC) pipeline. There would be efforts needed in terms of optimization on the same to make it production-ready.
Imagine that you work for an E-Commerce company selling some fashion products. You have the business team that wants to make some decisions based on some real-time updates that are made on the backend systems. They want to view some dashboards and views for the same. So let’s assume that the backend is built on microservice architecture, and you have multiple systems interacting with each other during any user operation and every system interacts with different databases.
Let’s consider three of the systems for our POC.
For this example, the business team wants to create an outbound dashboard. This contains the number of orders, types of items sold, cost of each item, and cost of shipping the items. This would be constantly updated at any point in time in multiple systems based on user and real-world actions as seen below.
Once the customer has selected, and the payment is made, the order goes to the warehouse. From there, the item is sent to the customer. Consider that after each action, one system would be updated in real-time.
Let’s have a look at the kind of data required by the business team:
- Units shipper per day per item type per the warehouse.
- Orders shipped per day with each courier company.
- The total cost of items shipped per day per the warehouse.
We know that this data is present in Order service, Warehouse service, and Logistics service. Assume all of them are using the MySQL database and all of them are updated in real-time. Now that we have looked at the use case, let’s think about what could our solution to this problem.
Possible Solution: We need to figure out a way to capture all the updates/inserts that happen in the different databases in different services and put it in a single place from where we can work on building some reports and working on some analytics. So, this is where Kafka Connect and Debezium comes in.
Understanding CDC, MySQL Binlogs, Debezium, and Some Basic Concepts:
Before we jump into the implementation of this system we will need to understand a few concepts. They are shown below:
Change Data Capture:
Change Data Capture (CDC) tracks data changes (usually close to real-time). CDC can be implemented for various tasks, such as auditing, copying data to another system, or processing (and reacting to) events. In MySQL, the easiest and probably most efficient way to track data changes is to use binary logs.
The binary log is a set of log files that contain information about data modifications made to a MySQL server instance. The log is enabled by starting the server with the
The binary log was introduced in MySQL 3.23.14. It contains all statements that update data. It also contains statements that potentially could have updated it (for example, a
DELETE which matched no rows), unless row-based logging is used. Statements are stored in the form of “events” that describe the modifications.
Similar to MySQL Binlogs, MongoDB has something called Oplogs, which are similar to bin logs and are used for CDC.
Debezium is a set of distributed services to capture changes in your databases so that your applications can see those changes and respond to them. Debezium records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.
Stream Processing, Real-Time Processing, and Complex Event Processing
Before we dive into the problem statement, one thing that I did want to put forth was that in this POC, I am building a Complex Event Processing (CEP) system. There are very subtle differences between stream processing, real-time processing, and complex event processing. They are:
- Stream Processing: Stream processing is useful for tasks like fraud detection and cybersecurity. If transaction data is stream-processed, fraudulent transactions can be identified and stopped before they are even complete.
- Real-Time Processing: If event time is very relevant and latencies in the second’s range are completely unacceptable, then it’s called Real-time (Rear real-time) processing. Eg. flight control system for space programs
- Complex Event Processing (CEP): CEP utilizes event-by-event processing and aggregation (for example, on potentially out-of-order events from a variety of sources, often with large numbers of rules or business logic).
This depends more on the business use case. The same tech stack can be used for stream processing as well as real-time processing.
So now that we do have an overall picture in terms of what we want to achieve and what concepts are involved, the next step is to understand the overall technical tasks and the tech stack that we would be using to build our system. Let’s take a look at the same.
Let’s dive into the tech stack that we would be using for this POC. They are:
- Mysql 8.0.18.
- Apache Kafka connect 1.1.
- Apache Kafka.
- Apache Kafka Streams 2.3.1.
- Elastic search 7.5.1.
- Kibana 7.5.1.
- Docker 19.03.5.
- Swagger UI.
- Postman 7.18.1.
Overall Technical Tasks
So, in terms of overall tasks, we will be splitting them up as shown below:
- Setting up local infrastructure using Docker.
- Data Ingestion into Kafka from MySQL database using Kafka connect.
- Reading data using Kafka streams in Java backend.
- Creating indices on Elasticsearch for the aggregated views.
- Listening to the events in real-time and updating the same.
- Setting up your local and running the java code
Summary and Next Steps
So in this article, we looked at:
- We understood the problem statement and looked at the possible solutions.
- Understood the different concepts required for our solution.
- We looked at the tech stack for the solution
Next steps: In the next article, we will look at how to setup MySQL database for BinLogs to be written and docker setup to create the required infrastructure in our local system.
If you do like this article, please do read the subsequent articles and share your feedback. Find me on LinkedIn at rohan_linkedIn.