You may wonder what this meaningless fight mentioned in the title is all about. We think that Kafka and Sqoop are rivals. You may think that Kafka and Sqoop are in different leagues and that there is no relation between them. Since we are new to all these topics, we are just come across something on the internet and start research. We tend to believe in the first option we hear which leaded us to think of these two concepts are rival, then we research it for validation.
Current state was that we managed to query the data so it was time to move on to the next target that was moving the data from oracle to hive. This would allow us to see real data in action. For this purpose, we separated the team into two groups. The first group was working with Kafka to sync the data while the second group was working on Hive SQL queries and tables.
Apache Kafka and Our Huge New Problem
Since no one in the team has any experience on Apache Kafka, we have started to research what Apache Kafka is and what it is capable of. First let Kafka describe itself.
Kafka® is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
Apache Kafka has very good support of sync data between two sides with help of third party components which we used Confluent. For our case, it reads the data from a relational database using JDBC and migrates it to Hive through creating a new table or inserts it to an existing one as new records. It sounds very nice. We inserted records into our database and saw them on Hive side. The result was successful. The logic is very simple. You need to specify a timestamp field to check if the record is newer than the last synchronized timestamp within Kafka. If so, Kafka migrates it.
Once we tried to update a record instead of inserting new one, a problem occurred. Kafka handles all insert/update operations as a new transaction and reflects them to hive as a new insert which means that you will see a new record for each update operation. That messed up our expectations. Additionally, it cannot understand deleted records, we need to develop some approach for the deleted ones. Here we needed someone to blame. Of course the scapegoat was Kafka, who else can it be?
Hadoop was our king, it was marvelous. As soon as we got this wrong conclusion (we understood later), we started to investigate what the problem was and ended up with a solution named Apache Sqoop.
Apache’s Answer to Migration Question: Apache Sqoop
Apache Sqoop is an application which is being used to handle data migration from a source to Apache Hadoop. Here we have a description directly from Sqoop’s website.
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
As of this point, we had confidence that Sqoop was the answer to our migration problem. Small note: our relational database has more than 500 GB of data. “efficiently transferring bulk data” sentence impressed us. Finally we transferred the data to Hadoop side using Sqoop.
Let’s check if our update is OK. Guess what! The answer is NO! In the meanwhile, we figured out another approach to test the update problem. We directly connected to hive using command line tool and tried to update a record using standard SQL update statement. The error message especially emphasized “updating a record is not suitable within a hive table”. Here we suddenly had the reason for the previous failure. We could have noticed earlier but we simply needed the experience.
All database providers like oracle, mysql, even sqlite, need to handle a file to store the data. Introduction to file organization lesson, you need to create a structural file type like random access files, json/xml files as your base line so that your update will be less painful. In any case, you will try to burden this load on your shoulders. Using random access files are generally a good option in terms of easiness and reading speed. Of course, databases use much more sophisticated approaches for handling the datafile. Here, Hadoop chunks the file and does not take care of the details. If you change any record on your source, it is a new record for hadoop/hive. Hadoop keeps the file handling part as simple as they can. We need to have a better understanding what we can do about this.
Data Sync Approaches
We have investigated what kind of approaches people use in real life. It seems two approaches are common.
1- Bulk Data Extraction
Companies uses Hadoop as processing environment and they do not use it as a persistence layer. This means that they do not want to sync the data in real time or near real time. They extract all the data once in a day, load it into Hadoop and then they execute their code to do the calculation. The results are being stored to another environment, where we can use Hadoop as a persistence layer as well. Should we use this? We do not know yet. Let’s check the other option.
2- Sync Cumulative Data
The second approach was to create an initial table, where we will sync only delta data later. Then create some views (yes you can create views on hive) per each table to show the most recent data based on timestamps. It seemed a better approach than the first one in terms of network efficiency, but worse for development efforts.
After some research on the internet for a while, we realized that companies like Google have different problems about storing data. They have found their own solutions and approaches to the problems. Google, for example, decided to remove Hadoop from their future plans and use a different storage method.
To be continued.