Introduction
We recently acquired a solution request from a valuable customer to ingest data from multiple sources into a storage medium so that it could be accessible, utilized and analyzed for a variety of revenue-generating use cases. Massive amounts of data must be ingested in near real-time, and the data must be queried without the additional learning curve that’s required to understand the nuances of big data ecosystem technologies like Spark or Flink.
The article discusses our initial approach, its drawbacks, and difficulties, as well as how it helped us develop a truly innovative platform capable of handling billions of messages in real-time. In situations where the persistence of huge volumes of incoming data and consumption is critical, this feature is an essential component of the DigiXT Data Platform (DDP).
Requirement
We had received the request from one of the most prominent names in the telecommunication services industry. For future research, the customer wants to archive millions of rows of data generated by telecom towers. Note that a single tower, on average, creates tens of millions of rows every second, and there are ten of these towers in total.
Figure 1: Problem Statement
Assume that roughly a billion rows per minute are typical. This data must be recorded so that it may be utilized to reply to various analytical inquiries. Furthermore, the records must be stored for at least six months. The solution must also be prepared to accommodate concurrent inquiries. The query time must be as minimal as possible. SAAL’s team began by delving deep into the data, as with any data-driven endeavour.
First Attempt
Googling to see if anything can be incorporated, and if required, adding some advancements to the current solutions, is the normal and obviously more dependable method to solve any technical problem.
However, at Saal, we want the solution to be unique, scalable, performant, and secure, as well as offer some additional advantages over the conventional alternatives. As a result, we found a commercial device that claimed to be capable of consuming billions of messages, provided the payload included time-series data
Telco data is inherently time-series, thus a database like Influxdb or Rocksdb will be ideal. Many people in the industry are already employing them, as we’ve seen. This article will also briefly go over how to utilize RocksDB to ingest time-series data.
After being persuaded of the database’s selection, we proceeded with the Proof of Concept (POC), beginning the ingestion with the assistance of a highly skilled technical team from the commercial vendor. The first several days were a catastrophe, but we gradually gained momentum. We began absorbing data, and when we saw billions of messages, we knew things were improving and we were getting closer to our objective. As the ingestion rate rose and the data flow was maintained for a longer length of time, we discovered that disk I/O is extremely slow, as is the CPU wait time for completing disk I/O. The “dstat” tool in Linux can readily identify this. We were running Ubuntu, the most recent version, on the POC, and the machine was quite powerful in terms of CPUs and RAM. We had 96 cores and 768 gigabytes of RAM. Here is an example of dstat output:
Figure 2: dstat analysis on the database
We can see that the CPU is doing nothing, that the wait times are increasing, and that the I/O is quite heavy, indicating that the system is unable to write efficiently as more data is ingressed. This might be a problem with the database settings or the drive, but we’ve noticed that the disk is linked through iSCSI on SSD, which is ideal for random reads and writes. As we saw, the connections, WAL-based writing, block size parameters, data compaction, and so on created too much delay and were unable to achieve the throughput necessary for the solution. Unfortunately, the commercial product was unable to fulfill the needs, while being flawless and suitable for small-scale use.
Applying Innovation: The SAAL Difference
Actual reflection of the problem reveals what is most important here: The ability to ingest, store, and retrieve information. Given this, the SAAL team pondered whether we should rely on a database at all. Other than safely storing and retrieving data, as well as protecting and keeping it, we don’t require complete ACID functionality here. However, durability and consistency are crucial, but file systems provide both. We don’t need to “join” several tables because the requirement is that all data be stored in a single table. We must have effective compression as well as a mechanism for efficiently querying large amounts of data.
We quickly concluded that using a columnar format in a compressed mode was the best option. And, because data is massive, we examined some of the finest formats used in the big data world:
As demonstrated, the Parquet 2 format provides the most benefits. Because individual tables in the big data ecosystem can be petabytes in size, attaining quick query response times necessitates intelligent filtering of table data based on criteria in the WHERE or HAVING clauses. Large tables are often partitioned using one or more columns that may efficiently filter the range of data. Date columns, for example, are frequently used as partition keys so that data partitions may be eliminated when a date range is given in SQL queries.
In addition to partition-level filtering, the Parquet file format enables file-level filtering based on the lowest and maximum values of each column in the file. These minimum/maximum column values are saved in the file’s footer.
If the range of data in the file between the minimum and maximum values does not coincide with the range of data given by the query, the system skips the whole file during scans. Filtering file-level minimum/maximum statistics was formerly coarse-grained: if a whole file could not be skipped, the full file had to be read. With the addition of Parquet Page Indexes to the Parquet format, scanners may decrease the amount of data read from disk, even more, providing a substantial performance boost for SELECT queries in SAAL’s DigiXT Data Platform.
The subsequent effort will be to produce the parquet files first. Rather than relying on the database, we utilized a memory-mapped format to generate Parquet files at the client-side, keeping them in line with industry best practices. As a result, we alleviated the bottleneck at the database server, which was a tremendous improvement that allowed us to achieve ingestion at a much higher level than customary. The high-level architectural overview is presented below:
A basic benchmarking enabled us to ingest around 50 billion messages per day, with query times of less than a minute.
The end result is as follows: We accomplished 254,389 parallel splits of Parquet columnar data and queried 211 billion rows in three minutes using three nodes running the query in parallel.
Summary
Due to this revolutionary approach, it is appropriate for any large-scale data ingestion requirements, IoT applications, and/or Telco use cases. DigiXT Data Platform offers an out-of-the-box JDBC/ODBC driver that can be used to ingest data from any other downstream applications for analytics, machine learning, and reporting, facilitating a complete data-driven ecosystem.
For additional information, please contact us.