Processing large datasets is a common challenge in data engineering and data science. When dealing with massive volumes of data, optimizing processing time becomes critical to avoid performance bottlenecks.
Recently, on the internet I found a fascinating challenge: "One Billion Row Challenge". This challenge went viral, especially on Twitter (now X). The task is straightforward: process 1 billion temperature records for cities around the world, and return the minimum, maximum, and average temperatures for each city.
The challenge is language-specific, each language had its own leaderboard. The actual challenge was needed to use Java. The fastest solution were able to process the 1 billion records in 1.535 seconds, using GraalVM and uses unsafe operations. I was curious to see how fast I could process the 1 billion records using C++.
My approach and solution
My approach was inspired by someone (I forgot who) building their own database system. They mentioned that systems have a mmap() function, which allows mapping a file into memory. This way, the file is loaded into memory and can be accessed as if it were an array. This is a very efficient way to read large files, as the operating system can manage the memory mapping and caching.
I decided to use this approach to read the input file, which contains the 1 billion records. I then used a hash map to store the minimum, maximum, and sum of the temperatures for each city. Finally, I iterated over the hash map to calculate the average temperature for each city.
The key step of my solution :
Memory Mapping the Input File
To efficiently load the 1 billion records, I used the mmap() system call, which maps the file into memory and treats it as an array. This approach minimizes the I/O overhead by leveraging the OS’s memory management capabilities.This approach lets the OS handle memory paging and caching, providing faster file access compared to standard file I/O.
Custom Float Parsing
To parse the temperature values from the input file, I implemented a custom float parser that converts the string representation of a float to a double. This parser is optimized for performance and can handle both positive and negative numbers.This custom parser is more efficient than standard library functions like std::stod() and std::atof(), which can be slow for large datasets.
Multithreading Implementation
The dataset is divided into chunks, each of which is processed by a separate thread. This allows the program to utilize the full power of the CPU by distributing the workload across all cores.Lock-Free Procesing with Local Aggregation
To avoid contention and synchronization overhead, each thread maintains a local hash map to store the temperature data for the cities it processes. This lock-free approach allows each thread to process its chunk independently and efficiently.The local hash maps are updated independently by each thread, reducing contention and improving performance.
Thread-Safe Global Aggregation
After each thread finishes processing, the local hash maps are merged into a global hash map. This merging process is synchronized using a mutex to ensure thread safety.Finally, the minimum, maximum, and average temperatures for each city are computed based on the aggregated data.
Avoiding String Copying
To avoid unnecessary string copying and memory allocation, I used std::string_view to represent the city names in the hash map. This lightweight wrapper provides a non-owning view of the underlying string data, reducing memory usage and improving performance.
Results
After implementing my solution in C++, I ran the program on a machine with 8 Cores and 24 GB of RAM. The program processed the 1 billion records in 1.35808 seconds. This result was comparable to the fastest Java solution, which processed the data in 1.535 seconds.
Note: While mmap() can offer significant performance improvements, its behavior is unpredictable and can vary depending on the operating system, hardware, and file system.
1brc; < 2 sec #PamerAjaDulu pic.twitter.com/nuQdnoZoZ3
— el (@helmy_lh) August 18, 2024
Complete code is available on my GitHub repository.