Processing large datasets is a common challenge in data engineering and data science. When dealing with massive volumes of data, optimizing processing time becomes critical to avoid performance bottlenecks.

Recently, on the internet I found a fascinating challenge: "One Billion Row Challenge". This challenge went viral, especially on Twitter (now X). The task is straightforward: process 1 billion temperature records for cities around the world, and return the minimum, maximum, and average temperatures for each city.
The challenge is language-specific, each language had its own leaderboard. The actual challenge was needed to use Java. The fastest solution were able to process the 1 billion records in 1.535 seconds, using GraalVM and uses unsafe operations. I was curious to see how fast I could process the 1 billion records using C++.
My approach and solution
My approach was inspired by someone (I forgot who) building their own database system. They mentioned that systems have a mmap() function, which allows mapping a file into memory. This way, the file is loaded into memory and can be accessed as if it were an array. This is a very efficient way to read large files, as the operating system can manage the memory mapping and caching.
I decided to use this approach to read the input file, which contains the 1 billion records. I then used a hash map to store the minimum, maximum, and sum of the temperatures for each city. Finally, I iterated over the hash map to calculate the average temperature for each city.
The key step of my solution :
Memory Mapping the Input File
To efficiently load the 1 billion records, I used the mmap() system call, which maps the file into memory and treats it as an array. This approach minimizes the I/O overhead by leveraging the OS’s memory management capabilities.char* file_data = static_cast<char*>(mmap(nullptr, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0));This approach lets the OS handle memory paging and caching, providing faster file access compared to standard file I/O.
Custom Float Parsing
To parse the temperature values from the input file, I implemented a custom float parser that converts the string representation of a float to a double. This parser is optimized for performance and can handle both positive and negative numbers.double parse_float(const char* str) {float result = 0.0f;bool negative = false;size_t i = 0;// Skip leading whitespacesif (str[i] == '-') {negative = true;i++;}// Parse the integer partwhile (i < len && str[i] != '.') {result = result * 10.0f + static_cast<float>(str[i] - '0');i++;}// Parse the fraction partif (i < len && str[i] == '.') {i++;float fraction = 1.0f;while (i < len) {fraction *= 0.1f;result += static_cast<float>(str[i] - '0') * fraction;i++;}}return negative ? -result : result;}This custom parser is more efficient than standard library functions like std::stod() and std::atof(), which can be slow for large datasets.
Multithreading Implementation
The dataset is divided into chunks, each of which is processed by a separate thread. This allows the program to utilize the full power of the CPU by distributing the workload across all cores.size_t num_threads = std::thread::hardware_concurrency();size_t chunk_size = sb.st_size / num_threads;Lock-Free Procesing with Local Aggregation
To avoid contention and synchronization overhead, each thread maintains a local hash map to store the temperature data for the cities it processes. This lock-free approach allows each thread to process its chunk independently and efficiently.std::unordered_map<std::string, locationData> local_loc_data;process_chunk(file_data, start_pos, end, local_loc_data);The local hash maps are updated independently by each thread, reducing contention and improving performance.
Thread-Safe Global Aggregation
After each thread finishes processing, the local hash maps are merged into a global hash map. This merging process is synchronized using a mutex to ensure thread safety.std::unordered_map<std::string, locationData> local_loc_data;// Update global datastd::lock_guard<std::mutex> lock(loc_data_mutex);for (const auto& [key, value] : local_loc_data) {// Merge data}Finally, the minimum, maximum, and average temperatures for each city are computed based on the aggregated data.
Avoiding String Copying
To avoid unnecessary string copying and memory allocation, I used std::string_view to represent the city names in the hash map. This lightweight wrapper provides a non-owning view of the underlying string data, reducing memory usage and improving performance.std::string location = std::string(line.substr(0, delimiter_pos)); // Extract the location
Results
After implementing my solution in C++, I ran the program on a machine with 8 Cores and 24 GB of RAM. The program processed the 1 billion records in 1.35808 seconds. This result was comparable to the fastest Java solution, which processed the data in 1.535 seconds.
Note: While mmap() can offer significant performance improvements, its behavior is unpredictable and can vary depending on the operating system, hardware, and file system.
1brc; < 2 sec #PamerAjaDulu pic.twitter.com/nuQdnoZoZ3
— el (@helmy_lh) August 18, 2024
Complete code is available on my GitHub repository.