Processing large datasets is a common challenge in data engineering and data science. When dealing with massive volumes of data, optimizing processing time becomes critical to avoid performance bottlenecks.

1 Billion Row Challenge Logo

Recently, on the internet I found a fascinating challenge: "One Billion Row Challenge". This challenge went viral, especially on Twitter (now X). The task is straightforward: process 1 billion temperature records for cities around the world, and return the minimum, maximum, and average temperatures for each city.

The challenge is language-specific, each language had its own leaderboard. The actual challenge was needed to use Java. The fastest solution were able to process the 1 billion records in 1.535 seconds, using GraalVM and uses unsafe operations. I was curious to see how fast I could process the 1 billion records using C++.

My approach and solution

My approach was inspired by someone (I forgot who) building their own database system. They mentioned that systems have a mmap() function, which allows mapping a file into memory. This way, the file is loaded into memory and can be accessed as if it were an array. This is a very efficient way to read large files, as the operating system can manage the memory mapping and caching.

I decided to use this approach to read the input file, which contains the 1 billion records. I then used a hash map to store the minimum, maximum, and sum of the temperatures for each city. Finally, I iterated over the hash map to calculate the average temperature for each city.

The key step of my solution :

  1. Memory Mapping the Input File
    To efficiently load the 1 billion records, I used the mmap() system call, which maps the file into memory and treats it as an array. This approach minimizes the I/O overhead by leveraging the OS’s memory management capabilities.

    char* file_data = static_cast<char*>(mmap(nullptr, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0));

    This approach lets the OS handle memory paging and caching, providing faster file access compared to standard file I/O.

  2. Custom Float Parsing
    To parse the temperature values from the input file, I implemented a custom float parser that converts the string representation of a float to a double. This parser is optimized for performance and can handle both positive and negative numbers.

    double parse_float(const char* str) {
    float result = 0.0f;
    bool negative = false;
    size_t i = 0;
    // Skip leading whitespaces
    if (str[i] == '-') {
    negative = true;
    i++;
    }
    // Parse the integer part
    while (i < len && str[i] != '.') {
    result = result * 10.0f + static_cast<float>(str[i] - '0');
    i++;
    }
    // Parse the fraction part
    if (i < len && str[i] == '.') {
    i++;
    float fraction = 1.0f;
    while (i < len) {
    fraction *= 0.1f;
    result += static_cast<float>(str[i] - '0') * fraction;
    i++;
    }
    }
    return negative ? -result : result;
    }

    This custom parser is more efficient than standard library functions like std::stod() and std::atof(), which can be slow for large datasets.

  3. Multithreading Implementation
    The dataset is divided into chunks, each of which is processed by a separate thread. This allows the program to utilize the full power of the CPU by distributing the workload across all cores.

    size_t num_threads = std::thread::hardware_concurrency();
    size_t chunk_size = sb.st_size / num_threads;
  4. Lock-Free Procesing with Local Aggregation
    To avoid contention and synchronization overhead, each thread maintains a local hash map to store the temperature data for the cities it processes. This lock-free approach allows each thread to process its chunk independently and efficiently.

    std::unordered_map<std::string, locationData> local_loc_data;
    process_chunk(file_data, start_pos, end, local_loc_data);

    The local hash maps are updated independently by each thread, reducing contention and improving performance.

  5. Thread-Safe Global Aggregation
    After each thread finishes processing, the local hash maps are merged into a global hash map. This merging process is synchronized using a mutex to ensure thread safety.

    std::unordered_map<std::string, locationData> local_loc_data;
    // Update global data
    std::lock_guard<std::mutex> lock(loc_data_mutex);
    for (const auto& [key, value] : local_loc_data) {
    // Merge data
    }

    Finally, the minimum, maximum, and average temperatures for each city are computed based on the aggregated data.

  6. Avoiding String Copying
    To avoid unnecessary string copying and memory allocation, I used std::string_view to represent the city names in the hash map. This lightweight wrapper provides a non-owning view of the underlying string data, reducing memory usage and improving performance.

    std::string location = std::string(line.substr(0, delimiter_pos)); // Extract the location

Results

After implementing my solution in C++, I ran the program on a machine with 8 Cores and 24 GB of RAM. The program processed the 1 billion records in 1.35808 seconds. This result was comparable to the fastest Java solution, which processed the data in 1.535 seconds.

Note: While mmap() can offer significant performance improvements, its behavior is unpredictable and can vary depending on the operating system, hardware, and file system.

Complete code is available on my GitHub repository.