Skip to content

tieliao/1brc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

1️⃣🐝🏎️ The One Billion Row Challenge in C

As promised, I worked out an implementation in C for 1brc. The goal is to provide some benchmarks with C code and challenge other implementations. I am specially interested in what performance can be achieved by implementations in different languages for this particular application.

This implementation shows, by using some simple and comprehensive optimization techniques (no SIMD and no dependencies), the performance can be considerably increased.

To know more about 1BRC, here are some pointers:

Cloning this repository

git clone https://github.com/tieliao/1brc
cd 1brc

Generating raw data

You may need to install jdk 21.

git clone https://github.com/gunnarmorling/1brc
cd 1brc
./mvnw clean verify
./create_measurements.sh 100000000

This will create a file, namely measurements.txt, with 100 million rows. This file contains 413 different city names.

To create a file (measurements3.txt) with 10k different city names:

./create_measurements3.sh 100000000

To test the baseline implementation:

time ./calculate_average_baseline.sh > output.txt

Running the challenge

make

Or run it manually:

gcc -O3 -DNDEBUG -o summarize 1brc.c -lpthread
time ./summarize 1brc/measurements.txt $(nproc)

Results

I was limited by the resources of my laptop which is an Intel 4-core i7-9750H CPU @ 2.60GHz/6GB RAM. So I can only test with 100 million rows. The first version of my code includes already some basic optimizations and gives an execution time of 3.6s. I did further optimizations with multi-threads, on hashing and parsing. The final version gives an execution time of 0.4s.

For comparaison, I ran the top 3 java best-performers and another C implementation as well.

Result (m:s.ms) Implementation Language Author
00:00.390 link C Tie Liao
00:00.813 link C Danny van Kooten
00:00.880 link 21.0.1-graalce Artsiom Korzun
00:00.945 link 21.0.1-graalce Thomas Wuerthinger, Quan Anh Mai, Alfonso² Peterssen
00:00.961 link 21.0.1-graalce Jaromir Hamala
00:27.131 link 21.0.1-graalce Baseline Gunnar Morling

The above results are using 413 different city names. I also tested with 10k different city names and the results are:

Result (m:s.ms) Implementation Language Author
00:00.677 link C Tie Liao
00:01.406 link 21.0.1-graalce Artsiom Korzun
00:01.440 link 21.0.1-graalce Jaromir Hamala
00:01.609 link 21.0.1-graalce Thomas Wuerthinger, Quan Anh Mai, Alfonso² Peterssen
00:02.345 link C Danny van Kooten
00:33.134 link 21.0.1-graalce Baseline Gunnar Morling

The accuracy of output was successfully checked against that of baseline in both cases.

For anyone who is interested in running the code with 1 billion rows, please let me know.

About

1 Billion Row Challenge in C

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published