Skip to content

jrosell/1br

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

1br

Introduction

1 Billion Row challenge with R:

  • This is the repo inspired by Gunnar Morlng's 1 billion row challenge to see which functions / libraries are quickest in summarizing the mean, min and max of a 1 billion rows of record
  • This work is based on alejandrohagan/1br and #5, but I've only used 1e8 rows.
  • I added some duckdb options and polars scan option. In order to do it I've added a file copy and file reading steps in each benchmark method to be sure to compare the pipelines without caching and a maximum of 8 threads.
  • If you see any issues or have suggestions of improvements, please let me know.

Instructions

  • If you need, execute install_required_packages(install = TRUE) from install.R file.
  • Generate 1e5, 1e6, 1e7, 1e8, 1e9 data running: ./generate_data.sh
  • Run the benchmark running: Rscript run.R or Rscript run.1e9.R
  • Check the generated plots and the results.

Results

2024-05-16

2024-03-27

2024-02-29

readr::read_rds("2024-02-29_all.rds")  %>% 
  group_split(n)

What can I do?

If you want, you have time and you have enough memory available in your computer, then you can try get the results for 1e9 rows:

  • Uncomment 1e9 lines on ./generate_data.sh
  • Comment run.R:25 and uncomment run.R:26
  • Generate 1e6, 1e7, 1e8 and 1e9 data running: ./generate_data.sh
  • Run the benchmark running: Rscript run.R
  • Check the generated plots.
  • Compare with other languages and solutions (Look at compare.php or onebrc for for rust)

Feedback is welcome. You can open an issue in this repo.

About

1 Billion Row challenge with R

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published