Instructions

Big data techonlogy benchmarks This project is designed to compare big data technologies comparisons for cleaning, manipulating and generally wrangling data in purpose of analysis and machine learning.

The benchmarks for this article.

The analysis is done on a 100GB Texi data 2009 - 2015.

Technologies

General Remarks

Some notebooks requeire a restart of the karnel after package installation.
Different notebooks run on different kernels, check out on the top what is what.
The notebooks of technologies who don't run out of core are set to work with only 1M rows.
On special cases notebooks needed to be restarted for optmial performance - that might not be fair, but I wanted to try to get the most out of each technology.

Instructions

Create an S3 bucket to put your results (or remove this part in the persist function in the code).
Create a ml.c5d.4xlarge instance on AWS SageMaker with extra 500G Stroage.
Run the get_data.ipynb notebook to mount the SSD and download the data.
Run the notebook you want to test.

In each notebook and the beginning, make sure the name of the instance and the S3 bucket is right.

Good luck!

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
notebooks		notebooks
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

General Remarks

Instructions

About

Releases

Packages

Languages

License

xdssio/big_data_benchmarks

Folders and files

Latest commit

History

Repository files navigation

General Remarks

Instructions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages