Skip to content

big data technologies comparisons for cleaning, manipulating and generally wrangling data in purpose of analysis and machine learning.

License

Notifications You must be signed in to change notification settings

xdssio/big_data_benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 

Repository files navigation

Big data techonlogy benchmarks This project is designed to compare big data technologies comparisons for cleaning, manipulating and generally wrangling data in purpose of analysis and machine learning.

The benchmarks for this article.

The analysis is done on a 100GB Texi data 2009 - 2015.

Technologies

General Remarks

  • Some notebooks requeire a restart of the karnel after package installation.
  • Different notebooks run on different kernels, check out on the top what is what.
  • The notebooks of technologies who don't run out of core are set to work with only 1M rows.
  • On special cases notebooks needed to be restarted for optmial performance - that might not be fair, but I wanted to try to get the most out of each technology.

Instructions

  1. Create an S3 bucket to put your results (or remove this part in the persist function in the code).
  2. Create a ml.c5d.4xlarge instance on AWS SageMaker with extra 500G Stroage.
  3. Run the get_data.ipynb notebook to mount the SSD and download the data.
  4. Run the notebook you want to test.
  • In each notebook and the beginning, make sure the name of the instance and the S3 bucket is right.

Good luck!

About

big data technologies comparisons for cleaning, manipulating and generally wrangling data in purpose of analysis and machine learning.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published