This package provides mapReduce functionality for Python users on a standard Mac. MapReduce can speed data work on machines that have sufficient memory and multiple cores. The tool is built for use with the Acquire Valued Shoppers Dataset and can easily be hacked for use in other applications.
This tool was built for use with the Acquire Valued Shoppers Dataset hosted by Kaggle.
- Download the Acquire Valued Shoppers Dataset and save it to the "data" directory.
- Run the main() function from the dataStoreSetup.py to populate your "chunks" directory.
- Use buildFeatures.py to generate some example features.
Data is first distributed across a number of CSV files which are stored in a directory "chunks"
Mappers are initialized using the Python multiprocessing library. Each mapper reads a pre-assigned subset of the CSV files stored in the "chunks" directory. Results from the mapper processes are stored in another set of CSV files in a directory titled "reduce_store". A shared list of locks is used to prevent mappers from simultaneously writing to the same reduce_store CSV file.
Reducers are called to consolidate the output from the mappers. Reducers store their output in memory. The output from the reduce step is consolidated into a single dataframe and written to a CSV filed located in the "features" directory.
- Pandas
- Numpy
- Scipy
Built and tested in Python 2.7.5