The objective of this project was to develop a machine learning model capable of forecasting the efficiency of different Apache Spark MLlib operators when executed on a cluster of multiple nodes.
We gathered a large dataset of time and memory metrics from various executions of operators and utilized it to train machine learning models capable of predicting these metrics without actually executing the operators.
Machine Learning operators executed in big data analytics runtimes (e.g., Apache Spark) are often complex code that requires a significant amount of time to complete over data of large volume. In this project, we collected multiple measurements of how the execution of different operators progresses over time and used learning algorithms to create models that can predict their performance without even executing them.
We completed the following steps in the project:
-
Installation and setup of Apache Spark: Using Okeanos-based resources, we installed and set up Apache Spark as our open-source, distributed analytics engine for executing different operators.
-
Operator Selection: We selected three operators from Apache Spark's MLLib library: k-means, random forest regression, and Word2Vec. These operators are diverse but belong to the same family.
-
Data Generation and Loading: Using an artificial data generator per operator, we created sample input data for the operators. The input data was of different sizes and structures and data points of different dimensions.
-
Measurement of Performance and Modeling: We executed multiple combinations of data to operator and monitored, for each combination, the total running time and main memory cluster usage. This data was then used to train a regression model to create accurate prediction models with minimal error in unseen data inputs.
To run experiments using the code, follow these steps:
-
Run
repeaterScript.sh
and provide the input parameters, which are the operator and the number of experiments you want to run. -
Depending on your input, either random_forest_training.py, kmeans_training.py, or w2v_training.py will run. These scripts call data generators to create artificially generated data for model training of the corresponding operator.
-
After the model training is finished, the memory usage and total training time is gathered. The results are then saved in CSV files.
By following these steps, you can run experiments with different operators and input parameters and collect performance metrics for each run. This data can be used to train prediction models for each operator and evaluate their accuracy.
- Apostolis Garos: https://github.com/ApostolisGaros
- Nikos Vlachakis: https://github.com/NikosVlachakis
- Georgios Angelis: https://github.com/GeorgeAngelis