In the English Premier League, May - July represents a lull period due to the lack of club football. What makes up for it, is the intense transfer speculation that surrounds all major player transfers today. You must have seen "Moneyball", where Peter Brand explains Billy Beane that "Its about getting things down to one number using stats the way we read them (players), we find value in player nobody else can see." Therefore, an important part of negotiations is predicting the fair market price for a player and perhaps perform EDA for this.
-
- The bar graph above shows how many players are there in every club and the variation of player's positions.
- x axis --> club names
- y-axis --> total number of players.
The variation of player position are distinguish by different color in each bars. The colors code for players position is explained on the righ side of the chart.
Arsenal , Everton , and Huddersfield have the biggest number of players with 28 players . Burnley has the lowest number of players with 18 players.
- The bar graph above shows how many players are there in every club and the variation of player's positions.
-
A typical histogram of players age can be called "Edge Peak Histogram".
- The edge peak distribution looks like the normal distribution except that it has a large peak at one tail.
- As it is shown in the histogram, the distribution of age is increased untill it meet the peak point at the age between 25-30.
- There is one bar that significanly higher than the others, that is why the histogram above can be calles Edge peak histogram.
-
- The bar chart above shows the top 10 market value for each player.
- x-axis --> players name
- y-axis --> market value
It appears that Eden Hazard and Paul Pogba have the biggest market value, which is 75. They are followed by Alexis Sanchez on the 3rd position.
- The bar chart above shows the top 10 market value for each player.
-
- The bar chart above shows the top 10 most view players.
- x-axis --> Player names
- y-axis --> Market value
It appears that Wayne Rooney has the biggest market value, followed by Paul Pogba and Dele Alli on 2nd and 3rd position.
- The bar chart above shows the top 10 most view players.
-
-
Each square shows the correlation between the variables the variables on each axis.
-
Values closer to zero means there is no linear trend between the two variables.
-
The close to 1 the correlation is the more positively correlated they are the closer to 1 the stronger the relationship is larger the number and darker the color the higher the correlation between the two variables.
-
-
- The bar chart above shows the count of players w.r.t their nationality in the FPL.
- x-axis --> countries
- y-axis --> count for players
We can conclude that England has highest i.e. maximum number of players.
- The bar chart above shows the count of players w.r.t their nationality in the FPL.
There are three primary metrics used to evaluate linear models. These are:
- Mean absolute error (MAE),
- Mean squared error (MSE), and
- Root mean squared error (RMSE).
-
MAE: The easiest to understand. Represents average error
-
MSE: Similar to MAE but noise is exaggerated and larger errors are “punished”. It is harder to interpret than MAE as it’s not in base units, however, it is generally more popular.
-
RMSE: Most popular metric, similar to MSE, however, the result is square rooted to make it more interpretable as it’s in base units. It is recommended that RMSE be used as the primary metric to interpret your model.
Algorithm Used | R2 Value | MAE | MSE | RMSE |
---|---|---|---|---|
Linear Regression | 0.7057 |
4.7294 |
44.04 |
6.636 |
Nearest Neighbour Regression | 0.4511 |
6.0111 |
82.16 |
9.064 |
Support Vector Machine | 0.3589 |
5.4067 |
95.95 |
9.795 |
Decision Tree Regression | 0.5169 |
5.4298 |
72.30 |
8.503 |
Random Forest Regression | 0.7284 |
4.3749 |
40.65 |
6.376 |
Gradient Boosted Regression | 0.7598 |
4.0861 |
35.94 |
5.995 |
Tuning the parameters to build and choose an optimal model is called hyperparameter tuning.
- Choosing from following method for hyperparameter tuning
- RandomizedSearchCV --> A fast way to hypertune model
- GridSearchCV--> A slow way to hypertune model
Used RandomizedSearchCV for hyperparameter tuning because it is faster than GridSearchCV.
Shown an example for RandomFrorestRegressor
.
- Assigned hyperparameters in form of dictionary for the ForestRegressor
{ 'max_depth': [5, 13, 21, 30], 'max_features': ['auto', 'sqrt'], 'min_samples_split': [5, 10, 15, 100], 'n_estimators': [100, 320, 540, 760, 980, 1200] }
- Fitted the model using 3 fold cross validation
rf_random.fit(X_train,y_train) # Fitting 3 folds for each of 10 candidates, totalling 30 fits
- Checking best paramters and best score
rf_random.best_params_ { 'max_depth': 21, 'max_features': 'sqrt', 'min_samples_split': 5, 'n_estimators': 1200 }
- Scores after hyperparameter tuning
Algorithm Used | R2 Value | MAE | MSE | RMSE |
---|---|---|---|---|
Random Forest Regression | 0.7685 |
4.1569 |
40.01 |
6.325 |
A higher R^2 score determines its a best model, so from the above findings we can conclude that Gradient Boosted
has higher R2 score followed by Random Forest
and Linear Regression
.
LINEAR REGRESSION :
r2 score: 74%
NEAREST NEIGHBOUR REGRESSION :
r2 score: 57%
TREE REGRESSION :
r2 score: 61%
RANDOM FOREST REGRESSION :
r2 score: 78%
GRADIENT BOOSTED REFGRESSION :
r2 score: 80%