Skip to content

Latest commit

 

History

History
96 lines (80 loc) · 7.57 KB

Anshul_ML_regression.md

File metadata and controls

96 lines (80 loc) · 7.57 KB

Concrete Compressive Strength Prediction using Machine Learning 😎😎

-Anshul Sharma, IIT Kharagpur Twitter GitHub followers

Open Source Love svg2

Cement is one of the components of concrete. Mixing of required substances in required amount produces concrete. The strength of concrete may be influenced by:

  1. Ratio of cement to water
  2. Size of aggregate
  3. Texture, stiffness of particles

In this project, we aim to determine the compressive strength of concrete given some data about cement.The data set for this project is acquired from the UCI ML repository

The data set has the following attributes:

  1. Cement
  2. Blast Furnace Slag
  3. Fly Ash
  4. Water
  5. Superplasticizer
  6. Coarse Aggregate
  7. Fine Aggregate
  8. Age
  9. Concrete compressive strength

Prerequisites

You should have the following softwares/libraries installed:

Python3
Scikit-learn
Jupyter notebook
Scipy
Numpy
Pandas
Matplotlib
Seaborn

Important Machine Learning Algorithms

Good understanding of the following algorithms is required.

Linear regression

We aim to predict a target variable using some given data variables. What we can do is to pass a line from the data set fitting the data set as shown. To generalise, you draw a straight line such that it crosses through the maximum number of points. Once we have done that, we can predict the target value using that line as the hypothesis function.
linear regression
The only problem is how to define the values of slop and intercept of the line.
linear regression intuition
We can use calculus to get the minimum values of slope and intercept such that the line passes through maximum number of points. More commonly, gradient descent is used for updating parameters at each step. Gradient descent, more details. We will not go into detailed mathematics because this note just provides intuition for the algorithms.

Polynomial Regression

Sometimes the line may not fit the data because of the data might have a high dimensional polynomial nature. So a line won't be enough. For this we need to increase the degee of attibutes (which can be done using the scikit-learn library). Check the below figure for linear regression.
linear regression
As you can see in the figure, the given line can fit the data almost perfectly. Simple linear regression is fine in this case. Now look at this figure.
polynomial regression
The line for predicted speeds is not able to fit the ground truth perfectly. The data is too complex for simple linear regression to make predictions. We need to have high degree attributes. Now look at this figure.
polynomial regression
This hypothesis function fits the data almost perfectly. So to decide between linear and polynomial regression, you have to visualize the data. Use python's seaborn for better aid in visualization. For detailed mathematics behind polynomial regression, Click Here

Random Forest Regression

First let's look at Ensemble Learning. Ensemble learning is a technique that combines the predictions from multiple weak machine learning algorithms together to make more accurate predictions. Since the model is comprised of many models, it is called an Ensemble model. Look at the image below for better understanding.
ensemble
Random forest is an supervised machine learning that can be used for classification and regression. It has trees which have no interaction in between them, they are completely independent from each other. Each tree acts as an independent model, which can be combined with others to form an ensemble. Each tree uses random data from the original data set when generating its splits, adding randomness to prevent overfitting. For many data sets, it produces a highly accurate classifier. It gives estimates of what variables that are important in the classification therefore, can be used for feature selection. It has an effective method for estimating missing data while maintaining accuracy.
random forests

Support Vector Regression

SVM is one of the most powerful machine learning algorithm available today. A linear regression is only able to generate a line to predict the data points, while a support vector regression can also generate a hyperplane. In SVM, Margin is the perpendicular distance between the hyperplane and the closest points. SVM tries to maximise this margin, no penalty region in built by SVM in the muti dimensional data.
Look at the image.
SVM
Making a hyperplane to fit this data is very difficult, specially it it's high dimensional. SVM does this task perfectly. SVR tries to have as many support vectors as possible within the margin, thus it keeps the error within the threshold. A major difference between Linear regression and SVR lies on the fact that Linear regression tends to minimize the error and SVR tends to keep it within a threshold. For further information visit Sklearn SVR

When to use what? 😕😕

  • Linear regression is a linear model, which means it works really great with data with linear properties. But, linear model cannot capture the non-linear features.
  • So in this case, you can use the decision trees or random forests which do a better job at capturing the non-linearity in the data by dividing the space into smaller sub-spaces.
  • Random forests behave like ensemble models, making decision trees even more robust to deal with noisy data, whereas standard regression methods can get easily confused by noise and will result in high error.
  • Normally, Support Vectors models perform better on sparse data than RF. Moreover, Decision trees work faster, non-linear data are handled well . Also they train faster but they have tendency to overfit.

Now pat your back!! You have successfully completed this module. 🏆 🏆

Author

Acknowledgments

  • DevIncept Mentor
  • Google for images
  • Medium blog posts
  • Wikipedia
  • UCI Machine learning datset repository