The purpose of this project is to illustrate how some of the models used in machine learning work. The implementations have been designed so as to facilitate understanding of the model and not in manner that prioritizes computational efficiency.
- [Machine-Learning-From-The-Ground-Up]
Linear regression fits a model with coefficients w = (w0,w1,...,wn) to minimise the mean squared error cost function between the observed targets in the dataset, and the targets predicted by the linear approximation. The linear model can be regularized with Ridge Regression, Lasso Regression and Elastic Net to decrease variance.
Figure 1: Training process of the Linear Regression model.
Figure 2: A visualization of how the accuracy of the model increases over training epochs.
Figure 3: A simple visulization of how regularization techniques affect the coefficients of the model.
Polynomial regression fits a model with coefficients w = (w0,w1,...,wn) to minimise the mean squared error cost function between the observed targets in the dataset, and the targets predicted by the linear approximation. Polynomial regression is a powerful model able to detect complex non-linear relationships between features in the dataset.
Figure 1: Training process of the Polynomial Regression model.
Figure 2: A visualization of how the accuracy of the model increases over training epochs.
A support vector machine is a classifier that finds the hyperplane that best separates the classes in the training data that the model is fitted to. The model predicts into which class an instance falls based on it's position relative to the separating hyperplane (the decision boundary).
Figure 1: A visualization of the separating hyperplane of a support vector machine in three dimensions.
Figure 2: A visualizaton of the decision boundaries of support vector machines with various kernels.
Figure 3: An illustration of the prediction confidence of a support vector machine with a Radial Basis Function kernel on non-linearly separable data.
A decision tree is a tree-like model in which the target value of instances is predicted based on a series of attribute tests. Each instance moves from the root node, down the decision tree, until it reaches a leaf node at which point the target value of the instance is predicted. The path the instance follows to reach a leaf node is determined based on the result of a set of predetermined attribute tests.
Figure 1: A visualization of the decision boundary of a decision tree.
Figure 1: A visualization of the predictions of decision trees with various regularization parameters.
Ensemble models aggregate the predictions of multiple base estimators in order to make a final prediction. This repository includes implementations of various ensemble models. Implemetations are listed below.
Classification ensemble models make use of various strategies in order to effectively aggregate the predictions of multiple base estimators in order to make a final prediction. Classification ensemble models included: VotingClassifier, BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier.
Figure 1: A visualization of the decision boundaries of various classification ensemble models (VotingClassifier, BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier).
Regression ensemble models make use of various strategies in order to effectively aggregate the predictions of multiple base estimators in order to make a final prediction. Regression ensemble models included: VotingRegressor, BaggingRegressor, RandomForestRegressor, ExtraTreesRegressor.
Figure 1: A visualization of the predictions of various regression ensemble models (VotingRegressor, BaggingRegressor, RandomForestRegressor, ExtraTreesRegressor).
A stacking model is an ensemble model in which the predictions of the estimators in the ensemble are aggregated through the use of a final estimator, the blender/meta learner.
Figure 1: A visualization of the predictions of a StackingClassifier and StackingRegressor.
An AdaBoost classifier is a classifier that trains estimators sequentially. An initial estimator is fitted to the training set, the following estimator is then fitted to a training set that places more emphasis on the instances misclassified by the previous estimator. This process of fitting estimators to correct the misclassifications of the previous estimator is repeated until the ensemble contains the desired number of estimators.
Figure 1: A visualization of the accuracies of AdaBoost classifiers with varying number of estimators.
A gradient boosting model is an ensemble model in which estimators are trained sequentially with each successive estimator trained to predict the pseudo-residuals of all of the estimators trained prior to it. Once the model has been trained the predictions of all of the estimators in the ensemble are aggregated to predict the target value of an instance.
Figure 1: The deviations of the predictions of gradient boosting classifiers with varying number of estimators from the target values.
Figure 1: A visualization of the predictions of gradient boosting regressors with varying number of estimators.
Dimensionality reduction is the process of reducing the number of variables under consideration. This is done by calculating a set of principal variables.
Principal Component Analysis is a dimensionality reduction technique. The axis (one-dimensional hyperplane) that accounts for the maximum variance of the data is calculated then a second axis, orthogonal to the first, that accounts for the maximum remaining variance of the data is calculated. This process of repeatedly calculating the orthogonal axis that accounts for the largest amount of the remaining variance is repeated until an axis has been calculated for each dimension. These axes are the principal components of the data. The data can then be projected onto the selected number of principal components, with the data being projected onto the axes that account for more variance first.
Figure 1: A visualization of the principal components of a three-dimensional dataset.
The objective of clustering algorithms is to partition the given data into the selected number of groups such that the data points within each group are more similar to those within the same group than those of other groups.
K-means and K-medians are clustering algorithms that function to partition the given data into the specified number of clusters. Each instance is classified as belonging to the cluster with the nearest mean in the case of K-means or median in the case of K-medians.
Figure 1: An illustration of the clusters formed by the K-means and K-medians algorithms.
Instance-based learning predicts the target value of new instances by comparing them to the instances seen during training, which have been stored in memory. This prediction strategy differs from that of other machine learning algorithms which make explicit generalizations based on the data seen during training.
A K-nearest neighbours classifier/regressor makes predictions based on the target values of the k-nearest instances in the feature space.
Figure 1: A visualization of the predictions of a K-nearest neighbours classifier and a K-nearest neighbours regressor.