Welcome to the Neural Network from Scratch in C++ project! This repository features a straightforward implementation of a neural network built entirely from the ground up using C++. Designed to engage AI and machine learning enthusiasts, this project provides a hands-on opportunity to explore the mathematical and programming principles behind neural networks. Whether you're a learner or an experienced developer, you'll gain deeper insights into the inner workings of neural networks and their underlying algorithms.
- Concept & Intuition
- Core Components
- Main Function
- Breaking into Modules
- Implementation
- Example: XOR operation prediction
- License
- Citation
- Contributing
- Author
Have you ever wondered what enables humans to breathe, walk, make decisions, respond to stimuli, and ultimately think? The answer lies in the brain and central nervous system, which consists of billions of interconnected neurons. Similarly, artificial neural networks (ANNs) are computational models inspired by the structure of the biological brain and neurons. They consist of interconnected layers of artificial neurons that process information and learn from data, enabling the network to make decisions and predictions.
In this project, we aim to build a neural network from scratch using C++, demystifying the concepts and mathematics behind these models. By manually implementing each component, we gain a deeper understanding of how neural networks operate and how they learn from data.
The fully-connected layer is arguably one of the most vital constituents of any neural network. The main functionality of this layer is to apply an affine transformation on the incoming data. But what exactly is this fancy term called affine transformation? Simply put, it is just a linear transformation (transforming a vector by multiplying it with a matrix) with a translation (adding another vector to a transformed vector). Mathematically, we can describe the output of this fully-connected layer (or affine transformation) as:
Where:
-
$W$ represents the weight matrix (perform a linear transformation on$x$ ), -
$x$ is the input vector, -
$b$ is the bias vector (translate$x$ ), -
$y$ is the output vector.
Although this transformation process is essential for the neural network, it is still very lacking in terms of its power, especially for processing highly complex data. The reason is that the affine transformation operations such as scaling, rotation, and shearing even with the translation still cannot account for the nonlinearity. Why is that?
let's see an example. Suppose we have a neural network with solely 2 fully-connected layers, then we can write out the equation as follows:
Then if we substitute (1) into (2), this is what we get:
After that, we can group
As you can see, it looks just like another affine transformation, which implies that no matter how many layers you put into your network, without a nonlinearity, the network will not be capable of exerting any more complex processing aside from a mere affine transformation (you can consult this video for more explanation). This is why the activation function needs to come into play.
An activation function in a neural network introduces non-linearity, allowing the model to learn and represent complex patterns in data. Without these non-linear functions, even a deep network would effectively behave like a shallow linear model, limiting its ability to capture intricate relationships. Here are some commonly used activation functions:
Sigmoid Function
The sigmoid function is a mathematical function with an S-shaped curve that transforms any real-valued input into a value between 0 and 1. This characteristic makes it particularly useful for applications requiring probabilities or binary classifications, as it compresses values into a range between 0 and 1.
Rectified Linear Unit (ReLU)
The Rectified Linear Unit (ReLU) function is both simple and effective. It outputs the input value directly if the input is positive, and zero if the input is negative. This straightforward approach introduces non-linearity into neural networks, allowing them to model complex patterns while remaining computationally efficient.
Leaky Rectified Linear Unit (Leaky ReLU)
Leaky ReLU (Rectified Linear Unit) is a variant of the ReLU activation function used in neural networks. While standard ReLU outputs zero for any negative input, leaky ReLU allows a small, non-zero gradient for negative inputs. This small slope, usually a small constant like 0.01, helps to prevent the "dying ReLU" problem where neurons can get stuck during training with zero output.
Hyperbolic Tangent Function (Tanh)
The tanh activation function, which stands for hyperbolic tangent, is a widely used non-linear function in neural networks. Like the sigmoid function, it features an S-shaped curve, but it maps input values to an output range of -1 to 1. This output range centers the data around zero, which can improve the performance and stability of the training process by ensuring more balanced gradients and faster convergence.
Now that we already laid some foundation of the neural network, from the fully-connected layer performing the affine transformation to the activation functions, which helps provide some nonlinearity, another question may arise: How can we obtain the optimal weight matrix and bias vector that will be used for an affine transformation? How can the neural network learn those values effectively?
Well, first things first, when learning anything, one of the most essential parts is the goal or objective that one wants to achieve. In the case of neural networks, it is to make a prediction that is as close to the target value as possible. As the guiding light for the learning process of neural networks, what we need is the loss function.
A loss function is a crucial component in a neural network that quantifies the difference between a model's predictions and the actual target values. It provides a numerical measure of how well the model is performing, guiding the optimization process by indicating how to adjust the model’s parameters to minimize errors. Essentially, the loss function acts as a feedback mechanism, helping to improve the accuracy of the model by minimizing the discrepancy between predicted and true outcomes. Some commonly used loss functions are:
Mean Squared Error (MSE)
Mean Squared Error (MSE) is a loss function widely used in machine learning and statistics. It measures the average squared difference between the predicted values and the actual values. A lower MSE indicates a better fit between the predicted values and the actual values. A higher MSE suggests a poorer fit.
where:
-
$N$ is a total number of data points. -
$y_i$ is an actual value for the i-th data point. -
$ŷ_i$ is a predicted value for the i-th data point.
Binary Cross Entropy (BCE)
Binary Cross Entropy (BCE) is another loss function commonly used in machine learning, especially for binary classification problems. It measures the dissimilarity between the predicted probability distribution and the true value.
where:
-
$N$ is a total number of data points. -
$y_i$ is an actual value for the i-th data point. -
$ŷ_i$ is a predicted value (should be in the form of probability, passing from the activation function like Sigmoid) for the i-th data point.
After we successfully define the goal, the next step is to look into the actual task of learning or optimizing the network toward the goal, and this will be the role of an important mathematical method called gradient descent.
Let's start with "The Tale of Gradient Descent".
Imagine you are in an unfamiliar terrain called "Parametric Space." This terrain is vast, stretching out in every direction, but it’s shrouded in a thick fog that obscures your vision. You can't see the landscape around you, where the valleys or peaks are, nor can you see the distant horizon. However, you possess one crucial ability: you can feel the slope of the ground beneath your feet.
Your ultimate goal in this mysterious space is to find the lowest point—a place so deep that no other point in the entire terrain is lower, called the "global minimum", and reaching it is your mission.
In this terrain, the height of the ground represents something called the loss function. The higher the terrain, the greater the loss, which measures how far off you are from your desired goal. So, to minimize your loss, you must descend to the lowest possible point, where the terrain flattens out, indicating that your loss is at its minimum.
You start your journey at a random location in the Parametric Space. You take a small step forward, feeling the slope beneath you. If the ground tilts downward, you follow it, trusting that it will lead you closer to your goal. This process is called Gradient Descent.
Your journey in the Parametric Space illustrates the power of the Gradient Descent algorithm—a methodical, iterative approach to finding the optimal solution in a complex landscape, where the path is not always clear, but the goal remains the same, finding the most profound depth, in other words, global minimum which provide you the lowest loss.
Now we will formalize this into a more concrete mathematical concept. First, let's extract some key components from this tale.
Each location in the parametric space indicates a distinct combination of values for every weight and bias of a network.
The height of terrain represents the loss value of a chosen loss function at that specific location (combination of parameters).
The slope of the ground beneath your feet corresponds to the mathematical concept of gradient.
Gradient is a vector of the partial derivative of the loss function with respect to the weights and biases in the network. This vector points in the direction where the loss function increases the most. Mathematically, we can write it as follows:
Where:
-
$\nabla\mathcal{L}$ indicates the gradient of a loss function$\mathcal{L}$ , -
$\mathcal{L}$ represents the loss function, -
$w_i$ is the$i^{th}$ weight, -
$b_j$ is the$j^{th}$ bias
Note that we normally simplify all the parameters (weights and biases) into a symbol
Guided by this gradient we will try to traverse the parametric space in such a way as to minimize the loss. A simple way to achieve this is by always stepping in the opposite direction with the gradient, as it points towards the steepest increase in loss. By consistently stepping in the gradient's opposite direction, we avoid moving towards regions where the loss increases the most. While this approach helps us reduce the loss, it doesn’t guarantee the largest decrease in a single step. This is what we call "Gradient descent".
This process can be simplified into the following equation:
Where:
-
$\theta$ represents a vector of all parameters, -
$\eta$ illustrates the learning rate (size of step), -
$\nabla\mathcal{L}(\theta)$ is a gradient vector of a loss function$\mathcal{L}$ with respect to parameters$\theta$
You might also notice the negative sign in front of
For each weight and bias, the equation would be:
This process of updating the weights and biases continues iteratively until the loss function converges to a minimum value or stops decreasing significantly. This process is typically controlled by setting a maximum number of iterations or monitoring the loss function’s improvement.
Once we’ve covered the essential concepts and foundational components of neural networks, we can now delve into their functionality. There are 2 key processes in neural networks: forward propagation and backward propagation.
Forward propagation is the process by which input data is passed through the network to generate an output or prediction. The input values will be processed in each layer, and those processed values (output of the previous layer) will be passed as input for the subsequent layer. This process will continue until the data passes through all of the layers.
Let's see an example. Suppose we have the following neural network architecture:
Then, we can write out the forward propagation computation of this network as mathematical equations as follows:
Note that
Backward propagation, or backpropagation, is the process used to train neural networks by adjusting the weights of the connections between neurons. After forward propagation generates an output, backpropagation calculates the error by comparing the predicted output with the actual target. This error is then propagated back through the network, layer by layer, to update the weights and biases using a method called gradient descent. The backpropagation process is essential for optimizing neural networks and ensuring that they can work well with your task.
Here, suppose we have this network.
To update the weight
Then to update the weight, we can apply the formula described in the gradient descent section to calculate the new value of this weight
Now that we have outlined the concept, components, and functioning of neural networks, though with all that information, it is enough for you to write your hard-coded neural network from scratch, it can only cope with a neural network with a particular structure. You will not be able to add more neurons or layers to your network without rewriting the whole calculation process, in other words, you cannot dynamically modify your network structure. Well, what is the solution to that? How can you make neural networks that you can dynamically alter their structure? The resolution is to break down neural networks into different types of layers.
Suppose, we have the following neural network linear(dense) layer taking in
Then for this particular layer, we can formalize the forward propagation process into a total of
In matrix form:
Where:
-
$W_{j \times i}$ represents the weight matrix, -
$x_{i \times 1}$ is the input vector, -
$b_{j \times 1}$ is the bias vector, -
$\mathbf{y}_{j \times 1}$ is the output vector.
And with these equations we can implement the feed forward process for our linear layer simply and straithforwardly. Then how about the backpropagation process.
As you may already notice, our linear layer have only two type of learnable parameter, the weights and bias. So this mean we just only need find the partial derivative of loss with respect to weights and bias right?
Wrong! Since we are dealing with a single layer of neuron, what we also need is some value to tell the previous layer about the error as well. This can be easily achieve by calculating the partial derivative of loss with respect to input of the current layer,
Derive partial derivative of loss with respect to weights
Next, let's try to derive the partial derivative of loss with respect to weights
For convenient, let us examine this simple instance of deriving
This can be generalize as follows:
Therefore, we can represent the same matrix
We can further simplify this matrix as:
As a result, the partial derivative of loss function with respect to the weights
Derive partial derivative of loss with respect to bias
Then for the partial derivative of loss with respect to bias we can represent it using the following vector.
Then to understand how we can derive each component it this vector, let consider this example of finding
This can be generalized as:
Therefore, the partial derivative of loss with respect to bias
Derive partial derivative of loss with respect to input
The partial derivative of loss with respect to input
Consider deriving
In generalized form, we can represent it like this:
Therefore,
In simple form, the partial derivative of loss with respect to input
Then what aboutthe activation layer? Luckily, since the activation layer does not have any learnable parameter, the solely thing we need to concern is how it can backpropagate the error through the layer. This can easily done through the use of the derivative of each activation function.
Derivative of the Sigmoid function
However, in fact, this derivative of Sigmoid function can further simplfied to:
But for our case, we decided to use the first derivation to implement our neural network.
Derivative of the ReLU function
Derivative of the Leaky ReLU function
Derivative of the Tanh function
For the loss function, we also need the derivative of each loss function to pass the error back to the previous layer during the backpropagation process as well. So, let's explore how we can derive the derivative of each loss function together.
Derivative of the MSE loss function
For each data sample, it is:
Derivative of the BCE loss function
For each data sample, it is:
Now let's dive into the actual implementation of each module and function necessary for creating your own neural network.
We will start with the important math operation and how we can implement it in C++. For these functions, you will need to import the following dependencies.
#include <vector>
#include <random>
#include <functional>
#include <algorithm>
#include <chrono>
Compute dot product
This function allows you to compute the dot product between two input vectors, namely
double dotProduct(std::vector<double> &v1, std::vector<double> &v2)
{
/**
* @brief Computes the dot product of two vectors
* @param[in] v1 The first vector
* @param[in] v2 The second vector
* @return The dot product of the two vectors
*/
double result = 0;
for (int i = 0; i < v1.size(); i++)
{
result += v1[i] * v2[i];
}
return result;
}
Element-wise multiplication between a vector and a scalar
This function allows you to perform an element-wise multiplication between a vector
std::vector<double> scalarVectorMultiplication(std::vector<double> &v, double scalar)
{
/**
* @brief Computes the element-wise multiplication of a vector and a scalar
* @param[in] v The vector to multiply
* @param[in] scalar The scalar to multiply the vector with
* @return A new vector with the element-wise multiplication of v and scalar
*/
std::transform(v.begin(), v.end(), v.begin(), std::bind(std::multiplies<double>(), std::placeholders::_1, scalar));
return v;
}
Vector subtraction
This function allows you to easily compute the subtraction between two vectors
std::vector<double> subtract(std::vector<double> &v1, std::vector<double> &v2)
{
/**
* @brief Computes the element-wise subtraction of two vectors
* @param[in] v1 The first vector
* @param[in] v2 The second vector
* @return A new vector with the elementwise subtraction of v1 and v2
*/
std::vector<double> out;
std::transform(v1.begin(), v1.end(), v2.begin(), std::back_inserter(out), std::minus<double>());
return out;
}
Matrix transpose
This function receives a matrix
std::vector<std::vector<double>> transpose(std::vector<std::vector<double>> &m)
{
/**
* @brief Computes the transpose of a matrix
* @param[in] m The matrix to transpose
* @return The transpose of the matrix
*/
std::vector<std::vector<double>> trans_vec(m[0].size(), std::vector<double>());
for (int i = 0; i < m.size(); i++)
{
for (int j = 0; j < m[i].size(); j++)
{
if (trans_vec[j].size() != m.size())
trans_vec[j].resize(m.size());
trans_vec[j][i] = m[i][j];
}
}
return trans_vec;
}
Weights initialization
This function allows you to generate a 2D vector of size
std::vector<std::vector<double>> uniformWeightInitializer(int rows, int cols)
{
/**
* @brief Initializes a matrix with uniform random weights between -1.0 and 1.0
* @param[in] rows The number of rows in the matrix
* @param[in] cols The number of columns in the matrix
* @return A matrix with uniform random weights between -1.0 and 1.0
*/
std::random_device rd;
std::mt19937 gen(rd() ^ std::chrono::system_clock::now().time_since_epoch().count());
std::uniform_real_distribution<> dis(-1.0, 1.0);
std::vector<std::vector<double>> weights(rows, std::vector<double>(cols));
for (int i = 0; i < rows; ++i)
{
for (int j = 0; j < cols; ++j)
{
weights[i][j] = dis(gen);
}
}
return weights;
}
Bias initialization
This function is used to generate a vector with a random value ranging from -1.0 to 1.0. This function will be further used to generate the bias of the fully connected layer (Linear layer).
std::vector<double> biasInitailizer(int size)
{
/**
* @brief Initializes a vector of biases with uniform random weights between -1.0 and 1.0
* @param[in] size The size of the vector
* @return A vector of biases with uniform random weights between -1.0 and 1.0
*/
std::random_device rd;
std::mt19937 gen(rd() ^ std::chrono::system_clock::now().time_since_epoch().count());
std::uniform_real_distribution<> dis(-1.0, 1.0);
std::vector<double> bias(size);
for (int i = 0; i < size; ++i)
{
bias[i] = dis(gen);
}
return bias;
}
Next, we will look into how we can implement each activation function that will allow our neural network perform a non-linear transformation on the input data. For more information and implementation of vectorize version of each activation function, you can consult activation.cpp file. For these activartion functions, you will need to import the following dependencies.
#include <cmath>
#include <vector>
Sigmoid and its derivative
double sigmoid(double x)
{
/**
* The sigmoid function maps any real-valued number to a value between 0 and 1.
* It is often used in the output layer of a neural network when the task is a
* binary classification problem.
* @param x the input value
* @return the output value of the sigmoid function
*/
return 1 / (1 + exp(-x));
}
double sigmoidDerivative(double x)
{ /**
* The derivative of the sigmoid function.
* @param x the input value
* @return the output value of the derivative of the sigmoid function
*/
return exp(x) / pow((exp(x) + 1), 2);
}
ReLU and its derivative
double relu(double x)
{ /**
* The Rectified Linear Unit (ReLU) activation function.
* @param x the input value
* @return the output value of the ReLU function
*/
if (x > 0)
return x;
else
return 0;
}
double reluDerivative(double x)
{ /**
* The derivative of the Rectified Linear Unit (ReLU) activation function.
* @param x the input value
* @return the output value of the derivative of the ReLU function
*/
if (x >= 0)
return 1;
else
return 0;
}
Leaky ReLU and its derivative
double leakyRelu(double x, double alpha = 0.01)
{
/**
* The Leaky Rectified Linear Unit (Leaky ReLU) activation function.
* @param x the input value
* @param alpha the leak rate, defaults to 0.01
* @return the output value of the Leaky ReLU function
*/
if (x > 0)
return x;
else
return alpha * x;
}
double leakyReluDerivative(double x, double alpha = 0.01)
{ /**
* The derivative of the Leaky Rectified Linear Unit (Leaky ReLU) activation function.
* @param x the input value
* @param alpha the leak rate, defaults to 0.01
* @return the output value of the derivative of the Leaky ReLU function
*/
if (x >= 0)
return 1;
else
return alpha;
}
Tanh and its derivative
double tanh(double x)
{ /**
* The Hyperbolic Tangent (tanh) activation function.
* @param x the input value
* @return the output value of the tanh function
*/
return (exp(x) - exp(-x)) / (exp(x) + exp(-x));
}
double tanhDerivative(double x)
{ /**
* The derivative of the Hyperbolic Tangent (tanh) activation function.
* @param x the input value
* @return the output value of the derivative of the tanh function
*/
return 1 - pow(tanh(x), 2);
}
Then we will move into the very core of constructing a neural network which is the layers. To implement these layer first we need to import the following dependencies.
#include <vector>
#include "activation.cpp" // Previously create activation functions
#include "utils.cpp" // Previously created utility functions
Based class 'Layer'
First, we shall define the based class for all layer. This class will consists of two public variables, input and output, and two public virtual method, forward and backward.
class Layer
{
public:
std::vector<double> input;
std::vector<double> output;
virtual std::vector<double> forward(const std::vector<double> input_data) = 0;
virtual std::vector<double> backward(std::vector<double> error, double learning_rate) = 0;
};
Sigmoid layer
The Sigmoid class is inherited from the Layer class with two override method for forward and backward, which allows the information to propagate through feed forward process and backpropagation process.
class Sigmoid : public Layer
{
public:
std::vector<double> forward(const std::vector<double> input_data) override
{
input = input_data;
output = vectSigmoid(input);
return output;
}
std::vector<double> backward(std::vector<double> error, double learning_rate) override
{
std::vector<double> derivative = vectSigmoidDerivative(input);
std::vector<double> grad_input;
for (int i = 0; i < derivative.size(); ++i)
{
grad_input.push_back(derivative[i] * error[i]);
}
return grad_input;
}
};
ReLU layer
The Relu class is inherited from the Layer class with two override method for forward and backward, which allows the information to propagate through feed forward process and backpropagation process.
class Relu : public Layer
{
public:
std::vector<double> forward(const std::vector<double> input_data) override
{
input = input_data;
output = vectRelu(input);
return output;
}
std::vector<double> backward(std::vector<double> error, double learning_rate) override
{
std::vector<double> derivative = vectReluDerivative(input);
std::vector<double> grad_input;
for (int i = 0; i < derivative.size(); ++i)
{
grad_input.push_back(derivative[i] * error[i]);
}
return grad_input;
}
};
Leaky ReLU layer
The LeakyRelu class is inherited from the Layer class with two override method for forward and backward, which allows the information to propagate through feed forward process and backpropagation process.
class LeakyRelu : public Layer
{
public:
double alpha = 0.01;
std::vector<double> forward(const std::vector<double> input_data) override
{
input = input_data;
output = vectLeakyRelu(input, alpha);
return output;
}
std::vector<double> backward(std::vector<double> error, double learning_rate) override
{
std::vector<double> derivative = vectLeakyReluDerivative(input, alpha);
std::vector<double> grad_input;
for (int i = 0; i < derivative.size(); ++i)
{
grad_input.push_back(derivative[i] * error[i]);
}
return grad_input;
}
};
Tanh layer
The Tanh class is inherited from the Layer class with two override method for forward and backward, which allows the information to propagate through feed-forward process and backpropagation process.
class Tanh : public Layer
{
public:
std::vector<double> forward(const std::vector<double> input_data) override
{
input = input_data;
output = vectTanh(input);
return output;
}
std::vector<double> backward(std::vector<double> error, double learning_rate) override
{
std::vector<double> derivative = vectTanhDerivative(input);
std::vector<double> grad_input;
for (int i = 0; i < derivative.size(); ++i)
{
grad_input.push_back(derivative[i] * error[i]);
}
return grad_input;
}
};
Linear layer
The Linear layer or fully connected layer is also inherited from the Layer class. The Linear class constructor requires the number of input and output neurons to create its instance. Then according to these numbers, its weights and bias will be created.
class Linear : public Layer
{
public:
int input_neuron;
int output_neuron;
std::vector<std::vector<double>> weights;
std::vector<double> bias;
Linear(int num_in, int num_out)
{
input_neuron = num_in;
output_neuron = num_out;
weights = uniformWeightInitializer(num_out, num_in);
bias = biasInitailizer(num_out);
}
std::vector<double> forward(const std::vector<double> input_data) override
{
input = input_data;
output.clear();
for (int i = 0; i < output_neuron; i++)
{
output.push_back(dotProduct(weights[i], input) + bias[i]);
}
return output;
}
std::vector<double> backward(std::vector<double> error, double learning_rate) override
{
std::vector<double> input_error; // dE/dX
std::vector<std::vector<double>> weight_error; // dE/dW
std::vector<double> bias_error; // dE/dB
std::vector<std::vector<double>> weight_transpose;
weight_error.clear();
bias_error.clear();
input_error.clear();
weight_transpose.clear();
weight_transpose = transpose(weights);
bias_error = error;
for (int i = 0; i < weight_transpose.size(); i++)
{
input_error.push_back(dotProduct(weight_transpose[i], error));
}
for (int j = 0; j < error.size(); j++)
{
std::vector<double> row;
for (int i = 0; i < input.size(); i++)
{
row.push_back(error[j] * input[i]);
}
weight_error.push_back(row);
}
std::vector<double> delta_bias = scalarVectorMultiplication(bias_error, learning_rate);
bias = subtract(bias, delta_bias);
for (int i = 0; i < weight_error.size(); i++)
{
std::vector<double> delta_weight = scalarVectorMultiplication(weight_error[i], learning_rate);
weights[i] = subtract(weights[i], delta_weight);
}
return input_error;
}
};
In this project, we implement two frequently used loss function, namely binary cross-entropy loss and mean-square error loss. To implement these loss functions you will need to import the following dependencies.
#include <vector>
#include <math.h>
#include <cmath>
Binary cross-entropy loss (BCE)
This loss function is specifically designed for binary classification tasks, normally known as binary cross-entropy (BCE) or log loss. It quantifies the difference between the predicted probability distribution and the actual binary labels (0 or 1). Here we demonstrate the implementation of the BCE loss function as well as its derivative.
double BCELoss(std::vector<double> true_label, std::vector<double> pred_prob)
{ /**
* Binary Cross Entropy Loss
* @param true_label true labels of the data
* @param pred_prob predicted probabilities
* @return binary cross entropy loss
*/
double sum = 0;
for (int i = 0; i < pred_prob.size(); i++)
{
sum += true_label[i] * log(pred_prob[i]) + (1 - true_label[i]) * log((1 - pred_prob[i]));
}
int size = true_label.size();
double loss = -(1.0 / size) * sum;
return loss;
}
std::vector<double> BCELossDerivative(std::vector<double> true_label, std::vector<double> pred_prob)
{ /**
* Compute derivative of binary cross entropy loss
* @param true_label true labels of the data
* @param pred_prob predicted probabilities
* @return derivative of binary cross entropy loss
*/
std::vector<double> dev = {(pred_prob[0] - true_label[0]) / ((pred_prob[0]) * (1 - pred_prob[0]))};
return dev;
}
Mean-squared error loss (MSE)
The mean-squared error (MSE) loss function is a versatile metric commonly used in regression tasks to measure the average squared difference between the predicted values and the actual target values. The following code is the implementation of MSE loss function and its derivative.
double MSELoss(std::vector<double> true_label, std::vector<double> pred)
{ /**
* Mean Squared Error Loss
* @param true_label true labels of the data
* @param pred predicted values
* @return mean squared error loss
*/
double sum = 0;
for (int i = 0; i < true_label.size(); i++)
{
sum += pow(true_label[i] - pred[i], 2.0);
}
int size = true_label.size();
double loss = (1.0 / size) * sum;
return loss;
}
std::vector<double> MSELossDerivative(std::vector<double> true_label, std::vector<double> pred)
{ /**
* Compute derivative of mean squared error loss
* @param true_label true labels of the data
* @param pred predicted values
* @return derivative of mean squared error loss
*/
std::vector<double> sub = subtract(pred, true_label);
std::vector<double> dev = scalarVectorMultiplication(sub, 2);
return dev;
}
Now, let's implement the most important part of this project, the neural network itself. There are five main method that we need to implement as follow:
- add() - To add each layer into our neural network.
- predict() - To perform feed forward operation which will bestow us with the predicted output.
- forward_propagation() - To perform feed forward operation as well, but this function will be use solely in the training process.
- backward_propagation() - To perform backpropagation in order to update the learnable parameter of our neural network, namely weights and bias.
- fit() - To train our neural network.
The abstract structure of our neural network (NN class)
class NN
{
public:
std::vector<std::unique_ptr<Layer>> layers;
void add(Layer *layer)
{
// Implement add method here
}
std::vector<double> predict(std::vector<double> input)
{
// Implement predict method here
}
std::vector<double> forward_propagation(conststd::vector<double> input)
{
// Implement forward propagation method here
}
void backward_propagation(const std::vector<double> &error, double learning_rate)
{
// Implement backward propagation method here
}
void fit(const std::vector<std::vector<double>> &X, const std::vector<std::vector<double>> &y, int epochs, double learning_rate)
{
// Implement fit method here
}
};
add() method implementation
void add(Layer *layer)
{
layers.emplace_back(layer);
}
predict() method implementation
std::vector<double> predict(std::vector<double> input)
{
return forward_propagation(input);
}
forward_propagation() method implementation
std::vector<double> forward_propagation(const std::vector<double> input)
{
std::vector<double> output = input;
for (const auto &layer : layers)
{
output = layer->forward(output);
}
return output;
}
backward_propagation() method implementation
void backward_propagation(const std::vector<double> &error, double learning_rate)
{
std::vector<double> grad = error;
for (auto it = layers.rbegin(); it != layers.rend(); ++it)
{
grad = (*it)->backward(grad, learning_rate);
}
}
fit() method implementation
void fit(const std::vector<std::vector<double>> &X, const std::vector<std::vector<double>> &y, int epochs, double learning_rate)
{
for (int epoch = 0; epoch < epochs; ++epoch)
{
double total_loss = 0.0;
for (size_t i = 0; i < X.size(); ++i)
{
// Forward pass
std::vector<double> output = forward_propagation(X[i]);
// Compute loss
double loss = BCELoss(y[i], output);
total_loss += loss;
std::vector<double> loss_derivative = BCELossDerivative(y[i], output);
// Backward pass
backward_propagation(loss_derivative, learning_rate);
}
// Print loss for monitoring
std::cout << "Epoch " << epoch + 1 << "/" << epochs << " - Loss: " << total_loss / X.size() << std::endl;
}
}
Combine everything together
class NN
{
public:
std::vector<std::unique_ptr<Layer>> layers;
// Add layers dynamically
void add(Layer *layer)
{
layers.emplace_back(layer);
}
// Make prediction using feed forward process
std::vector<double> predict(std::vector<double> input)
{
return forward_propagation(input);
}
// Forward propagation
std::vector<double> forward_propagation(const std::vector<double> input)
{
std::vector<double> output = input;
for (const auto &layer : layers)
{
output = layer->forward(output);
}
return output;
}
// Backward propagation
void backward_propagation(const std::vector<double> &error, double learning_rate)
{
std::vector<double> grad = error;
for (auto it = layers.rbegin(); it != layers.rend(); ++it)
{
grad = (*it)->backward(grad, learning_rate);
}
}
// Training function
void fit(const std::vector<std::vector<double>> &X, const std::vector<std::vector<double>> &y, int epochs, double learning_rate)
{
for (int epoch = 0; epoch < epochs; ++epoch)
{
double total_loss = 0.0;
for (size_t i = 0; i < X.size(); ++i)
{
// Forward pass
std::vector<double> output = forward_propagation(X[i]);
// Compute loss
double loss = BCELoss(y[i], output);
total_loss += loss;
std::vector<double> loss_derivative = BCELossDerivative(y[i], output);
// Backward pass
backward_propagation(loss_derivative, learning_rate);
}
// Print loss for monitoring
std::cout << "Epoch " << epoch + 1 << "/" << epochs << " - Loss: " << total_loss / X.size() << std::endl;
}
}
};
The XOR operation prediction problem is a classic example that highlights the power of machine learning algorithms, particularly neural networks. The key characteristic of the XOR problem is that it is not linearly separable, meaning that a single straight line cannot separate the data points into the correct classes. This challenge showcases the limitations of simple models like linear classifiers and the necessity of more complex models, such as neural networks, which can learn non-linear decision boundaries to accurately solve the problem.
The following examples are all possible case from XOR problem. Supposed that
- Case 1
- Input:
$x_1 = 0$ ,$x_2 = 0$ - Output:
$x_1 \oplus x_2 = 0$
- Input:
- Case 2
- Input:
$x_1 = 0$ ,$x_2 = 1$ - Output:
$x_1 \oplus x_2 = 1$
- Input:
- Case 3
- Input:
$x_1 = 1$ ,$x_2 = 0$ - Output:
$x_1 \oplus x_2 = 1$
- Input:
- Case 4
- Input:
$x_1 = 1$ ,$x_2 = 1$ - Output:
$x_1 \oplus x_2 = 0$
- Input:
#include <iostream>
#include <vector>
#include "NN.cpp"
int main()
{
// Initialize the neural network
NN neural_network;
// Add layers dynamically
neural_network.add(new Linear(2, 3));
neural_network.add(new Relu());
neural_network.add(new Linear(3, 3));
neural_network.add(new Relu());
neural_network.add(new Linear(3, 1));
neural_network.add(new Sigmoid());
// Example input data
std::vector<std::vector<double>> X = {{0, 0}, {0, 1}, {1, 0}, {1, 1}};
std::vector<std::vector<double>> y = {{0}, {1}, {1}, {0}};
// Train the network
neural_network.fit(X, y, 10000, 0.01);
// Test the network using XOR example
std::vector<double> input = {0, 0};
std::vector<double> output_prob = neural_network.predict(input);
std::vector<double> output = {0};
if (output_prob[0] > 0.5)
{
output = {1};
}
else
{
output = {0};
}
std::cout << "Input: " << input[0] << ", " << input[1] << std::endl;
std::cout << "Output Probability: " << output_prob[0] << std::endl;
std::cout << "Output: " << output[0] << std::endl;
std::cout << "Expected Output: " << 0 << std::endl;
std::cout << "----------------------" << std::endl;
input = {0, 1};
output_prob = neural_network.predict(input);
if (output_prob[0] > 0.5)
{
output = {1};
}
else
{
output = {0};
}
std::cout << "Input: " << input[0] << ", " << input[1] << std::endl;
std::cout << "Output Probability: " << output_prob[0] << std::endl;
std::cout << "Output: " << output[0] << std::endl;
std::cout << "Expected Output: " << 1 << std::endl;
std::cout << "----------------------" << std::endl;
input = {1, 0};
output_prob = neural_network.predict(input);
if (output_prob[0] > 0.5)
{
output = {1};
}
else
{
output = {0};
}
std::cout << "Input: " << input[0] << ", " << input[1] << std::endl;
std::cout << "Output Probability: " << output_prob[0] << std::endl;
std::cout << "Output: " << output[0] << std::endl;
std::cout << "Expected Output: " << 1 << std::endl;
std::cout << "----------------------" << std::endl;
input = {1, 1};
output_prob = neural_network.predict(input);
if (output_prob[0] > 0.5)
{
output = {1};
}
else
{
output = {0};
}
std::cout << "Input: " << input[0] << ", " << input[1] << std::endl;
std::cout << "Output Probability: " << output_prob[0] << std::endl;
std::cout << "Output: " << output[0] << std::endl;
std::cout << "Expected Output: " << 0 << std::endl;
std::cout << "----------------------" << std::endl;
return 0;
}
// Output example (100% accuracy)
/* Input: 0, 0
Output Probability: 4.68587e-5
Output: 0
Expected Output: 0
----------------------
Input: 0, 1
Output Probability: 0.998833
Output: 1
Expected Output: 1
----------------------
Input: 1, 0
Output Probability: 0.999439
Output: 1
Expected Output: 1
----------------------
Input: 1, 1
Output Probability: 0.00265182
Output: 0
Expected Output: 0
----------------------
*/
This code is licensed under the MIT License. See the LICENSE file for more details.
If you find our work helpful or use any part of this repository in your research, please consider citing this repository:
@software{sorawit_chokphantavee_2024_13584269,
author = {Sorawit Chokphantavee and
Sirawit Chokphantavee},
title = {{SorawitChok/Neural-Network-from-scratch-in-Cpp:
Neural Network from scratch in C++}},
month = aug,
year = 2024,
publisher = {Zenodo},
version = {v1.0.0},
doi = {10.5281/zenodo.13584269},
url = {https://doi.org/10.5281/zenodo.13584269}
}
Feel free to fork this repository and submit pull requests. Any contributions are welcome!
This repository was created by Sorawit Chokphantavee and Sirawit Chokphantavee.