-
Notifications
You must be signed in to change notification settings - Fork 102
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
15 changed files
with
1,154 additions
and
262 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
# Diving Deeper into Machine Learning | ||
|
||
We've focused on neural networks, using labeled data that we | ||
can use to learn the trends in our data. This is an example | ||
of _supervised learning_. | ||
|
||
Broadly speaking there are | ||
3 main [approaches to machine learning](https://en.wikipedia.org/wiki/Machine_learning#Approaches) | ||
|
||
* [Supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) | ||
|
||
This uses labeled pairs (input and output) to train the model | ||
to learn how to predict the outputs from the inputs. | ||
|
||
* [Unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning) | ||
|
||
No labeled data is provided. Instead the machine learning | ||
algorithm seeks to find the structure on its own. The goal | ||
is to learn patterns and features to be able to produce | ||
new data. | ||
|
||
* [Reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning) | ||
|
||
As with unsupervised learning, no labeled data is used, | ||
but the model is "rewarded" when it does something right, | ||
and the model tries to maximize rewards (think: self-driving | ||
cars). | ||
|
||
## Libraries | ||
|
||
There are a number of popular libraries that implement machine learning algorithms. | ||
Their features and performance vary quite a bit. An comparison of their | ||
features is provided by Wikipedia: [Comparison of deep learning software](https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software). | ||
|
||
Some additional comparisons are provided here: https://ritza.co/articles/scikit-learn-vs-tensorflow-vs-pytorch-vs-keras/ | ||
|
||
* [TensorFlow](https://www.tensorflow.org/) | ||
|
||
This is an open source machine learning library released by Google. It has support | ||
for CPUs, GPUs, and [TPUs](https://en.wikipedia.org/wiki/Tensor_Processing_Unit), | ||
and provides all the features you need to build deep learning workflows: | ||
[TensorFlow feactures](https://en.wikipedia.org/wiki/TensorFlow#Features). | ||
|
||
* [PyTorch](https://pytorch.org/) | ||
|
||
This is a machine learning library build off of the Torch library, originally | ||
developed by Facebook. | ||
|
||
* [scikit-learn](https://scikit-learn.org/stable/) | ||
|
||
This is a python library developed for machine learning. It has a lot of | ||
sample datasets that provide a nice means to learn how different methods work. | ||
It is designed to work with NumPy and SciPy. | ||
|
||
General recommendations on the web seem to be to use Scikit-learn to get | ||
started with machine learning and to explore ideas, but to switch to | ||
one of the other packages for computationally-intensive work. | ||
|
||
Scikit-learn provides some nice sample datasets: | ||
|
||
https://scikit-learn.org/stable/datasets/toy_dataset.html | ||
|
||
as well as generators for | ||
datasets: | ||
|
||
https://scikit-learn.org/stable/datasets/sample_generators.html | ||
|
||
There are also tools that provide higher-level interfaces to these | ||
|
||
* [Keras](https://keras.io/) | ||
|
||
Keras is built on top of TensorFlow and provides a nice python interface that | ||
hides a lot of the implementation details in TensorFlow. | ||
|
||
## Keras / TensorFlow | ||
|
||
We'll focus on Keras and TensorFlow. | ||
|
||
There are a large number of examples provided by Keras: | ||
|
||
https://keras.io/examples/ | ||
|
||
You should be able to install keras via pip or conda. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
# Artificial Neural Network Basics | ||
|
||
## Neural networks | ||
|
||
When we talk about machine learning, we often mean an [_artifical | ||
neural | ||
network_](https://en.wikipedia.org/wiki/Artificial_neural_network). A | ||
neural network mimics the action of neurons in your brain. We'll | ||
follow the notation from _Computational Methods for Physics_ by | ||
Franklin. | ||
|
||
Basic idea: | ||
|
||
* Create a nonlinear fitting routine with free parameters | ||
* Train the network on data with known inputs and outputs to set the parameters | ||
* Use the trained network on new data to predict the outcome | ||
|
||
We can think of a neural network as a map that takes a set of | ||
$N_\mathrm{in}$ parameters and returns a set of $N_\mathrm{out}$ | ||
parameters, which we can express this as: | ||
|
||
$${\bf z} = {\bf A} {\bf x}$$ | ||
|
||
where | ||
|
||
$${\bf x} = (x_1, x_2, \ldots, x_{N_\mathrm{in}})$$ | ||
|
||
are the inputs, | ||
|
||
$${\bf z} = (z_1, z_2, \ldots, z_{N_\mathrm{out}})$$ | ||
|
||
are the outputs, and | ||
${\bf A}$ is an $N_\mathrm{out} \times N_\mathrm{in}$ matrix. | ||
|
||
Our goal is to determine the matrix elements of ${\bf A}$. | ||
|
||
## Nomenclature | ||
|
||
We can visualize a neural network as: | ||
|
||
 | ||
|
||
* Neural networks are divided into _layers_ | ||
|
||
* There is always an _input layer_—it doesn't do any processing. | ||
|
||
* There is always an _output layer_. | ||
|
||
* Within a layer there are neurons or _nodes_. | ||
|
||
* For input, there will be one node for each input variable. In this figure, | ||
there are 3 nodes on the input layer. | ||
|
||
* The output layer will have as many nodes are needed to convey the answer | ||
we are seeking from the network. In this case, there are 2 nodes on the | ||
output layer. | ||
|
||
* Every node in the first layer connects to every node in the next layer | ||
|
||
* The _weight_ associated with the _connection_ can vary—these are the matrix elements. | ||
|
||
```{note} | ||
This is called a _dense layer_. There are alternate types of layers | ||
we can explore where the nodes are connected differently. | ||
``` | ||
* In this example, the processing is done in layer 2 (the output) | ||
* When you train the neural network, you are adjusting the weights connecting to the nodes | ||
* Some connections might have zero weight | ||
* This mimics nature—a single neuron can connect to several (or lots) of other neurons. | ||
## Universal approximation theorem | ||
A neural network can be designed to approximate any function, $f(x)$. For this to work, there must be a source of non-linearity in the network—this is a result of the [universal approximation theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem). | ||
We use a nonlinear [_activation function_](https://en.wikipedia.org/wiki/Activation_function) that is applied in a layer. It has | ||
the form: | ||
$$g({\bf v}) = \left ( \begin{array}{c} g(v_0) \\ g(v_1) \\ \vdots \\ g(v_{n-1}) \end{array} \right )$$ | ||
```{note} | ||
The activation function, $g({\bf v})$ works element-by-element on the vector ${\bf v}$. | ||
``` | ||
|
||
Then our neural network has the form: ${\bf z} = g({\bf A x})$ | ||
|
||
We want to choose a function $g(\xi)$ that is differentiable. A common choice is the _sigmoid function_: | ||
|
||
$$g(\xi) = \frac{1}{1 + e^{-\xi}}$$ | ||
|
||
```{figure} sigmoid.png | ||
--- | ||
align: center | ||
--- | ||
The sigmoid function | ||
``` | ||
|
||
```{note} | ||
There are [many choices for the activation function](https://en.wikipedia.org/wiki/Activation_function) which have | ||
different properties. Often the choice of activation function will be empirical, by experimenting with the | ||
performance of the network. | ||
``` | ||
|
||
## Basic algorithm | ||
|
||
|
||
|
||
* Training | ||
|
||
* Loop over the $T$ pairs $({\bf x}^k, {\bf y}^k)$ for $k = 1, \ldots, T$ | ||
|
||
* Predict the output for ${\bf x}^k$ as: | ||
|
||
$$z_i = g([{\bf A x}^k]_i) = g \left ( \sum_{j=1}^{N_\mathrm{in}} A_{ij} x^k_j \right )$$ | ||
|
||
* Constrain that ${\bf z} = {\bf y}^k$. | ||
|
||
This is a minimization problem, where we are minimizing: | ||
|
||
\begin{align*} | ||
f(A_{ij}) &= \| g({\bf A x}^k) - {\bf y}^k \|^2 \\ | ||
&= \sum_{i=1}^{N_\mathrm{out}} \left [ g\left (\sum_{j=1}^{N_\mathrm{in}} A_{ij} x^k_j \right ) - y^k_i \right ]^2 | ||
\end{align*} | ||
|
||
We call this function the _cost function_ or _loss function_. | ||
|
||
```{note} | ||
This is one possible choice for the cost function, $f(A_{ij})$, but [many others exist](https://en.wikipedia.org/wiki/Loss_function). | ||
``` | ||
* Update the matrix ${\bf A}$ based on the training pair $({\bf x}^k, {\bf y^{k}})$. | ||
* Using the network | ||
With the trained ${\bf A}$, we can now use the network on data we haven't seen before, $\boldsymbol \chi$: | ||
$$z_i = g([{\bf A {\boldsymbol \chi}}^k]_i) = g \left ( \sum_{j=1}^{N_\mathrm{in}} A_{ij} \chi^k_j \right )$$ | ||
There are a lot of details that we still need to figure out involving the training and minimization. | ||
We'll start with minimization: a common minimization technique used with | ||
neural networks is [_gradient descent_](https://en.wikipedia.org/wiki/Gradient_descent). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
# Deriving the Learning Correction | ||
|
||
For gradient descent, we need to derive the update to the matrix | ||
${\bf A}$ based on training on a set of our data, $({\bf x}^k, {\bf y}^k)$. | ||
|
||
Let's start with our cost function: | ||
|
||
$$f(A_{ij}) = \sum_{i=1}^{N_\mathrm{out}} (z_i - y_i^k)^2 = \sum_{i=1}^{N_\mathrm{out}} | ||
\Biggl [ g\biggl (\underbrace{\sum_{j=1}^{N_\mathrm{in}} A_{ij} x^k_j}_{\equiv \alpha_i} \biggr ) - y^k_i \Biggr ]^2$$ | ||
|
||
where we'll refer to the product ${\boldsymbol \alpha} \equiv {\bf Ax}$ to help simplify notation. | ||
|
||
We can compute the derivative with respect to a single matrix | ||
element, $A_{pq}$ by applying the chain rule: | ||
|
||
$$\frac{\partial f}{\partial A_{pq}} = | ||
2 \sum_{i=1}^{N_\mathrm{out}} (z_i - y^k_i) \left . \frac{\partial g}{\partial \xi} \right |_{\xi=\alpha_i} \frac{\partial \alpha_i}{\partial A_{pq}}$$ | ||
|
||
|
||
with | ||
|
||
$$\frac{\partial \alpha_i}{\partial A_{pq}} = \sum_{j=1}^{N_\mathrm{in}} \frac{\partial A_{ij}}{\partial A_{pq}} x^k_j = \sum_{j=1}^{N_\mathrm{in}} \delta_{ip} \delta_{jq} x^k_j = \delta_{ip} x^k_q$$ | ||
|
||
and for $g(\xi)$, we will assume the sigmoid function,so | ||
|
||
$$\frac{\partial g}{\partial \xi} | ||
= \frac{\partial}{\partial \xi} \frac{1}{1 + e^{-\xi}} | ||
=- (1 + e^{-\xi})^{-2} (- e^{-\xi}) | ||
= g(\xi) \frac{e^{-\xi}}{1+ e^{-\xi}} = g(\xi) (1 - g(\xi))$$ | ||
|
||
which gives us: | ||
|
||
\begin{align*} | ||
\frac{\partial f}{\partial A_{pq}} &= 2 \sum_{i=1}^{N_\mathrm{out}} | ||
(z_i - y^k_i) z_i (1 - z_i) \delta_{ip} x^k_q \\ | ||
&= 2 (z_p - y^k_p) z_p (1- z_p) x^k_q | ||
\end{align*} | ||
|
||
where we used the fact that the $\delta_{ip}$ means that only a single term contributes to the sum. | ||
|
||
Note that: | ||
|
||
* $e_p^k \equiv (z_p - y_p^k)$ is the error on the output layer, | ||
and the correction is proportional to the error (as we would | ||
expect). | ||
|
||
* The $k$ superscripts here remind us that this is the result of | ||
only a single pair of data from the training set. | ||
|
||
Now ${\bf z}$ and ${\bf y}^k$ are all vectors of size $N_\mathrm{out} \times 1$ and ${\bf x}^k$ is a vector of size $N_\mathrm{in} \times 1$, so we can write this expression for the matrix as a whole as: | ||
|
||
$$\frac{\partial f}{\partial {\bf A}} = 2 ({\bf z} - {\bf y}^k) \circ {\bf z} \circ (1 - {\bf z}) \cdot ({\bf x}^k)^\intercal$$ | ||
|
||
where the operator $\circ$ represents _element-by-element_ multiplication (the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices))). | ||
|
||
## Performing the update | ||
|
||
We could do the update like we just saw with our gradient descent | ||
example: take a single data point, $({\bf x}^k, {\bf y}^k)$ and | ||
do the full minimization, continually estimating the correction, | ||
$\partial f/\partial {\bf A}$ and updating ${\bf A}$ until we | ||
reach a minimum. The problem with this is that $({\bf x}^k, {\bf y}^k)$ is only one point in our training data, and there is no | ||
guarantee that if we minimize completely with point $k$ that we will | ||
also be a minimum with point $k+1$. | ||
|
||
Instead we take multiple passes through the training data (called _epochs_) and apply only a single push in the direction that gradient | ||
descent suggests, scaled by a _learning rate_ $\eta$. | ||
|
||
The overall minimization appears as: | ||
|
||
<div style="border: solid; padding: 10px; width: 80%; margin: 0 auto; background: #eeeeee"> | ||
* Loop over epochs | ||
|
||
* Loop over the training data, $\{ ({\bf x}^0, {\bf y}^0), ({\bf x}^1, {\bf y}^1), \ldots \}$. We'll refer to the current training | ||
pair as $({\bf x}^k, {\bf y}^k)$ | ||
|
||
* Propagate ${\bf x}^k$ through the network, getting the output | ||
${\bf z} = g({\bf A x}^k)$ | ||
|
||
* Compute the error on the output layer, ${\bf e}^k = {\bf z} - {\bf y}^k$ | ||
|
||
* Update the matrix ${\bf A}$ according to: | ||
|
||
$${\bf A} \leftarrow {\bf A} - 2 \,\eta\, {\bf e}^k \circ {\bf z} \circ (1 - {\bf z}) \cdot ({\bf x}^k)^\intercal$$ | ||
</div> | ||
|
Oops, something went wrong.