some reorg

sbu-python-class · Jan 16, 2024 · fc30a3d · fc30a3d
1 parent 8bf1802
commit fc30a3d
Show file tree

Hide file tree

Showing 15 changed files with 1,154 additions and 262 deletions.
diff --git a/content/11-machine-learning/gradient-descent.ipynb b/content/11-machine-learning/gradient-descent.ipynb
diff --git a/content/11-machine-learning/keras-clustering.ipynb b/content/11-machine-learning/keras-clustering.ipynb
@@ -798,22 +798,6 @@
     "\n",
     "ax.scatter(xpt, ypt, s=40, c=v, cmap=\"viridis\")"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "92f6b285-ce02-4f02-a7e1-aaf8f7f3597e",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "aa6cdba2-9f75-4541-a2d0-629721706f20",
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {

diff --git a/content/11-machine-learning/keras-mnist.ipynb b/content/11-machine-learning/keras-mnist.ipynb
diff --git a/content/11-machine-learning/machine-learning-libraries.md b/content/11-machine-learning/machine-learning-libraries.md
@@ -0,0 +1,83 @@
+# Diving Deeper into Machine Learning
+
+We've focused on neural networks, using labeled data that we
+can use to learn the trends in our data.  This is an example
+of _supervised learning_.  
+
+Broadly speaking there are
+3 main [approaches to machine learning](https://en.wikipedia.org/wiki/Machine_learning#Approaches)
+
+* [Supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) 
+
+  This uses labeled pairs (input and output) to train the model
+  to learn how to predict the outputs from the inputs.
+
+* [Unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning)
+
+  No labeled data is provided.  Instead the machine learning
+  algorithm seeks to find the structure on its own.  The goal
+  is to learn patterns and features to be able to produce
+  new data.
+
+* [Reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning)
+
+  As with unsupervised learning, no labeled data is used,
+  but the model is "rewarded"  when it does something right,
+  and the model tries to maximize rewards (think: self-driving
+  cars).
+
+## Libraries
+
+There are a number of popular libraries that implement machine learning algorithms.
+Their features and performance vary quite a bit.  An comparison of their
+features is provided by Wikipedia: [Comparison of deep learning software](https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software).
+
+Some additional comparisons are provided here: https://ritza.co/articles/scikit-learn-vs-tensorflow-vs-pytorch-vs-keras/
+
+* [TensorFlow](https://www.tensorflow.org/)
+
+  This is an open source machine learning library released by Google.  It has support
+  for CPUs, GPUs, and [TPUs](https://en.wikipedia.org/wiki/Tensor_Processing_Unit),
+  and provides all the features you need to build deep learning workflows:
+  [TensorFlow feactures](https://en.wikipedia.org/wiki/TensorFlow#Features).
+
+* [PyTorch](https://pytorch.org/)
+
+  This is a machine learning library build off of the Torch library, originally
+  developed by Facebook.
+
+* [scikit-learn](https://scikit-learn.org/stable/)
+
+  This is a python library developed for machine learning.  It has a lot of
+  sample datasets that provide a nice means to learn how different methods work.
+  It is designed to work with NumPy and SciPy.
+
+  General recommendations on the web seem to be to use Scikit-learn to get
+  started with machine learning and to explore ideas, but to switch to
+  one of the other packages for computationally-intensive work.
+
+  Scikit-learn provides some nice sample datasets: 
+
+  https://scikit-learn.org/stable/datasets/toy_dataset.html
+
+  as well as generators for
+  datasets:
+
+  https://scikit-learn.org/stable/datasets/sample_generators.html
+
+There are also tools that provide higher-level interfaces to these
+
+* [Keras](https://keras.io/)
+
+  Keras is built on top of TensorFlow and provides a nice python interface that
+  hides a lot of the implementation details in TensorFlow.
+
+## Keras / TensorFlow
+
+We'll focus on Keras and TensorFlow.
+
+There are a large number of examples provided by Keras:
+
+https://keras.io/examples/
+
+You should be able to install keras via pip or conda.
diff --git a/content/11-machine-learning/model.png b/content/11-machine-learning/model.png
diff --git a/content/11-machine-learning/neural-net-basics.md b/content/11-machine-learning/neural-net-basics.md
@@ -0,0 +1,144 @@
+# Artificial Neural Network Basics
+
+## Neural networks
+
+When we talk about machine learning, we often mean an [_artifical
+neural
+network_](https://en.wikipedia.org/wiki/Artificial_neural_network).  A
+neural network mimics the action of neurons in your brain.  We'll
+follow the notation from _Computational Methods for Physics_ by
+Franklin.
+
+Basic idea:
+
+* Create a nonlinear fitting routine with free parameters
+* Train the network on data with known inputs and outputs to set the parameters
+* Use the trained network on new data to predict the outcome
+
+We can think of a neural network as a map that takes a set of
+$N_\mathrm{in}$ parameters and returns a set of $N_\mathrm{out}$
+parameters, which we can express this as:
+
+$${\bf z} = {\bf A} {\bf x}$$
+
+where
+
+$${\bf x} = (x_1, x_2, \ldots, x_{N_\mathrm{in}})$$
+
+are the inputs,
+
+$${\bf z} = (z_1, z_2, \ldots, z_{N_\mathrm{out}})$$
+
+are the outputs, and
+${\bf A}$ is an $N_\mathrm{out} \times N_\mathrm{in}$ matrix.
+
+Our goal is to determine the matrix elements of ${\bf A}$.
+
+## Nomenclature
+
+We can visualize a neural network as:
+
+![NN diagram](nn_fig.png)
+
+* Neural networks are divided into _layers_
+
+  * There is always an _input layer_&mdash;it doesn't do any processing.
+
+  * There is always an _output layer_.
+
+* Within a layer there are neurons or _nodes_.
+
+  * For input, there will be one node for each input variable.  In this figure,
+    there are 3 nodes on the input layer.
+
+  * The output layer will have as many nodes are needed to convey the answer
+    we are seeking from the network.  In this case, there are 2 nodes on the
+    output layer.
+
+* Every node in the first layer connects to every node in the next layer
+
+  * The _weight_ associated with the _connection_ can vary&mdash;these are the matrix elements.
+
+    ```{note}
+    This is called a _dense layer_.  There are alternate types of layers
+    we can explore where the nodes are connected differently.
+    ```
+
+* In this example, the processing is done in layer 2 (the output)
+
+* When you train the neural network, you are adjusting the weights connecting to the nodes
+
+  * Some connections might have zero weight
+
+  * This mimics nature&mdash;a single neuron can connect to several (or lots) of other neurons.
+
+## Universal approximation theorem
+
+A neural network can be designed to approximate any function, $f(x)$.  For this to work, there must be a source of non-linearity in the network&mdash;this is a result of the [universal approximation theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem).
+
+We use a nonlinear [_activation function_](https://en.wikipedia.org/wiki/Activation_function) that is applied in a layer.  It has
+the form:
+
+$$g({\bf v}) = \left ( \begin{array}{c} g(v_0) \\ g(v_1) \\ \vdots \\ g(v_{n-1}) \end{array} \right )$$
+
+```{note}
+The activation function, $g({\bf v})$ works element-by-element on the vector ${\bf v}$.
+```
+
+Then our neural network has the form: ${\bf z} = g({\bf A x})$
+
+We want to choose a function $g(\xi)$ that is differentiable.  A common choice is the _sigmoid function_:
+
+$$g(\xi) = \frac{1}{1 + e^{-\xi}}$$
+
+```{figure} sigmoid.png
+---
+align: center
+---
+The sigmoid function
+```
+
+```{note}
+There are [many choices for the activation function](https://en.wikipedia.org/wiki/Activation_function) which have
+different properties.  Often the choice of activation function will be empirical, by experimenting with the 
+performance of the network.
+```
+
+## Basic algorithm
+
+
+
+* Training
+
+  * Loop over the $T$ pairs $({\bf x}^k, {\bf y}^k)$ for $k = 1, \ldots, T$
+
+    * Predict the output for ${\bf x}^k$ as:
+
+      $$z_i = g([{\bf A x}^k]_i) = g \left ( \sum_{j=1}^{N_\mathrm{in}} A_{ij} x^k_j \right )$$
+
+    * Constrain that ${\bf z} = {\bf y}^k$.
+
+      This is a minimization problem, where we are minimizing:
+
+      \begin{align*}
+      f(A_{ij}) &= \| g({\bf A x}^k) - {\bf y}^k \|^2 \\
+                &= \sum_{i=1}^{N_\mathrm{out}} \left [ g\left (\sum_{j=1}^{N_\mathrm{in}} A_{ij} x^k_j \right ) - y^k_i \right ]^2
+      \end{align*}
+
+      We call this function the _cost function_ or _loss function_.
+
+      ```{note}
+      This is one possible choice for the cost function, $f(A_{ij})$, but [many others exist](https://en.wikipedia.org/wiki/Loss_function).
+      ```
+
+    * Update the matrix ${\bf A}$ based on the training pair $({\bf x}^k, {\bf y^{k}})$.
+
+* Using the network
+
+  With the trained ${\bf A}$, we can now use the network on data we haven't seen before, $\boldsymbol \chi$:
+
+  $$z_i = g([{\bf A {\boldsymbol \chi}}^k]_i) = g \left ( \sum_{j=1}^{N_\mathrm{in}} A_{ij} \chi^k_j \right )$$
+
+There are a lot of details that we still need to figure out involving the training and minimization.
+We'll start with minimization: a common minimization technique used with
+neural networks is [_gradient descent_](https://en.wikipedia.org/wiki/Gradient_descent).
diff --git a/content/11-machine-learning/neural-net-derivation.md b/content/11-machine-learning/neural-net-derivation.md
@@ -0,0 +1,86 @@
+# Deriving the Learning Correction
+
+For gradient descent, we need to derive the update to the matrix
+${\bf A}$ based on training on a set of our data, $({\bf x}^k, {\bf y}^k)$.
+
+Let's start with our cost function:
+
+$$f(A_{ij}) = \sum_{i=1}^{N_\mathrm{out}} (z_i - y_i^k)^2 = \sum_{i=1}^{N_\mathrm{out}} 
+  \Biggl [ g\biggl (\underbrace{\sum_{j=1}^{N_\mathrm{in}} A_{ij} x^k_j}_{\equiv \alpha_i} \biggr ) - y^k_i \Biggr ]^2$$
+
+where we'll refer to the product ${\boldsymbol \alpha} \equiv {\bf Ax}$ to help simplify notation.
+
+We can compute the derivative with respect to a single matrix
+element, $A_{pq}$ by applying the chain rule:
+
+$$\frac{\partial f}{\partial A_{pq}} =
+  2 \sum_{i=1}^{N_\mathrm{out}} (z_i - y^k_i) \left . \frac{\partial g}{\partial \xi} \right |_{\xi=\alpha_i} \frac{\partial \alpha_i}{\partial A_{pq}}$$
+
+
+with
+
+$$\frac{\partial \alpha_i}{\partial A_{pq}} = \sum_{j=1}^{N_\mathrm{in}} \frac{\partial A_{ij}}{\partial A_{pq}} x^k_j = \sum_{j=1}^{N_\mathrm{in}} \delta_{ip} \delta_{jq} x^k_j = \delta_{ip} x^k_q$$
+
+and for $g(\xi)$, we will assume the sigmoid function,so
+
+$$\frac{\partial g}{\partial \xi} 
+  = \frac{\partial}{\partial \xi} \frac{1}{1 + e^{-\xi}} 
+  =- (1 + e^{-\xi})^{-2} (- e^{-\xi})
+  = g(\xi) \frac{e^{-\xi}}{1+ e^{-\xi}} = g(\xi) (1 - g(\xi))$$
+
+which gives us:
+
+\begin{align*}
+\frac{\partial f}{\partial A_{pq}} &= 2 \sum_{i=1}^{N_\mathrm{out}}
+   (z_i - y^k_i) z_i (1 - z_i) \delta_{ip} x^k_q \\
+   &= 2 (z_p - y^k_p) z_p (1- z_p) x^k_q
+\end{align*}
+
+where we used the fact that the $\delta_{ip}$ means that only a single term contributes to the sum.
+
+Note that:
+
+* $e_p^k \equiv (z_p - y_p^k)$ is the error on the output layer,
+  and the correction is proportional to the error (as we would
+  expect).
+
+* The $k$ superscripts here remind us that this is the result of
+  only a single pair of data from the training set.
+
+Now ${\bf z}$ and ${\bf y}^k$ are all vectors of size $N_\mathrm{out} \times 1$ and ${\bf x}^k$ is a vector of size $N_\mathrm{in} \times 1$, so we can write this expression for the matrix as a whole as:
+
+$$\frac{\partial f}{\partial {\bf A}} = 2 ({\bf z} - {\bf y}^k) \circ {\bf z} \circ (1 - {\bf z}) \cdot ({\bf x}^k)^\intercal$$
+
+where the operator $\circ$ represents _element-by-element_ multiplication (the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices))).
+
+## Performing the update
+
+We could do the update like we just saw with our gradient descent
+example: take a single data point, $({\bf x}^k, {\bf y}^k)$ and
+do the full minimization, continually estimating the correction,
+$\partial f/\partial {\bf A}$ and updating ${\bf A}$ until we
+reach a minimum.  The problem with this is that $({\bf x}^k, {\bf y}^k)$ is only one point in our training data, and there is no
+guarantee that if we minimize completely with point $k$ that we will
+also be a minimum with point $k+1$.
+
+Instead we take multiple passes through the training data (called _epochs_) and apply only a single push in the direction that gradient
+descent suggests, scaled by a _learning rate_ $\eta$.
+
+The overall minimization appears as:
+
+<div style="border: solid; padding: 10px; width: 80%; margin: 0 auto; background: #eeeeee">
+* Loop over epochs
+
+  * Loop over the training data, $\{ ({\bf x}^0, {\bf y}^0), ({\bf x}^1, {\bf y}^1), \ldots \}$.  We'll refer to the current training
+    pair as $({\bf x}^k, {\bf y}^k)$
+
+    * Propagate ${\bf x}^k$ through the network, getting the output
+      ${\bf z} = g({\bf A x}^k)$
+
+    * Compute the error on the output layer, ${\bf e}^k = {\bf z} - {\bf y}^k$
+
+    * Update the matrix ${\bf A}$ according to:
+
+      $${\bf A} \leftarrow {\bf A} - 2 \,\eta\, {\bf e}^k \circ {\bf z} \circ (1 - {\bf z}) \cdot ({\bf x}^k)^\intercal$$
+</div>
+