diff --git a/notebooks/noj_book/metamorph.clj b/notebooks/noj_book/metamorph.clj index f7158ff..f06b40e 100644 --- a/notebooks/noj_book/metamorph.clj +++ b/notebooks/noj_book/metamorph.clj @@ -4,13 +4,13 @@ [scicloj.kindly.v4.kind :as kind])) ;; # Machine learning pipelines -;; ## Clojure Core Pipelines +;; ## Clojure Core Pipelines -;; Clojure has built-in support for data processing pipelines—a series of functions where the output -;; of one step is the input to the next. In core Clojure, these are supported by the so-called -;; **threading macro**. +;; Clojure has built-in support for data processing pipelines—a series of functions where the output +;; of one step is the input to the next. In core Clojure, these are supported by the so-called +;; **threading macro**. -;; ### Example: Using the Threading Macro +;; ### Example: Using the Threading Macro (require '[clojure.string :as str]) (-> "hello" @@ -18,93 +18,93 @@ (str/reverse) (first)) -;; In the example above: +;; In the example above: -;; 1. `"hello"` is converted to uppercase, resulting in `"HELLO"`. -;; 2. The uppercase string is reversed, giving `"OLLEH"`. -;; 3. The first character of the reversed string is extracted, which is `\O`. +;; 1. `"hello"` is converted to uppercase, resulting in `"HELLO"`. +;; 2. The uppercase string is reversed, giving `"OLLEH"`. +;; 3. The first character of the reversed string is extracted, which is `\O`. -;; ## Function Composition with `comp` +;; ## Function Composition with `comp` -;; We can achieve the same result using **function composition** with `comp`. Note that when using -;; `comp`, the order of functions is reversed compared to the threading macro. +;; We can achieve the same result using **function composition** with `comp`. Note that when using +;; `comp`, the order of functions is reversed compared to the threading macro. (def upper-reverse-first (comp first str/reverse str/upper-case)) (upper-reverse-first "hello") -;; This defines a function `upper-reverse-first` that: +;; This defines a function `upper-reverse-first` that: -;; 1. Converts the input string to uppercase. -;; 2. Reverses the uppercase string. -;; 3. Extracts the first character. +;; 1. Converts the input string to uppercase. +;; 2. Reverses the uppercase string. +;; 3. Extracts the first character. -;; #### Applying the Composed Function +;; #### Applying the Composed Function -;; We can carry the composed function around and apply it in different places: +;; We can carry the composed function around and apply it in different places: (upper-reverse-first "world") -;; Or using `apply`: +;; Or using `apply`: (apply upper-reverse-first ["world"]) -;; #### Inlining the Composed Function +;; #### Inlining the Composed Function -;; We can also inline the composed function without assigning it to a variable: +;; We can also inline the composed function without assigning it to a variable: ((comp first str/reverse str/upper-case) "hello") -;; ## Pipelines in Machine Learning +;; ## Pipelines in Machine Learning -;; In machine learning, we usually have two separate concepts: +;; In machine learning, we usually have two separate concepts: -;; - **Pre-processing of the data**: Zero or more steps to prepare the data. -;; - **Fitting a model**: A single step where the model learns from the data. +;; - **Pre-processing of the data**: Zero or more steps to prepare the data. +;; - **Fitting a model**: A single step where the model learns from the data. -;; Considering these concepts, we aim to create a pipeline that satisfies the following goals: +;; Considering these concepts, we aim to create a pipeline that satisfies the following goals: -;; ### Pipeline Goals +;; ### Pipeline Goals -;; - **Unify Pre-processing and Fitting**: Combine all steps into a single pipeline. -;; - **Reusability**: The same pipeline can be executed multiple times (e.g., training vs. prediction), -;; possibly on different data. -;; - **Conditional Behavior**: Functions within the pipeline may need to behave differently during -;; training and prediction. -;; - **Stateful Steps**: Some steps might need to learn from the data during training and then apply -;; that learned state during prediction. -;; - **Readability**: Write pipeline steps in order for easier understanding. -;; - **Movability**: The entire pipeline should be assignable to a variable or addable to a sequence, -;; making it modular and reusable. -;; - **Callable**: The pipeline should be callable like a function, taking data as input and returning -;; the transformed data. +;; - **Unify Pre-processing and Fitting**: Combine all steps into a single pipeline. +;; - **Reusability**: The same pipeline can be executed multiple times (e.g., training vs. prediction), +;; possibly on different data. +;; - **Conditional Behavior**: Functions within the pipeline may need to behave differently during +;; training and prediction. +;; - **Stateful Steps**: Some steps might need to learn from the data during training and then apply +;; that learned state during prediction. +;; - **Readability**: Write pipeline steps in order for easier understanding. +;; - **Movability**: The entire pipeline should be assignable to a variable or addable to a sequence, +;; making it modular and reusable. +;; - **Callability**: The pipeline should be callable like a function, taking data as input and returning +;; the transformed data. -;; ### The Need for a New Approach +;; ### The Need for a New Approach -;; Clojure's threading macro (`->`) and function composition (`comp`) do not fully meet these requirements -;; because: +;; Clojure's threading macro (`->`) and function composition (`comp`) do not fully meet these requirements +;; because: -;; - They lack the ability to handle state between training and prediction phases. -;; - They don't support conditional behavior based on the execution context (e.g., training vs. prediction). -;; - They may not represent the pipeline steps in a readable, sequential order when using `comp`. +;; - They lack the ability to handle state between training and prediction phases. +;; - They don't support conditional behavior based on the execution context (e.g., training vs. prediction). +;; - They may not represent the pipeline steps in a readable, sequential order when using `comp`. -;; ## Introducing Metamorph Pipelines +;; ## Introducing Metamorph Pipelines -;; To address these limitations, **Metamorph pipelines** were developed. [Metamorph](https://github.com/scicloj/metamorph) provides a way to -;; create pipelines that: +;; To address these limitations, **Metamorph pipelines** were developed. [Metamorph](https://github.com/scicloj/metamorph) provides a way to +;; create pipelines that: -;; - Compose processing steps in a readable, sequential order. -;; - Maintain state between different stages of execution. -;; - Allow for conditional behavior within pipeline steps. -;; - Can be easily moved, assigned, and called like functions. +;; - Compose processing steps in a readable, sequential order. +;; - Maintain state between different stages of execution. +;; - Allow for conditional behavior within pipeline steps. +;; - Can be easily moved, assigned, and called like functions. -; ### A pipeline is a composition of functions +;; ### A pipeline is a composition of functions -;; A metamorph pipeline is created by the function `scicloj.metamorph.core/pipeline`. -;; It takes functions as input and composes them in order (unlike `comp`, which composes them in reverse order). -;; Note that it is not a macro, so it cannot take expressions such as `(str/upper-case)` directly. +;; A metamorph pipeline is created by the function `scicloj.metamorph.core/pipeline`. +;; It takes functions as input and composes them in order (unlike `comp`, which composes them in reverse order). +;; Note that it is not a macro, so it cannot take expressions such as `(str/upper-case)` directly. (require '[scicloj.metamorph.core :as mm] ) (def metamorph-pipeline-1 @@ -113,18 +113,19 @@ str/reverse first)) -;; This creates a function that can be called with data, like this: +;; This creates a function that can be called with data, like this: ;; `(metamorph-pipeline-1 "hello")` -;; -;; However, this would fail because metamorph pipeline functions are expected to return a map, -;; but the above functions return a string. -;; -; ### Pipelines steps input/output a context map +;; +;; However, this would fail because metamorph pipeline functions are expected to return a map, +;; but the above functions return a string. +;; + +;; ### Pipelines steps input/output a context map ;; ;; To maintain state and allow for **stateful steps**, we conventionally use a **context map** that is -;; passed through each function. +;; passed through each function. ;; -;; So we can only add functions to a metamorph pipeline which input and output a single map, +;; So we can only add functions to a metamorph pipeline which input and output a single map, ;; the so-called context map, often called `ctx`. (def metamorph-pipeline-2 @@ -133,15 +134,15 @@ (fn [ctx] ctx) (fn [ctx] ctx))) -; ### Context map key :metamorph/data + ; ### Context map key :metamorph/data -;; A second convention is that the map should have several "default keys", and all functions should understand them. -;; One of these keys is `:metamorph/data`. +;; A second convention is that the map should have several "default keys", and all functions should understand them. +;; One of these keys is `:metamorph/data`. +;; +;; It exists because in a metamorph pipeline we always pass around one main data object and several states. +;; The main data object manipulated by the pipeline needs to be stored and passed under the key `:metamorph/data`. ;; -;; It exists because in a metamorph pipeline we always pass around one main data object and several states. -;; The main data object manipulated by the pipeline needs to be stored and passed under the key `:metamorph/data`. -;; -;; We now change the metamorph pipeline accordingly, so that each function reads and writes from `:metamorph/data`. +;; We now change the metamorph pipeline accordingly, so that each function reads and writes from `:metamorph/data`. (def metamorph-pipeline-3-a (mm/pipeline @@ -152,7 +153,7 @@ (fn [ctx] (assoc ctx :metamorph/data (first (:metamorph/data ctx)))))) -;; Alternatively, using `update`: +;; Alternatively, using `update`: (def metamorph-pipeline-3-b (mm/pipeline @@ -160,15 +161,15 @@ (fn [ctx] (update ctx :metamorph/data str/reverse)) (fn [ctx] (update ctx :metamorph/data first)))) -;; Example usage: +;; Example usage: -(metamorph-pipeline-3-a {:metamorph/data "hello"}) +(metamorph-pipeline-3-a {:metamorph/data "hello"}) -(metamorph-pipeline-3-b {:metamorph/data "hello"}) +(metamorph-pipeline-3-b {:metamorph/data "hello"}) -; ### Pass additional state + ; ### Pass additional state -;; We can pass a main data object and any state through the pipeline. +;; We can pass a main data object and any state through the pipeline. (def metamorph-pipeline-4 (mm/pipeline @@ -183,19 +184,19 @@ (assoc ctx :metamorph/data (first (:metamorph/data ctx)))))) -;; Example usage: +;; Example usage: -(metamorph-pipeline-4 {:metamorph/data "hello"}) +(metamorph-pipeline-4 {:metamorph/data "hello"}) ;; -; ### Step functions can pass state to themselves (in other :mode) - -;; In nearly all cases, a step function wants to pass information only to itself. -;; It learns something in mode `:fit` and wants to use it in a second run of the pipeline in mode `:transform`. -;; -;; To make this easier, each step receives in the context map a unique step ID under the key `:metamorph/id`. -;; We can use this to store and retrieve state specific to that step, -;; avoiding clashes of keys between different step functions. +;; ### Step functions can pass state to themselves (in other :mode) + +;; In nearly all cases, a step function wants to pass information only to itself. +;; It learns something in mode `:fit` and wants to use it in a second run of the pipeline in mode `:transform`. +;; +;; To make this easier, each step receives in the context map a unique step ID under the key `:metamorph/id`. +;; We can use this to store and retrieve state specific to that step, +;; avoiding clashes of keys between different step functions. ;; ;; (to ease readability of the code, we now use destructuring of the arguments) @@ -212,23 +213,22 @@ (assoc ctx :metamorph/data (first (:metamorph/data ctx)))))) -;; Example usage: +;; Example usage: -(metamorph-pipeline-5 {:metamorph/data "hello"}) +(metamorph-pipeline-5 {:metamorph/data "hello"}) -;; Note: The actual UUID will vary each time the pipeline is run. +;; Note: The actual UUID will vary each time the pipeline is run. ;; To implement the requirement of allowing different behavior per step, we introduce another key in the -;; context map: `:metamorph/mode`. -;; +;; context map: `:metamorph/mode`. +;; ;; This can take two values, `:fit` and `:transform`, representing the concept of running the pipeline to -; learn something from the data (train or fit the pipeline/model) -;; and apply what was learned on new data (predict or transform). -;; The learned information can be stored in the context map, becoming available in later runs. - +;; learn something from the data (train or fit the pipeline/model) +;; and apply what was learned on new data (predict or transform). +;; The learned information can be stored in the context map, becoming available in later runs. -;; This passing of state only makes sense if the state is written to the map in one pass -;; and used in a different pass. +;; This passing of state only makes sense if the state is written to the map in one pass +;; and used in a different pass. (def metamorph-pipeline-6 (mm/pipeline @@ -249,25 +249,25 @@ (assoc ctx :metamorph/data (first (:metamorph/data ctx)))))) -; ### Run first in :fit then in :transform +;; ### Run first in :fit then in :transform -;; This shows how the pipeline is supposed to be run twice. -;; First in `:fit` mode and then in `:transform` mode, passing the full state context (`ctx`) -;; while updating the standard keys. +;; This shows how the pipeline is supposed to be run twice. +;; First in `:fit` mode and then in `:transform` mode, passing the full state context (`ctx`) +;; while updating the standard keys. -;; Usage: +;; Usage: -(def fitted-ctx - (metamorph-pipeline-6 {:metamorph/data "hello" - :metamorph/mode :fit})) +(def fitted-ctx + (metamorph-pipeline-6 {:metamorph/data "hello" + :metamorph/mode :fit})) ;; This will print `:state "5"` in the terminal, showing that the state from the `:fit` phase is used during the -;; `:transform` phase. +;; `:transform` phase. -(metamorph-pipeline-6 - (merge fitted-ctx - {:metamorph/data "world" - :metamorph/mode :transform})) +(metamorph-pipeline-6 + (merge fitted-ctx + {:metamorph/data "world" + :metamorph/mode :transform})) ;; #### Lifting to create pipeline functions @@ -287,7 +287,7 @@ (mm/lift str/reverse) (mm/lift first))) -(metamorph-pipeline-7 {:metamorph/data "hello"}) +(metamorph-pipeline-7 {:metamorph/data "hello"}) ;; #### Pipelines for machine learning diff --git a/notebooks/noj_book/ml_basic.clj b/notebooks/noj_book/ml_basic.clj index d3e8c1c..d5bd6e8 100644 --- a/notebooks/noj_book/ml_basic.clj +++ b/notebooks/noj_book/ml_basic.clj @@ -17,12 +17,9 @@ ^{:kindly/hide-code true :kindly/kind :kind/hiccup} -(->> [ - [ "Tribuo" "scicloj.ml.tribuo"] - [ "Smile" "scicloj.ml.smile"] +(->> [[ "Tribuo" "scicloj.ml.tribuo"] [ "Xgboost4J" "scicloj.ml.xgboost"] - [ "scikit-learn" "sklearn-clj"] - ] + [ "scikit-learn" "sklearn-clj"]] (map (fn [[library wrapper]] [:tr [:td library] @@ -32,28 +29,28 @@ ;; These libraries do not have any functions for the models they contain. -;; `metamorph.ml` has instead of funtcions per model the concept of each model having a -;; unique `key`, the :model-type , which needs to be given when calling -;;`metamorph.ml/train` +;; Instead of funtcions per model, `metamorph.ml` has the concept of each model having a +;; unique `key`, the `:model-type` , which needs to be given when calling +;;`metamorph.ml/train`. ;; -;; The model libraries register their models under these keys, when their main ns -;; is `require`d. (and the model keys get printed on screen when getting registered) +;; The model libraries register their models under these keys, when their main `ns` +;; is `require`d (and the model keys get printed on screen when getting registered). ;; So we cannot provide cljdoc for the models, as they do no have corresponding functions. ;; -;; Instead we provide in the the last chapters of the Noj book a complete list +;; Instead, we provide in the the last chapters of the Noj book a complete list ;; of all models (and their keys) incl. the parameters they take with a description. ;; For some models this reference documentation contains as well code examples. ;; This can be used to browse or search for models and their parameters. -;; The Tribuo plugin and their models are special in this. -;; It only contains 2 model types a keys, -;; namely :scicloj.ml.tribuo/classification and :scicloj.ml.tribuo/regression. -;; The model as such is encoded in the same ways as the Triuo Java libraries does this, +;; The [Tribuo](https://tribuo.org/) plugins and their models are special in this aspect. +;; The `scicloj.ml.tribuo` library only contains 2 model types as keys, +;; namely `:scicloj.ml.tribuo/classification` and `:scicloj.ml.tribuo/regression`. +;; The model as such is encoded in the same way as the Triuo Java libraries does this, ;; namely as a map of all Tribuo components in place, of which one is the model, -;; the so called "Trainer", always needed and having a certin :type, the model class. +;; the so called "Trainer", is always needed and has a certin `:type`, the model class. ;; -;; The reference documentation therefore lists all "Trainer" and their name incl. parameters -;; It lists as well all other "Configurable" which could be refered to in a component map. +;; The reference documentation therefore lists all "Trainer"s and their name incl. parameters. +;; It lists as well all other "Configurable"s which could be refered to in a component map. ;; ## Setup @@ -205,8 +202,8 @@ cat-maps ;; Split data into train and test set ;; -;; Now we split the data into train and test. By we use -;; a `:holdout` strategy, so will get a single split in training an test data. +;; Now we split the data into train and test. We use +;; a `:holdout` strategy, so will get a single split in training and test data. ;; (def split (first @@ -215,7 +212,7 @@ cat-maps split ;; ## Train a model -;; Now its time to train a model: +;; Now it's time to train a model: (require '[scicloj.metamorph.ml :as ml] '[scicloj.metamorph.ml.classification] @@ -225,33 +222,31 @@ split ;; ### Dummy model -;; We start with a dummy model, which simply predicts the majority class +;; We start with a dummy model, which simply predicts the majority class. (def dummy-model (ml/train (:train split) {:model-type :metamorph.ml/dummy-classifier})) - ;; TODO: Is the dummy model wrong about the majority? - (def dummy-prediction (ml/predict (:test split) dummy-model)) ;; It always predicts a single class, as expected: (-> dummy-prediction :survived frequencies) -;; we can calculate accuracy by using a metric after having converted -;; the numerical data back to original (important !) +;; We can calculate accuracy by using a metric after having converted +;; the numerical data back to original (important!). ;; We should never compare mapped columns directly. (loss/classification-accuracy (:survived (ds-cat/reverse-map-categorical-xforms (:test split))) (:survived (ds-cat/reverse-map-categorical-xforms dummy-prediction))) -;; It's performance is poor, even worse than coin flip. +;; Its performance is poor, even worse than a coin flip. (kindly/check = 0.3973063973063973) ;; ## Logistic regression -;; Next model to use is Logistic Regression -(require '[scicloj.ml.tribuo]) +;; Next model to use is Logistic Regression: +(require '[scicloj.ml.tribuo]) (def lreg-model (ml/train (:train split) @@ -269,10 +264,10 @@ split (:survived (ds-cat/reverse-map-categorical-xforms lreg-prediction))) (kindly/check = 0.7373737373737373) -;; Its performance is better, 73 % +;; Its performance is better, 73 %. ;; ## Random forest -;; Next is random forest +;; Next is random forest: (def rf-model (ml/train (:train split) {:model-type :scicloj.ml.tribuo/classification :tribuo-components [{:name "random-forest" :type "org.tribuo.classification.dtree.CARTClassificationTrainer" @@ -287,8 +282,9 @@ split (kind/hidden (set-sameish-comparator! 1)) -;; First five prediction including the probability distributions -;; are +;; Let us extract the first five prediction +;; and the probabilities provided by the mode. + (-> rf-prediction (tc/head) (tc/rows)) @@ -301,7 +297,6 @@ split [0.0 0.88 0.11]]) - (loss/classification-accuracy (:survived (ds-cat/reverse-map-categorical-xforms (:test split))) (:survived (ds-cat/reverse-map-categorical-xforms rf-prediction))) @@ -309,21 +304,19 @@ split (kindly/check = 0.7878787878787878) -;; best so far, 78 % -;; - -;; TODO: Extract feature importance. +;; best so far, 78 %. -;; # Next steps +;; ## Next steps ;; We could now go further and trying to improve the features / the model type ;; in order to find the best performing model for the data we have. ;; All models types have a range of configurations, -;; so called hyper-parameters. They can have as well influence on the +;; so-called hyper-parameters. They can have as well influence on the ;; model accuracy. ;; ;; So far we used a single split into 'train' and 'test' data, so we only get ;; a point estimate of the accuracy. This should be made more robust -;; via cross-validations and using different splits of the data. +;; via [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) and using different splits +;; of the data.