Update notebooks

INRIA · Nov 25, 2024 · 95e0a48 · 95e0a48
1 parent 29065fa
commit 95e0a48
Show file tree

Hide file tree

Showing 6 changed files with 23 additions and 14 deletions.
diff --git a/notebooks/01_tabular_data_exploration.ipynb b/notebooks/01_tabular_data_exploration.ipynb
@@ -540,12 +540,15 @@
     "We made important observations (which will be discussed later in more detail):\n",
     "\n",
     "* if your target variable is imbalanced (e.g., you have more samples from one\n",
-    "  target category than another), you may need special techniques for training\n",
-    "  and evaluating your machine learning model;\n",
-    "* having redundant (or highly correlated) columns can be a problem for some\n",
-    "  machine learning algorithms;\n",
-    "* contrary to decision tree, linear models can only capture linear\n",
-    "  interactions, so be aware of non-linear relationships in your data."
+    "  target category than another), you may need to be careful when interpreting\n",
+    "  the values of performance metrics;\n",
+    "* columns can be redundant (or highly correlated), which is not necessarily a\n",
+    "  problem, but may require special treatment as we will cover in future\n",
+    "  notebooks;\n",
+    "* decision trees create prediction rules by comparing each feature to a\n",
+    "  threshold value, resulting in decision boundaries that are always parallel\n",
+    "  to the axes. In 2D, this means the boundaries are vertical or horizontal\n",
+    "  line segments at the feature threshold values."
    ]
   }
  ],

diff --git a/notebooks/cross_validation_learning_curve.ipynb b/notebooks/cross_validation_learning_curve.ipynb
@@ -151,10 +151,14 @@
     "benefit to adding samples anymore or assessing the potential gain of adding\n",
     "more samples into the training set.\n",
     "\n",
-    "If we achieve a plateau and adding new samples in the training set does not\n",
-    "reduce the testing error, we might have reached the Bayes error rate using the\n",
-    "available model. Using a more complex model might be the only possibility to\n",
-    "reduce the testing error further.\n",
+    "If the testing error plateaus despite adding more training samples, it's\n",
+    "possible that the model has achieved its optimal performance. In this case,\n",
+    "using a more expressive model might help reduce the error further. Otherwise,\n",
+    "the error may have reached the Bayes error rate, the theoretical minimum error\n",
+    "due to inherent uncertainty not resolved by the available data. This minimum error is\n",
+    "non-zero whenever some of the variation of the target variable `y` depends on\n",
+    "external factors not fully observed in the features available in `X`, which is\n",
+    "almost always the case in practice.\n",
     "\n",
     "## Summary\n",
     "\n",

diff --git a/notebooks/linear_models_feature_engineering_classification.ipynb b/notebooks/linear_models_feature_engineering_classification.ipynb
@@ -452,7 +452,7 @@
     "from the previous models: its decision boundary can take a diagonal\n",
     "direction. Furthermore, we can observe that predictions are very confident in\n",
     "the low density regions of the feature space, even very close to the decision\n",
-    "boundary\n",
+    "boundary.\n",
     "\n",
     "We can obtain very similar results by using a kernel approximation technique\n",
     "such as the Nystr\u00f6m method with a polynomial kernel:"

diff --git a/notebooks/logistic_regression.ipynb b/notebooks/logistic_regression.ipynb
@@ -214,7 +214,7 @@
     "by name or position. In the code above `logistic_regression[-1]` means the\n",
     "last step of the pipeline. Then you can access the attributes of that step such\n",
     "as `coef_`. Notice also that the `coef_` attribute is an array of shape (1,\n",
-    "`n_features`) an then we access it via its first entry. Alternatively one\n",
+    "`n_features`) and then we access it via its first entry. Alternatively one\n",
     "could use `coef_.ravel()`.\n",
     "\n",
     "We are now ready to visualize the weight values as a barplot:"

diff --git a/notebooks/metrics_classification.ipynb b/notebooks/metrics_classification.ipynb
@@ -556,7 +556,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "prevalence = target_test.value_counts()[1] / target_test.value_counts().sum()\n",
+    "prevalence = (\n",
+    "    target_test.value_counts()[\"donated\"] / target_test.value_counts().sum()\n",
+    ")\n",
     "print(f\"Prevalence of the class 'donated': {prevalence:.2f}\")"
    ]
   },

diff --git a/notebooks/parameter_tuning_sol_03.ipynb b/notebooks/parameter_tuning_sol_03.ipynb
@@ -249,7 +249,7 @@
     "holding on any axis of the parallel coordinate plot. You can then slide (move)\n",
     "the range selection and cross two selections to see the intersections.\n",
     "\n",
-    "Selecting the best performing models (i.e. above an accuracy of ~0.68), we\n",
+    "Selecting the best performing models (i.e. above R2 score of ~0.68), we\n",
     "observe that **in this case**:\n",
     "\n",
     "- scaling the data is important. All the best performing models use scaled\n",