Skip to content

Commit

Permalink
Update notebooks
Browse files Browse the repository at this point in the history
  • Loading branch information
ArturoAmorQ committed Nov 25, 2024
1 parent 29065fa commit 95e0a48
Show file tree
Hide file tree
Showing 6 changed files with 23 additions and 14 deletions.
15 changes: 9 additions & 6 deletions notebooks/01_tabular_data_exploration.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -540,12 +540,15 @@
"We made important observations (which will be discussed later in more detail):\n",
"\n",
"* if your target variable is imbalanced (e.g., you have more samples from one\n",
" target category than another), you may need special techniques for training\n",
" and evaluating your machine learning model;\n",
"* having redundant (or highly correlated) columns can be a problem for some\n",
" machine learning algorithms;\n",
"* contrary to decision tree, linear models can only capture linear\n",
" interactions, so be aware of non-linear relationships in your data."
" target category than another), you may need to be careful when interpreting\n",
" the values of performance metrics;\n",
"* columns can be redundant (or highly correlated), which is not necessarily a\n",
" problem, but may require special treatment as we will cover in future\n",
" notebooks;\n",
"* decision trees create prediction rules by comparing each feature to a\n",
" threshold value, resulting in decision boundaries that are always parallel\n",
" to the axes. In 2D, this means the boundaries are vertical or horizontal\n",
" line segments at the feature threshold values."
]
}
],
Expand Down
12 changes: 8 additions & 4 deletions notebooks/cross_validation_learning_curve.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -151,10 +151,14 @@
"benefit to adding samples anymore or assessing the potential gain of adding\n",
"more samples into the training set.\n",
"\n",
"If we achieve a plateau and adding new samples in the training set does not\n",
"reduce the testing error, we might have reached the Bayes error rate using the\n",
"available model. Using a more complex model might be the only possibility to\n",
"reduce the testing error further.\n",
"If the testing error plateaus despite adding more training samples, it's\n",
"possible that the model has achieved its optimal performance. In this case,\n",
"using a more expressive model might help reduce the error further. Otherwise,\n",
"the error may have reached the Bayes error rate, the theoretical minimum error\n",
"due to inherent uncertainty not resolved by the available data. This minimum error is\n",
"non-zero whenever some of the variation of the target variable `y` depends on\n",
"external factors not fully observed in the features available in `X`, which is\n",
"almost always the case in practice.\n",
"\n",
"## Summary\n",
"\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -452,7 +452,7 @@
"from the previous models: its decision boundary can take a diagonal\n",
"direction. Furthermore, we can observe that predictions are very confident in\n",
"the low density regions of the feature space, even very close to the decision\n",
"boundary\n",
"boundary.\n",
"\n",
"We can obtain very similar results by using a kernel approximation technique\n",
"such as the Nystr\u00f6m method with a polynomial kernel:"
Expand Down
2 changes: 1 addition & 1 deletion notebooks/logistic_regression.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,7 @@
"by name or position. In the code above `logistic_regression[-1]` means the\n",
"last step of the pipeline. Then you can access the attributes of that step such\n",
"as `coef_`. Notice also that the `coef_` attribute is an array of shape (1,\n",
"`n_features`) an then we access it via its first entry. Alternatively one\n",
"`n_features`) and then we access it via its first entry. Alternatively one\n",
"could use `coef_.ravel()`.\n",
"\n",
"We are now ready to visualize the weight values as a barplot:"
Expand Down
4 changes: 3 additions & 1 deletion notebooks/metrics_classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -556,7 +556,9 @@
"metadata": {},
"outputs": [],
"source": [
"prevalence = target_test.value_counts()[1] / target_test.value_counts().sum()\n",
"prevalence = (\n",
" target_test.value_counts()[\"donated\"] / target_test.value_counts().sum()\n",
")\n",
"print(f\"Prevalence of the class 'donated': {prevalence:.2f}\")"
]
},
Expand Down
2 changes: 1 addition & 1 deletion notebooks/parameter_tuning_sol_03.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,7 @@
"holding on any axis of the parallel coordinate plot. You can then slide (move)\n",
"the range selection and cross two selections to see the intersections.\n",
"\n",
"Selecting the best performing models (i.e. above an accuracy of ~0.68), we\n",
"Selecting the best performing models (i.e. above R2 score of ~0.68), we\n",
"observe that **in this case**:\n",
"\n",
"- scaling the data is important. All the best performing models use scaled\n",
Expand Down

0 comments on commit 95e0a48

Please sign in to comment.