Skip to content

Commit

Permalink
FIX Several small fixes (#780)
Browse files Browse the repository at this point in the history
  • Loading branch information
ArturoAmorQ authored Nov 25, 2024
1 parent b2e0a03 commit 29065fa
Show file tree
Hide file tree
Showing 6 changed files with 23 additions and 14 deletions.
15 changes: 9 additions & 6 deletions python_scripts/01_tabular_data_exploration.py
Original file line number Diff line number Diff line change
Expand Up @@ -360,9 +360,12 @@
# We made important observations (which will be discussed later in more detail):
#
# * if your target variable is imbalanced (e.g., you have more samples from one
# target category than another), you may need special techniques for training
# and evaluating your machine learning model;
# * having redundant (or highly correlated) columns can be a problem for some
# machine learning algorithms;
# * contrary to decision tree, linear models can only capture linear
# interactions, so be aware of non-linear relationships in your data.
# target category than another), you may need to be careful when interpreting
# the values of performance metrics;
# * columns can be redundant (or highly correlated), which is not necessarily a
# problem, but may require special treatment as we will cover in future
# notebooks;
# * decision trees create prediction rules by comparing each feature to a
# threshold value, resulting in decision boundaries that are always parallel
# to the axes. In 2D, this means the boundaries are vertical or horizontal
# line segments at the feature threshold values.
12 changes: 8 additions & 4 deletions python_scripts/cross_validation_learning_curve.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,10 +102,14 @@
# benefit to adding samples anymore or assessing the potential gain of adding
# more samples into the training set.
#
# If we achieve a plateau and adding new samples in the training set does not
# reduce the testing error, we might have reached the Bayes error rate using the
# available model. Using a more complex model might be the only possibility to
# reduce the testing error further.
# If the testing error plateaus despite adding more training samples, it's
# possible that the model has achieved its optimal performance. In this case,
# using a more expressive model might help reduce the error further. Otherwise,
# the error may have reached the Bayes error rate, the theoretical minimum error
# due to inherent uncertainty not resolved by the available data. This minimum error is
# non-zero whenever some of the variation of the target variable `y` depends on
# external factors not fully observed in the features available in `X`, which is
# almost always the case in practice.
#
# ## Summary
#
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -331,7 +331,7 @@ def plot_decision_boundary(model, title=None):
# from the previous models: its decision boundary can take a diagonal
# direction. Furthermore, we can observe that predictions are very confident in
# the low density regions of the feature space, even very close to the decision
# boundary
# boundary.
#
# We can obtain very similar results by using a kernel approximation technique
# such as the Nyström method with a polynomial kernel:
Expand Down
2 changes: 1 addition & 1 deletion python_scripts/logistic_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@
# by name or position. In the code above `logistic_regression[-1]` means the
# last step of the pipeline. Then you can access the attributes of that step such
# as `coef_`. Notice also that the `coef_` attribute is an array of shape (1,
# `n_features`) an then we access it via its first entry. Alternatively one
# `n_features`) and then we access it via its first entry. Alternatively one
# could use `coef_.ravel()`.
#
# We are now ready to visualize the weight values as a barplot:
Expand Down
4 changes: 3 additions & 1 deletion python_scripts/metrics_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,9 @@
# of the positive class).

# %%
prevalence = target_test.value_counts()[1] / target_test.value_counts().sum()
prevalence = (
target_test.value_counts()["donated"] / target_test.value_counts().sum()
)
print(f"Prevalence of the class 'donated': {prevalence:.2f}")

# %% [markdown]
Expand Down
2 changes: 1 addition & 1 deletion python_scripts/parameter_tuning_sol_03.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,7 @@
# holding on any axis of the parallel coordinate plot. You can then slide (move)
# the range selection and cross two selections to see the intersections.
#
# Selecting the best performing models (i.e. above an accuracy of ~0.68), we
# Selecting the best performing models (i.e. above R2 score of ~0.68), we
# observe that **in this case**:
#
# - scaling the data is important. All the best performing models use scaled
Expand Down

0 comments on commit 29065fa

Please sign in to comment.