Skip to content

Commit

Permalink
FIX Update explanation regarding number of trees in GBDT (INRIA#799)
Browse files Browse the repository at this point in the history
  • Loading branch information
fritshermans authored Jan 29, 2025
1 parent 528917e commit 0ebeac1
Show file tree
Hide file tree
Showing 4 changed files with 34 additions and 30 deletions.
15 changes: 8 additions & 7 deletions notebooks/ensemble_ex_03.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -101,20 +101,21 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Both gradient boosting and random forest models improve when increasing the\n",
"number of trees in the ensemble. However, the scores reach a plateau where\n",
"adding new trees just makes fitting and scoring slower.\n",
"Random forest models improve when increasing the number of trees in the\n",
"ensemble. However, the scores reach a plateau where adding new trees just\n",
"makes fitting and scoring slower.\n",
"\n",
"To avoid adding new unnecessary tree, unlike random-forest gradient-boosting\n",
"Gradient boosting models overfit when the number of trees is too large. To\n",
"avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n",
"offers an early-stopping option. Internally, the algorithm uses an\n",
"out-of-sample set to compute the generalization performance of the model at\n",
"each addition of a tree. Thus, if the generalization performance is not\n",
"improving for several iterations, it stops adding trees.\n",
"\n",
"Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n",
"of trees is certainly too large. Change the parameter `n_iter_no_change` such\n",
"that the gradient boosting fitting stops after adding 5 trees that do not\n",
"improve the overall generalization performance."
"of trees is certainly too large. Change the parameter `n_iter_no_change`\n",
"such that the gradient boosting fitting stops after adding 5 trees to avoid\n",
"deterioration of the overall generalization performance."
]
},
{
Expand Down
17 changes: 9 additions & 8 deletions notebooks/ensemble_sol_03.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -129,20 +129,21 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Both gradient boosting and random forest models improve when increasing the\n",
"number of trees in the ensemble. However, the scores reach a plateau where\n",
"adding new trees just makes fitting and scoring slower.\n",
"Random forest models improve when increasing the number of trees in the\n",
"ensemble. However, the scores reach a plateau where adding new trees just\n",
"makes fitting and scoring slower.\n",
"\n",
"To avoid adding new unnecessary tree, unlike random-forest gradient-boosting\n",
"Gradient boosting models overfit when the number of trees is too large. To\n",
"avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n",
"offers an early-stopping option. Internally, the algorithm uses an\n",
"out-of-sample set to compute the generalization performance of the model at\n",
"each addition of a tree. Thus, if the generalization performance is not\n",
"improving for several iterations, it stops adding trees.\n",
"\n",
"Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n",
"of trees is certainly too large. Change the parameter `n_iter_no_change` such\n",
"that the gradient boosting fitting stops after adding 5 trees that do not\n",
"improve the overall generalization performance."
"of trees is certainly too large. Change the parameter `n_iter_no_change`\n",
"such that the gradient boosting fitting stops after adding 5 trees to avoid\n",
"deterioration of the overall generalization performance."
]
},
{
Expand All @@ -167,7 +168,7 @@
"source": [
"We see that the number of trees used is far below 1000 with the current\n",
"dataset. Training the gradient boosting model with the entire 1000 trees would\n",
"have been useless."
"have been detrimental."
]
},
{
Expand Down
15 changes: 8 additions & 7 deletions python_scripts/ensemble_ex_03.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,20 +64,21 @@
# Write your code here.

# %% [markdown]
# Both gradient boosting and random forest models improve when increasing the
# number of trees in the ensemble. However, the scores reach a plateau where
# adding new trees just makes fitting and scoring slower.
# Random forest models improve when increasing the number of trees in the
# ensemble. However, the scores reach a plateau where adding new trees just
# makes fitting and scoring slower.
#
# To avoid adding new unnecessary tree, unlike random-forest gradient-boosting
# Gradient boosting models overfit when the number of trees is too large. To
# avoid adding a new unnecessary tree, unlike random-forest gradient-boosting
# offers an early-stopping option. Internally, the algorithm uses an
# out-of-sample set to compute the generalization performance of the model at
# each addition of a tree. Thus, if the generalization performance is not
# improving for several iterations, it stops adding trees.
#
# Now, create a gradient-boosting model with `n_estimators=1_000`. This number
# of trees is certainly too large. Change the parameter `n_iter_no_change` such
# that the gradient boosting fitting stops after adding 5 trees that do not
# improve the overall generalization performance.
# of trees is certainly too large. Change the parameter `n_iter_no_change`
# such that the gradient boosting fitting stops after adding 5 trees to avoid
# deterioration of the overall generalization performance.

# %%
# Write your code here.
Expand Down
17 changes: 9 additions & 8 deletions python_scripts/ensemble_sol_03.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,20 +86,21 @@
)

# %% [markdown]
# Both gradient boosting and random forest models improve when increasing the
# number of trees in the ensemble. However, the scores reach a plateau where
# adding new trees just makes fitting and scoring slower.
# Random forest models improve when increasing the number of trees in the
# ensemble. However, the scores reach a plateau where adding new trees just
# makes fitting and scoring slower.
#
# To avoid adding new unnecessary tree, unlike random-forest gradient-boosting
# Gradient boosting models overfit when the number of trees is too large. To
# avoid adding a new unnecessary tree, unlike random-forest gradient-boosting
# offers an early-stopping option. Internally, the algorithm uses an
# out-of-sample set to compute the generalization performance of the model at
# each addition of a tree. Thus, if the generalization performance is not
# improving for several iterations, it stops adding trees.
#
# Now, create a gradient-boosting model with `n_estimators=1_000`. This number
# of trees is certainly too large. Change the parameter `n_iter_no_change` such
# that the gradient boosting fitting stops after adding 5 trees that do not
# improve the overall generalization performance.
# of trees is certainly too large. Change the parameter `n_iter_no_change`
# such that the gradient boosting fitting stops after adding 5 trees to avoid
# deterioration of the overall generalization performance.

# %%
# solution
Expand All @@ -110,7 +111,7 @@
# %% [markdown] tags=["solution"]
# We see that the number of trees used is far below 1000 with the current
# dataset. Training the gradient boosting model with the entire 1000 trees would
# have been useless.
# have been detrimental.

# %% [markdown]
# Estimate the generalization performance of this model again using the
Expand Down

0 comments on commit 0ebeac1

Please sign in to comment.