FIX Update explanation regarding number of trees in GBDT (INRIA#799)

Mahendra687 · Jan 29, 2025 · 0ebeac1 · 0ebeac1
1 parent 528917e
commit 0ebeac1
Show file tree

Hide file tree

Showing 4 changed files with 34 additions and 30 deletions.
diff --git a/notebooks/ensemble_ex_03.ipynb b/notebooks/ensemble_ex_03.ipynb
@@ -101,20 +101,21 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Both gradient boosting and random forest models improve when increasing the\n",
-    "number of trees in the ensemble. However, the scores reach a plateau where\n",
-    "adding new trees just makes fitting and scoring slower.\n",
+    "Random forest models improve when increasing the number of trees in the\n",
+    "ensemble. However, the scores reach a plateau where adding new trees just\n",
+    "makes fitting and scoring slower.\n",
     "\n",
-    "To avoid adding new unnecessary tree, unlike random-forest gradient-boosting\n",
+    "Gradient boosting models overfit when the number of trees is too large. To\n",
+    "avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n",
     "offers an early-stopping option. Internally, the algorithm uses an\n",
     "out-of-sample set to compute the generalization performance of the model at\n",
     "each addition of a tree. Thus, if the generalization performance is not\n",
     "improving for several iterations, it stops adding trees.\n",
     "\n",
     "Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n",
-    "of trees is certainly too large. Change the parameter `n_iter_no_change` such\n",
-    "that the gradient boosting fitting stops after adding 5 trees that do not\n",
-    "improve the overall generalization performance."
+    "of trees is certainly too large. Change the parameter `n_iter_no_change`\n",
+    "such that the gradient boosting fitting stops after adding 5 trees to avoid\n",
+    "deterioration of the overall generalization performance."
    ]
   },
   {

diff --git a/notebooks/ensemble_sol_03.ipynb b/notebooks/ensemble_sol_03.ipynb
@@ -129,20 +129,21 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Both gradient boosting and random forest models improve when increasing the\n",
-    "number of trees in the ensemble. However, the scores reach a plateau where\n",
-    "adding new trees just makes fitting and scoring slower.\n",
+    "Random forest models improve when increasing the number of trees in the\n",
+    "ensemble. However, the scores reach a plateau where adding new trees just\n",
+    "makes fitting and scoring slower.\n",
     "\n",
-    "To avoid adding new unnecessary tree, unlike random-forest gradient-boosting\n",
+    "Gradient boosting models overfit when the number of trees is too large. To\n",
+    "avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n",
     "offers an early-stopping option. Internally, the algorithm uses an\n",
     "out-of-sample set to compute the generalization performance of the model at\n",
     "each addition of a tree. Thus, if the generalization performance is not\n",
     "improving for several iterations, it stops adding trees.\n",
     "\n",
     "Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n",
-    "of trees is certainly too large. Change the parameter `n_iter_no_change` such\n",
-    "that the gradient boosting fitting stops after adding 5 trees that do not\n",
-    "improve the overall generalization performance."
+    "of trees is certainly too large. Change the parameter `n_iter_no_change`\n",
+    "such that the gradient boosting fitting stops after adding 5 trees to avoid\n",
+    "deterioration of the overall generalization performance."
    ]
   },
   {
@@ -167,7 +168,7 @@
    "source": [
     "We see that the number of trees used is far below 1000 with the current\n",
     "dataset. Training the gradient boosting model with the entire 1000 trees would\n",
-    "have been useless."
+    "have been detrimental."
    ]
   },
   {

diff --git a/python_scripts/ensemble_ex_03.py b/python_scripts/ensemble_ex_03.py
@@ -64,20 +64,21 @@
 # Write your code here.
 
 # %% [markdown]
-# Both gradient boosting and random forest models improve when increasing the
-# number of trees in the ensemble. However, the scores reach a plateau where
-# adding new trees just makes fitting and scoring slower.
+# Random forest models improve when increasing the number of trees in the
+# ensemble. However, the scores reach a plateau where adding new trees just
+# makes fitting and scoring slower.
 #
-# To avoid adding new unnecessary tree, unlike random-forest gradient-boosting
+# Gradient boosting models overfit when the number of trees is too large. To
+# avoid adding a new unnecessary tree, unlike random-forest gradient-boosting
 # offers an early-stopping option. Internally, the algorithm uses an
 # out-of-sample set to compute the generalization performance of the model at
 # each addition of a tree. Thus, if the generalization performance is not
 # improving for several iterations, it stops adding trees.
 #
 # Now, create a gradient-boosting model with `n_estimators=1_000`. This number
-# of trees is certainly too large. Change the parameter `n_iter_no_change` such
-# that the gradient boosting fitting stops after adding 5 trees that do not
-# improve the overall generalization performance.
+# of trees is certainly too large. Change the parameter `n_iter_no_change`
+# such that the gradient boosting fitting stops after adding 5 trees to avoid
+# deterioration of the overall generalization performance.
 
 # %%
 # Write your code here.

diff --git a/python_scripts/ensemble_sol_03.py b/python_scripts/ensemble_sol_03.py
@@ -86,20 +86,21 @@
 )
 
 # %% [markdown]
-# Both gradient boosting and random forest models improve when increasing the
-# number of trees in the ensemble. However, the scores reach a plateau where
-# adding new trees just makes fitting and scoring slower.
+# Random forest models improve when increasing the number of trees in the
+# ensemble. However, the scores reach a plateau where adding new trees just
+# makes fitting and scoring slower.
 #
-# To avoid adding new unnecessary tree, unlike random-forest gradient-boosting
+# Gradient boosting models overfit when the number of trees is too large. To
+# avoid adding a new unnecessary tree, unlike random-forest gradient-boosting
 # offers an early-stopping option. Internally, the algorithm uses an
 # out-of-sample set to compute the generalization performance of the model at
 # each addition of a tree. Thus, if the generalization performance is not
 # improving for several iterations, it stops adding trees.
 #
 # Now, create a gradient-boosting model with `n_estimators=1_000`. This number
-# of trees is certainly too large. Change the parameter `n_iter_no_change` such
-# that the gradient boosting fitting stops after adding 5 trees that do not
-# improve the overall generalization performance.
+# of trees is certainly too large. Change the parameter `n_iter_no_change`
+# such that the gradient boosting fitting stops after adding 5 trees to avoid
+# deterioration of the overall generalization performance.
 
 # %%
 # solution
@@ -110,7 +111,7 @@
 # %% [markdown] tags=["solution"]
 # We see that the number of trees used is far below 1000 with the current
 # dataset. Training the gradient boosting model with the entire 1000 trees would
-# have been useless.
+# have been detrimental.
 
 # %% [markdown]
 # Estimate the generalization performance of this model again using the