Merge branch 'scikit-learn:main' into main

virchan · May 14, 2024 · eaa0e41 · eaa0e41
2 parents 6b775a9 + 00db4df
commit eaa0e41
Show file tree

Hide file tree

Showing 4 changed files with 91 additions and 30 deletions.
diff --git a/doc/model_persistence.rst b/doc/model_persistence.rst
@@ -80,7 +80,9 @@ persist and plan to serve the model:
 - :ref:`ONNX <onnx_persistence>`: You need an `ONNX` runtime and an environment
   with appropriate dependencies installed to load the model and use the runtime
   to get predictions. This environment can be minimal and does not necessarily
-  even require `python` to be installed.
+  even require Python to be installed to load the model and compute
+  predictions. Also note that `onnxruntime` typically requires much less RAM
+  than Python to to compute predictions from small models.
 
 - :mod:`skops.io`, :mod:`pickle`, :mod:`joblib`, `cloudpickle`_: You need a
   Python environment with the appropriate dependencies installed to load the
@@ -208,13 +210,20 @@ persist and load your scikit-learn model, and they all follow the same API::
 
     # Here you can replace pickle with joblib or cloudpickle
     from pickle import dump
-    with open('filename.pkl', 'wb') as f: dump(clf, f)
+    with open("filename.pkl", "wb") as f:
+        dump(clf, f, protocol=5)
+
+Using `protocol=5` is recommended to reduce memory usage and make it faster to
+store and load any large NumPy array stored as a fitted attribute in the model.
+You can alternatively pass `protocol=pickle.HIGHEST_PROTOCOL` which is
+equivalent to `protocol=5` in Python 3.8 and later (at the time of writing).
 
 And later when needed, you can load the same object from the persisted file::
 
     # Here you can replace pickle with joblib or cloudpickle
     from pickle import load
-    with open('filename.pkl', 'rb') as f: clf = load(f)
+    with open("filename.pkl", "rb") as f:
+        clf = load(f)
 
 |details-end|
 
@@ -224,12 +233,14 @@ Security & Maintainability Limitations
 --------------------------------------
 
 :mod:`pickle` (and :mod:`joblib` and :mod:`clouldpickle` by extension), has
-many documented security vulnerabilities and should only be used if the
-artifact, i.e. the pickle-file, is coming from a trusted and verified source.
+many documented security vulnerabilities by design and should only be used if
+the artifact, i.e. the pickle-file, is coming from a trusted and verified
+source. You should never load a pickle file from an untrusted source, similarly
+to how you should never execute code from an untrusted source.
 
 Also note that arbitrary computations can be represented using the `ONNX`
-format, and therefore a sandbox used to serve models using `ONNX` also needs to
-safeguard against computational and memory exploits.
+format, and it is therefore recommended to serve models using `ONNX` in a
+sandboxed environment to safeguard against computational and memory exploits.
 
 Also note that there are no supported ways to load a model trained with a
 different version of scikit-learn. While using :mod:`skops.io`, :mod:`joblib`,
@@ -298,7 +309,8 @@ can be caught to obtain the original version the estimator was pickled with::
   warnings.simplefilter("error", InconsistentVersionWarning)
 
   try:
-      est = pickle.loads("model_from_prevision_version.pickle")
+      with open("model_from_prevision_version.pickle", "rb") as f:
+          est = pickle.load(f)
   except InconsistentVersionWarning as w:
       print(w.original_sklearn_version)
 
@@ -328,22 +340,34 @@ each approach can be summarized as follows:
 * :mod:`skops.io`: Trained scikit-learn models can be easily shared and put
   into production using :mod:`skops.io`. It is more secure compared to
   alternate approaches based on :mod:`pickle` because it does not load
-  arbitrary code unless explicitly asked for by the user.
+  arbitrary code unless explicitly asked for by the user. Such code needs to be
+  packaged and importable in the target Python environment.
 * :mod:`joblib`: Efficient memory mapping techniques make it faster when using
-  the same persisted model in multiple Python processes. It also gives easy
-  shortcuts to compress and decompress the persisted object without the need
-  for extra code. However, it may trigger the execution of malicious code while
-  untrusted data as any other pickle-based persistence mechanism.
-* :mod:`pickle`: It is native to Python and any Python object can be serialized
-  and deserialized using :mod:`pickle`, including custom Python classes and
-  objects. While :mod:`pickle` can be used to easily save and load scikit-learn
-  models, it may trigger the execution of malicious code while loading
-  untrusted data.
-* `cloudpickle`_: It is slower than :mod:`pickle` and :mod:`joblib`, and is
-  more insecure than :mod:`pickle` and :mod:`joblib` since it can serialize
-  arbitrary code. However, in certain cases it might be a last resort to
-  persist certain models. Note that this is discouraged by `cloudpickle`_
-  itself since there are no forward compatibility guarantees and you might need
-  the same version of `cloudpickle`_ to load the persisted model.
+  the same persisted model in multiple Python processes when using
+  `mmap_mode="r"`. It also gives easy shortcuts to compress and decompress the
+  persisted object without the need for extra code. However, it may trigger the
+  execution of malicious code when loading a model from an untrusted source as
+  any other pickle-based persistence mechanism.
+* :mod:`pickle`: It is native to Python and most Python objects can be
+  serialized and deserialized using :mod:`pickle`, including custom Python
+  classes and functions as long as they are defined in a package that can be
+  imported in the target environment. While :mod:`pickle` can be used to easily
+  save and load scikit-learn models, it may trigger the execution of malicious
+  code while loading a model from an untrusted source. :mod:`pickle` can also
+  be very efficient memorywise if the model was persisted with `protocol=5` but
+  it does not support memory mapping.
+* `cloudpickle`_: It has comparable loading efficiency as :mod:`pickle` and
+  :mod:`joblib` (without memory mapping), but offers additional flexibility to
+  serialize custom Python code such as lambda expressions and interactively
+  defined functions and classes. It might be a last resort to persist pipelines
+  with custom Python components such as a
+  :class:`sklearn.preprocessing.FunctionTransformer` that wraps a function
+  defined in the training script itself or more generally outside of any
+  importable Python package. Note that `cloudpickle`_ offers no forward
+  compatibility guarantees and you might need the same version of
+  `cloudpickle`_ to load the persisted model along with the same version of all
+  the libraries used to define the model. As the other pickle-based persistence
+  mechanisms, it may trigger the execution of malicious code while loading
+  a model from an untrusted source.
 
 .. _cloudpickle: https://github.com/cloudpipe/cloudpickle
diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
@@ -1247,6 +1247,43 @@ estimation.
    representations of feature space, also these approaches focus also on
    dimensionality reduction.
 
+.. _tree_ensemble_warm_start:
+
+Fitting additional trees
+------------------------
+
+RandomForest, Extra-Trees and :class:`RandomTreesEmbedding` estimators all support
+``warm_start=True`` which allows you to add more trees to an already fitted model.
+
+::
+
+  >>> from sklearn.datasets import make_classification
+  >>> from sklearn.ensemble import RandomForestClassifier
+
+  >>> X, y = make_classification(n_samples=100, random_state=1)
+  >>> clf = RandomForestClassifier(n_estimators=10)
+  >>> clf = clf.fit(X, y)  # fit with 10 trees
+  >>> len(clf.estimators_)
+  10
+  >>> # set warm_start and increase num of estimators
+  >>> _ = clf.set_params(n_estimators=20, warm_start=True)
+  >>> _ = clf.fit(X, y) # fit additional 10 trees
+  >>> len(clf.estimators_)
+  20
+
+When ``random_state`` is also set, the internal random state is also preserved
+between ``fit`` calls. This means that training a model once with ``n`` estimators is
+the same as building the model iteratively via multiple ``fit`` calls, where the
+final number of estimators is equal to ``n``.
+
+::
+
+  >>> clf = RandomForestClassifier(n_estimators=20)  # set `n_estimators` to 10 + 10
+  >>> _ = clf.fit(X, y)  # fit `estimators_` will be the same as `clf` above
+
+Note that this differs from the usual behavior of :term:`random_state` in that it does
+*not* result in the same result across different calls.
+
 .. _bagging:
 
 Bagging meta-estimator

diff --git a/meson.build b/meson.build
@@ -6,7 +6,7 @@ project(
   meson_version: '>= 1.1.0',
   default_options: [
     'buildtype=debugoptimized',
-    'c_std=c17',
+    'c_std=c11',
     'cpp_std=c++14',
   ],
 )

diff --git a/sklearn/ensemble/_forest.py b/sklearn/ensemble/_forest.py
@@ -1308,7 +1308,7 @@ class RandomForestClassifier(ForestClassifier):
         When set to ``True``, reuse the solution of the previous call to fit
         and add more estimators to the ensemble, otherwise, just fit a whole
         new forest. See :term:`Glossary <warm_start>` and
-        :ref:`gradient_boosting_warm_start` for details.
+        :ref:`tree_ensemble_warm_start` for details.
 
     class_weight : {"balanced", "balanced_subsample"}, dict or list of dicts, \
             default=None
@@ -1710,7 +1710,7 @@ class RandomForestRegressor(ForestRegressor):
         When set to ``True``, reuse the solution of the previous call to fit
         and add more estimators to the ensemble, otherwise, just fit a whole
         new forest. See :term:`Glossary <warm_start>` and
-        :ref:`gradient_boosting_warm_start` for details.
+        :ref:`tree_ensemble_warm_start` for details.
 
     ccp_alpha : non-negative float, default=0.0
         Complexity parameter used for Minimal Cost-Complexity Pruning. The
@@ -2049,7 +2049,7 @@ class ExtraTreesClassifier(ForestClassifier):
         When set to ``True``, reuse the solution of the previous call to fit
         and add more estimators to the ensemble, otherwise, just fit a whole
         new forest. See :term:`Glossary <warm_start>` and
-        :ref:`gradient_boosting_warm_start` for details.
+        :ref:`tree_ensemble_warm_start` for details.
 
     class_weight : {"balanced", "balanced_subsample"}, dict or list of dicts, \
             default=None
@@ -2434,7 +2434,7 @@ class ExtraTreesRegressor(ForestRegressor):
         When set to ``True``, reuse the solution of the previous call to fit
         and add more estimators to the ensemble, otherwise, just fit a whole
         new forest. See :term:`Glossary <warm_start>` and
-        :ref:`gradient_boosting_warm_start` for details.
+        :ref:`tree_ensemble_warm_start` for details.
 
     ccp_alpha : non-negative float, default=0.0
         Complexity parameter used for Minimal Cost-Complexity Pruning. The
@@ -2727,7 +2727,7 @@ class RandomTreesEmbedding(TransformerMixin, BaseForest):
         When set to ``True``, reuse the solution of the previous call to fit
         and add more estimators to the ensemble, otherwise, just fit a whole
         new forest. See :term:`Glossary <warm_start>` and
-        :ref:`gradient_boosting_warm_start` for details.
+        :ref:`tree_ensemble_warm_start` for details.
 
     Attributes
     ----------