Skip to content

Commit

Permalink
Merge branch 'scikit-learn:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
virchan authored May 14, 2024
2 parents 6b775a9 + 00db4df commit eaa0e41
Show file tree
Hide file tree
Showing 4 changed files with 91 additions and 30 deletions.
72 changes: 48 additions & 24 deletions doc/model_persistence.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,9 @@ persist and plan to serve the model:
- :ref:`ONNX <onnx_persistence>`: You need an `ONNX` runtime and an environment
with appropriate dependencies installed to load the model and use the runtime
to get predictions. This environment can be minimal and does not necessarily
even require `python` to be installed.
even require Python to be installed to load the model and compute
predictions. Also note that `onnxruntime` typically requires much less RAM
than Python to to compute predictions from small models.

- :mod:`skops.io`, :mod:`pickle`, :mod:`joblib`, `cloudpickle`_: You need a
Python environment with the appropriate dependencies installed to load the
Expand Down Expand Up @@ -208,13 +210,20 @@ persist and load your scikit-learn model, and they all follow the same API::

# Here you can replace pickle with joblib or cloudpickle
from pickle import dump
with open('filename.pkl', 'wb') as f: dump(clf, f)
with open("filename.pkl", "wb") as f:
dump(clf, f, protocol=5)

Using `protocol=5` is recommended to reduce memory usage and make it faster to
store and load any large NumPy array stored as a fitted attribute in the model.
You can alternatively pass `protocol=pickle.HIGHEST_PROTOCOL` which is
equivalent to `protocol=5` in Python 3.8 and later (at the time of writing).

And later when needed, you can load the same object from the persisted file::

# Here you can replace pickle with joblib or cloudpickle
from pickle import load
with open('filename.pkl', 'rb') as f: clf = load(f)
with open("filename.pkl", "rb") as f:
clf = load(f)

|details-end|

Expand All @@ -224,12 +233,14 @@ Security & Maintainability Limitations
--------------------------------------

:mod:`pickle` (and :mod:`joblib` and :mod:`clouldpickle` by extension), has
many documented security vulnerabilities and should only be used if the
artifact, i.e. the pickle-file, is coming from a trusted and verified source.
many documented security vulnerabilities by design and should only be used if
the artifact, i.e. the pickle-file, is coming from a trusted and verified
source. You should never load a pickle file from an untrusted source, similarly
to how you should never execute code from an untrusted source.

Also note that arbitrary computations can be represented using the `ONNX`
format, and therefore a sandbox used to serve models using `ONNX` also needs to
safeguard against computational and memory exploits.
format, and it is therefore recommended to serve models using `ONNX` in a
sandboxed environment to safeguard against computational and memory exploits.

Also note that there are no supported ways to load a model trained with a
different version of scikit-learn. While using :mod:`skops.io`, :mod:`joblib`,
Expand Down Expand Up @@ -298,7 +309,8 @@ can be caught to obtain the original version the estimator was pickled with::
warnings.simplefilter("error", InconsistentVersionWarning)

try:
est = pickle.loads("model_from_prevision_version.pickle")
with open("model_from_prevision_version.pickle", "rb") as f:
est = pickle.load(f)
except InconsistentVersionWarning as w:
print(w.original_sklearn_version)

Expand Down Expand Up @@ -328,22 +340,34 @@ each approach can be summarized as follows:
* :mod:`skops.io`: Trained scikit-learn models can be easily shared and put
into production using :mod:`skops.io`. It is more secure compared to
alternate approaches based on :mod:`pickle` because it does not load
arbitrary code unless explicitly asked for by the user.
arbitrary code unless explicitly asked for by the user. Such code needs to be
packaged and importable in the target Python environment.
* :mod:`joblib`: Efficient memory mapping techniques make it faster when using
the same persisted model in multiple Python processes. It also gives easy
shortcuts to compress and decompress the persisted object without the need
for extra code. However, it may trigger the execution of malicious code while
untrusted data as any other pickle-based persistence mechanism.
* :mod:`pickle`: It is native to Python and any Python object can be serialized
and deserialized using :mod:`pickle`, including custom Python classes and
objects. While :mod:`pickle` can be used to easily save and load scikit-learn
models, it may trigger the execution of malicious code while loading
untrusted data.
* `cloudpickle`_: It is slower than :mod:`pickle` and :mod:`joblib`, and is
more insecure than :mod:`pickle` and :mod:`joblib` since it can serialize
arbitrary code. However, in certain cases it might be a last resort to
persist certain models. Note that this is discouraged by `cloudpickle`_
itself since there are no forward compatibility guarantees and you might need
the same version of `cloudpickle`_ to load the persisted model.
the same persisted model in multiple Python processes when using
`mmap_mode="r"`. It also gives easy shortcuts to compress and decompress the
persisted object without the need for extra code. However, it may trigger the
execution of malicious code when loading a model from an untrusted source as
any other pickle-based persistence mechanism.
* :mod:`pickle`: It is native to Python and most Python objects can be
serialized and deserialized using :mod:`pickle`, including custom Python
classes and functions as long as they are defined in a package that can be
imported in the target environment. While :mod:`pickle` can be used to easily
save and load scikit-learn models, it may trigger the execution of malicious
code while loading a model from an untrusted source. :mod:`pickle` can also
be very efficient memorywise if the model was persisted with `protocol=5` but
it does not support memory mapping.
* `cloudpickle`_: It has comparable loading efficiency as :mod:`pickle` and
:mod:`joblib` (without memory mapping), but offers additional flexibility to
serialize custom Python code such as lambda expressions and interactively
defined functions and classes. It might be a last resort to persist pipelines
with custom Python components such as a
:class:`sklearn.preprocessing.FunctionTransformer` that wraps a function
defined in the training script itself or more generally outside of any
importable Python package. Note that `cloudpickle`_ offers no forward
compatibility guarantees and you might need the same version of
`cloudpickle`_ to load the persisted model along with the same version of all
the libraries used to define the model. As the other pickle-based persistence
mechanisms, it may trigger the execution of malicious code while loading
a model from an untrusted source.

.. _cloudpickle: https://github.com/cloudpipe/cloudpickle
37 changes: 37 additions & 0 deletions doc/modules/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1247,6 +1247,43 @@ estimation.
representations of feature space, also these approaches focus also on
dimensionality reduction.

.. _tree_ensemble_warm_start:

Fitting additional trees
------------------------

RandomForest, Extra-Trees and :class:`RandomTreesEmbedding` estimators all support
``warm_start=True`` which allows you to add more trees to an already fitted model.

::

>>> from sklearn.datasets import make_classification
>>> from sklearn.ensemble import RandomForestClassifier

>>> X, y = make_classification(n_samples=100, random_state=1)
>>> clf = RandomForestClassifier(n_estimators=10)
>>> clf = clf.fit(X, y) # fit with 10 trees
>>> len(clf.estimators_)
10
>>> # set warm_start and increase num of estimators
>>> _ = clf.set_params(n_estimators=20, warm_start=True)
>>> _ = clf.fit(X, y) # fit additional 10 trees
>>> len(clf.estimators_)
20

When ``random_state`` is also set, the internal random state is also preserved
between ``fit`` calls. This means that training a model once with ``n`` estimators is
the same as building the model iteratively via multiple ``fit`` calls, where the
final number of estimators is equal to ``n``.

::

>>> clf = RandomForestClassifier(n_estimators=20) # set `n_estimators` to 10 + 10
>>> _ = clf.fit(X, y) # fit `estimators_` will be the same as `clf` above

Note that this differs from the usual behavior of :term:`random_state` in that it does
*not* result in the same result across different calls.

.. _bagging:

Bagging meta-estimator
Expand Down
2 changes: 1 addition & 1 deletion meson.build
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ project(
meson_version: '>= 1.1.0',
default_options: [
'buildtype=debugoptimized',
'c_std=c17',
'c_std=c11',
'cpp_std=c++14',
],
)
Expand Down
10 changes: 5 additions & 5 deletions sklearn/ensemble/_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1308,7 +1308,7 @@ class RandomForestClassifier(ForestClassifier):
When set to ``True``, reuse the solution of the previous call to fit
and add more estimators to the ensemble, otherwise, just fit a whole
new forest. See :term:`Glossary <warm_start>` and
:ref:`gradient_boosting_warm_start` for details.
:ref:`tree_ensemble_warm_start` for details.
class_weight : {"balanced", "balanced_subsample"}, dict or list of dicts, \
default=None
Expand Down Expand Up @@ -1710,7 +1710,7 @@ class RandomForestRegressor(ForestRegressor):
When set to ``True``, reuse the solution of the previous call to fit
and add more estimators to the ensemble, otherwise, just fit a whole
new forest. See :term:`Glossary <warm_start>` and
:ref:`gradient_boosting_warm_start` for details.
:ref:`tree_ensemble_warm_start` for details.
ccp_alpha : non-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The
Expand Down Expand Up @@ -2049,7 +2049,7 @@ class ExtraTreesClassifier(ForestClassifier):
When set to ``True``, reuse the solution of the previous call to fit
and add more estimators to the ensemble, otherwise, just fit a whole
new forest. See :term:`Glossary <warm_start>` and
:ref:`gradient_boosting_warm_start` for details.
:ref:`tree_ensemble_warm_start` for details.
class_weight : {"balanced", "balanced_subsample"}, dict or list of dicts, \
default=None
Expand Down Expand Up @@ -2434,7 +2434,7 @@ class ExtraTreesRegressor(ForestRegressor):
When set to ``True``, reuse the solution of the previous call to fit
and add more estimators to the ensemble, otherwise, just fit a whole
new forest. See :term:`Glossary <warm_start>` and
:ref:`gradient_boosting_warm_start` for details.
:ref:`tree_ensemble_warm_start` for details.
ccp_alpha : non-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The
Expand Down Expand Up @@ -2727,7 +2727,7 @@ class RandomTreesEmbedding(TransformerMixin, BaseForest):
When set to ``True``, reuse the solution of the previous call to fit
and add more estimators to the ensemble, otherwise, just fit a whole
new forest. See :term:`Glossary <warm_start>` and
:ref:`gradient_boosting_warm_start` for details.
:ref:`tree_ensemble_warm_start` for details.
Attributes
----------
Expand Down

0 comments on commit eaa0e41

Please sign in to comment.