diff --git a/doc/introduction.rst b/doc/introduction.rst index 5e9f54686..3398cd510 100644 --- a/doc/introduction.rst +++ b/doc/introduction.rst @@ -31,7 +31,7 @@ Imbalanced-learn samplers accept the same inputs as scikit-learn estimators: * `data`, 2-dimensional array-like structures, such as: * Python's list of lists :class:`list`, * Numpy arrays :class:`numpy.ndarray`, - * Panda dataframes :class:`pandas.DataFrame`, + * Pandas dataframes :class:`pandas.DataFrame`, * Scipy sparse matrices :class:`scipy.sparse.csr_matrix` or :class:`scipy.sparse.csc_matrix`; * `targets`, 1-dimensional array-like structures, such as: diff --git a/doc/over_sampling.rst b/doc/over_sampling.rst index 3bc975b89..683b289a0 100644 --- a/doc/over_sampling.rst +++ b/doc/over_sampling.rst @@ -6,21 +6,26 @@ Over-sampling .. currentmodule:: imblearn.over_sampling -A practical guide -================= +As :ref:`discussed earlier <problem_statement>`, the decision function of a +multi-class classifier can favour the majority class, potentially leading to overfitting +(see, for example, a +`Dummy classifier +<https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html>`_). + +One approach to address this issue is to generate new samples for the under-represented +classes, a technique known as **over-sampling**. -You can refer to -:ref:`sphx_glr_auto_examples_over-sampling_plot_comparison_over_sampling.py`. +Please refer to :ref:`sphx_glr_auto_examples_over-sampling_plot_comparison_over_sampling.py` +for details on the visuals included in this document. .. _random_over_sampler: -Naive random over-sampling --------------------------- +Naive Random Over-Sampling +========================== -One way to fight this issue is to generate new samples in the classes which are -under-represented. The most naive strategy is to generate new samples by -randomly sampling with replacement the current available samples. The -:class:`RandomOverSampler` offers such scheme:: +The most naive strategy is to generate new samples by +**randomly sampling with replacement** from the existing samples. The +:class:`RandomOverSampler` implements this approach:: >>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2, @@ -35,28 +40,27 @@ randomly sampling with replacement the current available samples. The >>> print(sorted(Counter(y_resampled).items())) [(0, 4674), (1, 4674), (2, 4674)] -The augmented data set should be used instead of the original data set to train -a classifier:: +The **augmented data set** `(X_resampled, y_resampled)` should be used +instead of the original data set to train a classifier:: >>> from sklearn.linear_model import LogisticRegression >>> clf = LogisticRegression() >>> clf.fit(X_resampled, y_resampled) LogisticRegression(...) -In the figure below, we compare the decision functions of a classifier trained -using the over-sampled data set and the original data set. +In the figure below, we compare the decision functions of a classifier +trained on the augmented dataset with those trained on the original dataset. .. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_002.png :target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html :scale: 60 :align: center -As a result, the majority class does not take over the other classes during the -training process. Consequently, all classes are represented by the decision -function. +We observe that the majority class does not dominate the other classes during training. +Consequently, the decision function represents all classes. -In addition, :class:`RandomOverSampler` allows to sample heterogeneous data -(e.g. containing some strings):: +In addition, :class:`RandomOverSampler` supports **heterogeneous data** +(e.g., strings, datetime, categorical features, etc.):: >>> import numpy as np >>> X_hetero = np.array([['xxx', 1, 1.0], ['yyy', 2, 2.0], ['zzz', 3, 3.0]], @@ -71,7 +75,7 @@ In addition, :class:`RandomOverSampler` allows to sample heterogeneous data >>> print(y_resampled) [0 0 1 1] -It would also work with pandas dataframe:: +It also supports Pandas Dataframes:: >>> from sklearn.datasets import fetch_openml >>> df_adult, y_adult = fetch_openml( @@ -80,13 +84,14 @@ It would also work with pandas dataframe:: >>> df_resampled, y_resampled = ros.fit_resample(df_adult, y_adult) >>> df_resampled.head() # doctest: +SKIP -If repeating samples is an issue, the parameter `shrinkage` allows to create a -smoothed bootstrap. However, the original data needs to be numerical. The -`shrinkage` parameter controls the dispersion of the new generated samples. We -show an example illustrate that the new samples are not overlapping anymore -once using a smoothed bootstrap. This ways of generating smoothed bootstrap is -also known a Random Over-Sampling Examples -(ROSE) :cite:`torelli2014rose`. +If ordinary repetition is insufficient, the `shrinkage` parameter enables users to perform +a **smoothed bootstrap** (i.e., adding noise to resampled observations). However, +the original data must be numerical. + +The `shrinkage` parameter controls the dispersion of the newly generated samples. +We demonstrate that it can be used to produce non-overlapping new samples. +This method of generating a smoothed bootstrap is also known as **Random Over-Sampling Examples +(ROSE)** :cite:`torelli2014rose`. .. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_003.png :target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html @@ -95,14 +100,19 @@ also known a Random Over-Sampling Examples .. _smote_adasyn: -From random over-sampling to SMOTE and ADASYN ---------------------------------------------- +From Random Over-Sampling to SMOTE and ADASYN +============================================= + +Apart from the random sampling with replacement, two popular methods +for oversampling minority classes are: + +1. **Synthetic Minority Oversampling Technique (SMOTE)** :class:`SMOTE` +:cite:`chawla2002smote`; and -Apart from the random sampling with replacement, there are two popular methods -to over-sample minority classes: (i) the Synthetic Minority Oversampling -Technique (SMOTE) :cite:`chawla2002smote` and (ii) the Adaptive Synthetic -(ADASYN) :cite:`he2008adasyn` sampling method. These algorithms can be used in -the same manner:: +2. **Adaptive Synthetic (ADASYN)** :class:`ADASYN` +:cite:`he2008adasyn`. + +These algorithms can be applied in the same way:: >>> from imblearn.over_sampling import SMOTE, ADASYN >>> X_resampled, y_resampled = SMOTE().fit_resample(X, y) @@ -114,70 +124,73 @@ the same manner:: [(0, 4673), (1, 4662), (2, 4674)] >>> clf_adasyn = LogisticRegression().fit(X_resampled, y_resampled) -The figure below illustrates the major difference of the different -over-sampling methods. +The figure below illustrates the key differences between the various oversampling methods. .. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_004.png :target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html :scale: 60 :align: center -Ill-posed examples ------------------- +Ill-Posed Examples +================== + +While :class:`RandomOverSampler` over-samples by duplicating samples +from the minority class, :class:`SMOTE` and :class:`ADASYN` generate +new samples through interpolation. However, the approach used to +interpolate or generate these synthetic samples differs. -While the :class:`RandomOverSampler` is over-sampling by duplicating some of -the original samples of the minority class, :class:`SMOTE` and :class:`ADASYN` -generate new samples in by interpolation. However, the samples used to -interpolate/generate new synthetic samples differ. In fact, :class:`ADASYN` -focuses on generating samples next to the original samples which are wrongly -classified using a k-Nearest Neighbors classifier while the basic -implementation of :class:`SMOTE` will not make any distinction between easy and -hard samples to be classified using the nearest neighbors rule. Therefore, the -decision function found during training will be different among the algorithms. +Specifically, :class:`ADASYN` focuses on generating samples near +the original samples that are misclassified by a k-Nearest Neighbours classifier +(more precisely, a `KDTree +<https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html>`_). +In contrast, the basic implementation of :class:`SMOTE` does not distinguish between +easily and difficultly classified samples when using the nearest neighbours rule. +Consequently, the decision functions learned during training will differ between these algorithms. .. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_005.png :target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html :align: center -The sampling particularities of these two algorithms can lead to some peculiar -behavior as shown below. +The specific sampling characteristics of these two algorithms can result in +distinctive behaviours, as demonstrated below. .. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_006.png :target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html :scale: 60 :align: center -SMOTE variants --------------- +SMOTE Variants +============== + +:class:`SMOTE` might connect inliers with outliers; while :class:`ADASYN` +might focus solely on outliers. Both cases can lead to a +sub-optimal decision function. To address this, :class:`SMOTE` +provides three variants for generating samples + +1. :class:`BorderlineSMOTE` :cite:`han2005borderline` +2. :class:`SVMSMOTE` :cite:`nguyen2009borderline` +3. :class:`KMeansSMOTE` :cite:`last2017oversampling` + +These methods focus on samples near the decision boundary and generate samples +in the opposite direction of the nearest neighbour class. +These variants are illustrated in the figure below. -SMOTE might connect inliers and outliers while ADASYN might focus solely on -outliers which, in both cases, might lead to a sub-optimal decision -function. In this regard, SMOTE offers three additional options to generate -samples. Those methods focus on samples near the border of the optimal -decision function and will generate samples in the opposite direction of the -nearest neighbors class. Those variants are presented in the figure below. +In particular, the first variant of :class:`BorderlineSMOTE` corresponds to +`kind="borderline-1"`, while the second corresponds to `kind="borderline-2"`. .. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_007.png :target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html :scale: 60 :align: center +However, none of these SMOTE variants (or, in fact, +any of the methods presented so far, except :class:`RandomOverSampler`) can handle +categorical features. To work with mixed data types (continuous and categorical features), +we introduce the **Synthetic Minority Over-sampling Technique for Nominal and Continuous** +:class:`SMOTENC` :cite:`chawla2002smote`, an extension of the :class:`SMOTE` algorithm +designed to handle categorical features. -The :class:`BorderlineSMOTE` :cite:`han2005borderline`, -:class:`SVMSMOTE` :cite:`nguyen2009borderline`, and -:class:`KMeansSMOTE` :cite:`last2017oversampling` offer some variant of the -SMOTE algorithm:: - - >>> from imblearn.over_sampling import BorderlineSMOTE - >>> X_resampled, y_resampled = BorderlineSMOTE().fit_resample(X, y) - >>> print(sorted(Counter(y_resampled).items())) - [(0, 4674), (1, 4674), (2, 4674)] - -When dealing with mixed data type such as continuous and categorical features, -none of the presented methods (apart of the class :class:`RandomOverSampler`) -can deal with the categorical features. The :class:`SMOTENC` -:cite:`chawla2002smote` is an extension of the :class:`SMOTE` algorithm for -which categorical data are treated differently:: +We start by creating a dataset that includes both continuous and categorical features:: >>> # create a synthetic data set with continuous and categorical features >>> rng = np.random.RandomState(42) @@ -190,12 +203,17 @@ which categorical data are treated differently:: >>> print(sorted(Counter(y).items())) [(0, 20), (1, 30)] -In this data set, the first and last features are considered as categorical -features. One needs to provide this information to :class:`SMOTENC` via the -parameters ``categorical_features`` either by passing the indices, the feature -names when `X` is a pandas DataFrame, a boolean mask marking these features, -or relying on `dtype` inference if the columns are using the -:class:`pandas.CategoricalDtype`:: +Here, the first and last features are categorical. +This information must be provided to :class:`SMOTENC` via the `categorical_features` parameter +in one of the following ways: + +- By relying on `dtype` inference if the columns use the :class:`pandas.CategoricalDtype`. +- By passing the indices of the categorical features when `X` is a Pandas DataFrame. +- By specifying the feature names when `X` is a Pandas DataFrame. +- By providing a Boolean mask identifying the categorical features. + +Therefore, the samples generated in the first and last columns belong to the same categories +as the original data, without any additional interpolation:: >>> from imblearn.over_sampling import SMOTENC >>> smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0) @@ -209,22 +227,17 @@ or relying on `dtype` inference if the columns are using the ['B' 0.37... 2] ['B' 0.33... 2]] -Therefore, it can be seen that the samples generated in the first and last -columns are belonging to the same categories originally presented without any -other extra interpolation. - -However, :class:`SMOTENC` is only working when data is a mixed of numerical and -categorical features. If data are made of only categorical data, one can use -the :class:`SMOTEN` variant :cite:`chawla2002smote`. The algorithm changes in -two ways: +However, :class:`SMOTENC` only works when the data is a mixture of continuous and +categorical features. If the data consists only of categorical features, +the **Synthetic Minority Over-sampling Technique for Nominal variant**, :class:`SMOTEN` +:cite:`chawla2002smote` (without the "C"), can be used instead. The algorithm changes in two ways: -* the nearest neighbors search does not rely on the Euclidean distance. Indeed, - the value difference metric (VDM) also implemented in the class - :class:`~imblearn.metrics.ValueDifferenceMetric` is used. -* a new sample is generated where each feature value corresponds to the most - common category seen in the neighbors samples belonging to the same class. +- The nearest neighbours search uses the **value difference metric (VDM)** + :class:`imblearn.metrics.pairwise.ValueDifferenceMetric` instead of Euclidean distance. +- A new sample is generated where each feature value corresponds to the most + common category among the neighbour samples belonging to the same class. -Let's take the following example:: +Let's consider the following example to see how :class:`SMOTEN` handles categorical data:: >>> import numpy as np >>> X = np.array(["green"] * 5 + ["red"] * 10 + ["blue"] * 7, @@ -232,10 +245,9 @@ Let's take the following example:: >>> y = np.array(["apple"] * 5 + ["not apple"] * 3 + ["apple"] * 7 + ... ["not apple"] * 5 + ["apple"] * 2, dtype=object) -We generate a dataset associating a color to being an apple or not an apple. -We strongly associated "green" and "red" to being an apple. The minority class -being "not apple", we expect new data generated belonging to the category -"blue":: +We generate a dataset associating the colours of `apple` and `not apple`. +We strongly associate `green` and `red` with `apple`. The minority class is `not apple`, +so we expect the newly generated data to belong to the category `blue`:: >>> from imblearn.over_sampling import SMOTEN >>> sampler = SMOTEN(random_state=0) @@ -251,25 +263,31 @@ being "not apple", we expect new data generated belonging to the category array(['not apple', 'not apple', 'not apple', 'not apple', 'not apple', 'not apple'], dtype=object) -Mathematical formulation -======================== - -Sample generation ------------------ +Sample Generation +================= Both :class:`SMOTE` and :class:`ADASYN` use the same algorithm to generate new -samples. Considering a sample :math:`x_i`, a new sample :math:`x_{new}` will be -generated considering its k neareast-neighbors (corresponding to +samples. Given a sample :math:`x_i`, a new sample :math:`x_{new}` will be +generated by considering its :math:`k` nearest-neighbors (corresponding to ``k_neighbors``). For instance, the 3 nearest-neighbors are included in the blue circle as illustrated in the figure below. Then, one of these nearest-neighbors :math:`x_{zi}` is selected and a sample is generated as follows: +Both :class:`SMOTE` and :class:`ADASYN` use the same algorithm to generate new +samples. Given a sample :math:`x_i`, a new sample :math:`x_{new}` is generated +by considering its :math:`k` nearest neighbours (corresponding to the +``k_neighbors`` parameter of :class:`SMOTE`, or ``n_neighbors`` of :class:`ADASYN`). +For example, the three nearest neighbours of :math:`x_i` (including :math:`x_i` +itself) are shown within the blue circle +in the figure below. One of these nearest neighbours, :math:`x_{zi}`, is then selected, +and a new sample is generated as follows: + .. math:: - x_{new} = x_i + \lambda \times (x_{zi} - x_i) + x_{new} = x_i + \lambda (x_{zi} - x_i) -where :math:`\lambda` is a random number in the range :math:`[0, 1]`. This +where :math:`\lambda \in [0,1]` is randomly picked. This interpolation will create a sample on the line between :math:`x_{i}` and :math:`x_{zi}` as illustrated in the image below: @@ -278,58 +296,67 @@ interpolation will create a sample on the line between :math:`x_{i}` and :scale: 60 :align: center -SMOTE-NC slightly change the way a new sample is generated by performing -something specific for the categorical features. In fact, the categories of a -new generated sample are decided by picking the most frequent category of the -nearest neighbors present during the generation. +The sample generation process in :class:`SMOTENC` is slightly different because it +applies a specific approach to categorical features. +Specifically, the category of a newly generated sample is determined by the +most frequent category among its nearest neighbours during the generation process. .. warning:: - Be aware that SMOTE-NC is not designed to work with only categorical data. - -The other SMOTE variants and ADASYN differ from each other by selecting the -samples :math:`x_i` ahead of generating the new samples. - -The **regular** SMOTE algorithm --- cf. to the :class:`SMOTE` object --- does not -impose any rule and will randomly pick-up all possible :math:`x_i` available. - -The **borderline** SMOTE --- cf. to the :class:`BorderlineSMOTE` with the -parameters ``kind='borderline-1'`` and ``kind='borderline-2'`` --- will -classify each sample :math:`x_i` to be (i) noise (i.e. all nearest-neighbors -are from a different class than the one of :math:`x_i`), (ii) in danger -(i.e. at least half of the nearest neighbors are from the same class than -:math:`x_i`, or (iii) safe (i.e. all nearest neighbors are from the same class -than :math:`x_i`). **Borderline-1** and **Borderline-2** SMOTE will use the -samples *in danger* to generate new samples. In **Borderline-1** SMOTE, -:math:`x_{zi}` will belong to the same class than the one of the sample -:math:`x_i`. On the contrary, **Borderline-2** SMOTE will consider -:math:`x_{zi}` which can be from any class. - -**SVM** SMOTE --- cf. to :class:`SVMSMOTE` --- uses an SVM classifier to find -support vectors and generate samples considering them. Note that the ``C`` -parameter of the SVM classifier allows to select more or less support vectors. - -For both borderline and SVM SMOTE, a neighborhood is defined using the -parameter ``m_neighbors`` to decide if a sample is in danger, safe, or noise. - -**KMeans** SMOTE --- cf. to :class:`KMeansSMOTE` --- uses a KMeans clustering -method before to apply SMOTE. The clustering will group samples together and -generate new samples depending of the cluster density. - -ADASYN works similarly to the regular SMOTE. However, the number of -samples generated for each :math:`x_i` is proportional to the number of samples -which are not from the same class than :math:`x_i` in a given -neighborhood. Therefore, more samples will be generated in the area that the -nearest neighbor rule is not respected. The parameter ``m_neighbors`` is -equivalent to ``k_neighbors`` in :class:`SMOTE`. - -Multi-class management ----------------------- - -All algorithms can be used with multiple classes as well as binary classes -classification. :class:`RandomOverSampler` does not require any inter-class -information during the sample generation. Therefore, each targeted class is -resampled independently. In the contrary, both :class:`ADASYN` and -:class:`SMOTE` need information regarding the neighbourhood of each sample used -for sample generation. They are using a one-vs-rest approach by selecting each -targeted class and computing the necessary statistics against the rest of the -data set which are grouped in a single class. + Note that :class:`SMOTENC` is not designed to handle datasets + consisting solely of categorical features. + +The other SMOTE variants and :class:`ADASYN` differ in how they select +the samples :math:x_i before generating new samples: + +- :class:`SMOTE` imposes no specific rules and randomly selects + from all available :math:`x_i`. + +- :class:`BorderlineSMOTE` classifies each sample :math:`x_i` into one of + three categories: + + i. **Noise**: All nearest neighbours belong to a different class than :math:`x_i`. + + ii. **In danger**: At least half of the nearest neighbours belong to the same + class as :math:`x_i`. + + iii. **Safe**: All nearest neighbours belong to the same class as :math:`x_i`. + + Both ``kind="borderline-1"`` and ``kind="borderline-2"`` use samples + classified as *in danger* to generate new samples. + + - In ``kind="borderline-1"``, :math:`x_{zi}` is selected from the same class + as :math:`x_i`. + + - In contrast, ``kind="borderline-2"`` allows :math:`x_{zi}` to be from any class. + +- :class:`SVMSMOTE` uses a `SVM classifier + <https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html>`_ + to identify support vectors and generate samples based on them. + Note that the ``C`` parameter of the SVM classifier influences the number of support vectors. + +For both :class:`BorderlineSMOTE` and :class:`SVMSMOTE`, the neighbourhood used to +determine whether a sample is noise, in danger, or safe is defined by the parameter +``m_neighbors`` rather than ``k_neighbors``. + +- :class:`KMeansSMOTE` employs a `k-means clustering method + <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html>`_ + before applying :class:`SMOTE`. + The clustering groups samples together and generates new samples based on the density of each cluster. + +- :class:`ADASYN` works similarly to :class:`SMOTE`. However, the number of samples generated for each + :math:`x_i` is proportional to the number of neighbours that do not belong to the same class as + :math:`x_i`. Thus, more samples are generated in areas where the *nearest-neighbour rule* is not satisfied. + The parameter ``m_neighbors`` is equivalent to ``k_neighbors`` in :class:`SMOTE`. + +Multi-Class Management +====================== + +All algorithms can be applied to both binary and multi-class classification. + +:class:`RandomOverSampler` does not rely on inter-class information during sample generation, +meaning each target class is resampled independently. + +In contrast, both :class:`ADASYN` and :class:`SMOTE` require neighbourhood information +for each sample to generate new ones. These algorithms use a *one-vs-rest* approach, +where each target class is selected, and the necessary statistics are computed against +the rest of the dataset, which is treated as a single class.