From 18e3ec976b5bd5ff2d506dd147137d1cfa16aa07 Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Mon, 5 Feb 2024 09:55:30 +0000 Subject: [PATCH] build based on f994b94 --- dev/LICENSE/index.html | 2 +- dev/acknowledgements/index.html | 2 +- dev/advanced/developer/index.html | 2 +- dev/advanced/extend/index.html | 6 +++--- dev/index.html | 2 +- dev/indices/index.html | 2 +- dev/introduction/gettingstarted/index.html | 2 +- dev/introduction/motivation/index.html | 2 +- dev/losses/distance/index.html | 18 +++++++++--------- dev/losses/margin/index.html | 22 +++++++++++----------- dev/losses/other/index.html | 2 +- dev/search/index.html | 2 +- dev/user/aggregate/index.html | 2 +- dev/user/interface/index.html | 2 +- 14 files changed, 34 insertions(+), 34 deletions(-) diff --git a/dev/LICENSE/index.html b/dev/LICENSE/index.html index 3c29aef..a2e9c76 100644 --- a/dev/LICENSE/index.html +++ b/dev/LICENSE/index.html @@ -1,2 +1,2 @@ -LICENSE · LossFunctions.jl

LICENSE

The LossFunctions.jl package is licensed under the MIT "Expat" License:

Copyright (c) 2015: Christof Stocker, Tom Breloff, Alex Williams, and other contributers.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Partially based on EmpiricalRisks.jl by Dahua Lin

The EmpiricalRisks.jl package is licensed under the MIT "Expat" License:

Copyright (c) 2015: Dahua Lin.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

+LICENSE · LossFunctions.jl

LICENSE

The LossFunctions.jl package is licensed under the MIT "Expat" License:

Copyright (c) 2015: Christof Stocker, Tom Breloff, Alex Williams, and other contributers.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Partially based on EmpiricalRisks.jl by Dahua Lin

The EmpiricalRisks.jl package is licensed under the MIT "Expat" License:

Copyright (c) 2015: Dahua Lin.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

diff --git a/dev/acknowledgements/index.html b/dev/acknowledgements/index.html index 4870589..9d06136 100644 --- a/dev/acknowledgements/index.html +++ b/dev/acknowledgements/index.html @@ -1,2 +1,2 @@ -Acknowledgements · LossFunctions.jl
+Acknowledgements · LossFunctions.jl
diff --git a/dev/advanced/developer/index.html b/dev/advanced/developer/index.html index 5b12a72..31095b3 100644 --- a/dev/advanced/developer/index.html +++ b/dev/advanced/developer/index.html @@ -1,2 +1,2 @@ -Developer Documentation · LossFunctions.jl

Developer Documentation

In this part of the documentation we will discuss some of the internal design aspects of this library. Consequently, the target audience of this section and its sub-sections is primarily people interested in contributing to this package. As such, the information provided here should be of little to no relevance for users interested in simply applying the package.

Abstract Types

We have seen in previous sections, that many families of loss functions are implemented as immutable types with free parameters. An example for such a family is the L1EpsilonInsLoss, which represents all the $\epsilon$-insensitive loss-functions for each possible value of $\epsilon$.

Aside from these special families, there a handful of more generic families that between them contain almost all of the loss functions this package implements. These families are defined as abstract types in the type tree. Their main purpose is two-fold:

  • From an end-user's perspective, they are most useful for dispatching on the particular kind of prediction problem that they are intended for (regression vs classification).

  • Form an implementation perspective, these abstract types allow us to implement shared functionality and fall-back methods, or even allow for a simpler implementation.

Most of the implemented loss functions fall under the umbrella of supervised losses. As such, we barely mention other types of losses anywhere in this documentation.

Missing docstring.

Missing docstring for SupervisedLoss. Check Documenter's build log for details.

There are two interesting sub-families of supervised loss functions. One of these families is called distance-based. All losses that belong to this family are implemented as subtype of the abstract type DistanceLoss, which itself is subtype of SupervisedLoss.

Missing docstring.

Missing docstring for DistanceLoss. Check Documenter's build log for details.

The second core sub-family of supervised losses is called margin-based. All loss functions that belong to this family are implemented as subtype of the abstract type MarginLoss, which itself is subtype of SupervisedLoss.

Missing docstring.

Missing docstring for MarginLoss. Check Documenter's build log for details.

Shared Interface

Each of the three abstract types listed above serves a purpose other than dispatch. All losses that belong to the same family share functionality to some degree.

More interestingly, the abstract types DistanceLoss and MarginLoss, serve an additional purpose aside from shared functionality. We have seen in the background section what it is that makes a loss margin-based or distance-based. Without repeating the definition let us state that it boils down to the existence of a representing function $\psi$, which allows to compute a loss using a unary function instead of a binary one. Indeed, all the subtypes of DistanceLoss and MarginLoss are implemented in the unary form of their representing function.

Distance-based Losses

Supervised losses that can be expressed as a univariate function of output - target are referred to as distance-based losses. Distance-based losses are typically utilized for regression problems. That said, there are also other losses that are useful for regression problems that don't fall into this category, such as the PeriodicLoss.

Margin-based Losses

Margin-based losses are supervised losses where the values of the targets are restricted to be in $\{1,-1\}$, and which can be expressed as a univariate function output * target.

+Developer Documentation · LossFunctions.jl

Developer Documentation

In this part of the documentation we will discuss some of the internal design aspects of this library. Consequently, the target audience of this section and its sub-sections is primarily people interested in contributing to this package. As such, the information provided here should be of little to no relevance for users interested in simply applying the package.

Abstract Types

We have seen in previous sections, that many families of loss functions are implemented as immutable types with free parameters. An example for such a family is the L1EpsilonInsLoss, which represents all the $\epsilon$-insensitive loss-functions for each possible value of $\epsilon$.

Aside from these special families, there a handful of more generic families that between them contain almost all of the loss functions this package implements. These families are defined as abstract types in the type tree. Their main purpose is two-fold:

  • From an end-user's perspective, they are most useful for dispatching on the particular kind of prediction problem that they are intended for (regression vs classification).

  • Form an implementation perspective, these abstract types allow us to implement shared functionality and fall-back methods, or even allow for a simpler implementation.

Most of the implemented loss functions fall under the umbrella of supervised losses. As such, we barely mention other types of losses anywhere in this documentation.

Missing docstring.

Missing docstring for SupervisedLoss. Check Documenter's build log for details.

There are two interesting sub-families of supervised loss functions. One of these families is called distance-based. All losses that belong to this family are implemented as subtype of the abstract type DistanceLoss, which itself is subtype of SupervisedLoss.

Missing docstring.

Missing docstring for DistanceLoss. Check Documenter's build log for details.

The second core sub-family of supervised losses is called margin-based. All loss functions that belong to this family are implemented as subtype of the abstract type MarginLoss, which itself is subtype of SupervisedLoss.

Missing docstring.

Missing docstring for MarginLoss. Check Documenter's build log for details.

Shared Interface

Each of the three abstract types listed above serves a purpose other than dispatch. All losses that belong to the same family share functionality to some degree.

More interestingly, the abstract types DistanceLoss and MarginLoss, serve an additional purpose aside from shared functionality. We have seen in the background section what it is that makes a loss margin-based or distance-based. Without repeating the definition let us state that it boils down to the existence of a representing function $\psi$, which allows to compute a loss using a unary function instead of a binary one. Indeed, all the subtypes of DistanceLoss and MarginLoss are implemented in the unary form of their representing function.

Distance-based Losses

Supervised losses that can be expressed as a univariate function of output - target are referred to as distance-based losses. Distance-based losses are typically utilized for regression problems. That said, there are also other losses that are useful for regression problems that don't fall into this category, such as the PeriodicLoss.

Margin-based Losses

Margin-based losses are supervised losses where the values of the targets are restricted to be in $\{1,-1\}$, and which can be expressed as a univariate function output * target.

diff --git a/dev/advanced/extend/index.html b/dev/advanced/extend/index.html index cff3230..c541540 100644 --- a/dev/advanced/extend/index.html +++ b/dev/advanced/extend/index.html @@ -1,5 +1,5 @@ -Altering existing Losses · LossFunctions.jl

Altering existing Losses

There are situations in which one wants to work with slightly altered versions of specific loss functions. This package provides two generic ways to create such meta losses for specific families of loss functions.

  1. Scaling a supervised loss by a constant real number. This is done at compile time and can in some situations even lead to simpler code (e.g. in the case of the derivative for a L2DistLoss)

  2. Weighting the classes of a margin-based loss differently in order to better deal with unbalanced binary classification problems.

Scaling a Supervised Loss

It is quite common in machine learning courses to define the least squares loss as $\frac{1}{2} (\hat{y} - y)^2$, while this package implements that type of loss as an $L_2$ distance loss using $(\hat{y} - y)^2$, i.e. without the constant scale factor.

For situations in which one wants a scaled version of an existing loss type, we provide the concept of a scaled loss. The difference is literally only a constant real number that gets multiplied to the existing implementation of the loss function (and derivatives).

julia> lsloss = 1/2 * L2DistLoss()
+Altering existing Losses · LossFunctions.jl

Altering existing Losses

There are situations in which one wants to work with slightly altered versions of specific loss functions. This package provides two generic ways to create such meta losses for specific families of loss functions.

  1. Scaling a supervised loss by a constant real number. This is done at compile time and can in some situations even lead to simpler code (e.g. in the case of the derivative for a L2DistLoss)

  2. Weighting the classes of a margin-based loss differently in order to better deal with unbalanced binary classification problems.

Scaling a Supervised Loss

It is quite common in machine learning courses to define the least squares loss as $\frac{1}{2} (\hat{y} - y)^2$, while this package implements that type of loss as an $L_2$ distance loss using $(\hat{y} - y)^2$, i.e. without the constant scale factor.

For situations in which one wants a scaled version of an existing loss type, we provide the concept of a scaled loss. The difference is literally only a constant real number that gets multiplied to the existing implementation of the loss function (and derivatives).

julia> lsloss = 1/2 * L2DistLoss()
 ScaledLoss{L2DistLoss, 0.5}(L2DistLoss())
 
 julia> L2DistLoss()(4.0, 0.0)
@@ -11,7 +11,7 @@
     w * loss(target, output)
 else
     (1-w) * loss(target, output)
-end

Instead of providing special functions to compute a class-weighted loss, we instead expose a generic way to create new weighted versions of already existing unweighted margin losses. This way, every existing subtype of MarginLoss can be re-weighted arbitrarily. Furthermore, it allows every algorithm that expects a binary loss to work with weighted binary losses as well.

LossFunctions.WeightedMarginLossType
WeightedMarginLoss{L,W} <: MarginLoss

Can an be used to represent a re-weighted version of some type of binary loss L. The weight-factor W, which must be in [0, 1], denotes the relative weight of the positive class, while the relative weight of the negative class will be 1 - W.

source
julia> myloss = WeightedMarginLoss(HingeLoss(), 0.8)
+end

Instead of providing special functions to compute a class-weighted loss, we instead expose a generic way to create new weighted versions of already existing unweighted margin losses. This way, every existing subtype of MarginLoss can be re-weighted arbitrarily. Furthermore, it allows every algorithm that expects a binary loss to work with weighted binary losses as well.

LossFunctions.WeightedMarginLossType
WeightedMarginLoss{L,W} <: MarginLoss

Can an be used to represent a re-weighted version of some type of binary loss L. The weight-factor W, which must be in [0, 1], denotes the relative weight of the positive class, while the relative weight of the negative class will be 1 - W.

source
julia> myloss = WeightedMarginLoss(HingeLoss(), 0.8)
 WeightedMarginLoss{L1HingeLoss, 0.8}(L1HingeLoss())
 
 julia> myloss(-4.0, 1.0) # positive class
@@ -29,4 +29,4 @@
 
 julia> typeof(myloss) <: HingeLoss
 false

Similar to scaled losses, the constant weight factor gets promoted to a type-parameter. This can be quite an overhead when done on the fly every time the loss value is computed. To avoid this one can make use of Val to specify the scale factor in a type-stable manner.

julia> WeightedMarginLoss(HingeLoss(), Val(0.8))
-WeightedMarginLoss{L1HingeLoss, 0.8}(L1HingeLoss())
+WeightedMarginLoss{L1HingeLoss, 0.8}(L1HingeLoss())
diff --git a/dev/index.html b/dev/index.html index 6b563a9..092e379 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Home · LossFunctions.jl

LossFunctions.jl's documentation

This package represents a community effort to centralize the definition and implementation of loss functions in Julia. As such, it is a part of the JuliaML ecosystem.

The sole purpose of this package is to provide an efficient and extensible implementation of various loss functions used throughout Machine Learning (ML). It is thus intended to serve as a special purpose back-end for other ML libraries that require losses to accomplish their tasks. To that end we provide a considerable amount of carefully implemented loss functions, as well as an API to query their properties (e.g. convexity). Furthermore, we expose methods to compute their values, derivatives, and second derivatives for single observations as well as arbitrarily sized arrays of observations. In the case of arrays a user additionally has the ability to define if and how element-wise results are averaged or summed over.

From an end-user's perspective one normally does not need to import this package directly. That said, it should provide a decent starting point for any student that is interested in investigating the properties or behaviour of loss functions.

Introduction and Motivation

If this is the first time you consider using LossFunctions for your machine learning related experiments or packages, make sure to check out the "Getting Started" section.

If you are new to Machine Learning in Julia, or are simply interested in how and why this package works the way it works, feel free to take a look at the following sections. There we discuss the concepts involved and outline the most important terms and definitions.

User's Guide

This section gives a more detailed treatment of the exposed functions and their available methods. We will start by describing how to instantiate a loss, as well as the basic interface that all loss functions share.

Next we will consider how to average or sum the results of the loss functions more efficiently. The methods described here are implemented in such a way as to avoid allocating a temporary array.

Available Losses

Aside from the interface, this package also provides a number of popular (and not so popular) loss functions out-of-the-box. Great effort has been put into ensuring a correct, efficient, and type-stable implementation for those. Most of them either belong to the family of distance-based or margin-based losses. These two categories are also indicative for if a loss is intended for regression or classification problems

Loss Functions for Regression

Loss functions that belong to the category "distance-based" are primarily used in regression problems. They utilize the numeric difference between the predicted output and the true target as a proxy variable to quantify the quality of individual predictions.

distance-based losses

Loss Functions for Classification

Margin-based loss functions are particularly useful for binary classification. In contrast to the distance-based losses, these do not care about the difference between true target and prediction. Instead they penalize predictions based on how well they agree with the sign of the target.

margin-based losses

Advanced Topics

In some situations it can be useful to slightly alter an existing loss function. We provide two general ways to accomplish that. The first way is to scale a loss by a constant factor. This can for example be useful to transform the L2DistLoss into the least squares loss one knows from statistics. The second way is to reweight the two classes of a binary classification loss. This is useful for handling inbalanced class distributions.

If you are interested in contributing to LossFunctions.jl, or simply want to understand how and why the package does then take a look at our developer documentation (although it is a bit sparse at the moment).

Index

+Home · LossFunctions.jl

LossFunctions.jl's documentation

This package represents a community effort to centralize the definition and implementation of loss functions in Julia. As such, it is a part of the JuliaML ecosystem.

The sole purpose of this package is to provide an efficient and extensible implementation of various loss functions used throughout Machine Learning (ML). It is thus intended to serve as a special purpose back-end for other ML libraries that require losses to accomplish their tasks. To that end we provide a considerable amount of carefully implemented loss functions, as well as an API to query their properties (e.g. convexity). Furthermore, we expose methods to compute their values, derivatives, and second derivatives for single observations as well as arbitrarily sized arrays of observations. In the case of arrays a user additionally has the ability to define if and how element-wise results are averaged or summed over.

From an end-user's perspective one normally does not need to import this package directly. That said, it should provide a decent starting point for any student that is interested in investigating the properties or behaviour of loss functions.

Introduction and Motivation

If this is the first time you consider using LossFunctions for your machine learning related experiments or packages, make sure to check out the "Getting Started" section.

If you are new to Machine Learning in Julia, or are simply interested in how and why this package works the way it works, feel free to take a look at the following sections. There we discuss the concepts involved and outline the most important terms and definitions.

User's Guide

This section gives a more detailed treatment of the exposed functions and their available methods. We will start by describing how to instantiate a loss, as well as the basic interface that all loss functions share.

Next we will consider how to average or sum the results of the loss functions more efficiently. The methods described here are implemented in such a way as to avoid allocating a temporary array.

Available Losses

Aside from the interface, this package also provides a number of popular (and not so popular) loss functions out-of-the-box. Great effort has been put into ensuring a correct, efficient, and type-stable implementation for those. Most of them either belong to the family of distance-based or margin-based losses. These two categories are also indicative for if a loss is intended for regression or classification problems

Loss Functions for Regression

Loss functions that belong to the category "distance-based" are primarily used in regression problems. They utilize the numeric difference between the predicted output and the true target as a proxy variable to quantify the quality of individual predictions.

distance-based losses

Loss Functions for Classification

Margin-based loss functions are particularly useful for binary classification. In contrast to the distance-based losses, these do not care about the difference between true target and prediction. Instead they penalize predictions based on how well they agree with the sign of the target.

margin-based losses

Advanced Topics

In some situations it can be useful to slightly alter an existing loss function. We provide two general ways to accomplish that. The first way is to scale a loss by a constant factor. This can for example be useful to transform the L2DistLoss into the least squares loss one knows from statistics. The second way is to reweight the two classes of a binary classification loss. This is useful for handling inbalanced class distributions.

If you are interested in contributing to LossFunctions.jl, or simply want to understand how and why the package does then take a look at our developer documentation (although it is a bit sparse at the moment).

Index

diff --git a/dev/indices/index.html b/dev/indices/index.html index 7988253..6ac4d53 100644 --- a/dev/indices/index.html +++ b/dev/indices/index.html @@ -1,2 +1,2 @@ -Indices · LossFunctions.jl
+Indices · LossFunctions.jl
diff --git a/dev/introduction/gettingstarted/index.html b/dev/introduction/gettingstarted/index.html index 0796799..36c1455 100644 --- a/dev/introduction/gettingstarted/index.html +++ b/dev/introduction/gettingstarted/index.html @@ -17,4 +17,4 @@ 5.5 julia> mean(L2DistLoss(), pred_outputs, true_targets, [2,1,1], normalize=false) -1.8333333333333333

Getting Help

To get help on specific functionality you can either look up the information here, or if you prefer you can make use of Julia's native doc-system. The following example shows how to get additional information on L1HingeLoss within Julia's REPL:

?L1HingeLoss

If you find yourself stuck or have other questions concerning the package you can find us on the Julia's Zulip chat or the Machine Learning domain on Discourse:

If you encounter a bug or would like to participate in the further development of this package come find us on Github.

+1.8333333333333333

Getting Help

To get help on specific functionality you can either look up the information here, or if you prefer you can make use of Julia's native doc-system. The following example shows how to get additional information on L1HingeLoss within Julia's REPL:

?L1HingeLoss

If you find yourself stuck or have other questions concerning the package you can find us on the Julia's Zulip chat or the Machine Learning domain on Discourse:

If you encounter a bug or would like to participate in the further development of this package come find us on Github.

diff --git a/dev/introduction/motivation/index.html b/dev/introduction/motivation/index.html index 24117bf..62a7a2b 100644 --- a/dev/introduction/motivation/index.html +++ b/dev/introduction/motivation/index.html @@ -1,2 +1,2 @@ -Background and Motivation · LossFunctions.jl

Background and Motivation

In this section we will discuss the concept "loss function" in more detail. We will start by introducing some terminology and definitions. However, please note that we won't attempt to give a complete treatment of loss functions and the math involved (unlike a book or a lecture could do). So this section won't be a substitution for proper literature on the topic. While we will try to cover all the basics necessary to get a decent intuition of the ideas involved, we do assume basic knowledge about Machine Learning.

Warning

This section and its sub-sections serve solely as to explain the underlying theory and concepts and further to motivate the solution provided by this package. As such, this section is not intended as a guide on how to apply this package.

Terminology

To start off, let us go over some basic terminology. In Machine Learning (ML) we are primarily interested in automatically learning meaningful patterns from data. For our purposes it suffices to say that in ML we try to teach the computer to solve a task by induction rather than by definition. This package is primarily concerned with the subset of Machine Learning that falls under the umbrella of Supervised Learning. There we are interested in teaching the computer to predict a specific output for some given input. In contrast to unsupervised learning the teaching process here involves showing the computer what the predicted output is supposed to be; i.e. the "true answer" if you will.

How is this relevant for this package? Well, it implies that we require some meaningful way to show the true answers to the computer so that it can learn from "seeing" them. More importantly, we have to somehow put the true answer into relation to what the computer currently predicts the answer should be. This would provide the basic information needed for the computer to be able to improve; that is what loss functions are for.

When we say we want our computer to learn something that is able to make predictions, we are talking about a prediction function, denoted as $h$ and sometimes called "fitted hypothesis", or "fitted model". Note that we will avoid the term hypothesis for the simple reason that it is widely used in statistics for something completely different. We don't consider a prediction function as the same thing as a prediction model, because we think of a prediction model as a family of prediction functions. What that boils down to is that the prediction model represents the set of possible prediction functions, while the final prediction function is the chosen function that best solves the prediction problem. So in a way a prediction model can be thought of as the manifestation of our assumptions about the problem, because it restricts the solution to a specific family of functions. For example a linear prediction model for two features represents all possible linear functions that have two coefficients. A prediction function would in that scenario be a concrete linear function with a particular fixed set of coefficients.

The purpose of a prediction function is to take some input and produce a corresponding output. That output should be as faithful as possible to the true answer. In the context of this package we will refer to the "true answer" as the true target, or short "target". During training, and only during training, inputs and targets can both be considered as part of our data set. We say "only during training" because in a production setting we don't actually have the targets available to us (otherwise there would be no prediction problem to solve in the first place). In essence we can think of our data as two entities with a 1-to-1 connection in each observation, the inputs, which we call features, and the corresponding desired outputs, which we call true targets.

Let us be a little more concrete with the two terms we really care about in this package.

  • True Targets:

    A true target (singular) represents the "desired" output for the input features of a single observation. The targets are often referred to as "ground truth" and we will denote a single target as $y \in Y$. While $y$ can be a scalar or some array, the key is that it represents the target of a single observation. When we talk about an array (e.g. a vector) of multiple targets, we will print it in bold as $\mathbf{y}$. What the set $Y$ is will depend on the subdomain of supervised learning that you are working in.

    • Real-valued Regression: $Y \subseteq \mathbb{R}$.
    • Multioutput Regression: $Y \subseteq \mathbb{R}^k$.
    • Margin-based Classification: $Y = \{1,-1\}$.
    • Probabilistic Classification: $Y = \{1,0\}$.
    • Multiclass Classification: $Y = \{1,2,\dots,k\}$.

    See MLLabelUtils for more information on classification targets.

  • Predicted Outputs:

    A predicted output (singular) is the result of our prediction function given the features of some observation. We will denote a single output as $\hat{y} \in \mathbb{R}$ (pronounced as "why hat"). When we talk about an array of outputs for multiple observations, we will print it in bold as $\mathbf{\hat{y}}$. Note something unintuitive but important: The variables $y$ and $\hat{y}$ don't have to be of the same set. Even in a classification setting where $y \in \{1,-1\}$, it is typical that $\hat{y} \in \mathbb{R}$.

    The fact that in classification the predictions can be fundamentally different than the targets is important to know. The reason for restricting the targets to specific numbers when doing classification is mathematical convenience for loss functions. So loss functions have this knowledge build in.

In a classification setting, the predicted outputs and the true targets are usually of different form and type. For example, in margin-based classification it could be the case that the target $y=-1$ and the predicted output $\hat{y} = -1000$. It would seem that the prediction is not really reflecting the target properly, but in this case we would actually have a perfectly correct prediction. This is because in margin-based classification the main thing that matters about the predicted output is that the sign agrees with the true target.

Even though we talked about prediction functions and features, we will see that for computing loss functions all we really care about are the true targets and the predicted outputs, regardless of how the outputs were produced.

Definitions

We base most of our definitions on the work presented in [STEINWART2008]. Note, however, that we will adapt or simplify in places at our discretion. We do this in situations where it makes sense to us considering the scope of this package or because of implementation details.

Let us again consider the term prediction function. More formally, a prediction function $h$ is a function that maps an input from the feature space $X$ to the real numbers $\mathbb{R}$. So invoking $h$ with some features $x \in X$ will produce the prediction $\hat{y} \in \mathbb{R}$.

\[h : X \rightarrow \mathbb{R}\]

This resulting prediction $\hat{y}$ is what we want to compare to the target $y$ in order to asses how bad the prediction is. The function we use for such an assessment will be of a family of functions we refer to as supervised losses. We think of a supervised loss as a function of two parameters, the true target $y \in Y$ and the predicted output $\hat{y} \in \mathbb{R}$. The result of computing such a loss will be a non-negative real number. The larger the value of the loss, the worse the prediction.

\[L : \mathbb{R} \times Y \rightarrow [0,\infty)\]

Note a few interesting things about supervised loss functions.

  • The absolute value of a loss is often (but not always) meaningless and doesn't offer itself to a useful interpretation. What we usually care about is that the loss is as small as it can be.

  • In general the loss function we use is not the function we are actually interested in minimizing. Instead we are minimizing what is referred to as a "surrogate". For binary classification for example we are really interested in minimizing the ZeroOne loss (which simply counts the number of misclassified predictions). However, that loss is difficult to minimize given that it is not convex nor continuous. That is why we use other loss functions, such as the hinge loss or logistic loss. Those losses are "classification calibrated", which basically means they are good enough surrogates to solve the same problem. Additionally, surrogate losses tend to have other nice properties.

  • For classification it does not need to be the case that a "correct" prediction has a loss of zero. In fact some classification calibrated losses are never truly zero.

There are two sub-families of supervised loss-functions that are of particular interest, namely margin-based losses and distance-based losses. These two categories of loss functions are especially useful for the two basic sub-domains of supervised learning: Classification and Regression.

Margin-based Losses for (Binary) Classification

Margin-based losses are mainly utilized for binary classification problems where the goal is to predict a categorical value. They assume that the set of targets $Y$ is restricted to $Y = \{1,-1\}$. These two possible values for the target denote the positive class in the case of $y = 1$, and the negative class in the case of $y = -1$. In contrast to other formalism, they do not natively provide probabilities as output.

More formally, we call a supervised loss function $L : \mathbb{R} \times Y \rightarrow [0, \infty)$ margin-based if there exists a representing function $\psi : \mathbb{R} \rightarrow [0, \infty)$ such that

\[L(\hat{y}, y) = \psi (y \cdot \hat{y}), \qquad y \in Y, \hat{y} \in \mathbb{R}\]

Note

Throughout the codebase we refer to the result of $y \cdot \hat{y}$ as agreement. The discussion that lead to this convention can be found issue #9

Distance-based Losses for Regression

Distance-based losses are usually used in regression settings where the goal is to predict some real valued variable. The goal there is that the prediction is as close as possible to the true target. In such a scenario it is quite sensible to penalize the distance between the prediction and the target in some way.

More formally, a supervised loss function $L : \mathbb{R} \times Y \rightarrow [0, \infty)$ is said to be distance-based, if there exists a representing function $\psi : \mathbb{R} \rightarrow [0, \infty)$ satisfying $\psi (0) = 0$ and

\[L(\hat{y}, y) = \psi (\hat{y} - y), \qquad y \in Y, \hat{y} \in \mathbb{R}\]

Note

In the literature that this package is partially based on, the convention for the distance-based losses is that $r = y - \hat{y}$ (see [STEINWART2008] p. 38). We chose to diverge from this definition because it would force a difference of the sign between the results for the unary and the binary version of the derivative. That difference would be a introduced by the chain rule, since the inner derivative would result in $\frac{\partial}{\partial \hat{y}} (y - \hat{y}) = -1$.

Alternative Viewpoints

While the term "loss function" is usually used in the same context throughout the literature, the specifics differ from one textbook to another. For that reason we would like to mention alternative definitions of what a "loss function" is. Note that we will only give a partial and thus very simplified description of these. Please refer to the listed sources for more specifics.

In [SHALEV2014] the authors consider a loss function as a higher-order function of two parameters, a prediction model and an observation tuple. So in that definition a loss function and the prediction function are tightly coupled. This way of thinking about it makes a lot of sense, considering the process of how a prediction model is usually fit to the data. For gradient descent to do its job it needs the, well, gradient of the empirical risk. This gradient is computed using the chain rule for the inner loss and the prediction model. If one views the loss and the prediction model as one entity, then the gradient can sometimes be simplified immensely. That said, we chose to not follow this school of thought, because from a software-engineering standpoint it made more sense to us to have small modular pieces. So in our implementation the loss functions don't need to know that prediction functions even exist. This makes the package easier to maintain, test, and reason with. Given Julia's ability for multiple dispatch we don't even lose the ability to simplify the gradient if need be.

References

+Background and Motivation · LossFunctions.jl

Background and Motivation

In this section we will discuss the concept "loss function" in more detail. We will start by introducing some terminology and definitions. However, please note that we won't attempt to give a complete treatment of loss functions and the math involved (unlike a book or a lecture could do). So this section won't be a substitution for proper literature on the topic. While we will try to cover all the basics necessary to get a decent intuition of the ideas involved, we do assume basic knowledge about Machine Learning.

Warning

This section and its sub-sections serve solely as to explain the underlying theory and concepts and further to motivate the solution provided by this package. As such, this section is not intended as a guide on how to apply this package.

Terminology

To start off, let us go over some basic terminology. In Machine Learning (ML) we are primarily interested in automatically learning meaningful patterns from data. For our purposes it suffices to say that in ML we try to teach the computer to solve a task by induction rather than by definition. This package is primarily concerned with the subset of Machine Learning that falls under the umbrella of Supervised Learning. There we are interested in teaching the computer to predict a specific output for some given input. In contrast to unsupervised learning the teaching process here involves showing the computer what the predicted output is supposed to be; i.e. the "true answer" if you will.

How is this relevant for this package? Well, it implies that we require some meaningful way to show the true answers to the computer so that it can learn from "seeing" them. More importantly, we have to somehow put the true answer into relation to what the computer currently predicts the answer should be. This would provide the basic information needed for the computer to be able to improve; that is what loss functions are for.

When we say we want our computer to learn something that is able to make predictions, we are talking about a prediction function, denoted as $h$ and sometimes called "fitted hypothesis", or "fitted model". Note that we will avoid the term hypothesis for the simple reason that it is widely used in statistics for something completely different. We don't consider a prediction function as the same thing as a prediction model, because we think of a prediction model as a family of prediction functions. What that boils down to is that the prediction model represents the set of possible prediction functions, while the final prediction function is the chosen function that best solves the prediction problem. So in a way a prediction model can be thought of as the manifestation of our assumptions about the problem, because it restricts the solution to a specific family of functions. For example a linear prediction model for two features represents all possible linear functions that have two coefficients. A prediction function would in that scenario be a concrete linear function with a particular fixed set of coefficients.

The purpose of a prediction function is to take some input and produce a corresponding output. That output should be as faithful as possible to the true answer. In the context of this package we will refer to the "true answer" as the true target, or short "target". During training, and only during training, inputs and targets can both be considered as part of our data set. We say "only during training" because in a production setting we don't actually have the targets available to us (otherwise there would be no prediction problem to solve in the first place). In essence we can think of our data as two entities with a 1-to-1 connection in each observation, the inputs, which we call features, and the corresponding desired outputs, which we call true targets.

Let us be a little more concrete with the two terms we really care about in this package.

  • True Targets:

    A true target (singular) represents the "desired" output for the input features of a single observation. The targets are often referred to as "ground truth" and we will denote a single target as $y \in Y$. While $y$ can be a scalar or some array, the key is that it represents the target of a single observation. When we talk about an array (e.g. a vector) of multiple targets, we will print it in bold as $\mathbf{y}$. What the set $Y$ is will depend on the subdomain of supervised learning that you are working in.

    • Real-valued Regression: $Y \subseteq \mathbb{R}$.
    • Multioutput Regression: $Y \subseteq \mathbb{R}^k$.
    • Margin-based Classification: $Y = \{1,-1\}$.
    • Probabilistic Classification: $Y = \{1,0\}$.
    • Multiclass Classification: $Y = \{1,2,\dots,k\}$.

    See MLLabelUtils for more information on classification targets.

  • Predicted Outputs:

    A predicted output (singular) is the result of our prediction function given the features of some observation. We will denote a single output as $\hat{y} \in \mathbb{R}$ (pronounced as "why hat"). When we talk about an array of outputs for multiple observations, we will print it in bold as $\mathbf{\hat{y}}$. Note something unintuitive but important: The variables $y$ and $\hat{y}$ don't have to be of the same set. Even in a classification setting where $y \in \{1,-1\}$, it is typical that $\hat{y} \in \mathbb{R}$.

    The fact that in classification the predictions can be fundamentally different than the targets is important to know. The reason for restricting the targets to specific numbers when doing classification is mathematical convenience for loss functions. So loss functions have this knowledge build in.

In a classification setting, the predicted outputs and the true targets are usually of different form and type. For example, in margin-based classification it could be the case that the target $y=-1$ and the predicted output $\hat{y} = -1000$. It would seem that the prediction is not really reflecting the target properly, but in this case we would actually have a perfectly correct prediction. This is because in margin-based classification the main thing that matters about the predicted output is that the sign agrees with the true target.

Even though we talked about prediction functions and features, we will see that for computing loss functions all we really care about are the true targets and the predicted outputs, regardless of how the outputs were produced.

Definitions

We base most of our definitions on the work presented in [STEINWART2008]. Note, however, that we will adapt or simplify in places at our discretion. We do this in situations where it makes sense to us considering the scope of this package or because of implementation details.

Let us again consider the term prediction function. More formally, a prediction function $h$ is a function that maps an input from the feature space $X$ to the real numbers $\mathbb{R}$. So invoking $h$ with some features $x \in X$ will produce the prediction $\hat{y} \in \mathbb{R}$.

\[h : X \rightarrow \mathbb{R}\]

This resulting prediction $\hat{y}$ is what we want to compare to the target $y$ in order to asses how bad the prediction is. The function we use for such an assessment will be of a family of functions we refer to as supervised losses. We think of a supervised loss as a function of two parameters, the true target $y \in Y$ and the predicted output $\hat{y} \in \mathbb{R}$. The result of computing such a loss will be a non-negative real number. The larger the value of the loss, the worse the prediction.

\[L : \mathbb{R} \times Y \rightarrow [0,\infty)\]

Note a few interesting things about supervised loss functions.

  • The absolute value of a loss is often (but not always) meaningless and doesn't offer itself to a useful interpretation. What we usually care about is that the loss is as small as it can be.

  • In general the loss function we use is not the function we are actually interested in minimizing. Instead we are minimizing what is referred to as a "surrogate". For binary classification for example we are really interested in minimizing the ZeroOne loss (which simply counts the number of misclassified predictions). However, that loss is difficult to minimize given that it is not convex nor continuous. That is why we use other loss functions, such as the hinge loss or logistic loss. Those losses are "classification calibrated", which basically means they are good enough surrogates to solve the same problem. Additionally, surrogate losses tend to have other nice properties.

  • For classification it does not need to be the case that a "correct" prediction has a loss of zero. In fact some classification calibrated losses are never truly zero.

There are two sub-families of supervised loss-functions that are of particular interest, namely margin-based losses and distance-based losses. These two categories of loss functions are especially useful for the two basic sub-domains of supervised learning: Classification and Regression.

Margin-based Losses for (Binary) Classification

Margin-based losses are mainly utilized for binary classification problems where the goal is to predict a categorical value. They assume that the set of targets $Y$ is restricted to $Y = \{1,-1\}$. These two possible values for the target denote the positive class in the case of $y = 1$, and the negative class in the case of $y = -1$. In contrast to other formalism, they do not natively provide probabilities as output.

More formally, we call a supervised loss function $L : \mathbb{R} \times Y \rightarrow [0, \infty)$ margin-based if there exists a representing function $\psi : \mathbb{R} \rightarrow [0, \infty)$ such that

\[L(\hat{y}, y) = \psi (y \cdot \hat{y}), \qquad y \in Y, \hat{y} \in \mathbb{R}\]

Note

Throughout the codebase we refer to the result of $y \cdot \hat{y}$ as agreement. The discussion that lead to this convention can be found issue #9

Distance-based Losses for Regression

Distance-based losses are usually used in regression settings where the goal is to predict some real valued variable. The goal there is that the prediction is as close as possible to the true target. In such a scenario it is quite sensible to penalize the distance between the prediction and the target in some way.

More formally, a supervised loss function $L : \mathbb{R} \times Y \rightarrow [0, \infty)$ is said to be distance-based, if there exists a representing function $\psi : \mathbb{R} \rightarrow [0, \infty)$ satisfying $\psi (0) = 0$ and

\[L(\hat{y}, y) = \psi (\hat{y} - y), \qquad y \in Y, \hat{y} \in \mathbb{R}\]

Note

In the literature that this package is partially based on, the convention for the distance-based losses is that $r = y - \hat{y}$ (see [STEINWART2008] p. 38). We chose to diverge from this definition because it would force a difference of the sign between the results for the unary and the binary version of the derivative. That difference would be a introduced by the chain rule, since the inner derivative would result in $\frac{\partial}{\partial \hat{y}} (y - \hat{y}) = -1$.

Alternative Viewpoints

While the term "loss function" is usually used in the same context throughout the literature, the specifics differ from one textbook to another. For that reason we would like to mention alternative definitions of what a "loss function" is. Note that we will only give a partial and thus very simplified description of these. Please refer to the listed sources for more specifics.

In [SHALEV2014] the authors consider a loss function as a higher-order function of two parameters, a prediction model and an observation tuple. So in that definition a loss function and the prediction function are tightly coupled. This way of thinking about it makes a lot of sense, considering the process of how a prediction model is usually fit to the data. For gradient descent to do its job it needs the, well, gradient of the empirical risk. This gradient is computed using the chain rule for the inner loss and the prediction model. If one views the loss and the prediction model as one entity, then the gradient can sometimes be simplified immensely. That said, we chose to not follow this school of thought, because from a software-engineering standpoint it made more sense to us to have small modular pieces. So in our implementation the loss functions don't need to know that prediction functions even exist. This makes the package easier to maintain, test, and reason with. Given Julia's ability for multiple dispatch we don't even lose the ability to simplify the gradient if need be.

References

diff --git a/dev/losses/distance/index.html b/dev/losses/distance/index.html index 8a404ca..89105ef 100644 --- a/dev/losses/distance/index.html +++ b/dev/losses/distance/index.html @@ -1,5 +1,5 @@ -Distance-based Losses · LossFunctions.jl

Distance-based Losses

Loss functions that belong to the category "distance-based" are primarily used in regression problems. They utilize the numeric difference between the predicted output and the true target as a proxy variable to quantify the quality of individual predictions.

This section lists all the subtypes of DistanceLoss that are implemented in this package.

LPDistLoss

LossFunctions.LPDistLossType
LPDistLoss{P} <: DistanceLoss

The P-th power absolute distance loss. It is Lipschitz continuous iff P == 1, convex if and only if P >= 1, and strictly convex iff P > 1.

\[L(r) = |r|^P\]

source

L1DistLoss

LossFunctions.L1DistLossType
L1DistLoss <: DistanceLoss

The absolute distance loss. Special case of the LPDistLoss with P=1. It is Lipschitz continuous and convex, but not strictly convex.

\[L(r) = |r|\]


              Lossfunction                     Derivative
+Distance-based Losses · LossFunctions.jl

Distance-based Losses

Loss functions that belong to the category "distance-based" are primarily used in regression problems. They utilize the numeric difference between the predicted output and the true target as a proxy variable to quantify the quality of individual predictions.

This section lists all the subtypes of DistanceLoss that are implemented in this package.

LPDistLoss

LossFunctions.LPDistLossType
LPDistLoss{P} <: DistanceLoss

The P-th power absolute distance loss. It is Lipschitz continuous iff P == 1, convex if and only if P >= 1, and strictly convex iff P > 1.

\[L(r) = |r|^P\]

source

L1DistLoss

LossFunctions.L1DistLossType
L1DistLoss <: DistanceLoss

The absolute distance loss. Special case of the LPDistLoss with P=1. It is Lipschitz continuous and convex, but not strictly convex.

\[L(r) = |r|\]


              Lossfunction                     Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     3 │\.                     ./│    1 │            ┌------------│
       │ '\.                 ./' │      │            |            │
@@ -11,7 +11,7 @@
     0 │          '\./'          │   -1 │------------┘            │
       └────────────┴────────────┘      └────────────┴────────────┘
       -3                        3      -3                        3
-                 ŷ - y                            ŷ - y
source

L2DistLoss

LossFunctions.L2DistLossType
L2DistLoss <: DistanceLoss

The least squares loss. Special case of the LPDistLoss with P=2. It is strictly convex.

\[L(r) = |r|^2\]


              Lossfunction                     Derivative
+                 ŷ - y                            ŷ - y
source

L2DistLoss

LossFunctions.L2DistLossType
L2DistLoss <: DistanceLoss

The least squares loss. Special case of the LPDistLoss with P=2. It is strictly convex.

\[L(r) = |r|^2\]


              Lossfunction                     Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     9 │\                       /│    3 │                   .r/   │
       │".                     ."│      │                 .r'     │
@@ -23,7 +23,7 @@
     0 │        "-.___.-"        │   -3 │  _/r'                   │
       └────────────┴────────────┘      └────────────┴────────────┘
       -3                        3      -2                        2
-                 ŷ - y                            ŷ - y
source

LogitDistLoss

LossFunctions.LogitDistLossType
LogitDistLoss <: DistanceLoss

The distance-based logistic loss for regression. It is strictly convex and Lipschitz continuous.

\[L(r) = - \ln \frac{4 e^r}{(1 + e^r)^2}\]


              Lossfunction                     Derivative
+                 ŷ - y                            ŷ - y
source

LogitDistLoss

LossFunctions.LogitDistLossType
LogitDistLoss <: DistanceLoss

The distance-based logistic loss for regression. It is strictly convex and Lipschitz continuous.

\[L(r) = - \ln \frac{4 e^r}{(1 + e^r)^2}\]


              Lossfunction                     Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     2 │                         │    1 │                   _--'''│
       │\                       /│      │                ./'      │
@@ -35,7 +35,7 @@
     0 │        '-.___.-'        │   -1 │___.-''                  │
       └────────────┴────────────┘      └────────────┴────────────┘
       -3                        3      -4                        4
-                 ŷ - y                            ŷ - y
source

HuberLoss

LossFunctions.HuberLossType
HuberLoss <: DistanceLoss

Loss function commonly used for robustness to outliers. For large values of d it becomes close to the L1DistLoss, while for small values of d it resembles the L2DistLoss. It is Lipschitz continuous and convex, but not strictly convex.

\[L(r) = \begin{cases} \frac{r^2}{2} & \quad \text{if } | r | \le \alpha \\ \alpha | r | - \frac{\alpha^3}{2} & \quad \text{otherwise}\\ \end{cases}\]


              Lossfunction (d=1)               Derivative
+                 ŷ - y                            ŷ - y
source

HuberLoss

LossFunctions.HuberLossType
HuberLoss <: DistanceLoss

Loss function commonly used for robustness to outliers. For large values of d it becomes close to the L1DistLoss, while for small values of d it resembles the L2DistLoss. It is Lipschitz continuous and convex, but not strictly convex.

\[L(r) = \begin{cases} \frac{r^2}{2} & \quad \text{if } | r | \le \alpha \\ \alpha | r | - \frac{\alpha^3}{2} & \quad \text{otherwise}\\ \end{cases}\]


              Lossfunction (d=1)               Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     2 │                         │    1 │                .+-------│
       │                         │      │              ./'        │
@@ -47,7 +47,7 @@
     0 │        '-.___.-'        │   -1 │-------+'                │
       └────────────┴────────────┘      └────────────┴────────────┘
       -2                        2      -2                        2
-                 ŷ - y                            ŷ - y
source

L1EpsilonInsLoss

LossFunctions.L1EpsilonInsLossType
L1EpsilonInsLoss <: DistanceLoss

The $ϵ$-insensitive loss. Typically used in linear support vector regression. It ignores deviances smaller than $ϵ$, but penalizes larger deviances linarily. It is Lipschitz continuous and convex, but not strictly convex.

\[L(r) = \max \{ 0, | r | - \epsilon \}\]


              Lossfunction (ϵ=1)               Derivative
+                 ŷ - y                            ŷ - y
source

L1EpsilonInsLoss

LossFunctions.L1EpsilonInsLossType
L1EpsilonInsLoss <: DistanceLoss

The $ϵ$-insensitive loss. Typically used in linear support vector regression. It ignores deviances smaller than $ϵ$, but penalizes larger deviances linarily. It is Lipschitz continuous and convex, but not strictly convex.

\[L(r) = \max \{ 0, | r | - \epsilon \}\]


              Lossfunction (ϵ=1)               Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     2 │\                       /│    1 │                  ┌------│
       │ \                     / │      │                  |      │
@@ -59,7 +59,7 @@
     0 │       \_________/       │   -1 │------┘                  │
       └────────────┴────────────┘      └────────────┴────────────┘
       -3                        3      -2                        2
-                 ŷ - y                            ŷ - y
source

L2EpsilonInsLoss

LossFunctions.L2EpsilonInsLossType
L2EpsilonInsLoss <: DistanceLoss

The quadratic $ϵ$-insensitive loss. Typically used in linear support vector regression. It ignores deviances smaller than $ϵ$, but penalizes larger deviances quadratically. It is convex, but not strictly convex.

\[L(r) = \max \{ 0, | r | - \epsilon \}^2\]


              Lossfunction (ϵ=0.5)             Derivative
+                 ŷ - y                            ŷ - y
source

L2EpsilonInsLoss

LossFunctions.L2EpsilonInsLossType
L2EpsilonInsLoss <: DistanceLoss

The quadratic $ϵ$-insensitive loss. Typically used in linear support vector regression. It ignores deviances smaller than $ϵ$, but penalizes larger deviances quadratically. It is convex, but not strictly convex.

\[L(r) = \max \{ 0, | r | - \epsilon \}^2\]


              Lossfunction (ϵ=0.5)             Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     8 │                         │    1 │                  /      │
       │:                       :│      │                 /       │
@@ -71,7 +71,7 @@
     0 │      '-._______.-'      │   -1 │      /                  │
       └────────────┴────────────┘      └────────────┴────────────┘
       -3                        3      -2                        2
-                 ŷ - y                            ŷ - y
source

PeriodicLoss

LossFunctions.PeriodicLossType
PeriodicLoss <: DistanceLoss

Measures distance on a circle of specified circumference c.

\[L(r) = 1 - \cos \left( \frac{2 r \pi}{c} \right)\]

source

QuantileLoss

LossFunctions.QuantileLossType
QuantileLoss <: DistanceLoss

The distance-based quantile loss, also known as pinball loss, can be used to estimate conditional τ-quantiles. It is Lipschitz continuous and convex, but not strictly convex. Furthermore it is symmetric if and only if τ = 1/2.

\[L(r) = \begin{cases} -\left( 1 - \tau \right) r & \quad \text{if } r < 0 \\ \tau r & \quad \text{if } r \ge 0 \\ \end{cases}\]


              Lossfunction (τ=0.7)             Derivative
+                 ŷ - y                            ŷ - y
source

PeriodicLoss

LossFunctions.PeriodicLossType
PeriodicLoss <: DistanceLoss

Measures distance on a circle of specified circumference c.

\[L(r) = 1 - \cos \left( \frac{2 r \pi}{c} \right)\]

source

QuantileLoss

LossFunctions.QuantileLossType
QuantileLoss <: DistanceLoss

The distance-based quantile loss, also known as pinball loss, can be used to estimate conditional τ-quantiles. It is Lipschitz continuous and convex, but not strictly convex. Furthermore it is symmetric if and only if τ = 1/2.

\[L(r) = \begin{cases} -\left( 1 - \tau \right) r & \quad \text{if } r < 0 \\ \tau r & \quad \text{if } r \ge 0 \\ \end{cases}\]


              Lossfunction (τ=0.7)             Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     2 │'\                       │  0.3 │            ┌------------│
       │  \.                     │      │            |            │
@@ -83,7 +83,7 @@
     0 │           '_./'         │ -0.7 │------------┘            │
       └────────────┴────────────┘      └────────────┴────────────┘
       -3                        3      -3                        3
-                 ŷ - y                            ŷ - y
source

LogCoshLoss

LossFunctions.LogCoshLossType
LogCoshLoss <: DistanceLoss

The log cosh loss is twice differentiable, strongly convex, Lipschitz continous function.

\[L(r) = log ( cosh ( x ))\]


           Lossfunction                     Derivative
+                 ŷ - y                            ŷ - y
source

LogCoshLoss

LossFunctions.LogCoshLossType
LogCoshLoss <: DistanceLoss

The log cosh loss is twice differentiable, strongly convex, Lipschitz continous function.

\[L(r) = log ( cosh ( x ))\]


           Lossfunction                     Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
   2.5 │\                       /│    1 │                 .-------│
       │".                     ."│      │                |        │
@@ -95,4 +95,4 @@
     0 │        "-. _ .-"        │   -1 │------"                  │
       └────────────┴────────────┘      └────────────┴────────────┘
       -3                        3      -3                        3
-                 ŷ - y                            ŷ - y
source
Note

You may note that our definition of the QuantileLoss looks different to what one usually sees in other literature. The reason is that we have to correct for the fact that in our case $r = \hat{y} - y$ instead of $r_{\textrm{usual}} = y - \hat{y}$, which means that our definition relates to that in the manner of $r = -1 * r_{\textrm{usual}}$.

+ ŷ - y ŷ - ysource
Note

You may note that our definition of the QuantileLoss looks different to what one usually sees in other literature. The reason is that we have to correct for the fact that in our case $r = \hat{y} - y$ instead of $r_{\textrm{usual}} = y - \hat{y}$, which means that our definition relates to that in the manner of $r = -1 * r_{\textrm{usual}}$.

diff --git a/dev/losses/margin/index.html b/dev/losses/margin/index.html index 389c57d..75ae730 100644 --- a/dev/losses/margin/index.html +++ b/dev/losses/margin/index.html @@ -11,7 +11,7 @@ 0 │ └------------│ -1 │ │ └────────────┴────────────┘ └────────────┴────────────┘ -2 2 -2 2 - y * h(x) y * h(x)source

PerceptronLoss

LossFunctions.PerceptronLossType
PerceptronLoss <: MarginLoss

The perceptron loss linearly penalizes every prediction where the resulting agreement <= 0. It is Lipschitz continuous and convex, but not strictly convex.

\[L(a) = \max \{ 0, -a \}\]


              Lossfunction                     Derivative
+                y * h(x)                         y * h(x)
source

PerceptronLoss

LossFunctions.PerceptronLossType
PerceptronLoss <: MarginLoss

The perceptron loss linearly penalizes every prediction where the resulting agreement <= 0. It is Lipschitz continuous and convex, but not strictly convex.

\[L(a) = \max \{ 0, -a \}\]


              Lossfunction                     Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     2 │\.                       │    0 │            ┌------------│
       │ '..                     │      │            |            │
@@ -23,7 +23,7 @@
     0 │           \.____________│   -1 │------------┘            │
       └────────────┴────────────┘      └────────────┴────────────┘
       -2                        2      -2                        2
-                 y ⋅ ŷ                            y ⋅ ŷ
source

L1HingeLoss

LossFunctions.L1HingeLossType
L1HingeLoss <: MarginLoss

The hinge loss linearly penalizes every predicition where the resulting agreement < 1 . It is Lipschitz continuous and convex, but not strictly convex.

\[L(a) = \max \{ 0, 1 - a \}\]


              Lossfunction                     Derivative
+                 y ⋅ ŷ                            y ⋅ ŷ
source

L1HingeLoss

LossFunctions.L1HingeLossType
L1HingeLoss <: MarginLoss

The hinge loss linearly penalizes every predicition where the resulting agreement < 1 . It is Lipschitz continuous and convex, but not strictly convex.

\[L(a) = \max \{ 0, 1 - a \}\]


              Lossfunction                     Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     3 │'\.                      │    0 │                  ┌------│
       │  ''_                    │      │                  |      │
@@ -35,7 +35,7 @@
     0 │                ''_______│   -1 │------------------┘      │
       └────────────┴────────────┘      └────────────┴────────────┘
       -2                        2      -2                        2
-                 y ⋅ ŷ                            y ⋅ ŷ
source

SmoothedL1HingeLoss

LossFunctions.SmoothedL1HingeLossType
SmoothedL1HingeLoss <: MarginLoss

As the name suggests a smoothed version of the L1 hinge loss. It is Lipschitz continuous and convex, but not strictly convex.

\[L(a) = \begin{cases} \frac{0.5}{\gamma} \cdot \max \{ 0, 1 - a \} ^2 & \quad \text{if } a \ge 1 - \gamma \\ 1 - \frac{\gamma}{2} - a & \quad \text{otherwise}\\ \end{cases}\]


              Lossfunction (γ=2)               Derivative
+                 y ⋅ ŷ                            y ⋅ ŷ
source

SmoothedL1HingeLoss

LossFunctions.SmoothedL1HingeLossType
SmoothedL1HingeLoss <: MarginLoss

As the name suggests a smoothed version of the L1 hinge loss. It is Lipschitz continuous and convex, but not strictly convex.

\[L(a) = \begin{cases} \frac{0.5}{\gamma} \cdot \max \{ 0, 1 - a \} ^2 & \quad \text{if } a \ge 1 - \gamma \\ 1 - \frac{\gamma}{2} - a & \quad \text{otherwise}\\ \end{cases}\]


              Lossfunction (γ=2)               Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     2 │\.                       │    0 │                 ,r------│
       │ '.                      │      │               ./'       │
@@ -47,7 +47,7 @@
     0 │            '*-._________│   -1 │______./                 │
       └────────────┴────────────┘      └────────────┴────────────┘
       -2                        2      -2                        2
-                 y ⋅ ŷ                            y ⋅ ŷ
source

ModifiedHuberLoss

LossFunctions.ModifiedHuberLossType
ModifiedHuberLoss <: MarginLoss

A special (4 times scaled) case of the SmoothedL1HingeLoss with γ=2. It is Lipschitz continuous and convex, but not strictly convex.

\[L(a) = \begin{cases} \max \{ 0, 1 - a \} ^2 & \quad \text{if } a \ge -1 \\ - 4 a & \quad \text{otherwise}\\ \end{cases}\]


              Lossfunction                     Derivative
+                 y ⋅ ŷ                            y ⋅ ŷ
source

ModifiedHuberLoss

LossFunctions.ModifiedHuberLossType
ModifiedHuberLoss <: MarginLoss

A special (4 times scaled) case of the SmoothedL1HingeLoss with γ=2. It is Lipschitz continuous and convex, but not strictly convex.

\[L(a) = \begin{cases} \max \{ 0, 1 - a \} ^2 & \quad \text{if } a \ge -1 \\ - 4 a & \quad \text{otherwise}\\ \end{cases}\]


              Lossfunction                     Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     5 │    '.                   │    0 │                .+-------│
       │     '.                  │      │              ./'        │
@@ -59,7 +59,7 @@
     0 │              '-.________│   -5 │                         │
       └────────────┴────────────┘      └────────────┴────────────┘
       -2                        2      -2                        2
-                 y ⋅ ŷ                            y ⋅ ŷ
source

DWDMarginLoss

LossFunctions.DWDMarginLossType
DWDMarginLoss <: MarginLoss

The distance weighted discrimination margin loss. It is a differentiable generalization of the L1HingeLoss that is different than the SmoothedL1HingeLoss. It is Lipschitz continuous and convex, but not strictly convex.

\[L(a) = \begin{cases} 1 - a & \quad \text{if } a \ge \frac{q}{q+1} \\ \frac{1}{a^q} \frac{q^q}{(q+1)^{q+1}} & \quad \text{otherwise}\\ \end{cases}\]


              Lossfunction (q=1)               Derivative
+                 y ⋅ ŷ                            y ⋅ ŷ
source

DWDMarginLoss

LossFunctions.DWDMarginLossType
DWDMarginLoss <: MarginLoss

The distance weighted discrimination margin loss. It is a differentiable generalization of the L1HingeLoss that is different than the SmoothedL1HingeLoss. It is Lipschitz continuous and convex, but not strictly convex.

\[L(a) = \begin{cases} 1 - a & \quad \text{if } a \ge \frac{q}{q+1} \\ \frac{1}{a^q} \frac{q^q}{(q+1)^{q+1}} & \quad \text{otherwise}\\ \end{cases}\]


              Lossfunction (q=1)               Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     2 │      ".                 │    0 │                     ._r-│
       │        \.               │      │                   ./    │
@@ -71,7 +71,7 @@
     0 │                   '""---│   -1 │---------------┘         │
       └────────────┴────────────┘      └────────────┴────────────┘
       -2                        2      -2                        2
-                 y ⋅ ŷ                            y ⋅ ŷ
source

L2MarginLoss

LossFunctions.L2MarginLossType
L2MarginLoss <: MarginLoss

The margin-based least-squares loss for classification, which penalizes every prediction where agreement != 1 quadratically. It is locally Lipschitz continuous and strongly convex.

\[L(a) = {\left( 1 - a \right)}^2\]


              Lossfunction                     Derivative
+                 y ⋅ ŷ                            y ⋅ ŷ
source

L2MarginLoss

LossFunctions.L2MarginLossType
L2MarginLoss <: MarginLoss

The margin-based least-squares loss for classification, which penalizes every prediction where agreement != 1 quadratically. It is locally Lipschitz continuous and strongly convex.

\[L(a) = {\left( 1 - a \right)}^2\]


              Lossfunction                     Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     5 │     .                   │    2 │                       ,r│
       │     '.                  │      │                     ,/  │
@@ -83,7 +83,7 @@
     0 │              '-.____.-' │   -3 │         ./              │
       └────────────┴────────────┘      └────────────┴────────────┘
       -2                        2      -2                        2
-                 y ⋅ ŷ                            y ⋅ ŷ
source

L2HingeLoss

LossFunctions.L2HingeLossType
L2HingeLoss <: MarginLoss

The truncated least squares loss quadratically penalizes every predicition where the resulting agreement < 1. It is locally Lipschitz continuous and convex, but not strictly convex.

\[L(a) = \max \{ 0, 1 - a \}^2\]


              Lossfunction                     Derivative
+                 y ⋅ ŷ                            y ⋅ ŷ
source

L2HingeLoss

LossFunctions.L2HingeLossType
L2HingeLoss <: MarginLoss

The truncated least squares loss quadratically penalizes every predicition where the resulting agreement < 1. It is locally Lipschitz continuous and convex, but not strictly convex.

\[L(a) = \max \{ 0, 1 - a \}^2\]


              Lossfunction                     Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     5 │     .                   │    0 │                 ,r------│
       │     '.                  │      │               ,/        │
@@ -95,7 +95,7 @@
     0 │              '-.________│   -5 │   ./                    │
       └────────────┴────────────┘      └────────────┴────────────┘
       -2                        2      -2                        2
-                 y ⋅ ŷ                            y ⋅ ŷ
source

LogitMarginLoss

LossFunctions.LogitMarginLossType
LogitMarginLoss <: MarginLoss

The margin version of the logistic loss. It is infinitely many times differentiable, strictly convex, and Lipschitz continuous.

\[L(a) = \ln (1 + e^{-a})\]


              Lossfunction                     Derivative
+                 y ⋅ ŷ                            y ⋅ ŷ
source

LogitMarginLoss

LossFunctions.LogitMarginLossType
LogitMarginLoss <: MarginLoss

The margin version of the logistic loss. It is infinitely many times differentiable, strictly convex, and Lipschitz continuous.

\[L(a) = \ln (1 + e^{-a})\]


              Lossfunction                     Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     2 │ \.                      │    0 │                  ._--/""│
       │   \.                    │      │               ../'      │
@@ -107,7 +107,7 @@
     0 │                    '""*-│   -1 │__.--''                  │
       └────────────┴────────────┘      └────────────┴────────────┘
       -2                        2      -4                        4
-                 y ⋅ ŷ                            y ⋅ ŷ
source

ExpLoss

LossFunctions.ExpLossType
ExpLoss <: MarginLoss

The margin-based exponential loss for classification, which penalizes every prediction exponentially. It is infinitely many times differentiable, locally Lipschitz continuous and strictly convex, but not clipable.

\[L(a) = e^{-a}\]


              Lossfunction                     Derivative
+                 y ⋅ ŷ                            y ⋅ ŷ
source

ExpLoss

LossFunctions.ExpLossType
ExpLoss <: MarginLoss

The margin-based exponential loss for classification, which penalizes every prediction exponentially. It is infinitely many times differentiable, locally Lipschitz continuous and strictly convex, but not clipable.

\[L(a) = e^{-a}\]


              Lossfunction                     Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     5 │  \.                     │    0 │               _,,---:'""│
       │   l                     │      │           _r/"'         │
@@ -119,7 +119,7 @@
     0 │                ""---:.__│   -5 │  ./                     │
       └────────────┴────────────┘      └────────────┴────────────┘
       -2                        2      -2                        2
-                 y ⋅ ŷ                            y ⋅ ŷ
source

SigmoidLoss

LossFunctions.SigmoidLossType
SigmoidLoss <: MarginLoss

Continuous loss which penalizes every prediction with a loss within in the range (0,2). It is infinitely many times differentiable, Lipschitz continuous but nonconvex.

\[L(a) = 1 - \tanh(a)\]


              Lossfunction                     Derivative
+                 y ⋅ ŷ                            y ⋅ ŷ
source

SigmoidLoss

LossFunctions.SigmoidLossType
SigmoidLoss <: MarginLoss

Continuous loss which penalizes every prediction with a loss within in the range (0,2). It is infinitely many times differentiable, Lipschitz continuous but nonconvex.

\[L(a) = 1 - \tanh(a)\]


              Lossfunction                     Derivative
       ┌────────────┬────────────┐      ┌────────────┬────────────┐
     2 │""'--,.                  │    0 │..                     ..│
       │      '\.                │      │ "\.                 ./" │
@@ -131,4 +131,4 @@
     0 │                  '"-:.__│   -1 │          ',_,'          │
       └────────────┴────────────┘      └────────────┴────────────┘
       -2                        2      -2                        2
-                 y ⋅ ŷ                            y ⋅ ŷ
source
+ y ⋅ ŷ y ⋅ ŷsource diff --git a/dev/losses/other/index.html b/dev/losses/other/index.html index 33495a7..19c800c 100644 --- a/dev/losses/other/index.html +++ b/dev/losses/other/index.html @@ -1,2 +1,2 @@ -Other Losses · LossFunctions.jl

Other Losses

Loss functions exist that are not based on distances nor margins. This section lists other useful losses that are implemented in the package:

MisclassLoss

LossFunctions.MisclassLossType
MisclassLoss{R<:AbstractFloat} <: SupervisedLoss

Misclassification loss that assigns 1 for misclassified examples and 0 otherwise. It is a generalization of ZeroOneLoss for more than two classes.

The type parameter R specifies the result type of the loss. Default type is double precision R = Float64.

source

PoissonLoss

LossFunctions.PoissonLossType
PoissonLoss <: SupervisedLoss

Loss under a Poisson noise distribution (KL-divergence)

$L(output, target) = exp(output) - target*output$

source

CrossEntropyLoss

LossFunctions.CrossEntropyLossType
CrossEntropyLoss <: SupervisedLoss

The cross-entropy loss is defined as:

$L(output, target) = - target*log(output) - (1-target)*log(1-output)$

source
+Other Losses · LossFunctions.jl

Other Losses

Loss functions exist that are not based on distances nor margins. This section lists other useful losses that are implemented in the package:

MisclassLoss

LossFunctions.MisclassLossType
MisclassLoss{R<:AbstractFloat} <: SupervisedLoss

Misclassification loss that assigns 1 for misclassified examples and 0 otherwise. It is a generalization of ZeroOneLoss for more than two classes.

The type parameter R specifies the result type of the loss. Default type is double precision R = Float64.

source

PoissonLoss

LossFunctions.PoissonLossType
PoissonLoss <: SupervisedLoss

Loss under a Poisson noise distribution (KL-divergence)

$L(output, target) = exp(output) - target*output$

source

CrossEntropyLoss

LossFunctions.CrossEntropyLossType
CrossEntropyLoss <: SupervisedLoss

The cross-entropy loss is defined as:

$L(output, target) = - target*log(output) - (1-target)*log(1-output)$

source
diff --git a/dev/search/index.html b/dev/search/index.html index ff9687f..9889084 100644 --- a/dev/search/index.html +++ b/dev/search/index.html @@ -1,2 +1,2 @@ -Search · LossFunctions.jl

Loading search...

    +Search · LossFunctions.jl

    Loading search...

      diff --git a/dev/user/aggregate/index.html b/dev/user/aggregate/index.html index 49f12f5..2bbf7c9 100644 --- a/dev/user/aggregate/index.html +++ b/dev/user/aggregate/index.html @@ -24,4 +24,4 @@ 12.0 julia> mean(L1DistLoss(), [2,5,-2], [1.,2,3], [1,2,1]) -1.0 +1.0 diff --git a/dev/user/interface/index.html b/dev/user/interface/index.html index 3f32f6f..8e6f98c 100644 --- a/dev/user/interface/index.html +++ b/dev/user/interface/index.html @@ -36,4 +36,4 @@ 3-element Vector{Int64}: 1 3 - 5

      Computing the 1st Derivatives

      Maybe the more interesting aspect of loss functions are their derivatives. In fact, most of the popular learning algorithm in Supervised Learning, such as gradient descent, utilize the derivatives of the loss in one way or the other during the training process.

      To compute the derivative of some loss we expose the function deriv. It may be interesting to note explicitly, that we always compute the derivative in respect to the predicted output, since we are interested in deducing in which direction the output should change.

      Missing docstring.

      Missing docstring for deriv. Check Documenter's build log for details.

      Computing the 2nd Derivatives

      Additionally to the first derivative, we also provide the corresponding methods for the second derivative through the function deriv2. Note again, that we always compute the derivative in respect to the predicted output.

      Missing docstring.

      Missing docstring for deriv2. Check Documenter's build log for details.

      Properties of a Loss

      In some situations it can be quite useful to assert certain properties about a loss-function. One such scenario could be when implementing an algorithm that requires the loss to be strictly convex or Lipschitz continuous. Note that we will only skim over the defintions in most cases. A good treatment of all of the concepts involved can be found in either [BOYD2004] or [STEINWART2008].

      This package uses functions to represent individual properties of a loss. It follows a list of implemented property-functions defined in LearnBase.jl.

      Missing docstring.

      Missing docstring for isdistancebased. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for ismarginbased. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isminimizable. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isdifferentiable. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for istwicedifferentiable. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isconvex. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isstrictlyconvex. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isstronglyconvex. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isnemitski. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isunivfishercons. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isfishercons. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for islipschitzcont. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for islocallylipschitzcont. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isclipable. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isclasscalibrated. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for issymmetric. Check Documenter's build log for details.

      + 5

      Computing the 1st Derivatives

      Maybe the more interesting aspect of loss functions are their derivatives. In fact, most of the popular learning algorithm in Supervised Learning, such as gradient descent, utilize the derivatives of the loss in one way or the other during the training process.

      To compute the derivative of some loss we expose the function deriv. It may be interesting to note explicitly, that we always compute the derivative in respect to the predicted output, since we are interested in deducing in which direction the output should change.

      Missing docstring.

      Missing docstring for deriv. Check Documenter's build log for details.

      Computing the 2nd Derivatives

      Additionally to the first derivative, we also provide the corresponding methods for the second derivative through the function deriv2. Note again, that we always compute the derivative in respect to the predicted output.

      Missing docstring.

      Missing docstring for deriv2. Check Documenter's build log for details.

      Properties of a Loss

      In some situations it can be quite useful to assert certain properties about a loss-function. One such scenario could be when implementing an algorithm that requires the loss to be strictly convex or Lipschitz continuous. Note that we will only skim over the defintions in most cases. A good treatment of all of the concepts involved can be found in either [BOYD2004] or [STEINWART2008].

      This package uses functions to represent individual properties of a loss. It follows a list of implemented property-functions defined in LearnBase.jl.

      Missing docstring.

      Missing docstring for isdistancebased. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for ismarginbased. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isminimizable. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isdifferentiable. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for istwicedifferentiable. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isconvex. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isstrictlyconvex. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isstronglyconvex. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isnemitski. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isunivfishercons. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isfishercons. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for islipschitzcont. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for islocallylipschitzcont. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isclipable. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for isclasscalibrated. Check Documenter's build log for details.

      Missing docstring.

      Missing docstring for issymmetric. Check Documenter's build log for details.