Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sinkhorn Divergence #92

Conversation

davibarreira
Copy link
Member

Implemented the function sinkhorn_divergence which is an unbiased version of sinkhorn and that is also a metric in the space of probability spaces. This function is similar to the ot.bregman.empirical_sinkhorn_divergence from POT.py.

The tests required the use of PyCall, because ot.bregman.empirical_sinkhorn_divergence is not supported on PythonOT.jl.

@coveralls
Copy link

coveralls commented Jun 2, 2021

Pull Request Test Coverage Report for Build 899735377

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.06%) to 96.133%

Totals Coverage Status
Change from base Build 897834664: 0.06%
Covered Lines: 348
Relevant Lines: 362

💛 - Coveralls

@devmotion
Copy link
Member

The tests required the use of PyCall, because ot.bregman.empirical_sinkhorn_divergence is not supported on PythonOT.jl

It would be good to add it to PythonOT.

@davibarreira
Copy link
Member Author

The tests required the use of PyCall, because ot.bregman.empirical_sinkhorn_divergence is not supported on PythonOT.jl

It would be good to add it to PythonOT.

I'll see how this can be done,

@davibarreira
Copy link
Member Author

Submitted a PR to the PythonOT.jl package. Once the PR goes through there, I'll update this PR here in order to remove the PyCall dependency on the test.


The Sinkhorn Divergence is computed as:
```math
S_{c,ε}(μ,ν) := OT_{c,ε}(μ,ν) - \\frac{1}{2}(OT_{c,ε}(μ,μ) + OT_{c,ε}(ν,ν)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use \\operatorname{OT} here instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for \\operatorname{S}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, better to add the \\operatorname. Will do.

regularization=regularization,
kwargs...,
)
return max(0, OTμν - 0.5 * (OTμμ + OTνν))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we take max(0, ...) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To guarantee that the value is larger or equal to 0. The same is also present in POT.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. It doesn't come from the 'math' though, I assume any negative values would be due to numerical issues (i.e. any negative values would be very small?)

Copy link
Member Author

@davibarreira davibarreira Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. In theory, this is a "proper" metric (symmetric, 0 <=> x=y , trig inequality), so it would never be negative.

See also: [`sinkhorn2`](@ref)
"""
function sinkhorn_divergence(
c,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we stick to the calling convention of sinkhorn here, i.e. (\mu, \nu, C, \varepsilon) in that order.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were going to revert all functions to (c,\mu,\nu ...), like the implementations of ot_cost and ot_plan. Also, I think we should use lowercase c for the cost function, and uppercase C for cost matrix.

Copy link
Member

@zsteve zsteve Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Makes sense, which PR is this from? Sorry I've not kept up with all of them :P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your fault at all! I thought we had settle on it in the issue #63 , but now I see it's not there. So probably me and @devmotion talked about in one of the Multivariate Normals or 1D implementations (which have almost hundreds of comments). Things are moving so fast in this package now, that I actually don't know where it is.
I'll comment in issue #63, so we can make a decision. If I remember correctly, it was actually a suggestion of @devmotion, based on the fact that one usually declares arguments that are functions in the beginning (I'm guessing this is a Julia standard or something).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a standard and mentioned in the Julia documentation about ordering of function arguments since it allows one to use the do syntax

map(rand(20)) do x
    return x^2
end

instead of

map(x -> x^2, rand(20))

which is very convenient for longer and more complicated functions.

See also: [`sinkhorn2`](@ref)
"""
function sinkhorn_divergence(
c,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we implement this with the cost matrices as inputs instead? I think this would have multiple advantages:

  • it would allow to reuse precomputed cost matrices, here and in other functions
  • it would be guaranteed to work on the GPU (BTW GPU tests should be added if possible) whereas the default implementation of pairwise probably doesn't for most distances and custom functions (IIRC it uses incorrect containers and indexing which should be avoided on GPUs and is often disabled by users)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, this would still allow to define the function implemented by just forwarding it to the one with cost matrices that are computed from the DiscreteNonParametric and the cost function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW it's also quite restrictive to only allow univariate messures here it seems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the univariate was a mistake. I forgot that DiscreteNonParametric was only for the univariate case. I could implement it with a matrix cost, but it would actually require 3 matrices, so the it would be quite "ugly". It would be something like:
sinkhorn_divergence(mu,nu,C1,C2,C3,eps). So I think we should stick with the cost function... I don't know much about programming for GPU, but, don't you think we could adapt this somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could implement another sinkhorn_divergence that would take the cost matrices as argument. And would could perhaps indicate that such version should be used in case the user is interested in using GPU.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just define both and forward the call from the one with the cost function to the other:

Additionally, this would still allow to define the function implemented by just forwarding it to the one with cost matrices that are computed from the DiscreteNonParametric and the cost function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such version should be used in case the user is interested in using GPU.

It could be useful without GPUs as well, e.g., if you evaluate sinkhorn, sinkhorn2 or other other regularized OT distances with the same cost matrices.

In fact, it could also be useful to allow to specify pre-computed plans for the different sinkhorn2 calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convinced.

μ::DiscreteNonParametric,
ν::DiscreteNonParametric,
ε;
regularization=false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed since it's the default in sinkhorn2:

Suggested change
regularization=false,

See also: [`sinkhorn2`](@ref)
"""
function sinkhorn_divergence(
c,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a standard and mentioned in the Julia documentation about ordering of function arguments since it allows one to use the do syntax

map(rand(20)) do x
    return x^2
end

instead of

map(x -> x^2, rand(20))

which is very convenient for longer and more complicated functions.

ν.p,
pairwise(c, μ.support, ν.support),
ε;
regularization=regularization,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
regularization=regularization,

μ.p,
pairwise(c, μ.support; symmetric=true),
ε;
regularization=regularization,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
regularization=regularization,

ν.p,
pairwise(c, ν.support; symmetric=true),
ε;
regularization=regularization,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
regularization=regularization,

regularization=regularization,
kwargs...,
)
return max(0, OTμν - 0.5 * (OTμμ + OTνν))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid unwanted promotions (e.g. when working with Float32 on GPUs):

Suggested change
return max(0, OTμν - 0.5 * (OTμμ + OTνν))
return max(0, OTμν - (OTμμ + OTνν) / 2)

Intuitively I would have thought that the divergence can be negative if the regularization terms are included (BTW does one actually want to include them)?

Copy link
Member Author

@davibarreira davibarreira Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if the regularization terms are added, it could be negative, so you are correct that we should not allow the regularization keyword. For the Sinkhorn Divergence, one should probably never include the regularization terms, because the goal here is to turn the entropic ot cost into a "proper" metric in the space of probability distributions. Besides, Feydy (in the paper I cited), proved some neat properties of this metric, such as the fact that it also metrizes weak convergence.
TL;DR: I don't think one would want to ever include the regularization terms in the Sinkhorn Divergence.

For sinkhorn2, I saw one paper where the author suggested that the use of the cost plus regularization to behave better than when one removes it. Although, it can be negative, so it loses some of the interpretability. I usually refer to as "Sinkhorn loss" when I add the regularization, and "Sinkhorn cost" or "Sinkhorn distance" when I remove it. But this is personal, and not adopted in the literature generally (but it probably should, cause people use the terms interchangeably, and it becomes very confusing).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation! This confirms my intuition, so let's just remove the regularization keyword argument. Maybe then we should even check that kwargs... does not contain a keyword argument regularization and otherwise either throw an error or (maybe better) drop the regularization keyword argument and display a @warning. Maybe something like

....; regularization=nothing, kwargs...)
    if regularization !== nothing
        @warn "`sinkhorn_divergence` does not support the `regularization` keyword argument"
    end

test/entropic.jl Outdated
@testset "example" begin
# create distributions
N = 100
μ = DiscreteNonParametric(rand(N), ones(N) / N)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use random histograms?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'll update.

@davibarreira
Copy link
Member Author

davibarreira commented Jun 3, 2021

So, this current push in the PR is actually too much code for a single PR. I intend to break it down in other PR's, but, I thought it would be good to submit here the whole thing so you guys can take a look on how these new functions tie themselves to the rest of the package.
So, besides the sinkhorn_divergence function, I actually implemented the struct FiniteDiscreteMeasure and the cost_matrix function. Each one I'll separate in a different PR, in case we decide that they are appropriate for the package.

The FiniteDiscreteMeasure is a struct is a way that we can define our empirical measures and consistently write functions such as sinkhorn2(c, mu, nu) or sinkhorn2(C, mu, nu). Right now, the package sometimes refers mu and nu only as the support, and sometimes as actual measures. So the new struct would clear things.

The cost_matrix is just a helper function to construct cost matrices easily. This allows us to create other versions of our functions such as sinkhorn2 with a cost function instead of a cost matrix (similar to what I have done with sinkhorn_divergence).

@devmotion and @zsteve , what do you think of these two new additions? Should I submit a PR for each one?

I already wrote the test for all these functions, so if you guys prefer, we can actually just review everything in this PR.

@devmotion
Copy link
Member

I think it's a good idea to move these two additions to separate PRs, then it's easier to discuss and polish them.

I also think that probably a FiniteDiscreteMeasure is useful, I have some additional suggestions but it's easier to take this in the corresponding PR.

I am a bit less convinced by cost_matrix, it seems to introduce some type instabilities but probably this could be fixed. I don't think it should be exported and become part of the official API though, but also here it's probably best to discuss this in a separate PR 🙂

@davibarreira
Copy link
Member Author

The error in the check seems to be unrelated to the PR.

Project.toml Outdated Show resolved Hide resolved
src/OptimalTransport.jl Outdated Show resolved Hide resolved
Comment on lines 511 to 512
μ::Union{FiniteDiscreteMeasure, DiscreteNonParametric},
ν::Union{FiniteDiscreteMeasure, DiscreteNonParametric},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really what we support, maybe just omit the type or add two docstrings (one basically referring to the other)?

c,
μ::Union{FiniteDiscreteMeasure, DiscreteNonParametric},
ν::Union{FiniteDiscreteMeasure, DiscreteNonParametric},
ε; regularization=false, plan=nothing, kwargs...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful if we would restructure sinkhorn and introduce the different algorithms, as proposed in the other PR. Then users could also choose the stabilized algorithm or epsilon scaling here.

Comment on lines 564 to 565
μ::Union{FiniteDiscreteMeasure, DiscreteNonParametric},
ν::Union{FiniteDiscreteMeasure, DiscreteNonParametric},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

src/entropic/sinkhorn.jl Outdated Show resolved Hide resolved
Comment on lines 211 to 218
for (ε, metrics) in Iterators.product(
[0.1, 1.0, 10.0],
[
(sqeuclidean, SqEuclidean()),
(euclidean, Euclidean()),
(totalvariation, TotalVariation()),
],
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit uncommon, usually one would just use two for loops here. One can even use the short notation

for x in xs, y in ys

μ = OptimalTransport.discretemeasure(μsupp, μprobs)
ν = OptimalTransport.discretemeasure(νsupp)

for (ε, metrics) in Iterators.product(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@@ -0,0 +1,55 @@
using Distributions: DiscreteNonParametric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file seems unrelated to the PR?

test/runtests.jl Outdated Show resolved Hide resolved
Co-authored-by: David Widmann <[email protected]>
@davibarreira
Copy link
Member Author

Took a break from the computer a couple of days. Made the requested changes.

@codecov-commenter
Copy link

Codecov Report

Merging #92 (e261025) into master (f03b05c) will decrease coverage by 0.11%.
The diff coverage is 88.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #92      +/-   ##
==========================================
- Coverage   96.25%   96.14%   -0.12%     
==========================================
  Files          11       12       +1     
  Lines         561      570       +9     
==========================================
+ Hits          540      548       +8     
- Misses         21       22       +1     
Impacted Files Coverage Δ
src/entropic/sinkhorn_divergence.jl 88.88% <88.88%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f03b05c...e261025. Read the comment docs.

@davibarreira
Copy link
Member Author

I think this one is finally ready to go.

@davibarreira
Copy link
Member Author

@devmotion or @zsteve, whenever you guys can, please review this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants