Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document what to do at nondifferentiable points #419

Merged
merged 17 commits into from
Nov 10, 2021
Merged

Conversation

oxinabox
Copy link
Member

@oxinabox oxinabox commented Jul 27, 2021

Closes #404

This is based on a conversation @awf and I had at RSE-Conf 2018.
And we have to some extent been following it in ChainRules.jl since some time before then.

So here it is written down more formally

Here is the Docs Preview
Feedback is appreciated.

@oxinabox oxinabox added the documentation Improvements or additions to documentation label Jul 27, 2021
@JuliaDiff JuliaDiff deleted a comment from codecov-commenter Jul 27, 2021
@JuliaDiff JuliaDiff deleted a comment from codecov-commenter Jul 27, 2021
@JuliaDiff JuliaDiff deleted a comment from codecov-commenter Jul 27, 2021
@codecov-commenter
Copy link

codecov-commenter commented Jul 27, 2021

Codecov Report

Merging #419 (80be21e) into main (99d56b1) will increase coverage by 0.11%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #419      +/-   ##
==========================================
+ Coverage   92.91%   93.03%   +0.11%     
==========================================
  Files          15       15              
  Lines         819      833      +14     
==========================================
+ Hits          761      775      +14     
  Misses         58       58              
Impacted Files Coverage Δ
src/rule_definition_tools.jl 96.27% <0.00%> (+0.02%) ⬆️
src/projection.jl 97.47% <0.00%> (+0.07%) ⬆️
src/accumulation.jl 97.22% <0.00%> (+0.07%) ⬆️
src/tangent_types/thunks.jl 95.00% <0.00%> (+0.10%) ⬆️
src/tangent_types/tangent.jl 85.50% <0.00%> (+0.32%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99d56b1...80be21e. Read the comment docs.

@sethaxen sethaxen self-requested a review July 29, 2021 10:36
@oxinabox oxinabox changed the title Document sub/super-differential convention Document what to do at nondifferentiable points Nov 9, 2021
@oxinabox
Copy link
Member Author

oxinabox commented Nov 9, 2021

I have dropped almost all the sub/super stuff and just stuck to lots of practical examples with discussion.
That seems more useful to most.
iirc this was @MasonProtter 's suggestion.
I think it is better now

@oxinabox oxinabox force-pushed the ox/subgrad_convention branch from 74e52b0 to 35f8633 Compare November 9, 2021 18:22

This has a number of advantages.
- It follows the rule that derivatives are zero at local minima (and maxima).
- If you leave a gradient decent optimizer running it will eventually actually converge absolutely to the point -- where as with it being 1 or -1 it would never outright converge it would always flee.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If you leave a gradient decent optimizer running it will eventually actually converge absolutely to the point -- where as with it being 1 or -1 it would never outright converge it would always flee.
- If you leave a gradient decent optimizer running it will eventually actually converge absolutely to the point -- where as with it being 1 or -1 it would never outright converge it would always flee.

The word "flee" is evocative, but maybe a little confusing here. Maybe instead we could say "oscillate" or "wobble"

```

We do not have to worry about what to return for the side where it is not defined.
As we will never be asked for the derivative at e.g. `x=-2.5` since the primal function errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment, not sure if it's important but the primal won't error if we make the argument complex. And in that case there's the interesting issue of the branch cut.

- If the derivative from one side is finite and the other isn't, say it is the derivative taken from finite side.
- When derivative from each side is not equal, strongly consider reporting the average

Our goal as always, is to get a pragmatically useful result for everyone, which must by necessity also avoid a pathological result for anyone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth mentioning that we can't always get the result that's best for literally everyone, but we sometimes just have to do our best.

Copy link
Member

@mzgubic mzgubic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content looks good, and is definitely a useful addition.

My only suggestions would be to separate this in two: have the short version paragraph at the top in "writing good rules" and link to the rest of the text which IMO belongs to "maths" section.

@oxinabox
Copy link
Member Author

the writing good rules section is too long.
I think we can skip having a short summary of this there (for now at least).
People who need to answer this know they need to answer this and so can look it up.

But I will move this under math.

@oxinabox oxinabox merged commit 4d27d7e into main Nov 10, 2021
@oxinabox oxinabox deleted the ox/subgrad_convention branch November 10, 2021 13:36
Copy link

@awf awf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry annoying to add comments after you've already merged. I'm happy to do a PR instead if that's easier.


This has a number of advantages.
- It follows the rule that derivatives are zero at local minima (and maxima).
- If you leave a gradient decent optimizer running it will eventually actually converge absolutely to the point -- where as with it being 1 or -1 it would never outright converge it would always flee.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

descent


The other option for `x->ceil(x)` would be relax the problem into `x->x`, and thus say it is 1 everywhere
But that it too weird, if the use wanted a relaxation of the problem then they would provide one.
We can not be imposing that relaxation on to `ceil` for everyone is not reasonable.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not be imposing that relaxation on to ceil for everyone

or

Imposing that relaxation on to ceil for everyone is not reasonable.


We do not have to worry about what to return for the side where it is not defined.
As we will never be asked for the derivative at e.g. `x=-2.5` since the primal function errors.
But we do need to worry about at the boundary -- if that boundary point doesn't error.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe replace with

But we do need to worry about at the boundary. The function is defined for x=0 (because exp is defined at -Inf), but AD will return <what will it return? Is it NaN?>

As we will never be asked for the derivative at e.g. `x=-2.5` since the primal function errors.
But we do need to worry about at the boundary -- if that boundary point doesn't error.

Since we will never be asked about the left-hand side (as the primal errors), we can use just the right-hand side derivative.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repetition of line 95-96.

But this is more or less the same as choosing some large value -- in this case an extremely large value that will rapidly overflow.


### Derivative on-finite and different on both sides
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-finite

```

In this example, the primal is defined and finite, so we would like a derivative to defined.
We are back in the case of a local minimal like we were for `abs`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minimum

plot(x-> sign(x) * cbrt(x))
```

In this example, the primal is defined and finite, so we would like a derivative to defined.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be defined

From the case studies a few general rules can be seen for how to choose a value that is _useful_.
These rough rules are:
- Say the derivative is 0 at local optima
- If the derivative from one side is defined and the other isn't, say it is the derivative taken from defined side.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taken from the

These rough rules are:
- Say the derivative is 0 at local optima
- If the derivative from one side is defined and the other isn't, say it is the derivative taken from defined side.
- If the derivative from one side is finite and the other isn't, say it is the derivative taken from finite side.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taken from the

- Say the derivative is 0 at local optima
- If the derivative from one side is defined and the other isn't, say it is the derivative taken from defined side.
- If the derivative from one side is finite and the other isn't, say it is the derivative taken from finite side.
- When derivative from each side is not equal, strongly consider reporting the average
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of inclined to remove "strongly"

@oxinabox
Copy link
Member Author

@awf your comments look good to me, please do make a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document subgradient convention
8 participants