-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document what to do at nondifferentiable points #419
Conversation
Codecov Report
@@ Coverage Diff @@
## main #419 +/- ##
==========================================
+ Coverage 92.91% 93.03% +0.11%
==========================================
Files 15 15
Lines 819 833 +14
==========================================
+ Hits 761 775 +14
Misses 58 58
Continue to review full report at Codecov.
|
I have dropped almost all the sub/super stuff and just stuck to lots of practical examples with discussion. |
Co-authored-by: Mason Protter <[email protected]>
Co-authored-by: Miha Zgubic <[email protected]>
74e52b0
to
35f8633
Compare
docs/src/nondiff_points.md
Outdated
|
||
This has a number of advantages. | ||
- It follows the rule that derivatives are zero at local minima (and maxima). | ||
- If you leave a gradient decent optimizer running it will eventually actually converge absolutely to the point -- where as with it being 1 or -1 it would never outright converge it would always flee. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If you leave a gradient decent optimizer running it will eventually actually converge absolutely to the point -- where as with it being 1 or -1 it would never outright converge it would always flee. | |
- If you leave a gradient decent optimizer running it will eventually actually converge absolutely to the point -- where as with it being 1 or -1 it would never outright converge it would always flee. |
The word "flee" is evocative, but maybe a little confusing here. Maybe instead we could say "oscillate" or "wobble"
docs/src/nondiff_points.md
Outdated
``` | ||
|
||
We do not have to worry about what to return for the side where it is not defined. | ||
As we will never be asked for the derivative at e.g. `x=-2.5` since the primal function errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment, not sure if it's important but the primal won't error if we make the argument complex. And in that case there's the interesting issue of the branch cut.
docs/src/nondiff_points.md
Outdated
- If the derivative from one side is finite and the other isn't, say it is the derivative taken from finite side. | ||
- When derivative from each side is not equal, strongly consider reporting the average | ||
|
||
Our goal as always, is to get a pragmatically useful result for everyone, which must by necessity also avoid a pathological result for anyone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth mentioning that we can't always get the result that's best for literally everyone, but we sometimes just have to do our best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Content looks good, and is definitely a useful addition.
My only suggestions would be to separate this in two: have the short version paragraph at the top in "writing good rules" and link to the rest of the text which IMO belongs to "maths" section.
the writing good rules section is too long. But I will move this under math. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry annoying to add comments after you've already merged. I'm happy to do a PR instead if that's easier.
|
||
This has a number of advantages. | ||
- It follows the rule that derivatives are zero at local minima (and maxima). | ||
- If you leave a gradient decent optimizer running it will eventually actually converge absolutely to the point -- where as with it being 1 or -1 it would never outright converge it would always flee. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
descent
|
||
The other option for `x->ceil(x)` would be relax the problem into `x->x`, and thus say it is 1 everywhere | ||
But that it too weird, if the use wanted a relaxation of the problem then they would provide one. | ||
We can not be imposing that relaxation on to `ceil` for everyone is not reasonable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can not be imposing that relaxation on to ceil
for everyone
or
Imposing that relaxation on to ceil
for everyone is not reasonable.
|
||
We do not have to worry about what to return for the side where it is not defined. | ||
As we will never be asked for the derivative at e.g. `x=-2.5` since the primal function errors. | ||
But we do need to worry about at the boundary -- if that boundary point doesn't error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe replace with
But we do need to worry about at the boundary. The function is defined for x=0
(because exp
is defined at -Inf
), but AD will return <what will it return? Is it NaN?>
As we will never be asked for the derivative at e.g. `x=-2.5` since the primal function errors. | ||
But we do need to worry about at the boundary -- if that boundary point doesn't error. | ||
|
||
Since we will never be asked about the left-hand side (as the primal errors), we can use just the right-hand side derivative. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repetition of line 95-96.
But this is more or less the same as choosing some large value -- in this case an extremely large value that will rapidly overflow. | ||
|
||
|
||
### Derivative on-finite and different on both sides |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-finite
``` | ||
|
||
In this example, the primal is defined and finite, so we would like a derivative to defined. | ||
We are back in the case of a local minimal like we were for `abs`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minimum
plot(x-> sign(x) * cbrt(x)) | ||
``` | ||
|
||
In this example, the primal is defined and finite, so we would like a derivative to defined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be defined
From the case studies a few general rules can be seen for how to choose a value that is _useful_. | ||
These rough rules are: | ||
- Say the derivative is 0 at local optima | ||
- If the derivative from one side is defined and the other isn't, say it is the derivative taken from defined side. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken from the
These rough rules are: | ||
- Say the derivative is 0 at local optima | ||
- If the derivative from one side is defined and the other isn't, say it is the derivative taken from defined side. | ||
- If the derivative from one side is finite and the other isn't, say it is the derivative taken from finite side. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken from the
- Say the derivative is 0 at local optima | ||
- If the derivative from one side is defined and the other isn't, say it is the derivative taken from defined side. | ||
- If the derivative from one side is finite and the other isn't, say it is the derivative taken from finite side. | ||
- When derivative from each side is not equal, strongly consider reporting the average |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of inclined to remove "strongly"
@awf your comments look good to me, please do make a PR. |
Closes #404
This is based on a conversation @awf and I had at RSE-Conf 2018.
And we have to some extent been following it in ChainRules.jl since some time before then.
So here it is written down more formally
Here is the Docs Preview
Feedback is appreciated.