-
-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate Flux.Optimisers and implicit parameters in favour of Optimisers.jl and explicit parameters #1986
Comments
Thoughts on copying or moving parts of https://github.com/FluxML/Flux.jl/blob/v0.13.3/docs/src/training/optimisers.md#scheduling-optimisers over to the Optimisers.jl docs? |
I think that this issue might be a blocker for explicit parameters. For example: julia> gradient(bn -> sum(bn(randn(Float32, 3,1))), BatchNorm(3))
((λ = nothing, β = Fill(1.0f0, 3), γ = Float32[0.0, 0.0, 0.0], μ = nothing, σ² = nothing, ϵ = -0.0f0, momentum = nothing, affine = nothing, track_stats = nothing, active = nothing, chs = nothing),)
julia> gradient(bn -> sum(bn(randn(Float32, 3,1))), Chain(BatchNorm(3)))
(nothing,) Not sure if related, but it seems that for some models the call hangs idefinitely (I killed the process after 8 hours) if gradients are missing from one of the layers due to the above. The only 100% reproducer I have for this so far is one generated by https://github.com/DrChainsaw/NaiveGAflux.jl for a particular random seed, but if I manage to narrow it down I'll post an issue. If anyone has any tips on could cause the stall it would be very helpful. |
I wonder if this is encountering a similar issue as TuringLang/Turing.jl#1754. Have you tried For troubleshooting, a smaller example that doesn't hang but still takes some time to compile (i.e. over a minute) would be good. Bonus if you can provide an inference/total time breakdown from SnoopCompile.@snoopi_deep. |
Thanks @ToucheSir. Sorry if this is derailing. I think the root issue with missing gradients is quite important regardless as it impacts vanilla Flux. Putting the discussion on the hangs in details in an attempt to reduce clutterFrom what I can tell there does not seem to be any poor scaling at work. When printing out the forwards and backwards pass for the call One hypothesis is that it is the evil loopy structure of the |
Thanks. This looks like it deserves its own issue, so I've opened FluxML/Zygote.jl#1243. |
While this isn't a complete solution, one way around #1986 (comment) is to simply make the normalisation layers into immutable Is this worth doing? |
Separating out the configuration as in #1509 is also a solution and gives type stability and inferrability as well |
Another possibility is to refactor into functions which don't use the mutable struct and then add trivial rrules, something like: (BN::BatchNorm)(x) = batchnorm(BN..., x) # splatting mostly because I'm too lazy to type out all the members
function ChainRulesCore.rrule(config::RuleConfig{>:HasReverseMode}, BN::BatchNorm, x)
res, back = rrule_via_ad(config, batchnorm, BN..., x)
function BatchNorm_back(Δ)
δs = back(Δ)
Tangent{BatchNorm}(δs[1:end-1]...), δs[end]
end
return res, BatchNorm_back
end Not sure if this has any advantages over the above proposals, especially considering that the refactor looks like it could be a bit messy. I suppose one far fetched argument is that users might consider it a breaking change to make exported structs immutable. |
The main sources of type stability on norm layer types are |
@DrChainsaw saw JuliaLang/julia#44185 today while catching up on JuliaCon talks. Are you willing to try it out on your side and seeing what the profiler spits out? |
@ToucheSir wow, that looks like a great QOL improvement in general. I only have access to a windows environment atm, but I will give it a shot when that changes. I'm a bit worried that it seems to require yielding. Ctrl-c does nothing so I fear that it is stuck in an infinite loop which does not yield anywhere. |
@mcabbott Can this be closed as done? |
I think the remaining issue is RNNs not working right with the new explicit style. Have not followed closely but e.g. #2258 |
And regularization e.g. FluxML/Optimisers.jl#57 |
A question: Is using the "old" implicit going to introduce bugs/unexpected behavior during training? |
Legacy code using that approach will continue to work, but new code should not be using implicit params. Any bugs found with the old approach should be reported as usual, but note that they are effectively deprecated and will be removed in a future major release. |
So all the things are in place and we can get rid of the current pattern using implicit params:
to the one using explicit parameters and Optimisers.jl
Code
train!(loss, ::Params, data, ::AbstractOptimiser)
:Flux.params(m)
... or just fix it, broken in Make params non-differentiable (Closes #2040 & #2048) #2054:Documentation
Examples
@mcabbott @ToucheSir @darsnack feel free to add to this
The text was updated successfully, but these errors were encountered: