You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If transform with minimum lexical change, noop_flag.DATA_PTR<int>() function call will cross scope from host to target while not in CUDA. noop_flag happens to be a torch::tensor object which is not device copiable. Even if we introduce temporary variable and move the call out of lambda, the object tl (tensor list) will result capture size larger than 2048 bytes limit. A simple case hit every limitations we have.
We suggest that translate CUDA global function (template) use explicit functor (template) instead lambda capture. The kernel launching code would be like:
Considering following case in DeepSpeed kernel, a global function template with parameter pack
Current translation would be like:
If transform with minimum lexical change,
noop_flag.DATA_PTR<int>()
function call will cross scope from host to target while not in CUDA.noop_flag
happens to be a torch::tensor object which is not device copiable. Even if we introduce temporary variable and move the call out of lambda, the objecttl
(tensor list) will result capture size larger than 2048 bytes limit. A simple case hit every limitations we have.We suggest that translate CUDA global function (template) use explicit functor (template) instead lambda capture. The kernel launching code would be like:
All substitutions in the translation are localized. Manual example can be found at:
https://github.com/CaoZhongZ/sycl_compiler_test/blob/global_call_migrate/deepspeed/global_call_migrate.cpp
The text was updated successfully, but these errors were encountered: