-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
batch_normalization: Introduce vectorization optimization in the batch norm elementwise kernel. #933
base: main
Are you sure you want to change the base?
Conversation
for (int vt = 0; vt < VEC_SIZE; ++vt) { | ||
index_t feature = feature_vec_begin + vt; | ||
vec[vt] = static_cast<input_scalar_t>( | ||
gamma * (i[feature] - mean) * invstd + beta); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using vectorization for inputs? I see vt
is contiguous across the iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the CNL case, consecutive work-items cannot access consecutive data for the OffsetCalculator
is used for calculating loading stride.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we know the vt
size, we can calculate offset with linear id 0, 3, 7, 11? if the vt
size is 4?
Due to performance issues with the low-precision data type implementation of group stride loops on PVC (jira: PYTORCHDGQ-5162), partial vectorization optimization is used.