Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch_normalization: Introduce vectorization optimization in the batch norm elementwise kernel. #933

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

xytintel
Copy link
Contributor

@xytintel xytintel commented Sep 25, 2024

Due to performance issues with the low-precision data type implementation of group stride loops on PVC (jira: PYTORCHDGQ-5162), partial vectorization optimization is used.

@fengyuan14 fengyuan14 changed the title Perform vectorization optimization on the batch normalization forward pass batch_normalization: Introduce vectorization optimization in the forward pass Sep 26, 2024
@fengyuan14 fengyuan14 changed the title batch_normalization: Introduce vectorization optimization in the forward pass batch_normalization: Introduce vectorization optimization in the batch norm elementwise kernel. Sep 26, 2024
for (int vt = 0; vt < VEC_SIZE; ++vt) {
index_t feature = feature_vec_begin + vt;
vec[vt] = static_cast<input_scalar_t>(
gamma * (i[feature] - mean) * invstd + beta);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using vectorization for inputs? I see vt is contiguous across the iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the CNL case, consecutive work-items cannot access consecutive data for the OffsetCalculator is used for calculating loading stride.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know the vt size, we can calculate offset with linear id 0, 3, 7, 11? if the vt size is 4?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants