forked from openxla/xla
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update README.md #2
Open
JnProfile
wants to merge
1
commit into
main
Choose a base branch
from
test_ci
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+1
−1
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pemeliya
pushed a commit
that referenced
this pull request
Oct 26, 2023
#MIGRATION_3P_TRITON__GIT_TO_THIRD_PARTY # Commits integrated - 726bdb984f2bcb48adfaa341ee7b0263be227b98 [FRONTEND][BACKEND] Fix constexpr assignment ; revert #24... by Zahi Moudallal <[email protected]> - 87a223d76fe32a28ca563c94215a95f505794c6d bump triton_shared (openxla#2501) by Maksim Levental <[email protected]> - 721897fcc4f942aa97d2e9ba3787a5e213758177 upgrade llvm to `b1115f8c` (NFC) (openxla#2403) by Mehdi Amini <[email protected]> - 05dc28be0e72dd496300a31b99a21a5a5118f8e9 [CI] refactor workflows (openxla#2504) by Philippe Tillet <[email protected]> - 376acb610b5888263ee61713ff0a71e1d5908d69 [BUILD] Fix macos x86 build (openxla#2505) by Thomas Raoux <[email protected]> - 768fc1fcd98ecfc0892f8982b0bb009dd7bb11ea [FRONTEND] change hash to not require ptxas (openxla#2476) by ian Bearman <[email protected]> - e36d1665ca2f816212fc80ee2633caa66a0066bf [BACKEND] Fix unsupported view op created during optimiza... by Thomas Raoux <[email protected]> - a980ec50f1ed3176e2603c25f73f0ddc031cf1d8 [BACKEND] Fixing f8e5m2 to bf16 conversion on A100 (openxla#2508) by Zahi Moudallal <[email protected]> - a4f373938c9a4ba67105c5394c168945af4c990e [RUNTIME] Filter out paths that don't exist in json group... by Horace He <[email protected]> - be1de890e1f9bdf0910521b5a536c332a1c1aa2f [BACKEND] Replace assert(0) with llvm::report_fatal_error... by Keren Zhou <[email protected]> - 0d57820be9ca360cf62cc3a7dc21aecc45a1c53a update triton-shared ref (openxla#2506) by ian Bearman <[email protected]> - bdf464e4a8f80ad6bd6a7b470cb3d36efd61c8a2 Make kernel_static_print test work when called twice. (#2... by Justin Lebar <[email protected]> - 30186f401ec52d9addac79a60f418792875f7d11 Fix segfault in assertion test. (openxla#2520) by Justin Lebar <[email protected]> - dc9e3063d73d2410e1855e1ff258aa90a6158548 [HOPPER] Move to tl.make_block_ptr in flash_attention bac... by runseny <[email protected]> - b0c166b9e3f2f58c0906fa41f261787ebf3fef0d [BACKEND] Fixing bug in elementwise conversion (openxla#2517) by Zahi Moudallal <[email protected]> - 4f4c07e7d586aae3daa802ce86a9aa935f8cda17 [CI] add text file containing LLVM commit hash by Ashay Rane <[email protected]> - 7af27fadee0fce2218a1353feea2f76ea25ad005 update hash to 76ce4736721a by Phil Tillet <[email protected]> - f192611ff3bdacb8d1d1cad084dfe4cd277a0ec9 Bump LLVM version to https://github.com/llvm/llvm-project... by Goran Flegar <[email protected]> PiperOrigin-RevId: 576212898
pemeliya
pushed a commit
that referenced
this pull request
Nov 3, 2023
Imported from GitHub PR openxla#6599 FP8 cublasLt matmul uses fast accumulation when both operands' precision are DEFAULT. Otherwise fall back to high precision acuumulation. Issue#openxla#6168 This PR is closely related to Flax PR-![3416](google/flax#3416). Copybara import of the project: -- a4140da by shuw <[email protected]>: Add FP8 fast accumulation support for cublasLt. -- 9684568 by shuw <[email protected]>: Improve based on review #1 -- e906d76 by shuw <[email protected]>: Improve based on review #2 Merging this change closes openxla#6599 COPYBARA_INTEGRATE_REVIEW=openxla#6599 from wenscarl:fp8_fast_accumulation e906d76 PiperOrigin-RevId: 578948593
zstreet87
pushed a commit
that referenced
this pull request
Jan 18, 2024
…art #2 PiperOrigin-RevId: 599037622
ekuznetsov139
pushed a commit
that referenced
this pull request
May 13, 2024
…d phase to Initialize() Imported from GitHub PR openxla#12228 The first time that a NormThunk is executed, it will build a cudnn execution plan. This build step can hang if a NCCL collective is running at the same time. To fix this, I've moved the build step to take place during thunk initialization. We only observe this hang when using cudnn 9. Here's a backtrace from the hang that will be fixed: ``` Thread 585 (Thread 0x7fb9391ff640 (LWP 41364) "main.py"): #0 0x00007fd3d17cffd9 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fd3d17da24f in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007fd070967dfe in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1 #3 0x00007fd0709c928a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1 #4 0x00007f1970d76102 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #5 0x00007f1970f2c999 in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #6 0x00007f1970a7d4ab in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #7 0x00007f1970d0a9cb in ?? () from /lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 #8 0x00007fce60b2a98c in cudnn::backend::ExecutionPlan::finalize_internal() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 #9 0x00007fce60aefbb1 in cudnn::backend::Descriptor::finalize() () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 #10 0x00007fce60b15bec in cudnnBackendFinalize () from /lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 #11 0x00007fd2521b8f39 in cudnn_frontend::ExecutionPlanBuilder_v8::build() () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #12 0x00007fd2521734ba in stream_executor::gpu::(anonymous namespace)::GetExecPlanFromHeuristics(cudnn_frontend::OperationGraph_v8&&, stream_executor::gpu::(anonymous namespace)::CudnnHandle const&, bool) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #13 0x00007fd25216ff9b in stream_executor::gpu::CudnnSupport::NormRunnerFromDesc(stream_executor::Stream*, stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormKind, double, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, stream_executor::dnn::TensorDescriptor const&, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>, std::optional<stream_executor::dnn::TensorDescriptor>) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #14 0x00007fd24e36b88b in stream_executor::dnn::NormOp::RunnerFromAlgorithmDesc(stream_executor::dnn::AlgorithmDesc const&, stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #15 0x00007fd24e36ae37 in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}::operator()() const () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #16 0x00007fd24e36adbc in void absl::lts_20230802::base_internal::CallOnceImpl<stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}>(std::atomic<unsigned int>*, absl::lts_20230802::base_internal::SchedulingMode, stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*)::{lambda()#1}&&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #17 0x00007fd24e36a9bd in stream_executor::dnn::LazyOpRunner<stream_executor::dnn::NormOp>::GetOrCreateRunner(stream_executor::dnn::NormOp::Config, stream_executor::Stream*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #18 0x00007fd24e369d29 in xla::gpu::RunGpuNorm(xla::gpu::GpuNormConfig const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, stream_executor::DeviceMemoryBase const&, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, std::optional<stream_executor::DeviceMemoryBase>, stream_executor::DeviceMemoryBase const&, stream_executor::Stream*, xla::gpu::RunNormOptions) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so #19 0x00007fd24e368be6 in xla::gpu::NormThunk::ExecuteOnStream(xla::gpu::Thunk::ExecuteParams const&) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so ``` Copybara import of the project: -- f535330 by Trevor Morris <[email protected]>: Fix hang with cudnn layer norm by moving cudnn init to Initialize() Merging this change closes openxla#12228 COPYBARA_INTEGRATE_REVIEW=openxla#12228 from trevor-m:tmorris-norm-init f535330 PiperOrigin-RevId: 633220207
draganmladjenovic
pushed a commit
that referenced
this pull request
Aug 2, 2024
Use std::aligned_storage_t trick to avoid default-initializing Node struct on a hot path. name old cpu/op new cpu/op delta BM_SelectAndScatterF32/128/process_time 791µs ± 4% 720µs ± 2% -8.93% BM_SelectAndScatterF32/256/process_time 3.20ms ± 4% 2.96ms ± 2% -7.46% BM_SelectAndScatterF32/512/process_time 13.7ms ± 5% 12.8ms ± 2% -6.80% name old time/op new time/op delta BM_SelectAndScatterF32/128/process_time 790µs ± 5% 719µs ± 1% -9.00% BM_SelectAndScatterF32/256/process_time 3.20ms ± 3% 2.96ms ± 1% -7.58% BM_SelectAndScatterF32/512/process_time 13.2ms ± 4% 12.3ms ± 1% -6.82% PiperOrigin-RevId: 658139935
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.