Skip to content

Commit

Permalink
WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed Nov 6, 2024
1 parent 79848de commit 7e2bbea
Showing 1 changed file with 16 additions and 11 deletions.
27 changes: 16 additions & 11 deletions docs/programming_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,17 +30,19 @@ supported by HIP is outlined in this chapter. In general, GPUs are made up of
many so called Compute Units that excel at executing parallelizable,
computationally intensive workloads without complex control-flow.

Optimize the Thread and Block Sizes
Increase parallelism on multiple level
================================================================================

Correctly configuring the threads in the kernel launch configuration (e.g.,
threads per block, blocks per grid) is crucial for maximizing GPU performance.
Choose an optimal number of threads per block and blocks per grid based on the
specific hardware capabilities (e.g., the number of streaming multiprocessors (SMs)
and cores on the GPU). Ensure that the number of threads per block is a multiple
of the warp size (typically 32 for most GPUs) for efficient execution. Test
different configurations, as the best combination can vary depending on the
specific problem size and hardware.
To maximize performance and keep all system components fully utilized, the
application should expose and efficiently manage as much parallelism as possible.
:ref:`Parallel execution <parallel execution>` can be achieved at the
application, device, and multiprocessor levels.

The application’s host and device operations can achieve parallel execution
through asynchronous calls, streams, or HIP graphs. On the device level,
multiple kernels can execute concurrently when resources are available, and at
the multiprocessor level, developers can overlap data transfers with
computations to further optimize performance.

Data Management and Transfer Between CPU and GPU
================================================================================
Expand All @@ -54,8 +56,11 @@ performance critical, so it is important to know how to use them effectively.
Memory Management on the GPU
================================================================================

On-device GPU memory accesses from the threads in a kernel can be a performance bottleneck, depending on the workload. There are also some specifics concerning device memory accesses that have to be considered, compared to CPUs.
GPUs also have different memory spaces, with different access levels and performance characteristics, that have specific use cases.
On-device GPU memory accesses from the threads in a kernel can be a performance
bottleneck, depending on the workload. There are also some specifics concerning
device memory accesses that have to be considered, compared to CPUs.
GPUs also have different memory spaces, with different access levels and
performance characteristics, that have specific use cases.

Synchronize CPU and GPU Workloads
================================================================================
Expand Down

0 comments on commit 7e2bbea

Please sign in to comment.