Make atomicAdd Faster in Vulkan
Global counter, shared counter and subgroup optimization.
atomic
ensures only one GPU thread can write to the memory, preventing others from writing at the same time. If many threads hit at once, a single global atomic can become a hotspot, often introducing performance bottleneck.
1. Global Counter
Global counter is visible to all workgroups. Thread contention will happen and cause performance degradation.
1 |
|
2. Shared Counter
Count within the workgroup using shared memory, then do one atomic to the global counter.
1 |
|
3. Subgroup Optimization
We can go further and do one atomic per subgroup. Threads in a workgroup are not executed individually, in fact, they are executed as batches and each batch is what we call subgroup. It often refers to a warp (NVIDIA) or wavefront/wave (AMD).
Typical sizes: 32 (NVIDIA), 64 (AMD), 8/16/32 (Intel/ARM). It’s hardware-dependent and can vary by devices.
subgroupAdd()
reduces the per-lane values within the subgroup.subgroupElect()
is true for exactly one lane in the subgroup.
To enable this feature, you need Vulkan 1.1+, and compile your shaders like this:
1 | glslc --target-env=vulkan1.1 -mfmt=c shader.comp -o shader.spv |
1 |
|
Make atomicAdd Faster in Vulkan
http://chuzcjoe.github.io/CGV/cgv-make-atomic-op-faster-vulkan/