Posted 2025-08-21 Joe Chu CGV2 minutes read (About 340 words)0 visits

Make atomicAdd Faster in Vulkan

Global counter, shared counter and subgroup optimization.

atomic ensures only one GPU thread can write to the memory, preventing others from writing at the same time. If many threads hit at once, a single global atomic can become a hotspot, often introducing performance bottleneck.

1. Global Counter

Global counter is visible to all workgroups. Thread contention will happen and cause performance degradation.

#version 450
layout(local_size_x = 128) in;

// SSBO with a single 32-bit counter
layout(std430, binding = 0) buffer CounterBuf {
    uint counter;
};

void main() {
    bool hit = /* your condition from the computed value */;

    if (hit) {
        // Atomically add 1
        atomicAdd(counter, 1u);
    }
}

2. Shared Counter

Count within the workgroup using shared memory, then do one atomic to the global counter.

#version 450
layout(local_size_x = 128) in;

layout(std430, binding = 0) buffer CounterBuf {
    uint globalCounter;
};

shared uint wgCounter;

void main() {
    // One lane initializes the shared counter
    if (gl_LocalInvocationIndex == 0u) wgCounter = 0u;
    barrier(); // synchronize shared memory initialization

    bool hit = /* your condition */;
    if (hit) {
        atomicAdd(wgCounter, 1u);  // fast, within workgroup
    }

    barrier(); // ensure all increments are done

    // One lane publishes to global
    if (gl_LocalInvocationIndex == 0u) {
        atomicAdd(globalCounter, wgCounter);
    }
}

3. Subgroup Optimization

We can go further and do one atomic per subgroup. Threads in a workgroup are not executed individually, in fact, they are executed as batches and each batch is what we call subgroup. It often refers to a warp (NVIDIA) or wavefront/wave (AMD).

Typical sizes: 32 (NVIDIA), 64 (AMD), 8/16/32 (Intel/ARM). It’s hardware-dependent and can vary by devices.

subgroupAdd() reduces the per-lane values within the subgroup.
subgroupElect() is true for exactly one lane in the subgroup.

To enable this feature, you need Vulkan 1.1+, and compile your shaders like this:

1 2	glslc --target-env=vulkan1.1 -mfmt=c shader.comp -o shader.spv

#version 450
#extension GL_KHR_shader_subgroup_arithmetic : require
layout(local_size_x = 128) in;

layout(std430, binding = 0) buffer CounterBuf { uint globalCounter; };

void main() {
    bool hit = /* your condition */;
    uint subgroupSum = subgroupAdd(hit ? 1u : 0u);
    if (subgroupElect()) {                // exactly one lane per subgroup
        atomicAdd(globalCounter, subgroupSum);
    }
}

More resources: https://www.khronos.org/blog/vulkan-subgroup-tutorial

Make atomicAdd Faster in Vulkan

http://chuzcjoe.github.io/CGV/cgv-make-atomic-op-faster-vulkan/

Author

Joe Chu

Posted on

2025-08-21

Updated on

2025-08-24

Licensed under

#vulkan atomic

Make atomicAdd Faster in Vulkan

1. Global Counter

2. Shared Counter

3. Subgroup Optimization

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Catalogue

Advertisement