GPU Parallel Algorithms

The Shader Code

These three functions appear in the Gaussian Splatting backward pass shader. Each solves a different parallel coordination problem that arises when hundreds of threads share state.

// Binary reduction to accumulate per-thread gradients across the workgroup
void loadFloat_bwd(uint idx, uint localDispatchIdx, float dOut)
{
    if (abs(dOut) < 10.f)
        reductionBuffer[localDispatchIdx] = dOut;
    else
        reductionBuffer[localDispatchIdx] = 10.f * sign(dOut);

    GroupMemoryBarrierWithGroupSync();

    for (uint stride = (WG_X * WG_Y) / 2; stride > 0; stride /= 2)
    {
        if (localDispatchIdx < stride)
            reductionBuffer[localDispatchIdx] += reductionBuffer[localDispatchIdx + stride];
        GroupMemoryBarrierWithGroupSync();
    }

    if (localDispatchIdx == 0)
        atomicAccumulate(reductionBuffer[0], idx);
}

// CAS loop for floating-point atomic accumulation
void atomicAccumulate(float val, uint idx)
{
    if (val == 0.f) return;
    for (;;)
    {
        uint oldInt = derivBuffer[idx].load();
        float oldFloat = asfloat(oldInt);
        float newFloat = oldFloat + val;
        uint newInt = asuint(newFloat);
        if (derivBuffer[idx].compareExchange(oldInt, newInt) == oldInt)
            break;
    }
}

// Bitonic sort (parallel workgroup sort)
SortedShortList bitonicSort(PaddedShortList, uint localIdx)
{
    GroupMemoryBarrierWithGroupSync();
    for (uint k = 2; k <= GAUSSIANS_PER_BLOCK; k *= 2)
    {
        for (uint j = k / 2; j > 0; j /= 2)
        {
            for (uint i = localIdx; i < GAUSSIANS_PER_BLOCK; i += WG_X * WG_Y)
            {
                uint l = i ^ j;
                if (l > i)
                {
                    if ((((i & k) == 0) && (blobs[i] > blobs[l])) ||
                        (((i & k) != 0) && (blobs[i] < blobs[l])))
                    {
                        var temp = blobs[i]; blobs[i] = blobs[l]; blobs[l] = temp;
                    }
                }
            }
            GroupMemoryBarrierWithGroupSync();
        }
    }
    return { 0 };
}

1. Workgroups and Shared Memory

Before diving into the algorithms, it helps to understand the GPU execution model they operate within.

A workgroup is a collection of threads (e.g. 16×16 = 256 threads) that execute together on a single compute unit. What makes workgroups special is that all threads within one can access a small, extremely fast region of memory called group-shared memory — also known as workgroup memory or LDS (Local Data Store).

Because threads execute in parallel and may write to shared memory at different times, GroupMemoryBarrierWithGroupSync() acts as a checkpoint: no thread is allowed to proceed past it until every thread in the workgroup has reached the same barrier. This is the mechanism that makes the following algorithms correct.

In this shader: WG_X * WG_Y threads share reductionBuffer[] (for gradients) and blobs[] (for sorting).

8×4 workgroup — 32 threads sharing group-shared memory. The barrier line ensures writes are visible to all threads before any read proceeds.

2. Binary Tree Reduction

Each thread in a workgroup has computed a gradient dOut for a shared Gaussian parameter. We need the sum of all gradients across the workgroup. A naive sequential sum would take O(N) steps in series. Binary tree reduction does it in O(log₂N) parallel rounds.

How it works:

Round 1 (stride = N/2): thread 0 adds thread N/2, thread 1 adds thread N/2+1, etc. — N/2 additions happen simultaneously.
Round 2 (stride = N/4): thread 0 adds thread N/4, thread 1 adds thread N/4+1, etc.
Continue halving the stride each round until stride = 1.
After log₂(N) rounds, thread 0 holds the total sum of all values.

With N=16 threads that is only 4 rounds rather than 15 sequential additions.

Notice the gradient clamping step before reduction: each gradient is clamped to [−10, +10]. Unclamped gradients can grow very large during early training, causing the optimiser to take wildly oversized steps. Clamping prevents this exploding gradient problem without zeroing out the signal entirely.

Clamped values (first row) — each gradient is clamped to [−10, +10] before entering the reduction buffer.

Round: —

Stride: —

Threads adding: —

Thread 0 value: —

3. Compare-and-Swap Floating-Point Atomics

After reduction, thread 0 of each workgroup calls atomicAccumulate to add its result to a global gradient buffer that is shared between workgroups. The problem: GPU shading languages do not provide a built-in floating-point atomic add. We have to build one.

The compare-and-swap (CAS) loop:

Read the current value as raw integer bits (using asfloat / asuint to bitcast).
Compute the desired new value = old + delta.
Attempt the swap: compareExchange(oldInt, newInt) only writes if the value is still oldInt. Returns the actual old value.
If the returned value equals oldInt, no other thread modified it — success, break.
If not, another thread changed it between our read and write — retry from step 1.

Bitcasting: asuint(3.14f) reinterprets the same 32 bits as a uint integer — no numeric conversion, just a type-pun. This is valid because IEEE 754 floats are 32-bit values just like uint32, so we can use integer CAS to protect a float slot. Example: 1.2f → 0x3F99999A → 1.2f.

Counter: 0.0

Expected final: —

Step: 0

4. Bitonic Sort

Why sort at all? Each workgroup processes one tile of the output image. Before blending, it gathers every Gaussian whose projected ellipse overlaps that tile into a short-list in shared memory — typically 16–64 entries. Those Gaussians are translucent: they have an opacity α ∈ (0, 1). Alpha blending is order-dependent. The correct formula composites back-to-front:

C_out = α · C_gaussian + (1 − α) · C_in

A closer splat must be blended last, on top of everything behind it. Blend them in the wrong order and the final pixel colour is wrong — nearer translucent objects appear to bleed through farther ones. So the short-list must be sorted by depth (far → near) before the blending loop runs.

Why bitonic sort specifically? Most sorting algorithms are poor fits for GPU execution:

Quicksort / introsort — recursive, irregular branching, pivot-dependent memory access patterns. Threads diverge constantly.
Merge sort — requires a temporary buffer and inter-thread communication to merge sub-arrays. Hard to parallelise within one workgroup.
Insertion sort — inherently serial; each step depends on the result of the previous one.

Bitonic sort avoids all of these problems. In every pass, every thread does exactly the same thing: compare element i with element i XOR j, then conditionally swap. No thread needs to know what any other thread found. All N/2 pairs in a pass are disjoint, so all comparisons are independent — no data hazards, no mid-pass barriers. The swap direction is decided by a single bitmask test with no branching. The entire sort runs in shared memory, which is hundreds of times faster than global memory. This is exactly the kind of uniform, data-parallel, low-divergence work that GPU hardware is built for.

What is a bitonic sequence? A sequence that first monotonically increases then monotonically decreases (or vice versa). Bitonic sort builds and merges such sequences until the whole array is sorted.

The algorithm:

Outer loop doubles k from 2 to N — each iteration handles a merge stage.
Inner loop halves j from k/2 down to 1 — each iteration is one pass of comparisons.
Each pass: compare element i with element l = i XOR j. The XOR pairing is what produces the right comparison distance at each level.
The swap direction alternates: (i & k) == 0 means sort this bitonic block ascending, otherwise descending — the alternation is what builds nested bitonic sequences.

Complexity: O(log²N) total passes, each pass is O(1) per thread in parallel. For N=16 that is 10 passes, versus 15 steps for a serial sort — and all comparisons within a pass happen simultaneously on GPU.

Step: 0 / —

k = —

j = —

Swaps: 0

The diagram below shows the complete comparison network for N=8. Each horizontal line is an element position; each vertical connector is a compare-and-swap pair. Reading left-to-right, the columns form the passes of the algorithm.

Bitonic sort comparison network for N=8. Connectors show which pairs are compared in each pass; direction (up/down arrow) shows the required sort order for that block.

4a. XOR Pairing — the butterfly network

In every pass, each thread computes its partner as l = i ^ j. XOR flips exactly one bit, so every thread independently finds its unique partner — no coordination needed, and all N/2 pairs in a pass are disjoint.

XOR with j flips exactly one bit → every thread gets a guaranteed unique partner, independently. All N/2 pairs are disjoint — no element appears twice — so all comparisons run simultaneously with no data hazards.

4b. Swap Direction — ascending vs. descending groups

(i & k) == 0 assigns each thread to the ascending or descending half of its k-group. This builds interlocking bitonic sequences ready for the next merge stage.

(i & k) == 0 means thread i has the k-bit clear → it is in the lower half of its k-group → sort ascending. Opposite for upper half. Each k-level doubles the group size, building bitonic sequences that are ready for the next merge stage.

4c. Bitonic Merge Insight — why it converges

A bitonic sequence (rises then falls) is sorted by log₂(N) butterfly passes, each halving the disorder. Step through the three passes to see how [2,4,6,8,7,5,3,1] becomes fully sorted.

Pass: 0 / 3

j = —

Swaps this pass: —

Each butterfly pass over a bitonic sequence produces two smaller bitonic sequences. After log₂(N) passes, every element has been globally compared and the result is sorted. This recursive halving is why the algorithm is correct.

5. Why These Algorithms Together

In the Gaussian Splatting pipeline, each of the three algorithms occupies a distinct role in the forward/backward pass:

bitonicSort — sorts the short-list of Gaussian indices by depth so that alpha blending in the forward pass produces a correct image.
Binary reduction — after blending, the backward pass distributes gradient across all threads; reduction sums those per-thread contributions into one value per workgroup.
CAS loop — thread 0 of each workgroup writes its summed gradient to global memory; the CAS loop ensures this is safe when multiple workgroups update the same Gaussian's parameter simultaneously.

Together they form a pipeline: sort first (forward), then accumulate gradients safely (backward). Each algorithm is chosen because it maps efficiently onto GPU parallelism — none of them would be used in CPU code.

Pipeline flow showing the three algorithms in sequence through the forward and backward passes.