Differentiable Rendering & Training

1. The State-Caching Problem

Reverse-mode automatic differentiation (backpropagation) works by replaying a function's forward pass in reverse. To do this it needs the intermediate values that existed at each step — because the gradient of step i depends on the value of the input to step i, not just its output.

For a simple loop over N blobs, naive auto-diff would store the full pixel state after each blob: N copies of (r, g, b, T). For a short-list of 64 Gaussians across millions of pixels, that's an enormous amount of memory.

Naive auto-diff through the blending loop: the forward pass stores N intermediate states for the backward pass to consume. Memory cost grows linearly with short-list size.

This is why fineRasterize has a [BackwardDerivative(fineRasterize_bwd)] attribute — it tells Slang's auto-diff engine to use a hand-written backward pass instead of generating one automatically.

[BackwardDerivative(fineRasterize_bwd)]
float4 fineRasterize(SortedShortList, uint localIdx, no_diff float2 uv)
{ ... }

2. The State-Undo Trick — undoPixelState()

Gaussian blending has a useful property: the operation is invertible. Given the state after applying blob i and the blob's contribution, you can recover the state before. This is undoPixelState().

The undo is straightforward algebra. If the forward step is:

T_new = T_old × (1 − α)
C_new = C_old + α × T_old × colour

Then the undo is:

T_old = T_new / (1 − α)          // recover previous transmittance
C_old = C_new − α × T_old × colour  // recover previous accumulated colour

fineRasterize_bwd exploits this to run the backward loop in reverse without caching any forward states — it starts from the final state (stored in shared memory as finalVal) and undoes one step at a time:

void fineRasterize_bwd(SortedShortList, uint localIdx, float2 uv, float4 dOut)
{
    PixelState pixelState = { finalVal[localIdx], maxCount[localIdx] };

    for (uint _i = count; _i > 0; _i--)
    {
        uint i = _i - 1;
        var gval = eval(blobID, uv, localIdx);         // re-evaluate blob
        var prevState = undoPixelState(pixelState, i+1, gval);  // undo step

        // Auto-diff handles the gradient math within the loop body
        bwd_diff(transformPixelState)(dpState, dpGVal, dColor);
        bwd_diff(eval)(blobID, uv, localIdx, dpGVal.getDifferential());

        pixelState = prevState;
        dColor = dpState.getDifferential();
    }
}

Step: —

Left: naive auto-diff stores all N states. Right: state-undo approach — only the final state is stored; each backward step re-evaluates the blob and undoes the forward transformation. Step through to compare.

Why re-evaluate blobs in the backward pass? Each bwd_diff(eval) call re-runs the Gaussian evaluation (position, covariance, colour) to get the blob's contribution value. This recomputation trades compute for memory — a deliberate choice when memory is the bottleneck, which it typically is on GPU.

3. Slang's Auto-Diff Vocabulary

Several Slang-specific constructs appear in the backward pass. Understanding what they do clarifies how the manual and automatic parts interleave:

[Differentiable] — marks a function as one whose derivative Slang can generate or use in a backward pass. Functions without this cannot participate in auto-diff.
no_diff float2 uv — marks a parameter as non-differentiable. The pixel UV is a fixed coordinate, not a learnable parameter, so no gradient path runs through it. This reduces generated code complexity.
diffPair(x) — creates a (primal, differential) pair from a value x. The primal is the forward-pass value; the differential slot receives the gradient during bwd_diff().
bwd_diff(f)(args, d_out) — calls the backward pass of f, propagating derivative d_out back through f's inputs. Writes gradient contributions into any diffPair arguments.
getDifferential() — reads the gradient that was written into a diffPair after a bwd_diff() call.
workgroupUniformLoad(blobCount) — a WGSL intrinsic that asserts this load is uniform across the workgroup (every thread reads the same value). Required by uniformity analysis to prevent errors when using the result as a loop bound.

The key design insight in fineRasterize_bwd: the outer loop structure is written manually (to control state reconstruction), but the inner loop body — the Gaussian evaluation and blending math — is still differentiated by Slang via bwd_diff(). You only hand-write the parts where auto-diff would be incorrect or inefficient, and let the engine handle the rest.

4. The Training Loop — Three Kernels

Learning happens through three compute kernels dispatched in order each training iteration. Only the final kernel renders the image; the other two handle gradient computation and parameter updates.

Three kernels per training step. clearDerivatives and computeDerivatives run before imageMain. Gradient data flows right (forward); parameter updates flow left (backward).

1. clearDerivativesMain — resets the derivative buffer to zero before each iteration. Gradient accumulation is additive (multiple pixels can contribute gradients to the same Gaussian), so the buffer must be zeroed between steps.

[playground::CALL(BLOB_BUFFER_SIZE, 1, 1)]
void clearDerivativesMain(uint2 dispatchThreadID)
{
    derivBuffer[dispatchThreadID.x].store(asuint(0.f));
}

2. computeDerivativesMain — runs one forward+backward pass per pixel by calling bwd_diff(loss). Slang generates the backward code for loss(), which chains through splatBlobs() → fineRasterize() and ultimately calls fineRasterize_bwd. Gradient contributions for each Gaussian parameter are accumulated into derivBuffer using the CAS atomic pattern.

void computeDerivativesMain(uint2 dispatchThreadID)
{
    float perPixelWeight = 1.f / (imageSize.x * imageSize.y);
    bwd_diff(loss)(dispatchThreadID, targetImageSize, perPixelWeight);
}

3. updateBlobsMain — reads each parameter's gradient from derivBuffer, applies the Adam optimiser update, and writes the new parameter value back to blobsBuffer. Since all parameters are laid out sequentially in a single float buffer, one thread per buffer slot handles the update with no struct reinterpretation needed.

5. Adam Optimizer — Adaptive Learning Rates

Raw gradients are noisy — they vary significantly from iteration to iteration because each image pixel produces slightly different gradient estimates. Applying them directly (SGD) leads to unstable optimisation. Adam (Adaptive Moment Estimation) acts as a temporal filter on gradients before applying updates:

// First moment (exponential moving average of gradient)
m_t = β₁ × m_prev + (1 − β₁) × g_t

// Second moment (exponential moving average of gradient squared)
v_t = β₂ × v_prev + (1 − β₂) × g_t²

// Bias-corrected moments (compensate for zero-initialisation)
m̂_t = m_t / (1 − β₁)
v̂_t = v_t / (1 − β₂)

// Parameter update
update = (η / (√v̂_t + ε)) × m̂_t
param  = param − update

m_t is a momentum term — it smooths gradient direction by averaging over recent history (β₁≈0.9 gives a window of ~10 iterations). v_t tracks gradient magnitude; dividing by √v̂_t gives each parameter an adaptive step size — parameters with consistently large gradients get smaller steps (damping), parameters with small or erratic gradients get larger steps (exploration).

Adam optimizer state for a single parameter across iterations. The update magnitude is dampened by v̂_t, smoothed by m̂_t, producing stable convergence even with noisy gradients.

Why not just store the gradient directly in a float buffer? Float atomic adds aren't natively supported on most GPUs. The derivative buffer stores raw uint bits and uses the CAS loop (from the GPU Parallel Algorithms page) to safely accumulate float gradients from multiple threads and workgroups. The Adam kernel then reads those accumulated gradients, applies bias-corrected moment updates, and writes the new parameter value.