Tile-Based Coarse Rasterization

1. One Workgroup, One Tile

A GPU compute shader is dispatched over the full output image. Rather than have every thread independently search through all Gaussians, the image is divided into rectangular tiles — each the size of one workgroup (WG_X × WG_Y threads, typically 16×16 = 256 threads).

All threads in a workgroup share the same fast on-chip shared memory. The tile-based approach exploits this: all 256 threads in a workgroup cooperate to build a short-list of Gaussians for their tile, store it in shared memory, then each thread independently reads from that list to compute its own pixel colour.

Two-phase design: building the short-list is a cooperative task (threads divide up the work), while evaluating the final pixel colour is embarrassingly parallel (each thread acts alone). The stages are explicitly separated in the code.

Tile: hover a tile

Short-list size: —

Hover a tile to see which Gaussians' bounding boxes intersect it. Blue tile = selected workgroup; amber outline = Gaussians in the short-list.

Shared memory is large enough for one tile's short-list, not the entire scene. Without tiling, every thread would need to iterate every Gaussian — O(N) work per thread. With tiling, the 256 threads share the O(N) scan and each thread only evaluates the Gaussians that actually touch its tile.

2. Strided Iteration — coarseRasterize()

There are typically far more Gaussians in the scene than there are threads in a workgroup. coarseRasterize() distributes them using a strided loop: thread i processes Gaussians i, i+WG_SIZE, i+2×WG_SIZE, …

for (uint i = localIdx; i < numGaussians; i += (WG_X * WG_Y))
{
    gaussian = Gaussian2D.load(i, localIdx);
    OBB bounds = gaussian.bounds();
    if (bounds.intersects(tileBounds))
    {
        blobs[blobCountAT++] = i;  // atomic increment
    }
}

Each thread independently loads a Gaussian, computes its bounding box, and tests intersection with the tile's OBB tileBounds. If it intersects, the thread atomically appends the Gaussian's index to shared blobs[]. The atomic increment (blobCountAT++) prevents two threads from writing to the same slot simultaneously.

Batch: —

Gaussians assigned: 0 / 20

Strided work distribution across 8 threads for 20 Gaussians. Each colour is one thread. Step through batches to see how the workload is shared.

Two barriers frame the function:

One at the start — ensures any counter reset from initShortList() has completed before threads begin writing.
One at the end — ensures all threads have finished appending before any thread reads the final blobCount. Without this, a fast thread could read the count while slow threads are still adding Gaussians.

GroupMemoryBarrierWithGroupSync();  // start: wait for reset
// ... strided loop ...
GroupMemoryBarrierWithGroupSync();  // end: wait for all writes
blobCount = blobCountAT.load();

3. Stretch-Free UV Coordinates — calcUV()

Gaussians are defined in normalised UV space [0,1]². The calcUV() function converts a pixel's dispatch ID into UV without distorting the image. When the render target has a different aspect ratio than the texture, it letterboxes or pillarboxes — scaling to match one axis and centering on the other.

if (aspectRatioRT > aspectRatioTEX)
{
    // RT wider → match widths, letterbox vertically
    float xCoord = dispatchID.x / renderSize.x;
    float yCoord = (dispatchID.y * aspectRatioTEX) / renderSize.x;
    float yCoordMax = aspectRatioTEX / aspectRatioRT;
    yCoord += (1.0 - yCoordMax) / 2.0;  // re-centre
}

The same function also computes tileLow and tileHigh — the UV corners of the current workgroup's tile — which become the OBB tileBounds passed to coarse rasterization.

Blue = Gaussian UV space [0,1]². Amber = the portion of UV space covered by the current render target. Toggle aspect ratios to see letterboxing and pillarboxing.

4. The Short-List Limit & padBuffer()

The blobs[] array lives in shared memory with a fixed capacity: NUM_GAUSSIANS_PER_BLOCK. If more Gaussians intersect a tile than this limit, the excess are skipped. This is a deliberate simplification — a production implementation would run multiple gather-sort-blend passes for overflow tiles.

After coarse rasterization, padBuffer() fills the unused tail of the short-list with a sentinel value (MAX_UINT). This is required because bitonic sort needs a power-of-2 element count. Sentinel values sort to the far end and are invisible to fine rasterization — their depth values are so large they contribute zero to blending.

Why MAX_UINT as sentinel? The short-list stores Gaussian indices, and depth values are looked up during sorting. By using the maximum possible integer as a sentinel, those fake entries always sort to the back of the list — far from the camera — and the fine rasterizer's blending loop terminates before reaching them (or they contribute nothing once transmittance is exhausted).