Before computing any pixel colour, the GPU must determine which Gaussians are even relevant to each pixel. This is done efficiently by dividing the image into tiles and having each workgroup build a short-list for its tile.
A GPU compute shader is dispatched over the full output image. Rather than have every thread independently search through all Gaussians, the image is divided into rectangular tiles — each the size of one workgroup (WG_X × WG_Y threads, typically 16×16 = 256 threads).
All threads in a workgroup share the same fast on-chip shared memory. The tile-based approach exploits this: all 256 threads in a workgroup cooperate to build a short-list of Gaussians for their tile, store it in shared memory, then each thread independently reads from that list to compute its own pixel colour.
Two-phase design: building the short-list is a cooperative task (threads divide up the work), while evaluating the final pixel colour is embarrassingly parallel (each thread acts alone). The stages are explicitly separated in the code.
Hover a tile to see which Gaussians' bounding boxes intersect it. Blue tile = selected workgroup; amber outline = Gaussians in the short-list.
Shared memory is large enough for one tile's short-list, not the entire scene. Without tiling, every thread would need to iterate every Gaussian — O(N) work per thread. With tiling, the 256 threads share the O(N) scan and each thread only evaluates the Gaussians that actually touch its tile.
coarseRasterize()There are typically far more Gaussians in the scene than there are threads in a workgroup. coarseRasterize() distributes them using a strided loop: thread i processes Gaussians i, i+WG_SIZE, i+2×WG_SIZE, …
for (uint i = localIdx; i < numGaussians; i += (WG_X * WG_Y))
{
gaussian = Gaussian2D.load(i, localIdx);
OBB bounds = gaussian.bounds();
if (bounds.intersects(tileBounds))
{
blobs[blobCountAT++] = i; // atomic increment
}
}
Each thread independently loads a Gaussian, computes its bounding box, and tests intersection with the tile's OBB tileBounds. If it intersects, the thread atomically appends the Gaussian's index to shared blobs[]. The atomic increment (blobCountAT++) prevents two threads from writing to the same slot simultaneously.
Strided work distribution across 8 threads for 20 Gaussians. Each colour is one thread. Step through batches to see how the workload is shared.
Two barriers frame the function:
initShortList() has completed before threads begin writing.blobCount. Without this, a fast thread could read the count while slow threads are still adding Gaussians.GroupMemoryBarrierWithGroupSync(); // start: wait for reset // ... strided loop ... GroupMemoryBarrierWithGroupSync(); // end: wait for all writes blobCount = blobCountAT.load();
calcUV()Gaussians are defined in normalised UV space [0,1]². The calcUV() function converts a pixel's dispatch ID into UV without distorting the image. When the render target has a different aspect ratio than the texture, it letterboxes or pillarboxes — scaling to match one axis and centering on the other.
if (aspectRatioRT > aspectRatioTEX)
{
// RT wider → match widths, letterbox vertically
float xCoord = dispatchID.x / renderSize.x;
float yCoord = (dispatchID.y * aspectRatioTEX) / renderSize.x;
float yCoordMax = aspectRatioTEX / aspectRatioRT;
yCoord += (1.0 - yCoordMax) / 2.0; // re-centre
}
The same function also computes tileLow and tileHigh — the UV corners of the current workgroup's tile — which become the OBB tileBounds passed to coarse rasterization.
Blue = Gaussian UV space [0,1]². Amber = the portion of UV space covered by the current render target. Toggle aspect ratios to see letterboxing and pillarboxing.
padBuffer()The blobs[] array lives in shared memory with a fixed capacity: NUM_GAUSSIANS_PER_BLOCK. If more Gaussians intersect a tile than this limit, the excess are skipped. This is a deliberate simplification — a production implementation would run multiple gather-sort-blend passes for overflow tiles.
After coarse rasterization, padBuffer() fills the unused tail of the short-list with a sentinel value (MAX_UINT). This is required because bitonic sort needs a power-of-2 element count. Sentinel values sort to the far end and are invisible to fine rasterization — their depth values are so large they contribute zero to blending.
Why MAX_UINT as sentinel? The short-list stores Gaussian indices, and depth values are looked up during sorting. By using the maximum possible integer as a sentinel, those fake entries always sort to the back of the list — far from the camera — and the fine rasterizer's blending loop terminates before reaching them (or they contribute nothing once transmittance is exhausted).