In high-resolution vision systems, image coordinates often range from 0 to 8000 pixels. When these values are plugged into a DLT matrix, the entries can vary by several orders of magnitude (e.g., $10^0$ vs $10^6$ for squared components).
This disparity causes the matrix $A^T A$ to have a very high condition number, making the SVD solution extremely sensitive to even tiny amounts of noise. The result is "jittery" or wildly incorrect homographies.
The standard solution is to normalize the points before estimation. This is achieved by a transformation matrix $T$ that shifts the centroid to the origin and scales the points so their average distance is $\sqrt{2}$.
After finding the normalized homography $\tilde{H}$, it must be de-normalized: $H = T'^{-1} \tilde{H} T$.