To estimate a 16-point homography in real-time, we need a lightweight but powerful architecture. We implement a VGG-style Convolutional Neural Network (CNN) designed specifically for coordinate regression.
Our raw camera feed is high-resolution, but the network doesn't need every detail. We downsample the frames to 128x128 pixels and convert them to grayscale. Most importantly, we normalize the pixel values from range [0, 255] to [0.0, 1.0] to keep the gradients stable.
We stack multiple conv2d layers to extract high-level patterns. We use Max Pooling after every block to reduce spatial dimensions and focus on the macro-geometry of our 21x21 grid's nested squares.
At the end of the network, we "flatten" the spatial features into a single vector and pass them through a Dense (Fully Connected) layer. Unlike classification networks that use Softmax, we use a linear activation function with 8 units.
VGG-Style Regression Pipeline