Building the Regression Engine

To estimate a 16-point homography in real-time, we need a lightweight but powerful architecture. We implement a VGG-style Convolutional Neural Network (CNN) designed specifically for coordinate regression.

The Input Pipeline

Our raw camera feed is high-resolution, but the network doesn't need every detail. We downsample the frames to 128x128 pixels and convert them to grayscale. Most importantly, we normalize the pixel values from range [0, 255] to [0.0, 1.0] to keep the gradients stable.

Feature Extraction Layers

We stack multiple conv2d layers to extract high-level patterns. We use Max Pooling after every block to reduce spatial dimensions and focus on the macro-geometry of our 21x21 grid's nested squares.

The Regression Head

At the end of the network, we "flatten" the spatial features into a single vector and pass them through a Dense (Fully Connected) layer. Unlike classification networks that use Softmax, we use a linear activation function with 8 units.

Absolute Sub-Pixel Output: These 8 units output the continuous (x, y) coordinates for our four anchor corners. This is the raw data that feeds our inverse mapping system.
model.add(tf.layers.conv2d({filters: 32, kernelSize: 3}));
model.add(tf.layers.maxPooling2d({poolSize: 2}));
model.add(tf.layers.flatten());
model.add(tf.layers.dense({units: 8, activation: 'linear'}));
Input Grayscale Tensor [128, 128, 1]
Conv2D Feature Map 32 [126, 126, 32]
MaxPool Spatial Reduction [63, 63, 32]
Conv2D Feature Map 64 [61, 61, 64]
Flatten Spatial to Vector [238144]
Output Regression Head [8]

VGG-Style Regression Pipeline