You have the architecture and the data, but the network isn't converging. A training loss plateauing at around 0.01 for a coordinate regression task is a classic symptom of Model Collapse—the network has given up and is simply predicting the "average" shape for every input.
In a 256x256 pixel normalized space, an MSE of 0.01 equals an RMSE of 10% (0.1). This means your network is guessing 25.6 pixels away from the true corner—a total failure.
If your learning rate is too high (e.g., $0.001$), the network takes steps that overshoot the target, continuously bouncing around the optimal localized valley without ever dropping in.
The Fix: Manually drop the learning rate to 0.0001 or 0.00001 when the plateau begins. optimizer: tf.train.adam(0.0001)
Initially, random 3D perspective tilts produce massive errors. MSE squares these errors, creating gradient spikes that destabilize the weights early on.
The Fix: Use Huber Loss (tf.losses.huberLoss). It acts linearly for large outlier errors (preventing spikes) but acts quadratically like MSE as errors approach zero, ensuring sub-pixel precision in the final phase.
Neural networks struggle with disparate scales. Always ensure your input pixels are [0, 1] (divide by 255) and your target coordinates are [0, 1] (divide by canvas dimensions).
Coordinate regression must output negative values if a corner is pulled off-canvas. Never use relu or sigmoid on your final Dense layer. It must stay 'linear'.
Pixels: 0-255 vs 0-1.0
PASS (Scaled)Final layer activation function
Linear (True Regression)