Structured light-based depth sensors provide accurate depth information independently of the scene appearance by extracting pattern positions from the captured pixel intensities.
Spatial neighborhood encoding, in particular, is a popular structured light approach for off-the-shelf hardware. However, it suffers from the distortion and fragmentation of the projected pattern by the scene's geometry in the vicinity of a pixel. This forces algorithms to find a delicate balance between depth prediction accuracy and robustness to pattern fragmentation or appearance change. While stereo matching provides more robustness at the expense of accuracy, we show that learning to regress a pixel's position within the projected pattern is not only more accurate when combined with classification but can be made equally robust. We propose to split the regression problem into smaller classification sub-problems in a coarse-to-fine manner with the use of a weight-adaptive layer that efficiently implements branching per-pixel Multilayer Perceptrons applied to features extracted by a Convolutional Neural Network.
As our approach requires full supervision, we train our algorithm on a rendered dataset sufficiently close to the real-world domain. On a separately captured real-world dataset, we show that our network outperforms state-of-the-art and is significantly more robust than other regression-based approaches.
dataset.zip: Training and validation data, synthetic as well as captured.
dataset_jpg.zip: The same dataset but with lossy jpg encoding for IR images.
trained_models.zip: Models trained on the synthetic data.