Deep Residual Learning

Deep Residual Learning

Deep Residual Learning Paper

Residual networks: weights learn residuals, they don't learn transformed outputs. Residual Neural Network

Standard: WX => Y

Residual: WX => R, X+R => Y

This paper, judging from the title, is about deep neural networks that are residual nets.

DNNs have degradation problems, not caused by overfitting. Adding more layers leads to higher error after a certain point.

Theoretically, if you start with a shallow architecture, train it, then add more layers, then train again, the new layers will just learn to pass inputs with no changes (identity mapping). However, this didn't seem to be the case when the authors did it (perhaps because the parameter space is so huge that the model is unlikely to find identity mapping).

Turns out finding the identity function is hard. BUT, not for a residual network, since the identity function for a residual network is just zeroes, and the weights are initialized close to 0.

Default function becomes zeroes with residual net.

Amazing YouTube video going through this: YouTube Video