Plan:
- Guess what article is about
- Read abstract
- LLM summary (skipped)
- Notes on LLM summary (skipped)
- YouTube vid if needed (skipped)
- Notes if needed (skipped)
- Read paper over
My Guess:
Identity mappings in deep residual nets. Residual nets were created to help models converge to the identity function. By using residuals/skips, the identity function becomes all zeros rather than all ones and is easier for the model to converge to.
Abstract Notes:
They use identity mapping as the skip connection. Why do they have to even use the construct h(x) = x? Why not just say 'x'? y = h(x) + F(x, W), why not just x + F(x, W)? (F is a residual function btw) (I assume there is a reason for this, but on its face it seems unnecessary)
Paper Notes:
- Authors: if the activation func is also an identity mapping the network has some advantages (I did not understand what they meant by 'the signal could be directly propagated from one unit to any other units, in both forward and backward passes')
- Looking at their diagram of information propagation, they are eliminating ReLU after adding the residuals to x.
- Keeping a clean information pathway is helpful for easing optimization, they say. Makes sense intuitively. Is this an example of better inductive biases?
- Ok, so there ARE variants of h. The authors propose using an identity mapping.
- Original residual net:
- x -> weight -> BN -> ReLU -> weight -> BN (what is BN?) v
- -> addition -> ReLU -> x(l+1)
- New network: BN -> ReLU is done before weights. Then this is repeated. Then this is added onto Xl
- x -> BN -> ReLU -> weight -> BN -> ReLU -> weight v
- -> addition -> x(l+1) # no relu, identity skip connection
- BN is batch normalization. Just normalize the data, and then shift/scale it by learnable parameters by gamma and beta.
- Resnet sub-blocks are called residual units
- In the original resnet f was relu but here its just an identity map
- x(l+1) = xl + F(xl,Wl)
- Recursively
- x(l+2) = x(l+1) + F(x(xl+1, Wl+1) = xl + F(xl,Wl) + F(xl+1, Wl+1)
- xL = xl + SUM(from L-1, i=l, F(xi, Wi))
- Super simple backprop eq
- dE/dxl = dE/dxL * dxL/dxl = dE/dxL * (1 + (d/dxl)SUM(i=l to L-1, F(xi,Wi)))
- First derr/dxL term propagates information back to any shallower unit l
- A non-identity skip connection function contributed to vanishing or exploding gradient. Author demonstrates that by showing mathematically that a coefficient skip function explodes or vanishes unless the coef is 1 (i.e. the identify func). Other more complex skip functions have derivatives which can mess things up.
- So skip connections are the grey arrow? Got it. Residual calculation is parallel to skip connection.
- Dropout imposes a scale factor on each identity shortcut, meaning, it is no longer the identity shortcut (its now scaled) which hampers the flow of information. Other multiplicative manipulations can have this same effect.
Main Takeaways:
- Use identity mappings in residual nets to avoid vanishing/exploding gradient problem
- Identity shortcut connections and identity after-addition activation are essential