Google explanation: Google Blog
GPipe explained: Papers with Code
Train deep neural networks faster by splitting up training across cells and distributing cells to accelerators. Essentially reduce waste from accelerator downtime.
Splits mini batches into micro batches.
What are mini batches? Stanford CS230
X(nx, m),Y(1, m)=batch
Break X,Y up into X{t}(nx, 1000),Y{t}(1, 1000) minibatches (minibatch size=1000 here)
Gradient descent on minibatches:
Forward prop:
Z[L] = W[L]X{t} + b[L]
A[L] = activation_func(Z[L])
So you can do a forward pass all at once? And backprop all at once? Damn ok... if I am understanding this correctly.
Choosing minibatch size: 64, 128, 256, etc. Make sure it fits in CPU/GPU memory.
Vt=beta*V(t-1) + (1-beta)*(next_val_t) - store+decay the past values by beta. Beta ~ 0.98 as a good value.
Can use this for gradient descent with momentum.
Momentum grad descent: beta = friction, beta*V(t-1)=velocity, (next_val_t) =acceleration (wtf????)
RMSprop: modified momentum grad descent (choose higher log beta, eg 0.999, bc dw/db is squared)
ADAM: combination of momentum grad descent and RMSprop
Learning rate decay: decay the learning rate over epochs. Several different methods - exponential decay + others
I am retarded, F and B were forward and backward, took me until now to realize this.
Divide network into K cells
Place k cell onto kth accelerator
Algorithm syncs computation time across partitions, maximizing efficiency of the pipeline, reducing variance in total computation costs
Forward: divides N mini batches into M micro batches, which are put through the K accelerators.
Backward: calculate gradients for each micro batch
End of mini batch: accumulate all M gradients. Update model params across all accelerators.