GPipe

GPipe

GPipe Paper

Google explanation: Google Blog

GPipe explained: Papers with Code

GPipe:

Train deep neural networks faster by splitting up training across cells and distributing cells to accelerators. Essentially reduce waste from accelerator downtime.

Splits mini batches into micro batches.

Mini Batches Tangent

What are mini batches? Stanford CS230

X(nx, m),Y(1, m)=batch

Break X,Y up into X{t}(nx, 1000),Y{t}(1, 1000) minibatches (minibatch size=1000 here)

Gradient descent on minibatches:

Forward prop:

Z[L] = W[L]X{t} + b[L]

A[L] = activation_func(Z[L])

So you can do a forward pass all at once? And backprop all at once? Damn ok... if I am understanding this correctly.

Choosing minibatch size: 64, 128, 256, etc. Make sure it fits in CPU/GPU memory.

Exponential Moving Averages Refresher

Vt=beta*V(t-1) + (1-beta)*(next_val_t) - store+decay the past values by beta. Beta ~ 0.98 as a good value.

Can use this for gradient descent with momentum.

Momentum grad descent: beta = friction, beta*V(t-1)=velocity, (next_val_t) =acceleration (wtf????)

RMSprop: modified momentum grad descent (choose higher log beta, eg 0.999, bc dw/db is squared)

ADAM: combination of momentum grad descent and RMSprop

Learning rate decay: decay the learning rate over epochs. Several different methods - exponential decay + others

Back to GPipe

I am retarded, F and B were forward and backward, took me until now to realize this.

GPIPE Algorithm

Divide network into K cells

Place k cell onto kth accelerator

Algorithm syncs computation time across partitions, maximizing efficiency of the pipeline, reducing variance in total computation costs

Forward: divides N mini batches into M micro batches, which are put through the K accelerators.

Backward: calculate gradients for each micro batch

End of mini batch: accumulate all M gradients. Update model params across all accelerators.