Neural Machine Translation

Neural Machine Translation

Neural Machine Translation Paper

Translation Method

Jointly learning to align and translate? Hmm no idea what this means, will find out I guess.

First:

LLM summary (copy/paste paper into LLM and ask for a summary of key points)

Then YouTube if possible (this was ok: YouTube Video)

Then read the paper

Explain it out loud as if I were teaching it to an imaginary person

LLM Summary Questions:

What does 'jointly tuned' mean here? Each component can be tuned on its own?

AI: You got this backwards, it actually means that the whole model gets trained at once. Traditional statistical ML systems have subcomponents often trained separately. The NMT's encoder and decoder are tuned together, allowing the model to learn more effective representations of the data/align components better because the encoder's output is directly optimized to be useful for the decoder's task of generating translations (whoa!! ok so training components together aligns them better than training them separately, at least in this case...)

Traditional approach had fixed-length vectors which is obviously inefficient, because eventually sequences will be too long to store enough information in the vector size

AI: Correct

Interesting, no masking for future tokens in the attention mechanism? Allows for filling text in, is this correct?

What are source annotations? Are these just embeddings?

AI: No. Annotations are more context-rich than embeddings. They are embeddings + context

Process:

  1. Get word embeddings
  2. Process 1 through bidirectional RNN to get hidden states for each word (both forward and backward hidden state)
  3. OH THIS IS WHAT THE MLPs DO IN THE TRANSFORMER!! Oooh ok ok ok ok ok. Embeddings -> hidden states -> do attention with hidden states -> next embeddings
  4. Transformer repeats this over and over. Whoa so this paper is basically one layer of the transformer's feed-forward being discovered!!!!

Goal of Paper:

Make a better translator lol (it worked, beat RNN encoder-decoder by a lot)

Limitations:

Computationally expensive, does not handle rare/unknown words well

Paper Notes:

Basically the main change is that the input isn't smooshed into a fixed-length vector

Translation is about finding the sentence in the new language which maximizes the conditional probability of the original sentence.