Skip to content
On this page

Transformers

  • references:
  • it turns out that #attention is an earlier concept, which itself is motivated by the encoder-decoder sequence-to-sequence architecture
    • (oops) I didn't really understand this diagram:
      • sequence to sequence models are essentially the third diagram, where basically the input sequence and output sequence are asynchronous
      • versus the normal rnn, which takes the output as the input
      • I really like the example of 1-to-many, image-to-caption, so your input is not a sequence, but your output is
    • seq2seq example is the translation program, where importantly the input and output sequence don't have to be the same length (due to the way languages differ in their realisations of the same meaning)
    • the encoder is the rnn on the input sequence, and this culminates into the last hidden layer
      • this is essentially the context vector: the idea here is that this (fixed-length) vector captures all the information about the sentence
        • key: this acts like a sort of informational bottleneck, and actually is the impetus for the attention mechanic
    • key point (via this tutorial):
      • instead of having everything represented as the last hidden layer (fixed-length), why not just look at all the hidden layers (vectors representing each word)
      • but, that would be variable length, so instead just look at a linear combination of those hidden layers. this linear combination is learned, and is basically attention
  • Transformers #todo
    • "prior art"
      • CNN: easy to parallelise, but aren't recurrent (can't capture sequential dependency)
      • RNN: reverse
    • goal of transformers/attention is achieve parallelization and recurrence
      • by appealing to "attention" to get the recurrence (?)

Transformers

  • key is multi-head self-attention
    • encoded representation of input: key-value pairs (K,VRnK,V \in \mathbb{R}^{n})
      • corresponding to hidden states
    • previous output is compressed into query QRmQ \in \mathbb{R}^{m}.
    • output of the transformer is a weighted sum of the values (VV).

Todo

  • Visualising and Measuring the Geometry of BERT arXiv
  • random blog
  • pretty intuitive description of transformers on tumblr, via the LessWrong community
Edit this page
Last updated on 1/4/2022