Skip to content

Tags

On this page

Vision Transformers

Thanks to their computational efficiency (and hence scalability), self-attention based (Transformers) architectures have become the model of choice in NLP. Their success has led people to wonder if such architectures are universal1, or, at the very least, if they can be applied to computer vision tasks.

Self-attention gives a good tradeoff between local context awareness and computational efficiency: rnn/lstm's allow for long-range dependencies, but the sequential nature of training really hampers model size.

Compare this convolution layers (which are highly specialized for Computer Vision Tasks)

Stand-Alone Self-Attention in Vision Models

An Image Is Worth 16x16 Words

src: (Dosovitskiy et al., 2021)dosovitskiy2021image

Summary


  1. For instance, it's been speculated that the self-attention mechanism is comparable to the way attention manifests in our brains – such a thesis would imply the universality of this module.
  2. Dosovitskiy, A. et al., 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,

LINKS TO THIS PAGE

Edit this page
Last updated on 1/29/2022