On this page
On this page
Vision Transformers
Thanks to their computational efficiency (and hence scalability), self-attention based (Transformers) architectures have become the model of choice in NLP. Their success has led people to wonder if such architectures are universal1, or, at the very least, if they can be applied to computer vision tasks.
Self-attention gives a good tradeoff between local context awareness and computational efficiency: rnn/lstm's allow for long-range dependencies, but the sequential nature of training really hampers model size.
Compare this convolution layers (which are highly specialized for Computer Vision Tasks)
Stand-Alone Self-Attention in Vision Models
An Image Is Worth 16x16 Words
src: (Dosovitskiy et al., 2021)dosovitskiy2021image
Summary
LINKS TO THIS PAGE
Edit this page
Last updated on 1/29/2022