Vision Transformers

Thanks to their computational efficiency (and hence scalability), self-attention based (Transformers) architectures have become the model of choice in NLP. Their success has led people to wonder if such architectures are universal¹, or, at the very least, if they can be applied to computer vision tasks.

Self-attention gives a good tradeoff between local context awareness and computational efficiency: rnn/lstm's allow for long-range dependencies, but the sequential nature of training really hampers model size.

Compare this convolution layers (which are highly specialized for Computer Vision Tasks)

Stand-Alone Self-Attention in Vision Models

An Image Is Worth 16x16 Words

src: (Dosovitskiy et al., 2021)^{dosovitskiy2021image}

Summary

For instance, it's been speculated that the self-attention mechanism is comparable to the way attention manifests in our brains – such a thesis would imply the universality of this module.↩
Dosovitskiy, A. et al., 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,↩

LINKS TO THIS PAGE

Neural Representations

Tags

Vision Transformers

An Image Is Worth 16x16 Words

Summary

LINKS TO THIS PAGE

Tags