Three things everyone should know about Vision Transformers

43 points | by reqo 7 hours ago

14 comments

Centigonal 6 hours ago
There's something that tickles me about this paper's title. The thought that everyone should know these three things. The idea of going to my neighbor who's a retired K-12 teacher and telling her about how adding MLP-based patch pre-processing layers improves Bert-like self-supervised training based on patch masking.
[-]
- woopwoop 6 hours ago
  Clickbait titles are something of a tradition in this field by now. Some important paper titles include "One weird trick for parallelizing convolutional neural networks", "Attention is all you need", and "A picture is worth 16x16 words". Personally I still find it kind of irritating, but to each their own I guess.
  [-]
  - minimaxir 6 hours ago
    Only the first one is clickbait in the style of blogs that incentivize you to click on the headline (i.e. the information gap), the last two are just fun puns.
    [-]
    - janalsncm 5 hours ago
      Honestly I took the first one as making fun of that trope. Usually the “one weird trick to” ends in some tabloid-style thing like lose 15 pounds or find out if your husband is loyal. So “parallizing CNNs” is a joke, as if that’s something you’d see in a checkout isle.
    - woopwoop 5 hours ago
      In what sense is "Attention is all you need" a pun?
      [-]
      - minimaxir 5 hours ago
        It's a reference to the lyric "love is all you need" from the song "All You Need Is Love" by the Beatles, and it uses a faux-synonym with a different meaning.
  - adultSwim 5 hours ago
    "Attention is all you need" is an outlier. They backed up their bold claim with breakthrough results.
    For modest incremental improvements, I greatly prefer boring technical titles. Not everything needs to a stochastic parrot. We see this dynamic with building luxury condos. On any individual project, making that pick will help juice profit. When the whole city follows that , it leads to a less desirable outcome.
- pixl97 6 hours ago
  Hey, when the AI powered T-rex is chasing you down you'll wish you paid attention that the vision transformers perception is based on movement!
  Had to throw some Jurassic Park humor in here.
- guerrilla 6 hours ago
  Yeah, I guess today was the day that I learned I am not part of "everyone". I feel so left out now.
i5heu 6 hours ago
I put this paper into 4o so i can check if it is relevant, so that you do not have to do this too here are the bullet points:
- Vision Transformers can be parallelized to reduce latency and improve optimization without sacrificing accuracy.
- Fine-tuning only the attention layers is often sufficient for adapting ViTs to new tasks or resolutions, saving compute and memory.
- Using MLP-based patch preprocessing improves performance in masked self-supervised learning by preserving patch independence.
[-]
- Jamesoncrate 5 hours ago
  just read the abstract
  [-]
  - jmugan 5 hours ago
    You would think. I don't know about this paper in particular, but I'm continually surprised about how much more I get out of LLM summaries of papers than the abstracts of papers written by the authors.
    [-]
    - mananaysiempre an hour ago
      Paper abstracts are not optimized by drive-by readers like you and me. They are optimized for active researchers in the field reading their daily arXiv digest that lists all the new papers across the categories they work in, and needing to take the read/don't-read decision for each entry there as efficiently as possible.
      If you’ve already decided you’re interested in the paper, then the Introduction and/or Conclusion sections are what you’re looking for.
    - tough 2 hours ago
      This would be an interesting metric to track, how different an abstract generated from LLM giving it the paper as source, vs the actual abstract is, and if it has any correlation whatsoever with the overall quality of the paper or not