AIdventure - #transformer

Speeding up Attention Layers

September 11, 2024 • 7 min read

Multi-head, Multi-Query & Grouped-Query Attention layers clearly explained. How cache works in the Attention layers

May 24, 2021 • 9 min read

Unsupervised visual feature learning using knowledge distillation and transformers

October 22, 2020 • 6 min read

Google shows how treating image patches as tokens can revolutionize computer vision

October 2, 2019 • 4 min read

Knowledge distillation compresses BERT: smaller, faster, with almost all performance

July 26, 2019 • 5 min read

Unlocking the true potential of BERT through rigorous optimization and strategic training choices

October 11, 2018 • 5 min read

Pre-training bidirectional by jointly conditioning on both left and right context

June 11, 2018 • 4 min read

Semi-supervised learning through generative pre-training on unlabeled text and task-specific fine-tuning

June 12, 2017 • 19 min read

Demystifying the Transformer architecture, explaining the Encoder, Decoder, and Attention mechanisms block by block with PyTorch implementation