Speeding up Attention Layers
September 11, 2024
Multi-head, Multi-Query & Grouped-Query Attention layers clearly explained. How cache works in the Attention layers
Posts tagged with #transformer
September 11, 2024
Multi-head, Multi-Query & Grouped-Query Attention layers clearly explained. How cache works in the Attention layers
May 24, 2021
Unsupervised visual feature learning using knowledge distillation and transformers
October 22, 2020
Google shows how treating image patches as tokens can revolutionize computer vision
October 2, 2019
Knowledge distillation compresses BERT: smaller, faster, with almost all performance
July 26, 2019
Unlocking the true potential of BERT through rigorous optimization and strategic training choices
October 11, 2018
Pre-training bidirectional by jointly conditioning on both left and right context