Speeding up Attention Layers
September 11, 2024 • 7 min read
Multi-head, Multi-Query & Grouped-Query Attention layers clearly explained. How cache works in the Attention layers
Posts tagged with #efficiency
Multi-head, Multi-Query & Grouped-Query Attention layers clearly explained. How cache works in the Attention layers
Fine-tuning large language models via trainable rank decomposition matrices
Knowledge distillation compresses BERT: smaller, faster, with almost all performance