AIdventure - LoRA - Low-Rank Adaptation of Large Language Models

LoRA - Low-Rank Adaptation of Large Language Models

LoRA - Low-Rank Adaptation of Large Language Models
Mario Parreño#paper#nlp#lora#low-rank#transformers#fine-tuning

Do we need to fine-tune all the parameters? How expressive should the matrix updates be? As we want to fine-tune larger models, retraining all the parameters becomes less feasible. Low-Rank Adaptation (LoRA) proposes to freeze the pretrained model weights and inject trainable rank decomposition matrices into each layer of the architecture. LoRA reduces the trainable parameters while performing on-par than full fine-tuning.

LoRA

The principal idea apply to any dense layers in deep learning models.

Some works show that pre-trained models have a low intrinsic dimension and can still learn efficiently despite a random projection to a smaller space. LoRA hypothesizes that updating the weights also have a low intrinsic rank during adaptation.

For a pre-trained weight matrix W0Rd×kW_0 \in \mathbb{R}^{d \times k}, LoRA constrain its update by representing the latter with a low-rank decomposition, i.e, W0+ΔW=W0+BAW_0 + \Delta W = W_0 + B A, where BRd×rB \in \mathbb{R}^{d \times r} and ARr×kA \in \mathbb{R}^{r \times k}, seeking that the rank rmin(d,k)r \ll min(d,k).

During training W0W_0 is frozen and does not receive gradient updates, while AA and BB contain the trainable parameters.

Next there is an example of how matrix decomposition works where the matrices have rank 1.

Matrix decomposition example with rank 1.
Matrix decomposition example with rank 1.

If we extend it to a general rank rr:

Extending LoRA to a general rank r.
Extending LoRA to a general rank r.

Note that, as we increase the rank rr and LoRA is being applied to more weight matrices, LoRA becomes similar to full fine-tuning.

LoRA exploration space. The x-axis represents the rank r and the y-axis the number of weight matrices adapted.
LoRA exploration space. The x-axis represents the rank r and the y-axis the number of weight matrices adapted.

Training vs. Inference

LoRA uses a random Gaussian Initialization for AA and zero for B, so ΔW=BA\Delta W = B A is zero at the beginning of the training. We then scale ΔWx\Delta W x by αr\frac{\alpha}{r}, where α\alpha is a constant in rr. When optimizing with Adam, tuning α\alpha is roughly the same as tuning the learning rate if we scale the initialization appropriately. Typically, α\alpha is set equal to rr. A higher α\alpha results in smaller updates, which can lead to more stable but potentially slower learning. Conversely, a lower α\alpha results in larger updates, which can speed up learning but also increase the risk of instability.

At training time W0W_0 and ΔW=BA\Delta W = B A are multiplied with the same input, and their respective output vectors are summed coordinate-wise.

LoRA training time. The input is multiplied by the pretrained frozen weights and LoRA modules B and A and $W_0$ and $BA$ are summed coordinate-wise.
LoRA training time. The input is multiplied by W₀ and LoRA modules B and A. Finally, W₀ and B×A are summed coordinate-wise.

During inference, when initializing the model, we can merge the two matrices into a single matrix W=W0+BAW = W_0 + B A and apply the same input to WW.

LoRA test time. The model is initialized by merging the pretrained frozen weights and LoRA modules B and A into single matrices. At inference time, the input is multiplied by the merged matrix.
LoRA test time. The model is initialized by merging the pretrained frozen weights and LoRA modules B and A into single matrices. At inference time, the input is multiplied by the merged matrix.

Applying LoRA to Transformers

In the Transformer architecture there are four weight matrices in the self-attention module (WqW_q, WkW_k, WvW_v, WoW_o) and two in the MLP module. Authors limit the study to only adapting the attention weights for downstream tasks and freeze the MLP modules.

The most significant benefit comes from the reduction in memory and storage usage. For a large Transformer trained with Adam, we reduce the VRAM usage by up to 2/32/3 if rdmodelr \ll d_{model} as we do not need to store the optimizer states for the frozen parameters. This allows to train with significantly fewer GPUs and around a 25% speedup during the training compared to full fine-tuning, as we do not need to calculate the gradient for the vast majority of the parameters.

Benefits

  • Computing Efficiency: LoRA makes training more efficient and lowers the hardware barrier to entry by up to 3 times when using adaptive optimizers since LoRA does not need to calculate the grandients or maintain the optimizer states for most parameters. Instead LoRA only optimize the injected, much smaller low-rank matrices.
  • No Additional Latency: By merging the trainable matrices with the frozen weights when deployed, W=W0+BAW= W_0 + BA, introducing no inference latency.
  • Task Switching: A pre-trained model can be shared and used to build many small LoRA modules for different tasks. The shared model can be freezed and efficiently switch tasks by replacing the LoRA modules. Note that both WOW_O and BABA are in Rd×k\mathbb{R}^{d \times k}, so when we need to switch to another downstream task, we can recover W0W_0 by subtracting BABA and adding a different BAB'A', a quick operation with very little memory overhead.
  • Storing Efficiency: One of the drawbacks for full fine-tuning is that for each downstream task, we learn a different set of parameters. If the model is large, storing and deploying many independent instances of fine-tuned models can be challenging, if at all feasible. For example, for a base model of 350GB, storing 100 full fine-tuned models would require 35TB of storage. With a sufficient small LoRA the checkpoint size is reduced roughly 10,000×10,000\times, ending up with 350GB + 35MB * 100Models \approx 354GB.

Tips

- Which weight matrices in Transformer should we apply LoRA to?

  • Authors experiment that it is preferable to adapt more weight matrices than adapting a single type of weights with larger rank.

- How to choose the rank rr?

  • LoRA performs competitively with a very small rr, suggesting the update matrix ΔW\Delta W could have a very small intrinsic rank.

- What to do if LoRA underperforms?

  • If LoRA underperforms, adapt more parameters and/or increase the rank.

Credits

Table of Contents