Vishal Pandey | Applied ML Research

LoRA and QLoRA:  Efficient Fine-Tuning for Large Language Models

Training GPT-style models doesn’t just break banks; it breaks GPUs. With each new 10× scale-up in parameters, the compute bill balloons, and outright re-training or full fine-tuning becomes prohibitive. Yet practitioners still need to adapt these giant models to niche domains: medical text, legal briefs, and product catalogs, without blowing their budgets. Enter LoRA (Low-Rank Adaptation) and its quantized sibling QLoRA, a pair of techniques that let you squeeze top-notch performance out of today’s LLMs by tweaking only tiny, low-rank subspaces of their massive weight matrices. In this post, we’ll see why LoRA works so well, how QLoRA extends it with 4-bit quantization, and when you should choose one over the other


1. What is Fine-Tuning?

So, standard fine-tuning updates all the model parameters.

2. Problem with Fine-Tuning?

3. The Emergence of Low-Rank Methods

Low-rank approximation in linear algebra: any matrix Md×k can be approximated by a product of two smaller matrices Ad×r and Br×k when r<min(d,k):

MAB,withrank(M)r.

Key idea: Many weight changes during fine-tuning lie in a low-dimensional subspace.

Before LoRA, adapters (e.g., Houlsby et al., 2019) inserted small bottleneck "adapter" layers. LoRA extends this principle by directly factorizing the update to existing weights, rather than injecting new modules.


Low Rank Adaptation (LoRA)

Microsoft - AI

Paper Link: Low Rank Adaptation (LoRA)


1. LoRA’s Intuition

W=W+ΔW,ΔW=AB,Ad×r,Br×k.

2. LoRA Architecture (Module-Level View)

2.1 Transformer Block Overview

A transformer block consists of:

2.2 LoRA Modification: Only Certain Weights Get Low-Rank Updates

LoRA injects low-rank adapters into selected linear layers, typically:

Instead of updating Wq directly, we freeze it and add a learnable low-rank update:

Wq=Wq+ΔWq=Wq+AqBq

Similarly for Wv.

2.3 Module Architecture

LoRA Architecture

So, Mathematically:

Output=Wqx+Bq(Aqx)

Here, only 𝐴𝑞 and 𝐵𝑞 are updated during training.

3. Mathematical Formulation (LoRA)

Consider a single weight matrix W0d×k in the transformer (e.g., a projection in self-attention or MLP).

W=W0+ΔW,ΔW=AB,Ad×r,Br×k.

With LoRA:

y=W0x+(AB)x=W0x+B(Ax).

Compute Ax (size r), then B(Ax) (size k).

dr+rkdk=rkdk+rddk1.

4. Benefits

5. Why does LoRA work

From the LoRA Paper (Section 4.1), we got:

[Section - 4.1] When adapting to a specific task, it shows that the pre-trained language models have a low "intrinsic dimension" and can still learn efficiently despite a random projection to a smaller subspace. Inspired by this, we hypothesize the updates to the weights also have a low "intrinsic rank" during adaptation.

It's basically means that the W matrix of a pretrained model contains many parameters that convey the same information as others (so they can be obtained by a combination of the other weights); This means we can get rid of them without decreasing the performance of the model. This kind of matrix is called rank-deficient (they don't have full rank).


Quantization - Low Rank Adaptation (QLoRA)

University of Washington

Paper Link: Quantization - Low Rank Adaptation (QLoRA)


'Fine-tune models with billions of parameters on a single GPU using quantization + LoRA'

1. Motivation

LoRA made fine-tuning efficient by reducing trainable parameters. However, model weights were still stored in full precision (FP32 or BF16), consuming a lot of GPU memory.

For example: LLaMA-65B with full precision = 130+ GB VRAM

Even if only LoRA weights are trained, the frozen base model still consumes a huge memory footprint. Enter QLoRA, which quantizes the frozen weights (e.g., to 4-bit) and combines this with LoRA for maximum memory efficiency.

2. Core Idea

QLoRA = Quantized base model (4-bit) + Low-Rank Adapters (LoRA layers)

Key innovations:

3. Architecture Overview

4. Mathematical View

4-Bit Quantization of Frozen Base

Given a pre-trained weight matrix Wd×k, apply:

4.1 NF4 Quantization:

Non-uniform quantization scheme learned from the data distribution:

Q(W)=QuantizeNF4(W)

Notes:

4.2 During Forward Pass:

y=Q(W)·x+B(Ax)

Note:

5. Architecture

QLoRA Architecture


Training and Inference: LoRA vs QLoRA

Aspect LoRA QLoRA
Base Model Weights Stored in full precision (FP32 / BF16) Quantized to 4-bit (NF4 or FP4)
Trainable Parameters Only the low-rank matrices Ad×r, Br×k Same as LoRA: only LoRA adapters A,B are trained
Frozen Weights Format FP16 / FP32 4-bit integer quantized with separate scaling factors (group-wise NF4)
Backward Pass Through Base No gradients through base weights No gradients through base weights (also quantized, so not differentiable)
Memory Usage (GPU RAM) Moderate (due to full-precision base model) Very low (quantized base + LoRA weights in FP16)
Precision of LoRA adapters Typically FP16 or BF16 Same: FP16 or BF16
Optimizer Used Adam / AdamW PagedAdamW: memory-paged optimizer for CPU+GPU hybrid optimization
Compute Requirement Moderate to high (if base model is large) Very low: QLoRA can fine-tune 65B models on a single A100 48GB GPU
Training Speed Faster than full fine-tuning Slightly slower than LoRA due to dequantization overhead during forward pass
Forward Computation Wx+B(Ax) with W in FP16 Q(W)x+B(Ax) with Q(W): dequantized 4-bit base weight
Inference Model Size Base model + LoRA adapters (~1% additional) Quantized base model + small LoRA adapters (~0.5% additional)
Merging for Deployment Can merge W=W+AB to single matrix More complex: requires dequantizing, summing, then re-quantizing to 4-bit
Multi-Task Adaptability Easy — multiple adapters per task Same — multiple low-rank adapters, each task shares the same 4-bit base
Downstream Performance Often matches or exceeds full fine-tuning Matches LoRA and full fine-tuning in most benchmark tasks

🙏 Thanks for Reading

Thanks for reading this deep dive into LoRA and QLoRA. I hope this gave you a clear mathematical and architectural understanding of how modern parameter-efficient fine-tuning works, and how it’s enabling powerful large language models to be adapted with minimal compute. Whether you're experimenting on a personal GPU or scaling multi-task deployments, these techniques are paving the way for more democratized AI.