![](https://csdnimg.cn/release/download_crawler_static/89112899/bg6.jpg)
Experimental setup. We consider three architectures (encoder, encoder-decoder, and decoder only)
and compare QLoRA with 16-bit adapter-finetuning and with full-finetuning for models up to 3B. Our
evaluations include GLUE [
58
] with RoBERTa-large [
38
], Super-NaturalInstructions (TKInstruct)
[
61
] with T5 [
49
], and 5-shot MMLU [
24
] after finetuning LLaMA on Flan v2 [
39
] and Alpaca
[
55
]. To additionally study the advantages of NF4 over other 4-bit data types, we use the setup of
Dettmers and Zettlemoyer
[13]
and measure post-quantization zero-shot accuracy and perplexity
across different models (OPT [
72
], LLaMA [
57
], BLOOM [
52
], Pythia [
7
]) for model sizes 125m -
13B. We provide more details in the results section for each particular setup to make the results more
readable. Full details in Appendix A.
QLoRA-All
QLoRA-FFN
QLoRA-Attention
Alpaca (ours)
Stanford-Alpaca
Model
60
61
62
63
64
RougeL
bits
4
16
Figure 2: RougeL for LLaMA 7B models on the
Alpaca dataset. Each point represents a run with a
different random seed. We improve on the Stanford
Alpaca fully finetuned default hyperparameters to
construct a strong 16-bit baseline for comparisons.
Using LoRA on all transformer layers is critical to
match 16-bit performance.
While paged optimizers are critical to do 33B/65B
QLORA tuning on a single 24/48GB GPU, we do
not provide hard measurements for Paged Optimiz-
ers since the paging only occurs when processing
mini-batches with long sequence lengths, which is
rare. We do, however, perform an analysis of the
runtime of paged optimizers for 65B models on
48GB GPUs and find that with a batch size of 16,
paged optimizers provide the same training speed
as regular optimizers. Future work should measure
and characterize under what circumstances slow-
downs occur from the paging process.
Default LoRA hyperparameters do not match 16-
bit performance When using the standard prac-
tice of applying LoRA to query and value attention
projection matrices [
28
], we are not able to replicate
full finetuning performance for large base models.
As shown in Figure 2 for LLaMA 7B finetuning on
Alpaca, we find that the most critical LoRA hyper-
parameter is how many LoRA adapters are used in
total and that LoRA on all linear transformer block
layers are required to match full finetuning perfor-
mance. Other LoRA hyperparameters, such as the
projection dimension r, do not affect performance (see Appendix A).
10
10
10
11
Total model bits
0.60
0.61
0.62
0.63
0.64
0.65
0.66
0.67
Mean zeroshot accuracy
4-bit LLaMA
Float
NFloat
NFloat + DQ
Data type
Figure 3: Mean zero-shot accuracy over Wino-
grande, HellaSwag, PiQA, Arc-Easy, and Arc-
Challenge using LLaMA models with different 4-bit
data types. The NormalFloat data type significantly
improves the bit-for-bit accuracy gains compared
to regular 4-bit Floats. While Double Quantization
(DQ) only leads to minor gains, it allows for a more
fine-grained control over the memory footprint to fit
models of certain size (33B/65B) into certain GPUs
(24/48GB).
Similarly, we find that default hyperparameters for
fully finetuned baselines are undertuned. We do a
hyperparameter search over learning rates 1e-6 to
5e-5 and batch sizes 8 to 128 to find robust baselines.
Results for 7B LLaMA finetuning on Alpaca are
shown in Figure 2.
4-bit NormalFloat yields better performance
than 4-bit Floating Point While the 4-bit
NormalFloat (NF4) data type is information-
theoretically optimal, it still needs to be determined
if this property translates to empirical advantages.
We follow the setup from Dettmers and Zettlemoyer
[13]
where quantized LLMs (OPT [
72
], BLOOM
[
52
], Pythia [
7
], LLaMA) of different sizes (125M
to 65B) with different data types are evaluated on
language modeling and a set of zero-shot tasks. In
Figure 3 and Table 2 we see that NF4 improves per-
formance significantly over FP4 and Int4 and that
double quantization reduces the memory footprint
without degrading performance.
k-bit QLORA matches 16-bit full finetuning and
16-bit LoRA performance Recent findings have
established that 4-bit quantization for inference is
6