QLORA：4位量化驱动的高效大模型微调技术

需积分: 5 185 浏览量更新于2024-06-14 1 收藏 1.02MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

大模型微调经典论文《QLORA：大语言模型高效微调》由Tim Dettmers、Artidoro Pagnoni等人提出，主要针对大型语言模型（LLMs）的内存效率问题。在当今深度学习领域，随着模型规模的爆炸性增长，如何在有限硬件资源下进行有效的微调成为关键挑战。QLORA作为一种创新解决方案，致力于解决这一问题，特别是对于那些拥有650亿参数的模型。 QLORA的核心创新在于以下几个方面： 1. **4位NormalFloat (NF4) 量化**：这是一种针对权重分布为正态分布的新型数据类型，理论证明它在信息理论上是最优的，能够在保持精度的同时将权重数据的存储需求降至4位，极大地减少了内存占用。 2. **双重量化**：这种方法进一步压缩内存，通过将量化常数也进行量化，实现了对模型内存消耗的双重削减，使得整体模型的平均内存占用大幅降低。 3. **分页优化器 (Paged Optimizers)**：为了管理微调过程中可能出现的内存峰值，QLORA引入了这种优化策略，能够有效控制内存使用，确保即使在单个48GB GPU上也能进行大规模模型的微调。通过QLORA，研究团队开发出一系列名为Guanaco的模型，它们在Vicuna基准测试中表现出色。其中最大的模型在仅使用单个GPU进行24小时微调后，就能达到ChatGPT性能的99.3%。这表明QLORA不仅提高了微调效率，还在资源有限的环境下实现了与高性能模型相当的结果。 QLORA的优势体现在其能在小规模高质量数据集上实现最先进的性能，并且在使用小型模型时也能与之前最先进的模型竞争。这对于实际应用中的模型部署和扩展具有重要意义，因为大型模型的微调成本通常很高，而QLORA提供了一种在资源受限环境中提升模型性能的有效途径。 QLORA是大语言模型微调领域的一个里程碑，它通过一系列创新技术优化了内存管理，使得更复杂的模型可以在单卡设备上进行高效微调，为今后的研究和工业应用打开了新的可能性。

资源详情

资源推荐

ensure a discrete zeropoint of

and to use all

bits for a k-bit datatype, we create an asymmetric

data type by estimating the quantiles

of two ranges

k−1

for the negative part and

k−1

+ 1

for

the positive part and then we unify these sets of

and remove one of the two zeros that occurs in both

sets. We term the resulting data type that has equal expected number of values in each quantization bin

k-bit NormalFloat (NFk), since the data type is information-theoretically optimal for zero-centered

normally distributed data. The exact values of this data type can be found in Appendix E.

Double Quantization We introduce Double Quantization (DQ), the process of quantizing the

quantization constants for additional memory savings. While a small blocksize is required for precise

4-bit quantization [

], it also has a considerable memory overhead. For example, using 32-bit

constants and a blocksize of 64 for

, quantization constants add

32/64 = 0.5

bits per parameter on

average. Double Quantization helps reduce the memory footprint of quantization constants.

More speciﬁcally, Double Quantization treats quantization constants

FP32

of the ﬁrst quantization

as inputs to a second quantization. This second step yields the quantized quantization constants

FP8

and the second level of quantization constants

FP32

. We use 8-bit Floats with a blocksize of

256 for the second quantization as no performance degradation is observed for 8-bit quantization,

in line with results from Dettmers and Zettlemoyer

[13]

. Since the

FP32

are positive, we subtract

the mean from

before quantization to center the values around zero and make use of symmetric

quantization. On average, for a blocksize of 64, this quantization reduces the memory footprint per

parameter from

32/64 = 0.5

bits, to

8/64 + 32/(64 · 256) = 0.127

bits, a reduction of 0.373 bits

per parameter.

Paged Optimizers use the NVIDIA uniﬁed memory

feature wich does automatic page-to-page

transfers between the CPU and GPU for error-free GPU processing in the scenario where the GPU

occasionally runs out-of-memory. The feature works like regular memory paging between CPU RAM

and the disk. We use this feature to allocate paged memory for the optimizer states which are then

automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU

memory when the memory is needed in the optimizer update step.

QLORA. Using the components described above, we deﬁne QLORA for a single linear layer in

the quantized base model with a single LoRA adapter as follows:

BF16

= X

BF16

doubleDequant(c

FP32

, c

k-bit

, W

NF4

) + X

BF16

, (5)

where doubleDequant(·) is deﬁned as:

doubleDequant(c

FP32

, c

k-bit

, W

k-bit

) = dequant(dequant(c

FP32

, c

k-bit

), W

4bit

) = W

BF16

, (6)

We use NF4 for

and FP8 for

. We use a blocksize of 64 for

for higher quantization precision

and a blocksize of 256 for c

to conserve memory.

For parameter updates only the gradient with respect to the error for the adapters weights

∂E

∂L

are

needed, and not for 4-bit weights

∂E

∂W

. However, the calculation of

∂E

∂L

entails the calculation of

∂X

∂W

which proceeds via equation (5) with dequantization from storage

NF4

to computation data type

BF16

to calculate the derivative

∂X

∂W

in BFloat16 precision.

To summarize, QLORA has one storage data type (usually 4-bit NormalFloat) and a computation

data type (16-bit BrainFloat). We dequantize the storage data type to the computation data type

to perform the forward and backward pass, but we only compute weight gradients for the LoRA

parameters which use 16-bit BrainFloat.

4 QLoRA vs. Standard Finetuning

We have discussed how QLoRA works and how it can signiﬁcantly reduce the required memory for

ﬁnetuning models. The main question now is whether QLoRA can perform as well as full-model

ﬁnetuning. Furthermore, we want to analyze the components of QLoRA including the impact of

NormalFloat4 over standard Float4. The following sections will discuss the experiments that aimed

at answering these questions.

https://docs.nvidia.com/cuda/cuda-c-programming-guide

Experimental setup. We consider three architectures (encoder, encoder-decoder, and decoder only)

and compare QLoRA with 16-bit adapter-ﬁnetuning and with full-ﬁnetuning for models up to 3B. Our

evaluations include GLUE [

] with RoBERTa-large [

], Super-NaturalInstructions (TKInstruct)

[

] with T5 [

], and 5-shot MMLU [

] after ﬁnetuning LLaMA on Flan v2 [

] and Alpaca

[

]. To additionally study the advantages of NF4 over other 4-bit data types, we use the setup of

Dettmers and Zettlemoyer

[13]

and measure post-quantization zero-shot accuracy and perplexity

across different models (OPT [

], LLaMA [

], BLOOM [

], Pythia [

]) for model sizes 125m -

13B. We provide more details in the results section for each particular setup to make the results more

readable. Full details in Appendix A.

QLoRA-All

QLoRA-FFN

QLoRA-Attention

Alpaca (ours)

Stanford-Alpaca

Model

RougeL

bits

Figure 2: RougeL for LLaMA 7B models on the

Alpaca dataset. Each point represents a run with a

different random seed. We improve on the Stanford

Alpaca fully ﬁnetuned default hyperparameters to

construct a strong 16-bit baseline for comparisons.

Using LoRA on all transformer layers is critical to

match 16-bit performance.

While paged optimizers are critical to do 33B/65B

QLORA tuning on a single 24/48GB GPU, we do

not provide hard measurements for Paged Optimiz-

ers since the paging only occurs when processing

mini-batches with long sequence lengths, which is

rare. We do, however, perform an analysis of the

runtime of paged optimizers for 65B models on

48GB GPUs and ﬁnd that with a batch size of 16,

paged optimizers provide the same training speed

as regular optimizers. Future work should measure

and characterize under what circumstances slow-

downs occur from the paging process.

Default LoRA hyperparameters do not match 16-

bit performance When using the standard prac-

tice of applying LoRA to query and value attention

projection matrices [

], we are not able to replicate

full ﬁnetuning performance for large base models.

As shown in Figure 2 for LLaMA 7B ﬁnetuning on

Alpaca, we ﬁnd that the most critical LoRA hyper-

parameter is how many LoRA adapters are used in

total and that LoRA on all linear transformer block

layers are required to match full ﬁnetuning perfor-

mance. Other LoRA hyperparameters, such as the

projection dimension r, do not affect performance (see Appendix A).

Total model bits

0.60

0.61

0.62

0.63

0.64

0.65

0.66

0.67

Mean zeroshot accuracy

4-bit LLaMA

Float

NFloat

NFloat + DQ

Data type

Figure 3: Mean zero-shot accuracy over Wino-

grande, HellaSwag, PiQA, Arc-Easy, and Arc-

Challenge using LLaMA models with different 4-bit

data types. The NormalFloat data type signiﬁcantly

improves the bit-for-bit accuracy gains compared

to regular 4-bit Floats. While Double Quantization

(DQ) only leads to minor gains, it allows for a more

ﬁne-grained control over the memory footprint to ﬁt

models of certain size (33B/65B) into certain GPUs

(24/48GB).

Similarly, we ﬁnd that default hyperparameters for

fully ﬁnetuned baselines are undertuned. We do a

hyperparameter search over learning rates 1e-6 to

5e-5 and batch sizes 8 to 128 to ﬁnd robust baselines.

Results for 7B LLaMA ﬁnetuning on Alpaca are

shown in Figure 2.

4-bit NormalFloat yields better performance

than 4-bit Floating Point While the 4-bit

NormalFloat (NF4) data type is information-

theoretically optimal, it still needs to be determined

if this property translates to empirical advantages.

We follow the setup from Dettmers and Zettlemoyer

[13]

where quantized LLMs (OPT [

], BLOOM

[

], Pythia [

], LLaMA) of different sizes (125M

to 65B) with different data types are evaluated on

language modeling and a set of zero-shot tasks. In

Figure 3 and Table 2 we see that NF4 improves per-

formance signiﬁcantly over FP4 and Int4 and that

double quantization reduces the memory footprint

without degrading performance.

k-bit QLORA matches 16-bit full ﬁnetuning and

16-bit LoRA performance Recent ﬁndings have

established that 4-bit quantization for inference is

剩余25页未读，继续阅读

就是一顿骚操作

粉丝: 556
资源: 53

会员权益专享

QLORA：4位量化驱动的高效大模型微调技术

About Firefly(流萤): 中文对话式大语言模型(全量微调+QLoRA)，支持微调Llma2、Llama、Qwen、B

百川大模型微调，lora模型，训练微调自己的大预言模型

大模型指令微调概述，大模型微调简单介绍ppt

大模型推理和大模型微调的关系是什么

大模型微调方法和大模型框架

大模型微调有哪些大模型

大模型微调中的核心要素

大模型微调 hive数据集

大模型微调 lora

医学图像 视觉 大模型 微调

linux系统chatglm3-6B大模型微调

大语言模型微调实现Text2SQL

提升性能——NLP模型微调指南

大语言模型微调的难点挑战

垂直 大模型 标注 微调

yolov5模型微调

使用预训练模型微调时，被微调的预训练模型作为基础模型，这种说法准确吗

大模型微调过拟合的解决方法

· 大模型的定义· 大模型相关概念区分· 大模型的发展历程· 大模型的特点· 大模型的分类· 大模型的泛化与微调

介绍一下基于模型微调的小样本学习

会员权益专享

最新资源

医学图像视觉大模型微调

垂直大模型标注微调