"QWEN1：大型开源语言模型系列技术资料揭秘"

需积分: 0 116 浏览量更新于2024-01-11 1 收藏 1.53MB PDF 举报

大模型，通意千问，开源技术资料，是指最近出现的大型语言模型（LLM），它已经彻底改变了人工智能领域，并使得以前被认为只有人类才能完成的自然语言处理任务成为可能。其中，我们要介绍的是QWEN1，这是我们大型语言模型系列的第一个版本。QWEN是一个全面的语言模型系列，包括具有不同参数数量的不同模型。它不仅包括QWEN，基本的预训练语言模型，还包括QWEN-CHAT。 QWEN的出现，对于人工智能领域的发展具有重要意义。它为我们提供了一个强大而灵活的工具，可以应用于各种领域。QWEN可以根据不同的需求，调整参数数量，以适应不同任务的处理要求。此外，QWEN还具有开源的特点，这意味着任何人都可以通过使用它来提升自己的自然语言处理能力。开源技术资料为我们提供了学习和探索的机会，使我们能够更好地理解和利用大型语言模型的潜力。 QWEN的核心优势在于其通意千问的能力。这意味着它可以理解和回答各种类型的问题，无论是简单的还是复杂的。QWEN通过预训练和微调的方式，逐渐熟悉和掌握了大量的语言知识，使它能够以类似于人类的方式进行思考和回答问题。 QWEN的开源性也为研究者和开发者们提供了许多可能性。他们可以基于QWEN进行进一步的研究，并利用其潜力来开发更先进的自然语言处理系统。开源技术资料为他们提供了一个共享的平台，让他们能够相互沟通和合作，加速技术的发展和创新。然而，大型语言模型的发展也面临着一些挑战和问题。首先，模型的尺寸和参数数量巨大，需要庞大的计算资源来进行训练和推理。其次，模型的可解释性仍然是一个尚未解决的问题，我们并不能完全理解和解释模型是如何得出特定答案的。此外，对于大型语言模型的运用，也需要加强对隐私和安全的保护。由于模型具有强大的智能和学习能力，对于用户数据的使用和保护要有明确的规定和控制措施，以避免滥用和侵犯个人隐私的问题。总结起来，大型语言模型的出现，如QWEN1，为人工智能领域带来了巨大的突破和变革。它使得以前被认为只有人类才能完成的自然语言处理任务成为可能，为我们提供了一个强大而灵活的工具。而其开源性和技术资料的共享，也为研究者和开发者们提供了广阔的空间和机会，来加速技术的发展和创新。然而，大型语言模型的发展也面临着一些挑战和问题，比如计算资源的需求、模型的可解释性和隐私安全等方面。因此，我们需要在推动技术发展的同时，也要加强对其应用的规范和监管，确保其合理和可持续的发展。

(2022), is aimed at improving the model’s helpfulness by focusing on natural language generation for

diverse tasks. To ensure the model’s ability to generalize to a wide range of scenarios, we speciﬁcally

excluded data formatted in prompt templates that could potentially limit its capabilities. Furthermore,

we have prioritized the safety of the language model by annotating data related to safety concerns

such as violence, bias, and pornography. This will enable the model to detect and reject malicious

prompts or provide safe answers in such situations.

In addition to data quality, we have observed that the training method can signiﬁcantly impact the

ﬁnal performance of the model. To achieve this, we utilized the ChatML-style format (OpenAI,

2022), which is a versatile meta language capable of describing both the metadata (such as roles)

and the content of a turn. This format enables the model to effectively distinguish between various

types of information, including system setup, user inputs, and assistant outputs, among others. By

leveraging this approach, we can enhance the model’s ability to accurately process and analyze

complex conversational data.

3.1.2 TRAINING

Consistent with pretraining, we also apply next-token prediction as the training task for SFT. We apply

the loss masks for the system and user inputs. More details are demonstrated in Appendix A.1.1.

The model’s training process utilizes the AdamW optimizer, with the following hyperparameters:

set to

0.9

set to

0.95

, and

set to

−8

. The sequence length is limited to

2048

, and the batch size

128

. The model undergoes a total of

4000

steps, with the learning rate gradually increased over the

ﬁrst

1430

steps, reaching a peak of

2 × 10

−6

. To prevent overﬁtting, weight decay is applied with a

value of

0.1

, dropout regularization is set to

0.1

, and gradient clipping is enforced with a limit of

1.0

3.2 REINFORCEMENT LEARNING FROM HUMAN FEEDBACK

While SFT has proven to be effective, we acknowledge that its generalization and creativity capa-

bilities may be limited, and it is prone to overﬁtting. To address this issue, we have implemented

Reinforcement Learning from Human Feedback (RLHF) to further align SFT models with human

preferences, following the approaches of Ouyang et al. (2022); Christiano et al. (2017). This process

involves training a reward model and using Proximal Policy Optimization (PPO) (Schulman et al.,

2017) to conduct policy training.

3.2.1 REWARD MODEL

To create a successful reward model, like building a large language model (LLM), it is crucial to

ﬁrst undergo pretraining and then ﬁnetuning. This pretraining process, also known as preference

model pretraining (PMP) (Bai et al., 2022b), necessitates a vast dataset of comparison data. This

dataset consists of sample pairs, each containing two distinct responses for a single query and their

corresponding preferences. Similarly, ﬁnetuning is also conducted on this type of comparison data,

but with a higher quality due to the presence of quality annotations.

During the ﬁne-tuning phase, we gather a variety of prompts and adjust the reward model based on

human feedback for responses from the QWEN models. To ensure the diversity and complexity of

user prompts are properly taken into account, we have created a classiﬁcation system with around

6600

detailed tags and implemented a balanced sampling algorithm that considers both diversity and

complexity when selecting prompts for annotation by the reward model (Lu et al., 2023). To generate

a wide range of responses, we have utilized QWEN models of different sizes and sampling strategies,

as diverse responses can help reduce annotation difﬁculties and enhance the performance of the

reward model. These responses are then evaluated by annotators following a standard annotation

guideline, and comparison pairs are formed based on their scores.

In creating the reward model, we utilize the same-sized pre-trained language model QWEN and

initiate the PMP process. Subsequently, we ﬁne-tune the PMP model to enhance its performance. It

is important to mention that we have incorporated a pooling layer into the original QWEN model to

extract the reward for a sentence based on a speciﬁc end token. The learning rate for this process has

been set to a constant value of

3 × 10

−6

, and the batch size is

. Additionally, the sequence length

is set to 2048, and the training process lasts for a single epoch.

Table 4: Test Accuracy of QWEN PMP and reward model on diverse human preference benchmark

datasets.

Model

QWEN QWEN Anthropic Anthropic OpenAI Stanford OpenAI

Helpful-base Helpful-online Helpful-base Helpful-online Summ. SHP PRM800K

PMP 62.68 61.62 76.52 65.43 69.60 60.05 70.59

RM 74.78 69.71 73.98 64.57 69.99 60.10 70.52

3.2.2 REINFORCEMENT LEARNING

Our Proximal Policy Optimization (PPO) process involves four models: the policy model, value

model, reference model, and reward model. Before starting the PPO procedure, we pause the policy

model’s updates and focus solely on updating the value model for

steps. This approach ensures

that the value model can adapt to different reward models effectively.

During the PPO operation, we use a strategy of sampling two responses for each query simultaneously.

This strategy has proven to be more effective based on our internal benchmarking evaluations. We set

the KL divergence coefﬁcient to 0.04 and normalize the reward based on the running mean.

The policy and value models have learning rates of

1 × 10

−6

and

5 × 10

−6

, respectively. To enhance

training stability, we utilize value loss clipping with a clip value of

0.15

. For inference, the policy

top-p is set to

0.9

. Our ﬁndings indicate that although the entropy is slightly lower than when top-p is

set to

1.0

, there is a faster increase in reward, ultimately resulting in consistently higher evaluation

rewards under similar conditions.

Additionally, we have implemented a pre-trained gradient to mitigate the alignment tax. Empirical

ﬁndings indicate that, with this speciﬁc reward model, the KL penalty is adequately robust to

counteract the alignment tax in benchmarks that are not strictly code or math in nature, such as

those that test common sense knowledge and reading comprehension. It is imperative to utilize

a signiﬁcantly larger volume of the pretrained data in comparison to the PPO data to ensure the

effectiveness of the pretrained gradient. Additionally, our empirical study suggests that an overly

large value for this coefﬁcient can considerably impede the alignment to the reward model, eventually

compromising the ultimate alignment, while an overly small value would only have a marginal effect

on alignment tax reduction.

3.3 AUTOMATIC AND HUMAN EVALUATION OF ALIGNED MODELS

To showcase the effectiveness of our aligned models, we conduct a comparison with other aligned

models on well-established benchmarks, including MMLU (Hendrycks et al., 2020), C-Eval (Huang

et al., 2023), GSM8K (Cobbe et al., 2021), HumanEval (Chen et al., 2021b), and BBH (Suzgun et al.,

2022). Besides the widely used few-shot setting, we test our aligned models in the zero-shot setting

to demonstrate how well the models follow instructions. The prompt in a zero-shot setting consists

of an instruction and a question without any previous examples in the context. The results of the

baselines are collected from their ofﬁcial reports and OpenCompass (OpenCompass Team, 2023).

The results in Table 5 demonstrate the effectiveness of our aligned models in understanding human

instructions and generating appropriate responses. QWEN-14B-Chat outperforms all other models

except ChatGPT (OpenAI, 2022) and LLAMA 2-CHAT-70B (Touvron et al., 2023b) in all datasets,

including MMLU (Hendrycks et al., 2020), C-Eval (Huang et al., 2023), GSM8K (Cobbe et al., 2021),

HumanEval (Chen et al., 2021b), and BBH (Suzgun et al., 2022). In particular, QWEN’s performance

in HumanEval, which measures the quality of generated codes, is signiﬁcantly higher than that of

other open-source models.

Moreover, QWEN’s performance is consistently better than that of open-source models of similar size,

such as LLaMA2 (Touvron et al., 2023b), ChatGLM2 (ChatGLM2 Team, 2023), InternLM (InternLM

Team, 2023), and Baichuan2 (Yang et al., 2023). This suggests that our alignment approach, which

involves ﬁne-tuning the model on a large dataset of human conversations, has been effective in

improving the model’s ability to understand and generate human-like language.

Table 5: Performance of aligned models on widely-used benchmarks. We report both zero-shot

and few-shot performance of the models.

Model Params MMLU C-Eval GSM8K HumanEval BBH

0-shot / 5-shot 0-shot / 5-shot 0-shot / 8-shot 0-shot 0-shot / 3-shot

Proprietary models

GPT-3.5 - - / 69.1 - / 52.5 - / 78.2 73.2 - / 70.1

GPT-4 - - / 83.0 - / 69.9 - / 91.4 86.6 - / 86.7

Open-source models

ChatGLM2 6B 45.5 / 46.0 50.1 / 52.6 - / 28.8 11.0 - / 32.7

InternLM-Chat 7B - / 51.1 - / 53.6 - / 33.0 14.6 - / 32.5

Baichuan2-Chat

7B - / 52.9 - / 55.6 - / 32.8 13.4 - / 35.8

13B - / 57.3 - / 56.7 - / 55.3 17.7 - / 49.9

LLAMA 2-CHAT

7B - / 46.2 - / 31.9 - / 26.3 12.2 - / 35.6

13B - / 54.6 - / 36.2 - / 37.1 18.9 - / 40.1

70B - / 63.8 - / 44.3 - / 59.3 32.3 - / 60.8

QWEN-CHAT

1.8B 42.4 / 43.9 50.7 / 50.3 27.8 / 19.5 14.6 27.1 / 25.0

7B 55.8 / 57.0 59.7 / 59.3 50.3 / 54.1 37.2 39.6 / 46.7

14B 64.6 / 66.5 69.8 / 71.7 60.1 / 59.3 43.9 46.9 / 58.7

  



































































































































































   

Figure 4: Results of the human evaluation for chat models. We compare Qwen-7B (SFT), Qwen-

14B (SFT), Qwen-14B (RLHF), as well as GPT-4 against GPT-3.5. Each bar segment represents the

percentage of wins, ties, and losses, from bottom to top. On average, the RLHF model outperforms

the SFT model and falls behind GPT-4 by a relatively small margin.

剩余58页未读，继续阅读

大表哥汽车人

粉丝: 5475
资源: 3

"QWEN1：大型开源语言模型系列技术资料揭秘"

大模型源代码和资料.zip

AI大模型算法源代码及相应技术资料.zip

阿里巴巴通义千问大模型源码

码多多ChatAI智能聊天系统-PHP源码版V2.5.0+开源端

码多多ChatAI智能聊天系统-PHP源码版V2.5.0+开源端.zip

计算机行业专题报告：华为产业系列深度，凤凰磐涅，AI昇腾-20230827-财通证券-22页.pdf

Elixir中的Plug通道监控技术详解

基于通道循环自动编码的图像建模技术分析

Slushengine模型X硬件设计详解

越南人情绪分析的多通道LSTM-CNN模型研究

最新资源