Stable Code：新一代代码智能模型引领编程未来

版权申诉

自然语言处理

184 浏览量更新于2024-08-03 收藏 557KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

“Stable Code TechReport release.pdf”是Stability AI Language Models团队发布的技术报告，介绍了新一代代码语言模型——Stable Code。该模型在代码补全、推理、数学计算等软件工程任务上表现出色，同时具有轻量级和高效的特点，为编程智能化带来了新的可能。 Stable Code的主要功能包括： 1. **代码补全**：Stable Code能够分析编程环境的上下文，智能地自动完成代码片段，从而显著提升程序员的编码效率。这一特性对于减少编码错误、提高编程速度具有重要意义。 2. **多轮对话**：模型支持自然语言交互，用户可以通过自然语言指令与模型进行对话，让其执行各种任务。这使得开发者能更直观地向模型传达需求，简化了复杂任务的执行过程。 3. **数学理解**：Stable Code在处理数学问题时展现出强大的理解能力，可以理解和解决复杂的数学文本。这意味着它能在需要数学计算或分析的编程场景中提供辅助。 Stable Code的应用场景广泛： 1. **开发工具**：它可以集成到代码编辑器中，作为AI助手提供代码补全、错误排查和调试功能，增强开发者的开发体验。 2. **教育领域**：对初学者来说，Stable Code能帮助他们理解代码逻辑和结构，通过交互式学习方式提高编程技能的学习效率。 3. **自动化测试**：在软件开发流程中，Stable Code可以自动生成测试用例，有助于确保软件的质量和稳定性。此外，报告还介绍了StableCodeInstruct，这是一个指令版模型，允许用户通过自然语言聊天界面进行问答和基于指令的任务执行。模型的训练数据和过程在报告中进行了详细阐述，并且模型权重可供下载和使用。报告对模型进行了全面评估，包括多语言编程基准测试和关注多轮对话的MT基准测试。在发布时，Stable Code是参数量小于3B的开源模型中的最优者，其性能甚至与更大规模的模型相当。 Stable Code和StableCodeInstruct是人工智能在软件开发领域的创新应用，它们的出现将代码理解和生成提升到了一个新的水平，为开发者提供了强大的工具，有望推动编程领域的进一步发展。

资源详情

资源推荐

selecting such a varied dataset mix is to develop a language model well suited to perform a wide range

of software engineering tasks, not only those directly related to programming like code completion.

Furthermore, our training data also contains general text datasets in order to provide the model with a

broader linguistic knowledge and context. We hope this enables the model to address a wider range

of queries and tasks in a conversational manner. In Table 1, we provide the data sources, epochs,

categories, and sampling weights of the datasets used to set up the pretraining corpus. We use an

80:20 split distribution of code and natural language data, respectively, with the contributions from

individual components detailed in the table (see Table 2 for references).

Table 2: References for main training datasets.

Dataset Reference

StarCoder Data [26]

Github Issues [25]

Github Diffs [28]

Stackexchange [16]

Arxiv [16]

Synthetic Dataset Sec 2.1.1

Proof Pile Math [6]

Meta Math QA [42]

Reﬁned Web [30]

2.1.1 Synthetic Dataset

We also introduce a small synthetic dataset into our pre-training corpus. The data is synthetically

generated from the seed prompts of the CodeAlpaca

dataset, which is comprised of 174,000 prompts.

To augment the diversity and difﬁculty presented in the code-alpaca prompts, we employed the

"Evol-Instruct" method as introduced by Xu et al., [

] wherein we ask a language model (in this case,

we use WizardLM [

]) to increase the complexity of the given seed prompt in a step-by-step fashion.

By applying strategies focused on breadth, reasoning, deepening, and complexity, we were able to

enrich our collection with an additional 100,000 prompts. We leverage the DeepSeek Coder 34B

model [

] to generate synthetic outputs for the newly developed "Evol-Instruct" prompts. We believe

that introducing this synthetic data early during the pretraining phase helped the model respond better

to natural language text based on an ablation experiment we conducted.

2.2 Long Context Dataset

Building upon the initial pre-training phase, we composed an additional stage of training that

speciﬁcally targets the model’s capability to process and comprehend long sequences. Having a

longer context length is useful for coding models due to the usual inter-dependence of multiple

ﬁles within a repository. We speciﬁcally chose 16,384 as the context length for our long context

dataset after determining the median and mean number of tokens in a software repository to be

≈ 12k

and

≈ 18k

tokens, respectively. This continued training stage focused on a curated selection of

programming languages, all sourced from The Starcoder dataset [

], a ﬁltered version of the Stack

which is a large collection of high quality and permissively licensed coding data [

]. The languages

selected for this phase were based on the Stack Overﬂow Developer Survey 2022 [

]. In particular,

we selected Python, C, C++, Go, Java, and JavaScript.

To create this long context dataset, we took ﬁles written in these languages within a repository

and combined them, inserting a special <repo_continuation> token between each ﬁle to maintain

separation while preserving the ﬂow of content. To circumvent any potential biases that might arise

from a ﬁxed ordering of ﬁles—a factor that could inadvertently teach the model an unintended

sequence or hierarchy—we employed a randomized strategy. For each repository, we generated not

one, but two distinct orderings of the concatenated ﬁles. The statistics are outlined in Table 3.

Subset found here - https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K

剩余13页未读，继续阅读

猫头虎

粉丝: 30w+
资源: 462

Stable Code：新一代代码智能模型引领编程未来

性能测试报告.pdf

代码静态测试实验报告.pdf

暂态稳定分析程序报告.pdf

数据集cifar-10 latex引用

51浅析建设工程全过程造价管理.docx

31工程量清单计价模式下的造价控制与管理.docx

Java毕业设计基于SSM+mysql的学生宿舍管理系统源码+数据库（高分代码）

c++课程设计-个人收支管理系统.zip

js逆向-安某客空间推理验证码验证流程分析

ssm+mysql的在线汽车交易系统(源码+lw+ppt)

ssm+mysql的课程进度管理系统（源码+lw+ppt）

基于springboot的冬奥会科普平台设计与实现.docx

java基于蚁群算法路由选择可视化动态模拟(论文+开题报告+翻译+任务书+外文翻译).zip

java基于ssm+jsp 固定资产管理系统源码 带毕业论文+PPT

CRC校验日常学习笔记

软考视频课件PPT讲义

java基于ssm+jsp 机房预约系统源码 带毕业论文

基于J2EE的B2C电子商务系统开发（论文+系统+开题报告+文献综述+任务书+答辩PPT+中期报表+外文文献+说明书）.zip

Sigrity-XtractIM User Guide.rar

java基于ssm+vue垃圾分类系统源码 带毕业论文+PPT

最新资源

java基于ssm+jsp 固定资产管理系统源码带毕业论文+PPT

java基于ssm+jsp 机房预约系统源码带毕业论文

java基于ssm+vue垃圾分类系统源码带毕业论文+PPT