selecting such a varied dataset mix is to develop a language model well suited to perform a wide range
of software engineering tasks, not only those directly related to programming like code completion.
Furthermore, our training data also contains general text datasets in order to provide the model with a
broader linguistic knowledge and context. We hope this enables the model to address a wider range
of queries and tasks in a conversational manner. In Table 1, we provide the data sources, epochs,
categories, and sampling weights of the datasets used to set up the pretraining corpus. We use an
80:20 split distribution of code and natural language data, respectively, with the contributions from
individual components detailed in the table (see Table 2 for references).
Table 2: References for main training datasets.
Dataset Reference
StarCoder Data [26]
Github Issues [25]
Github Diffs [28]
Stackexchange [16]
Arxiv [16]
Synthetic Dataset Sec 2.1.1
Proof Pile Math [6]
Meta Math QA [42]
Refined Web [30]
2.1.1 Synthetic Dataset
We also introduce a small synthetic dataset into our pre-training corpus. The data is synthetically
generated from the seed prompts of the CodeAlpaca
§
dataset, which is comprised of 174,000 prompts.
To augment the diversity and difficulty presented in the code-alpaca prompts, we employed the
"Evol-Instruct" method as introduced by Xu et al., [
40
] wherein we ask a language model (in this case,
we use WizardLM [
40
]) to increase the complexity of the given seed prompt in a step-by-step fashion.
By applying strategies focused on breadth, reasoning, deepening, and complexity, we were able to
enrich our collection with an additional 100,000 prompts. We leverage the DeepSeek Coder 34B
model [
21
] to generate synthetic outputs for the newly developed "Evol-Instruct" prompts. We believe
that introducing this synthetic data early during the pretraining phase helped the model respond better
to natural language text based on an ablation experiment we conducted.
2.2 Long Context Dataset
Building upon the initial pre-training phase, we composed an additional stage of training that
specifically targets the model’s capability to process and comprehend long sequences. Having a
longer context length is useful for coding models due to the usual inter-dependence of multiple
files within a repository. We specifically chose 16,384 as the context length for our long context
dataset after determining the median and mean number of tokens in a software repository to be
≈ 12k
and
≈ 18k
tokens, respectively. This continued training stage focused on a curated selection of
programming languages, all sourced from The Starcoder dataset [
26
], a filtered version of the Stack
which is a large collection of high quality and permissively licensed coding data [
25
]. The languages
selected for this phase were based on the Stack Overflow Developer Survey 2022 [
35
]. In particular,
we selected Python, C, C++, Go, Java, and JavaScript.
To create this long context dataset, we took files written in these languages within a repository
and combined them, inserting a special <repo_continuation> token between each file to maintain
separation while preserving the flow of content. To circumvent any potential biases that might arise
from a fixed ordering of files—a factor that could inadvertently teach the model an unintended
sequence or hierarchy—we employed a randomized strategy. For each repository, we generated not
one, but two distinct orderings of the concatenated files. The statistics are outlined in Table 3.
§
Subset found here - https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K
3