A Dataset of Feature Additions and Feature Removals from
the Linux Kernel
Leonardo Passos
∗
University of Waterloo
Canada
lpassos@gsd.uwaterloo.ca
Krzysztof Czarnecki
University of Waterloo
Canada
kczarnec@gsd.uwaterloo.ca
ABSTRACT
This paper describes a dataset of feature additions and re-
movals in the Linux kernel evolution history, spanning over
seven years of kernel development. Features, in this context,
denote configurable system options that users select when
creating customized kernel images. The provided dataset is
the largest corpus we are aware of capturing feature additions
and removals, allowing researchers to assess the kernel evolu-
tion from a feature-oriented point-of-view. Furthermore, the
dataset can be used to better understand how features evolve
over time, and how different artifacts change as a result.
One particular use of the dataset is to provide a real-world
case to assess existing support for feature traceability and
evolution. In this paper, we detail the dataset extraction pro-
cess, the underlying database schema, and example queries.
The dataset is directly available at our Bitbucket repository:
https://bitbucket.org/lpassos/kconfigdb
Categories and Subject Descriptors
D.2.7 [
Distribution, Maintenance, and Enhancement
]: Version control
General Terms
Management
Keywords
Linux, Version Control History, Evolution, Traceability
1. INTRODUCTION
Highly-configurable software systems allow users to config-
ure the target software according to their own preferences
and needs. Configurability is achieved by having variable
software artifacts, meaning that they can be restructured
to suit a particular configuration [7]. Different software sys-
tems fit into such description (e.g., database management
∗
Funded by CAPES, grant BEX 0459-10-0.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
MSR ’14, May 31 – June 1, 2014, Hyderabad, India
Copyright 2014 ACM 978-1-4503-2863-0/14/05 ...$15.00.
systems [3, 8, 11], SOA-based applications [2], and operating
systems [1, 4, 5]), and the Linux kernel is probably the most
well-known case.
In the Linux kernel, variability is captured in system fea-
tures (configurable system options), which are explicitly de-
clared in variability models written in the Kconfig language
[9]. Features in variability models are then referenced in
build files and C code. Figure 1 conceptually illustrates how
features appear in these three artifact types and how they
bind such artifacts. The variability model contains feature
declarations, including drivers, file systems, scheduling poli-
cies, network protocols, etc. In the figure, feature FB is a
driver providing support for framebuffer devices.
1
Through
a configuration process over the declared features (step 1),
2
users state which features should be part of the final kernel,
and which should be excluded. Upon a feature selection,
specific build rules are triggered to compile corresponding
C files. For example, selecting FB triggers the compilation
of fb.c (step 2). Compilation, however, first requires pre-
processing target source files to remove any compile-time
variability introduced by C pre-processor directives (step 3),
such as ifdefs. In our example, the pre-processing of sti core.c,
another framebuffer-related feature, shows that under the
presence of FB, the post-processed sti select fbfont function
has a non-NULL return. After pre-processing target files, the
resulting directive-free files are compiled (step 4).
As the kernel evolves, new features are introduced, retired,
split, merged and renamed. These evolution changes are
expressed by two basic operations in the variability model:
feature additions and feature removals (e.g., the split of a
feature
f
into
f
i
and
f
j
is given by the removal of
f
followed
by the addition of
f
i
and
f
j
). By processing the Linux kernel
version control history, we extract a dataset of feature addi-
tions and removals to the variability model, linking them to
their specific commits, changed files (Kconfig, build system,
and C files), contributors and associated releases. Commits
that do not change the variability model are also stored, but
with less detail. The dataset, kept as a relational database,
allows different types of queries, including the retrieval of the
commit that adds/removes a particular feature, the release
in which an addition/removal occurs, which contributors
1
A framebuffer device is an abstraction for the graphic hard-
ware. It represents the framebuffer of some video hardware,
and allows application software to access the graphic hard-
ware through a well-defined interface [6].
2
Different tools exist to configure the kernel, including xcon-
fig, menuconfig, and gconfig. These tools render the Kconfig
model as a hierarchy of features, from which users select
those of interest and set their values accordingly.