A large subset of Scheme’s core forms (lambda, quote, set!,
etc) and extended forms (cond, case, letrec, internal define
etc.) must be supported by the compiler. Although most of these
forms are not essential, their presence allows us to write our pro-
grams in a more natural way. In implementing the extended forms,
we show how a large number of syntactic forms can be added with-
out changing the core language that the compiler supports.
A large collection of primitives (cons, car, vector?, etc.)
and library procedures (map, apply, list->vector, etc.) need
to be implemented. Some of these library procedures can be im-
plemented directly, while others require some added support from
the compiler. For example, some of the primitives cannot be im-
plemented without supporting variable-arity procedures, and others
require the presence of apply. Implementing a writer and a reader
requires adding a way to communicate with an external run-time
system.
3. Writing a Compiler in 24 Small Steps
Now that we described the development methodology, we turn our
attention to the actual steps taken in constructing a compiler. This
section is a brief description of 24 incremental stages: the first is a
small language composed only of small integers, and the last covers
most of the requirements of R
5
RS. A more detailed presentation of
these stages is in the accompanying extended tutorial.
3.1 Integers
The simplest language that we can compile and test is composed
of the fixed-size integers, or fixnums. Let’s write a small compiler
that takes a fixnum as input and produces a program in assembly
that returns that fixnum. Since we don’t know yet how to do that,
we ask for some help from another compiler that does know: gcc.
Let’s write a small C function that returns an integer:
int scheme_entry(){
return 42;
}
Let’s compile it using gcc -O3 --omit-frame-pointer -S
test.c and see the output. The most relevant lines of the output
file are the following:
1. .text
2. .p2align 4,,15
3. .globl scheme_entry
4. .type scheme_entry, @function
5. scheme_entry:
6. movl $42, %eax
7. ret
Line 1 starts a text segment, where code is located. Line 2 aligns
the beginning of the procedure at 4-byte boundaries (not important
at this point). Line 3 informs the assembler that the scheme entry
label is global so that it becomes visible to the linker. Line 4
says that scheme entry is a function. Line 5 denotes the start of
the scheme entry procedure. Line 6 sets the value of the %eax
register to 42. Line 7 returns control to the caller, which expects
the received value to be in the %eax register.
Generating this file from Scheme is straightforward. Our com-
piler takes an integer as input and prints the given assembly with
the input substituted in for the value to be returned.
(define (compile-program x)
(emit "movl $~a, %eax" x)
(emit "ret"))
To test our implementation, we write a small C run-time system
that calls our scheme entry and prints the value it returns:
/* a simple driver for scheme_entry */
#include <stdio.h>
int main(int argc, char** argv){
printf("%d\n", scheme_entry());
return 0;
}
3.2 Immediate Constants
Values in Scheme are not limited to the fixnum integers. Booleans,
characters, and the empty list form a collection of immediate val-
ues. Immediate values are those that can be stored directly in
a machine word and therefore do not require additional storage.
The types of the immediate objects in Scheme are disjoint, conse-
quently, the implementation cannot use fixnums to denote booleans
or characters. The types must also be available a t run time to al-
low the driver to print the values appropriately and to allow us to
provide the type predicates (discussed in the next step).
One way of encoding the type information is by dedicating some
of the lower bits of the machine word for type information and
using the rest of the machine word for storing the value. Every type
of value is defined by a mask and a tag. The mask defines which bits
of the integer are used for the type information and the tag defines
the value of these bits.
For fixnums, the lower two bits (mask = 11
b
) must be 0
(tag = 00
b
). This leaves 30 bits to hold the value of a fixnum.
Characters are tagged with 8 bits ( tag = 00001111
b
) leaving 24
bits for the value (7 of which are actually used to encode the ASCII
characters). Booleans are given a 7-bit tag (tag = 0011111
b
), and
1-bit value. The empty list is given the value 00101111
b
.
We extend our compiler to handle the immediate types appro-
priately. The code generator must convert the different immediate
values to the corresponding machine integer values.
(define (compile-program x)
(define (immediate-rep x)
(cond
((integer? x) (shift x fixnum-shift))
...))
(emit "movl $~a, %eax" (immediate-rep x))
(emit "ret"))
The driver must also be extended to handle the newly-added
values. The following code illustrates the concept:
#include <stdio.h>
#define fixnum_mask 3
#define fixnum_tag 0
#define fixnum_shift 2
...
int main(int argc, char** argv){
int val = scheme_entry();
if((val & fixnum_mask) == fixnum_tag){
printf("%d\n", val >> fixnum_shift);
} else if(val == empty_list){
printf("()\n");
} ...
return 0;
}
3.3 Unary Primitives
We extend the language now to include calls to primitives that ac-
cept one argument. We start with the simplest of these primitives:
add1 and sub1. To compile an expression in the form (add1 e),
we first emit the code for e. That code would evaluate e placing its
value in the %eax register. What remains to be done is incrementing
Scheme and Functional Programming, 2006 29