J1: a small Forth CPU Core for FPGAs
James Bowman
Willow Garage
Menlo Park, CA
jamesb@willowgarage.com
Abstract— This paper describes a 16-bit Forth CPU core,
intended for FPGAs. The instruction set closely matches the
Forth programming language, simplifying cross-compilation.
Because it has higher throughput than comparable CPU cores,
it can stream uncompressed video over Ethernet using a simple
software loop. The entire system (source Verilog, cross compiler,
and TCP/IP networking code) is published under the BSD
license. The core is less than 200 lines of Verilog, and operates
reliably at 80 MHz in a Xilinx Spartan
R
-3E FPGA, delivering
approximately 100 ANS Forth MIPS.
I. INTRODUCTION
The J1 is a small CPU core for use in FPGAs. It is a 16-
bit von Neumann architecture with three basic instruction
formats. The instruction set of the J1 maps very closely to
ANS Forth. The J1 does not have:
• condition registers or a carry flag
• pipelined instruction execution
• 8-bit memory operations
• interrupts or exceptions
• relative branches
• multiply or divide suppor t.
Despite these limitatio ns it has good performance and code
density, and reliably runs a complex program.
II. RELATED WORK
While there have been many CPUs for Forth, three current
designs stand out as options for embedded FPGA cor e s:
MicroCore [1] is a popular configurable pro cessor core
targeted at FPGAs. It is a dual-stack Harvard architecture,
encodes instructio ns in 8 bits, and executes one instruction
in two system clock cycles. A call requires two of these
instructions: a push literal followed by a branch to Top-
of-Stack (TOS). A 32-bit implementation with all options
enabled runs at 2 5 MHz - 12. 5 MIPS - in a Xilinx Spartan -
2S FPGA.
b16-small [2], [3] is a 16-bit RISC processor. In a ddition
to dual stacks, it has an address register A, and a c arry flag C.
Instructions are 5 bits each, and are packed 1-3 in each word.
Byte memory access is supported. Instructions execute at a
rate of one per cycle, except memory accesses and literals
which take one extra cycle. The b16 assembly language re-
sembles Chuck Moore’s ColorForth. FPGA implementations
of b16 ru n at 30 MHz.
eP32 [4] is a 32-bit RISC processor with deep re turn and
data stacks. It has an address register (X) a nd status register
(T ). Instru ctions are encoded in six bits, hen c e each 3 2-
bit word contains five instructions. Im plemented in TSMC’s
0.18µm CMOS standard library the CPU runs at 100 MHz,
providing 100 MIPS if all instructions are short. However a
jump or call instruction causes a stall as the target instruction
is fetched, so these instructions operate at 20 MIPS.
III. THE J1 CPU
A. Architecture
This description follows the conventio n that the top of
stack is T , the second item on the stack is N , and the top
of the return stack is R.
J1’s internal state consists of:
• a 33 deep × 16-bit data stack
• a 32 deep × 16-bit return stack
• a 13-bit program cou nter
There is no other interna l state: the CPU has no con dition
flags, modes or extra registers.
Memory is 16-bits wide and addressed in b ytes. Only
aligned 16-bit memory accesses are supported: byte memory
access is implemented in software. Addre sses 0-1638 3 are
RAM, used for code and data. Locations 16384-32767 are
used for memory-mapped I/O.
The 16-bit instruction for mat (table I) uses an unencode d
hardwired layout, as seen in the N ovix NC4016 [5]. Like
many other stack m achines, there are five catego ries of
instructions: literal, jump, conditional jump, ca ll, and ALU.
Literals are 15-bit, zero-extended to 16 -bit, and hence use
a single instruction when the number is in the rang e 0-327 67.
To h andle n umbers in the range 32768 -6553 5, the compiler
follows the immediate instruction with invert. Hence the
majority of immediate loads take one instruction.
All target addresses - for call, jump and conditional branch
- are 13-b it. This limits code size to 8K words, or 16K bytes.
The advantages are twofold. Firstly, instruction decode is
simpler because all three kinds of instructions have the same
format. Secondly, because there are no relative branches,
the cross compiler avoids the problem of ra nge overflow in
resolve.
Conditional branches are often a source of complexity in
CPUs and their associated compiler. J1 has a single instruc-
tion that tests and pops T , and if T = 0 replaces the current
PC with the 13-bit target value. This instruction is the same
as 0branch word found in many Forth implementa tions,
and is of course sufficient to implement the full set of control
structures.
ALU instruction have multiple fields: