1. Single-processor Computing
• The size of a floating point number. If the arithmetic unit of a CPU is designed to multiply 8-byte
numbers efficiently (‘double precision’; see section 3.2.2) then numbers half that size (‘single
precision’) can sometimes be processed at higher efficiency, and for larger numbers (‘quadruple
precision’) some complicated scheme is needed. For instance, a quad precision number could
be emulated by two double precision numbers with a fixed difference between the exponents.
These measurements are not necessarily identical. For instance, the original Pentium processor had 64-bit
data busses, but a 32-bit processor. On the other hand, the Motorola 68000 processor (of the original Apple
Macintosh) had a 32-bit CPU, but 16-bit data busses.
The first Intel microprocessor, the 4004, was a 4-bit processor in the sense that it processed 4 bit chunks.
These days, 64 bit processors are becoming the norm.
1.2.3 Caches: on-chip memory
The bulk of computer memory is in chips that are separate from the processor. However, there is usually a
small amount (typically a few megabytes) of on-chip memory, called the cache. This will be explained in
detail in section 1.3.4.
1.2.4 Graphics, controllers, special purpose hardware
One difference between ‘consumer’ and ‘server’ type processors is that the consumer chips devote consid-
erable real-estate on the processor chip to graphics. Processors for cell phones and tablets can even have
dedicated circuitry for security or mp3 playback. Other parts of the processor are dedicated to communi-
cating with memory or the I/O subsystem. We will not discuss those aspects in this book.
1.2.5 Superscalar processing and instruction-level parallelism
In the von Neumann model processors operate through control flow: instructions follow each other linearly
or with branches without regard for what data they involve. As processors became more powerful and
capable of executing more than one instruction at a time, it became necessary to switch to the data flow
model. Such superscalar processors analyze several instructions to find data dependencies, and execute
instructions in parallel that do not depend on each other.
This concept is also known as Instruction Level Parallelism (ILP), and it is facilitated by various mecha-
nisms:
• multiple-issue: instructions that are independent can be started at the same time;
• pipelining: already mentioned, arithmetic units can deal with multiple operations in various
stages of completion;
• branch prediction and speculative execution: a compiler can ‘guess’ whether a conditional in-
struction will evaluate to true, and execute those instructions accordingly;
• out-of-order execution: instructions can be rearranged if they are not dependent on each other,
and if the resulting execution will be more efficient;
• prefetching: data can be speculatively requested before any instruction needing it is actually
encountered (this is discussed further in section 1.3.5).
20 Introduction to High Performance Scientific Computing