3 Design Tradeoffs
There are three phases of a disk imaging system: im-
age creation, image distribution, and image installation.
Each phase has aspects which must be balanced to fulfill
a desired goal. We consider each phase in turn.
3.1 Image Creation
In image creation, the goal is to create a consistent snap-
shot of a disk or partition in the most efficient way pos-
sible.
Source availability: While it is possible for the
source of the snapshot to be active during the image cre-
ation process, it is more common that it be quiescent to
ensure consistency. Quiescence may be achieved either
by using a separate partition or disk for the image source
or by running the image creation tool in a standalone en-
vironment which doesn’t use the source partition. What-
ever the technique, the time that the image source is “of-
fline” may be a consideration. For example, an image
creation tool which compresses the data as it reads it
from the disk may take much longer than one that just
reads the raw data and compresses later. However, the
former will require much less space to store the initial
image.
Degree of compression and data segmenta-
tion: Another factor is how much (if any) and what kind
of compression is used when creating the image. While
compression would seem to be an obvious optimization,
there are trade-offs. As mentioned, the time and CPU
resources required to create an image are greater when
compressing. Compression also impacts the distribution
and decompression process. If a disk image is com-
pressed as a single unit and even a single byte is lost
during distribution, the decompression process will stall
until the byte is acquired successfully. Thus, depending
on the distribution medium, images may need to be
broken into smaller pieces, each of which is compressed
independently. This can make image distribution more
robust and image installation more efficient at the
expense of sub-optimal compression.
Filesystem-aware compression: A stated advan-
tage of disk imaging over techniques that operate at the
file level is that imaging requires no knowledge of the
contents or semantics of the data being imaged. This
matches well with typical file compression tools and al-
gorithms which are likewise ignorant of the data be-
ing compressed. However, most disk images contain
filesystems and most filesystems have a large amount
of available (free) space in them, space that will du-
tifully be compressed even though the contents are ir-
relevant. Thus, the trade-off for being able to handle
any content is wasted time and space creating the image
and wasted time decompressing the image. One com-
mon workaround is to zero all the free space in filesys-
tems on the disk prior to imaging, for example, by cre-
ating and then deleting a large file full of zeros. This at
least ensures maximum compressibility of the free space.
A better solution is to perform filesystem-aware com-
pression. A filesystem-aware compression tool under-
stands the layout of a disk, identifying filesystems and
differentiating the important, allocated blocks from the
unimportant, free blocks. The allocated blocks are com-
pressed while the free blocks are skipped. Of course,
a disk imaging tool using filesystem-aware compression
requires even more intimate knowledge of a filesystem
than a file-level tool, but the imaging tool need not un-
derstand all filesystems it may encounter– it can always
fall back on naive compression.
3.2 Image Distribution
Image distribution is concerned with getting a disk im-
age from a “server” to one or more “clients.” In our con-
text it is assumed that the server and clients are different
machines and not just different disks on the same ma-
chine. Furthermore, we restrict the discussion to distri-
bution over a network.
Network bandwidth and latency: Perhaps the most
important aspect of network distribution is bandwidth
utilization. The availability of bandwidth affects how
images are created (the degree of compression) as well
as how many clients can be supported by a server (scal-
ing). Bandwidth requirements are reduced significantly
by using compression. Increased compression not only
reduces the amount of data that needs to be transferred,
it also slows the consumption rate of the client due to the
need to decompress the data before writing it to disk. If
image distribution is serialized, only one client at a time,
then compression alone may be sufficient to achieve a
target bandwidth. However, if the goal is to distribute
an image to multiple clients simultaneously, then typical
unicast protocols will need to be replaced with broad-
cast or multicast. Broadcast works well in environments
where all clients in the broadcast domain are involved
in the image distribution. If the network is shared, then
multicast is more appropriate, ensuring that unaffiliated
machines are not affected. Just as in all data transfer pro-
tocols, the delay-bandwidth product affects how much
data needs to be en route in order to keep clients busy,
and the bandwidth and latency influence the granularity
of the error recovery protocol.
Network reliability: As alluded to earlier, the er-
ror rate of the network may affect how compression is
performed. Smaller compression units may limit the
effectiveness of the compression, but increase the abil-
ity of clients to remain busy in the face of lost packets.
More generally, in lossy networks it is desirable to sub-
divide an image into “chunks” and include with each
chunk additional information to make that chunk self-