Programming the Perl DBI
age 14
Another common way of storing data within flat files is to use fixed-length records in which to store
the data. That is, each piece of data fits into an exactly sized space in the data file. In this form of
database, no delimiting character is needed between the fields. There's also no need to delimit each
record, but we'll continue to use ASCII line termination as a record delimiter in our examples because
Perl makes it very easy to work with files line by line.
Using fixed-width fields is similar to the way in which data is organized in more powerful database
systems such as an RDBMS. The pre-allocation of space for record data allows the storage manager to
make assumptions about the layout of the data on disk and to optimize accordingly. For our
megalithic data purposes, we could settle on the data sizes of:
[8]
[8]
The fact that these data sizes are all powers of two has no significance other than to indicate that the authors
are old enough to remember when powers of two were significant and useful sometimes. They generally aren't
anymore.
Field Required Bytes
----- --------------
Name 64
Location 64
Map Reference 16
Type 32
Description 256
Storing the data in this format requires slightly different storage manager logic to be used, although
the standard Perl file I/O functions are still applicable. To test this data for the correct record, we
need to implement a different way of extracting the fields from within each record. For a fixed-length
data file, the Perl function
unpack() is perfect. The following code shows how the unpack() function
replaces the
split() used above:
### Break up the record data into separate fields
### using the data sizes listed above
( $name, $location, $mapref, $type, $description ) =
unpack( "A64 A64 A16 A32 A256", $_ );
Although fixed-length fields are always the same length, the data that is being put into a particular
field may not be as long as the field. In this case, the extra space will be filled with a character not
normally encountered in the data or one that can be ignored. Usually, this is a space character (ASCII
32) or a
nul (ASCII 0).
In the code above, we know that the data is space-packed, and so we remove any trailing space from
the name record so as not to confuse the search. This can be simply done by using the uppercase
A
format with
unpack().
If you need to choose between delimited fields and fixed-length fields, here are a few guidelines:
The main limitations
The main limitation with delimited fields is the need to add special handling to ensure that
neither the field delimiter or the record delimiter characters get added into a field value.
The main limitation with fixed-length fields is simply the fixed length. You need to check for
field values being too long to fit (or just let them be silently truncated). If you need to increase
a field width, then you'll have to write a special utility to rewrite your file in the new format
and remember to track down and update every script that manipulates the file directly.
Space
A delimited-field file often uses less space than a fixed-length record file to store the same
data, sometimes very much less space. It depends on the number and size of any empty or
partially filled fields. For example, some field values, like web URLs, are potentially very long
but typically very short. Storing them in a long fixed-length field would waste a lot of space.
While delimited-field files often use less space, they do "waste" space due to all the field
delimiter characters. If you're storing a large number of very small fields then that might tip
the balance in favor of fixed-length records.