4
1 Introduction
Reading data into a statistical system for analysis and exporting the results to some other system
for report writing can be frustrating tasks that can take far more time than the statistical analysis
itself, even though most readers will find the latter far more appealing.
This manual describes the import and export facilities available either in R itself or via
packages which are available from CRAN or elsewhere.
Unless otherwise stated, everything described in this manual is (at least in principle) available
on all platforms running R.
In general, statistical systems like R are not particularly well suited to manipulations of
large-scale data. Some other systems are better than R at this, and part of the thrust of
this manual is to suggest that rather than duplicating functionality in R we can make another
system do the work! (For example Therneau & Grambsch (2000) commented that they preferred
to do data manipulation in SAS and then use package survival (https://CRAN.R-project.
org/package=survival) in S for the analysis.) Database manipulation systems are often very
suitable for manipulating and extracting data: several packages to interact with DBMSs are
discussed here.
There are packages to allow functionality developed in languages such as Java, perl and
python to be directly integrated with R code, making the use of facilities in these languages even
more appropriate. (See the rJava (https://CRAN.R-project.org/package=rJava) package
from CRAN and the SJava, RSPerl and RSPython packages from the Omegahat project, http://
www.omegahat.net.)
It is also worth remembering that R like S comes from the Unix tradition of small re-usable
tools, and it can be rewarding to use tools such as awk and perl to manipulate data before
import or after export. The case study in Becker, Chambers & Wilks (1988, Chapter 9) is an
example of this, where Unix tools were used to check and manipulate the data before input to
S. The traditional Unix tools are now much more widely available, including for Windows.
This manual was first written in 2000, and the number of scope of R packages has increased
a hundredfold since. For specialist data formats it is worth searching to see if a suitable package
already exists.
1.1 Imports
The easiest form of data to import into R is a simple text file, and this will often be acceptable for
problems of small or medium scale. The primary function to import from a text file is scan, and
this underlies most of the more convenient functions discussed in Chapter 2 [Spreadsheet-like
data], page 8.
However, all statistical consultants are familiar with being presented by a client with a
memory stick (formerly, a floppy disc or CD-R) of data in some proprietary binary format,
for example ‘an Excel spreadsheet’ or ‘an SPSS file’. Often the simplest thing to do is to use
the originating application to export the data as a text file (and statistical consultants will
have copies of the most common applications on their computers for that purpose). However,
this is not always possible, and Chapter 3 [Importing from other statistical systems], page 14,
discusses what facilities are available to access such files directly from R. For Excel spreadsheets,
the available methods are summarized in Chapter 9 [Reading Excel spreadsheets], page 29.
In a few cases, data have been stored in a binary form for compactness and speed of access.
One application of this that we have seen several times is imaging data, which is normally stored
as a stream of bytes as represented in memory, possibly preceded by a header. Such data formats
are discussed in Chapter 5 [Binary files], page 22, and Section 7.5 [Binary connections], page 26.
For much larger databases it is common to handle the data using a database management
system (DBMS). There is once again the option of using the DBMS to extract a plain file, but