PREFACE xvii
to set this book aside in disappointment. In the next section we’ll suggest some well-written
R introductions.
You sho u ld als o not ex pe c t th e denitive guide to web scraping or text mining. First, we
focus on a software environment that was not specically tailored to these purposes. There
might be applications where
R is not the ideal solution for your task and other software
solutions might be more suited. We will not bother you with alternative environments such
as PHP, Python, Ruby, or Perl. To nd out if this book is helpful for you, you should ask
yourself whether you are already using or planning to use
R for your daily work. If the answer
to both questions is no, you should probably consider your alternatives. But if you already
use
R or intend to use it, you can spare yourself the effort to learn yet another language and
stay within a familiar environment.
This book is not strictly speaking about data science either. There are excellent intro-
ductions to the topic like the recently published books by O’Neil and Schutt (2013), Torgo
(2010), Zhao (2012), and Zumel and Mount (2014). What is occasionally missing in these
introductions is how data for data science applications are actually acquired. In this sense,
our book serves as a preparatory step for data analyses but also provides guidance on how to
manage available information and keep it up to date.
Finally, what you most certainly will not get is the perfect solution to your specic
problem. It is almost inherent in the data collection process that the elds where the data are
harvested are never exactly alike, and sometimes rapidly change shape. Our goal is to enable
you to adapt the pieces of code provided in the examples and case studies to create new pieces
of code to help you succeed in collecting the data you need.
Why R?
There are many reasons why we think that R is a good solution for the problems that are
covered in this book. To us, the most important points are:
1.
R is freely and easily accessible. You can download, install, and use it wherever and
whenever you want. There are huge benets to not being a specialist in expensive
proprietary programs, as you do not depend on the willingness of employers to pay
licensing fees.
2. For a software environment with a primarily statistical focus,
R has a large community
that continues to ourish.
R is used by various disciplines, such as social scientists,
medical scientists, psychologists, biologists, geographers, linguists, and also in busi-
ness. This range allows you to share code with many developers and protfrom
well-documented applications in diverse settings.
3.
R is open source. This means that you can easily retrace how functions work and mod-
ify them with little effort. It also means that program modications are not controlled
by an exclusive team of programmers that takes care of the product. Even if you are
not interested in contributing to the development of
R,youwillstillreapthebenets
from having access to a wide variety of optional extensions—packages. The num-
ber of packages is continuously growing and many existing packages are frequently
updated. You can nd nice overviews of popular themes in
R usage on http://cran.r-
project.org/web/views/.