values, etc.). This data can then be visualized by passing the results to the plot function,
which is designed to visualize the results of this analysis.
In other languages with large scientific computing communities, such as Python, du-
plicating the functionality of lm requires the use of several third-party libraries to rep-
resent the data (NumPy), perform the analysis (SciPy), and visualize the results (mat-
plotlib). As we will see in the following chapters, such sophisticated analyses can be
performed with a single line of code in R.
In addition, as in other scientific computing environments, the fundamental data type
in R is a vector. Vectors can be aggregated and organized in various ways, but at the
core, all data is represented this way. This relatively rigid perspective on data structures
can be limiting, but is also logical given the application of the language. The most
frequently used data structure in R is the data frame, which can be thought of as a
matrix with attributes, an internally defined “spreadsheet” structure, or relational
database-like structure in the core of the language. Fundamentally, a data frame is
simply a column-wise aggregation of vectors that R affords specific functionality to,
which makes it ideal for working with any manner of data.
For all of its power, R also has its disadvantages. R does not scale well
with large data, and although there have been many efforts to address
this problem, it remains a serious issue. For the purposes of the case
studies we will review, however, this will not be an issue. The data sets
we will use are relatively small, and all of the systems we will build are
prototypes or proof-of-concept models. This distinction is important
because if your intention is to build enterprise-level machine learning
systems at the Google or Facebook scale, then R is not the right solution.
In fact, companies like Google and Facebook often use R as their “data
sandbox” to play with data and experiment with new machine learning
methods. If one of those experiments bears fruit, then the engineers will
attempt to replicate the functionality designed in R in a more appropri-
ate language, such as C.
This ethos of experimentation has also engendered a great sense of community around
the language. The social advantages of R hinge on this large and growing community
of experts using and contributing to the language. As Bo Cowgill alludes to, R was
borne out of statisticians’ desire to have a computing environment that met their spe-
cific needs. Many R users, therefore, are experts in their various fields. This includes
an extremely diverse set of disciplines, including mathematics, statistics, biology,
chemistry, physics, psychology, economics, and political science, to name a few. This
community of experts has built a massive collection of packages on top of the extensive
base functions in R. At the time of writing, CRAN, the R repository for packages,
contained over 2,800 packages. In the case studies that follow, we will use many of the
most popular packages, but this will only scratch the surface of what is possible with R.
R for Machine Learning | 3