data, what purpose those changes serve, and how they help us gain knowledge about the data. It is as much
about deciding what to do with the data as it is about how to do it efficiently.
Statistics is of course also closely related to data science. So closely linked, in fact, that many consider
data science just a fancy word for statistics that looks slightly more modern and sexy. I can’t say that I
strongly disagree with this—data science does sound sexier than statistics—but just as data science is
slightly different from computer science, data science is also slightly different from statistics. Just, perhaps,
somewhat less different than computer science is.
A large part of doing statistics is building mathematical models for your data and fitting the models to
the data to learn about the data in this way. That is also what we do in data science. As long as the focus is on
the data, I am happy to call statistics data science. If the focus changes to the models and the mathematics,
then we are drifting away from data science into something else—just as if the focus changes from the data
to computations we are drifting from data science to computer science.
Data science is also related to machine learning and artificial intelligence, and again there are huge
overlaps. Perhaps not surprising since something like machine learning has its home both in computer
science and in statistics; if it is focusing on data analysis, it is also at home in data science. To be honest, it
has never been clear to me when a mathematical model changes from being a plain old statistical model to
becoming machine learning anyway.
For this book, we are just going to go with my definition and, as long as we are focusing on analyzing
data, we are going to call it data science.
Prerequisites for Reading this Book
In the first seven chapters in this book, the focus is on data analysis and not programming. For those
seven chapters, I do not assume a detailed familiarity with topics such as software design, algorithms, data
structures, and such. I do not expect you to have any experience with the R programming language either.
I do, however, expect that you have had some experience with programming, mathematical modeling, and
statistics.
Programming R can be quite tricky at times if you are familiar with a scripting language or object-
oriented languages. R is a functional language that does not allow you to modify data, and while it does
have systems for object-oriented programming, it handles this programming paradigm very differently from
languages you are likely to have seen such as Java or Python.
For the data analysis part of this book, the first seven chapters, we will only use R for very
straightforward programming tasks, so none of this should pose a problem. We will have to write simple
scripts for manipulating and summarizing data so you should be familiar with how to write basic
expressions like function calls, if statements, loops, and so on. These things you will have to be comfortable
with. I will introduce every such construction in the book when we need them so you will see how they are
expressed in R, but I will not spend much time explaining them. I mostly will just expect you to be able to
pick it up from examples.
Similarly, I do not expect you to know already how to fit data and compare models in R. I do expect that
you have had enough introduction to statistics to be comfortable with basic terms like parameter estimation,
model fitting, explanatory and response variables, and model comparison. If not, I expect you to be at least
able to pick up what we are talking about when you need to.
I won’t expect you to know a lot about statistics and programming, but this isn’t Data Science for
Dummies, so I do expect you to be able to figure out examples without me explaining everything in detail.
After the first seven chapters is a short description of a data analysis project, one of my students did
in an earlier class. It shows how such a project could look, but I suggest that you do not wait until you have
finished the first seven chapters to start doing such analysis yourself. To get the most benefit out of reading
this book, you should be applying what you learn continuously. Already when you begin reading, I suggest
that you find a dataset that you would be interested in finding out more about and then apply what you learn
in each chapter to that data.