O’Reilly-5980006 master October 28, 2010 22:0
Data analysis, as I understand it, is not a fixed set of techniques. It is a way of life, and it
has a name: curiosity. There is always something else to find out and something more to
learn. This book is not the last word on the matter; it is merely a snapshot in time: things I
knew about and found useful today.
“Works are of value only if they give rise to better ones.”
(Alexander von Humboldt, writing to Charles Darwin, 18 September 1839)
Before We Begin
More data analysis efforts seem to go bad because of an excess of sophistication rather
than a lack of it.
This may come as a surprise, but it has been my experience again and again. As a
consultant, I am often called in when the initial project team has already gotten stuck.
Rarely (if ever) does the problem turn out to be that the team did not have the required
skills. On the contrary, I usually find that they tried to do something unnecessarily
complicated and are now struggling with the consequences of their own invention!
Based on what I have seen, two particular risk areas stand out:
•
The use of “statistical” concepts that are only partially understood (and given the
relative obscurity of most of statistics, this includes virtually all statistical concepts)
•
Complicated (and expensive) black-box solutions when a simple and transparent
approach would have worked at least as well or better
I strongly recommend that you make it a habit to avoid all statistical language. Keep it
simple and stick to what you know for sure. There is absolutely nothing wrong with
speaking of the “range over which points spread,” because this phrase means exactly what
it says: the range over which points spread, and only that! Once we start talking about
“standard deviations,” this clarity is gone. Are we still talking about the observed width of
the distribution? Or are we talking about one specific measure for this width? (The
standard deviation is only one of several that are available.) Are we already making an
implicit assumption about the nature of the distribution? (The standard deviation is only
suitable under certain conditions, which are often not fulfilled in practice.) Or are we even
confusing the predictions we could make if these assumptions were true with the actual
data? (The moment someone talks about “95 percent anything” we know it’s the latter!)
I’d also like to remind you not to discard simple methods until they have been proven
insufficient. Simple solutions are frequently rather effective: the marginal benefit that
more complicated methods can deliver is often quite small (and may be in no reasonable
relation to the increased cost). More importantly, simple methods have fewer
opportunities to go wrong or to obscure the obvious.
xiv PREFACE
www.codecloud.net