the open source MR implementation. These tools are maturing fast and are open source
(especially Mahout). Mahout has a set of algorithms for clustering and classification, as well as a
very good recommendation algorithm (Konstan and Riedl 2012). Mahout can thus be said to
work on big data, with a number of production use cases, mainly for the recommendation
system. I have also used Mahout in a production system for realizing recommendation
algorithms in financial domain and found it to be scalable, though not without issues. (I had to
tweak the source significantly.) One observation about Mahout is that it implements only a
smaller subset of ML algorithms over Hadoop—only 25 algorithms are of production quality,
with only 8 or 9 usable over Hadoop, meaning scalable over large data sets. These include the
linear regression, linear SVM, the K-means clustering, and so forth. It does provide a fast
sequential implementation of the logistic regression, with parallelized training. However, as
several others have also noted (see Quora.com, for instance), it does not have implementations
of nonlinear SVMs or multivariate logistic regression (discrete choice model, as it is otherwise
known).
Overall, this book is not intended for Mahout bashing. However, my point is that it is quite hard
to implement certain ML algorithms including the kernel SVM and CGD (note that Mahout has
an implementation of stochastic gradient descent) over Hadoop. This has been pointed out by
several others as well—for instance, see the paper by Professor Srirama (Srirama et al. 2012).
This paper makes detailed comparisons between Hadoop and Twister MR (Ekanayake et al.
2010) with regard to iterative algorithms such as CGD and shows that the overheads can be
significant for Hadoop. What do I mean by iterative? A set of entities that perform a certain
computation, wait for results from neighbors or other entities, and start the next iteration. The
CGD is a perfect example of iterative ML algorithm—each CGD can be broken down into daxpy,
ddot, and matmul as the primitives. I will explain these three primitives: daxpy is an operation
that takes a vector x, multiplies it by a constant k, and adds another vector y to it; ddot
computes the dot product of two vectors x and y; matmul multiplies a matrix by a vector and
produces a vector output. This means 1 MR per primitive, leading to 6 MRs per iteration and
eventually 100s of MRs per CG computation, as well as a few gigabytes (GB)s of communication
even for small matrices. In essence, the setup cost per iteration (which includes reading from
HDFS into memory) overwhelms the computation for that iteration, leading to performance
degradation in Hadoop MR. In contrast, Twister distinguishes between static and variable data,
allowing data to be in memory across MR iterations, as well as a combine phase for collecting all
reduce phase outputs and, hence, performs significantly better.
The other second-generation tools are the traditional tools that have been scaled to work over
Hadoop. The choices in this space include the work done by Revolution Analytics, among others,
to scale R over Hadoop and the work to implement a scalable runtime over Hadoop for R
programs (Venkataraman et al. 2012). The SAS in-memory analytics, part of the High
Performance Analytics toolkit from SAS, is another attempt at scaling a traditional tool by using
a Hadoop cluster. However, the recently released version works over Greenplum/Teradata in