
5
The Mahout team is building the environment dialect in the image of R. The new
Mahout is a Scala-based beast, and all algebraic expressions are now in Scala with an
R-like Scala DSL layered on top.
Initially, Samsara had a DSL (enabled via a separate import) for MATLAB-like
dialect as well, but unfortunately Scala operator support posed issues implementing the
entire MATLAB operator set verbatim. As a result, this work received much less attention.
Instead, we focused on the R side of things.
The goal is for the Mahout DSL to be easily readable by R programmers. E.g.
A
%*% B
is matrix multiplication,
A * B
is the element-wise Hadamard product, methods
like colMeans, colSums follow R naming.
Among other things, arguably, math written in an R-like fashion is easier to under-
stand and maintain than the same things written in other basic procedural or functional
environments.
Mahout Samsara is backend-agnostic.
Indeed, Mahout is not positioning itself as Spark-specific. You can think of it that way if
you use Spark, but if you use H2O, you could think of it as H2O-specifc (or, hopefully,
"Apache Flink-specific" in the future) just as easily.
Neither of the above examples contain a single Spark (or H2O) imported dependency.
They are written once but run on any of supported backs.
Not every algorithm can be written with this set of backend-independent techniques
of course – there is more on that below. But quite a few can – and the majority can
leverage at least some of these techniques as the backbone. For example, imagine that the
dataset
X
above is a result of an embarrassingly parallel statistical Monte Carlo technique
(which is also backend-independent), and just like that perhaps we get a backend-agnostic
Gibbs sampler.
Mahout is an add-on to backend functionality.
Mahout is not taking away any capabilities of the backend. Instead, one can think of it as
an "add-on" over, e.g., Spark and all its technologies. The same is true for H2O.
In truth, algebra and statistics alone are not enough to make ends meet. Access to
the Spark RDD API, streaming, functional programming, external libraries, and many
other wonderful things is desirable. In the case of Apache Spark one can embed algebraic
pipelines by importing Spark-specific capabilities. Import MLlib or GraphX and all
the goodies are available. Import DataFrame (or SchemaRDD) and use the language-
integrated QL, and so on.
But if we want to draw any parallels, MLlib is “off-the-shelf code.” Mahout 0.10+ is
about that, too; but we hope that it is more about “off-the-shelf math” rather than code.
In other words, Mahout 0.10+ is for people who like to experiment and research at scale
using known mathematical constructs, execute more control over an algorithm, and pay
much less attention to the specifics of distributed engines, and potentially would like to
share the outcomes across different operational backends.