models.Heconcludedthelecturebycallingforstatisticstoberenamed“data
science.”
In2001,WilliamS.Clevelandpublishedanactionplanforcreatinga
universitydepartmentinthefieldofdatascience(Cleveland2001).Theplan
emphasizestheneedfordatasciencetobeapartnershipbetweenmathematics
andcomputerscience.Italsoemphasizestheneedfordatasciencetobe
understoodasamultidisciplinaryendeavorandfordatascientiststolearnhowto
workandengagewithsubject-matterexperts.Inthesameyear,LeoBreiman
published“StatisticalModeling:TheTwoCultures”(2001).Inthispaper,
Breimancharacterizesthetraditionalapproachtostatisticsasadata-modeling
culturethatviewstheprimarygoalofdataanalysisasidentifyingthe(hidden)
stochasticdatamodel(e.g.,linearregression)thatexplainshowthedatawere
generated.Hecontraststhisculturewiththealgorithmic-modelingculturethat
focusesonusingcomputeralgorithmstocreatepredictionmodelsthatare
accurate(ratherthanexplanatoryintermsofhowthedatawasgenerated).
Breiman’sdistinctionbetweenastatisticalfocusonmodelsthatexplainthedata
versusanalgorithmicfocusonmodelsthatcanaccuratelypredictthedata
highlightsacoredifferencebetweenstatisticiansandMLresearchers.The
debatebetweentheseapproachesisstillongoingwithinstatistics(see,for
example,Shmueli2010).Ingeneral,todaymostdatascienceprojectsaremore
alignedwiththeMLapproachofbuildingaccuratepredictionmodelsandless
concernedwiththestatisticalfocusonexplainingthedata.Soalthoughdata
sciencebecameprominentindiscussionsrelatingtostatisticsandstillborrows
methodsandmodelsfromstatistics,ithasovertimedevelopeditsowndistinct
approachtodataanalysis.
Since2001,theconceptofdatasciencehasbroadenedwellbeyondthatofa
redefinitionofstatistics.Forexample,overthepast10yearstherehasbeena
tremendousgrowthintheamountofthedatageneratedbyonlineactivity(online
retail,socialmedia,andonlineentertainment).Gatheringandpreparingthese
dataforuseindatascienceprojectshasresultedintheneedfordatascientiststo
developtheprogrammingandhackingskillstoscrape,merge,andcleandata
(sometimesunstructureddata)fromexternalwebsources.Also,theemergence
ofbigdatahasmeantthatdatascientistsneedtobeabletoworkwithbig-data
technologies,suchasHadoop.Infact,todaytheroleofadatascientisthas
becomesobroadthatthereisanongoingdebateregardinghowtodefinethe
expertiseandskillsrequiredtocarryoutthisrole.
3
Itis,however,possibletolist
theexpertiseandskillsthatmostpeoplewouldagreearerelevanttotherole,
whichareshowninfigure1.Itisdifficultforanindividualtomasterallofthese
areas,and,indeed,mostdatascientistsusuallyhavein-depthknowledgeandreal