没有合适的资源?快使用搜索试试~ 我知道了~
首页Applied_Text_Analysis_with_Python
资源详情
资源评论
资源推荐
Chapter1.Languageand
Computation
Applicationsthatleveragenaturallanguageprocessingtechniquestounderstand
human-generatedtextandaudiodataarebecomingfixturesofourlives.These
aretheapplicationsthatcuratethemyriadofhuman-generatedinformationon
thewebspecificallyonourbehalf,offeringnewandpersonalizedmechanismsof
human-computerinteraction.
Wehavegrownusedtotheseapplicationsservinginavarietyofroles,from
spamfiltersthatgroomouremailtraffictosearchenginesthattailorresultswith
personallyrelevantterms.Ironically,whileopportunitiestoinjectlanguage-
basedfeaturesintoapplicationscontinuetomultiply,theemergenceof
developersskilledintheirconstructiondoesnotseemtohavekeptpace.Thisis
atleastinpartbecauseasthesefeaturesbecomeincreasinglyprevalent,theyalso
becomeincreasinglyinvisible.Butitisalsobecausetherisingtideofdata
sciencehasnotyetpermeatedtheprevailingcultureofsoftwaredevelopment.
TheDataScienceParadigm
Thankstoinnovationsinmachinelearningandscalabledataprocessing,thepast
decadehasseen“datascience”and“dataproduct”rapidlybecomehousehold
terms.Theresultingnewroleof“datascientist” — onepartstatistician,onepart
computerscientist,andonepartdomainexpert — hasbecomeoneofthemost
significantjobsofthe21stcentury.
Theworkparadigmthatisgrowinguparoundthepracticeofdatascienceisone
ofresearchandexperimentation;inpartbecausemanydatascientistshave
previouslyspenttimeinpostgraduatestudies,andinpartbecauseoftheprocess
ofdatasciencedevelopmentisnecessarilyexperimental.Asaresult,data
scientistsanddatasciencedepartmentsoftenoperateautonomouslyfromthe
developmentteam,producingbusinessanalyticsforseniormanagement,which
maytheninformchangestothetechnologyorproductstrategyandeventuallybe
passedontothedevelopmentteamforimplementation.
Whilethisorganizationalstructuremaybeeffectiveinsomecases,itisnot
particularlyefficient;aswecanseeFigure1-1inifdatascientistswere
integratedinthedevelopmentteamfromthestart,improvementstotheproduct
wouldbemuchmoreimmediate,andthecompanymuchmorecompetitive.The
challengeisthatthedatascienceworkflowisnotalwayscompatiblewith
softwaredevelopmentpractices.Datacanbeunpredictableandsignalisnota
guarantee.AsHilaryMasonsaysofdataproductdevelopment,datascienceisn’t
alwaysparticularlyagile .
Figure1-1.TowardsaBetterParadigmforDataScience
Oneoftheconsequencesofthecurrentdatascienceparadigmisthatmostofthe
publishedresourcesonmachinelearningandnaturallanguageprocessingare
writteninwaysthatsupportresearch,butdonotscalewelltoapplications
development.Forinstance,whilethereareanumberofexcellenttoolsfor
machinelearningontext,theavailableresources,documentation,tutorials,blog
poststendtoleanheavilyontoydatasets,dataexplorationtools,andresearch
code.Fewresourcesexisttoexplain,forexample,howtobuildasufficiently
largecorpustosupportanapplication,howtomanageitssizeandstructureasit
growsovertime,orhowtotransformrawdocumentsintousabledata.In
practice,thisisunquestionablythemajorityoftheworkofbuildinglanguage-
1
baseddataproducts.
Thisbookisintendedtobridgethisgapbyempoweringadevelopment-oriented
approachtotextanalytics.Initwewilldemonstratehowtoleveragethe
availableopensourcetechnologiestocreatedataproductsthataremodular,
testable,tunable,andscalable.Togetherwiththesetools,wehopetheapplied
techniquespresentedinthisbookwillenabledatascientiststobuildthenext
generationofdataproducts.
Inthischapter,wewillbeginbyframingwhatwemeanbylanguage-awaredata
productsandtalkabouthowtobeginspottingtheminthewild.Next,we’ll
discussarchitecturaldesignpatternsthatarewellsuitedtotextanalytics
applications.Finally,we’llconsidersomeoftheuniquechallengesofworking
withtextdata,andprovideashortoverviewofsomeoftheprimaryopensource
librarieswe’llbeusingthroughoutthebook.
LanguageAwareDataProducts
Dataproductsarethosethatderivetheirvaluefromdataandgeneratenewdata
inreturn .Ourviewofappliedtextanalyticsisasthecreationof“language-
awaredataproducts”;user-facingapplicationsthatareresponsivetohuman
inputandcanadapttochange,thatarenotonlyimpressivelyaccurate,but
relativelysimpletodesign.Attheircore,theseapplicationstakeintextdataas
input,parseitintocompositeparts,computeuponthosecomposites,and
recombinetheminawaythatdeliversameaningfulandtailoredendresult.
Oneofourfavoriteexamplesofthisis“YelpyInsights”,areviewfiltering
techniquethatleveragesacombinationofsentimentanalysis,significant
collocations,andsearchtechniquestodetermineifarestaurantissuitablefor
vegetariansorparticularagegroups.Thisapplicationusesarich,application-
specificcorpusformultiplelanguageprocessingcomponentsandrevealsitself
tousersinnaturalandunexpectedways.Forexample,automaticidentification
ofsignificantsentencesinreviewsandtermhighlightingallowrestaurantgoers
todigestalargeamountoftexteasilyandmakeclearerdecisions.Although
languageanalysisisnotYelp’scorebusiness,theimpactthatithasontheir
bottomlineisundeniable.
Anothersimpleexampleofbolt-onlanguageanalysiswithoversizedeffectsis
2
the“suggestedtag”featureaddedtositeslikeStackOverflow,Netflix,Amazon,
YouTube,andothers.Tagsaremetainformationaboutaquestionorapostthat
areessentialforsearchandrecommendationsandplayalargerolein
determiningwhatcontentisviewedbyspecificusers.Theyalsoactby
identifyingpropertiesofthecontenttheydescribeandcanbeusedtogroup
contentwithsimilar,existingcontentandproposenamedtopicsfortheuserto
selectduringediting.SuchfeatureshighlightthebasicmethodologyofNLP
applications:clusteringsimilartextintomeaningfulgroupsorclassifyingtext
withspecificlabels,orsaidanotherway — unsupervisedandsupervised
machinelearning.
Thegoalofmachinelearningistofitexistingdatatosomemodel(oftencalled
training),creatingarepresentationoftherealworldthatisabletomake
decisionsorgeneratepredictionsonnewdatabasedondiscoveredpatterns.In
practice,thisisdonebyselectingamodelfamilythatdeterminestherelationship
betweenthetargetdataandtheinput,specifyingaformthatincludesparameters
andfeatures,thenusingsomeoptimizationproceduretominimizetheerrorof
themodelonthetrainingdata.Thefittedmodelcannowbeintroducedtonew
dataonwhichitwillmakeaprediction-returninglabels,probabilities,
membership,orvaluesbasedonthemodelform.Thechallengeistostrikea
balancebetweenbeingabletopreciselylearnthepatternsintheknowndataand
beingabletogeneralizesothemodelperformswellonexamplesithasnever
seenbefore.
Machinelearningiscentraltoapplieddatascience.Whileapplicationsthat
performnaturallanguageprocessinghavebeenaroundforseveraldecades
(rememberClippy?),theadditionofmachinelearningenablesadegreeof
flexibility,andthus,responsiveness,thatwouldnototherwisebepossible.By
trainingamachinelearningmodelonanapplication-specificcorpuslike
restaurantreviews,themodelisabletobetterhighlighttermsandmeaningsthat
areassociatedwiththatdomainratherthantheambiguityimpliedbygeneral
language.Modelscanberetrainedonnewdata,targetnewdecisionspaces,and
evenbefitonaper-userbasiswithoutmuchaddedprogrammingeffortsuchthat
theycancontinuetobeusedanddevelopedastheapplicationchanges.
Toadaptandrespondtonewanddifferentcontextsandusers,applicationsthat
implementpredictivemodelsmustbegeneralizableandadaptable.Dataproducts
剩余396页未读,继续阅读
lssc4205
- 粉丝: 14
- 资源: 5
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- RTL8188FU-Linux-v5.7.4.2-36687.20200602.tar(20765).gz
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
- SPC统计方法基础知识.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0