Spark NLP: Natural Language Understanding at Scale
Veysel Kocaman, David Talby
John Snow Labs Inc.
16192 Coastal Highway
Lewes, DE , USA 19958
{veysel, david}@johnsnowlabs.com
Abstract
Spark NLP is a Natural Language Processing (NLP) library built on top of
Apache Spark ML. It provides simple, performant & accurate NLP annotations
for machine learning pipelines that can scale easily in a distributed environment.
Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+
languages. It supports nearly all the NLP tasks and modules that can be used seam-
lessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x
growth since January 2020, Spark NLP is used by 54% of healthcare organizations
as the world’s most widely used NLP library in the enterprise.
Keywords: spark, natural language processing, deep learning, tensorflow, cluster
1. Spark NLP Library
Natural language processing (NLP) is a key component in many data science
systems that must understand or reason about a text. Common use cases include
question answering, paraphrasing or summarising, sentiment analysis, natural
language BI, language modelling, and disambiguation. Nevertheless, NLP is
always just a part of a bigger data processing pipeline and due to the nontrivial
steps involved in this process, there is a growing need for all-in-one solution to ease
the burden of text preprocessing at large scale and connecting the dots between
various steps of solving a data science problem with NLP. A good NLP library
should be able to correctly transform the free text into structured features and
let the users train their own NLP models that are easily fed into the downstream
machine learning (ML) or deep learning (DL) pipelines with no hassle.
Spark NLP is developed to be a single unified solution for all the NLP tasks
and is the only library that can scale up for training and inference in any Spark
cluster, take advantage of transfer learning and implementing the latest and greatest
Preprint submitted to Software Impacts January 27, 2021
arXiv:2101.10848v1 [cs.CL] 26 Jan 2021