CHAPTER 1 ■ INTRODUCTION TO BUSINESS ANALYTICS AND DATA ANALYSIS TOOLS
17
Big Data Is Not Just About Size
Gartner defines the three v’s of big data as volume, velocity, and variety. So far, only the volume aspect of big
data has been discussed. In this context, the speed with which the data is getting created is also important.
Consider the familiar example of the Cern Hydron Collider experiments; it annually generates 150 million
petabytes of data, which is about 1.36EB (1EB = 1073741824GB) per day. Per-hour transactions for Walmart
are more than 1 million.
The third v is variety. This dimension refers to the type of formats in which the data gets generated. It can be
structured, numeric or non-numeric, text, e-mail, customer transactions, audio, and video, to name just a few.
In addition to these three v’s, some like to include veracity while defining big data. Veracity includes the
biases, noise, and deviation that is inherent in most big data sets. It is more common to the data generated from
social media web sites. The SAS web site also counts on data complexity as one of the factors for defining big data.
Gartner’s definition of the three v’s has almost become an industry standard when it comes to defining
big data.
Sources of Big Data
Some of the big data sources have already been discussed in the earlier sections. Advanced science studies
in environmental sciences, genomics, microbiology, quantum physics, and so on, are the sources of data sets
that may be classified in the category of big data. Scientists are often struck by the sheer volume of data sets
they need to analyze for their research work. They need to continuously innovate ways and means to store,
process, and analyze such data.
Daily customer transactions with retailers such as Amazon, Walmart, and eBay also generate large
volumes of data at amazing rates. This kind of data mainly falls under the category of structured data.
Unstructured text data such as product descriptions, book reviews, and so on, is also involved. Healthcare
systems also add hundreds of terabytes of data to data centers annually in the form of patient records and
case documentations. Global consumer transactions processed daily by credit card companies such as Visa,
American Express, and MasterCard may also be classified as sources of big data.
The United States and other governments also are big sources of data generation. They need the power
of some of the world’s most powerful supercomputers to meaningfully process the data in reasonable time
frames. Research projects in fields such as economics and population studies, conducted by the World Bank,
UN, and IMF, also consume large amounts of data.
More recently, social media sites such as Facebook, Twitter, and LinkedIn are presenting some great
opportunities in the field of big data analysis. These sites are now among some of the biggest data generation
sources in the world. They are mainly the sources of unstructured data. Data forms included here are text
data such as customer responses, conversations, messages, and so on. Lots of other data sources such as
audio clips, numerous videos, and images are also included. Their databases are hundreds of petabytes.
This data, although difficult to analyze, presents immense opportunities to generate useful insights
and information such as product promotion, trend and sentiment analysis, brand management, online
reputation management for political outfits and individuals, to name a few. Social media analytics is a
rapidly growing field, and several startups and established companies are devoting considerable time and
energies to this practice. Table 1-1 compares big data to conventional data.
www.it-ebooks.info