(see Fig. 4). However, many Big Data analytic platforms, like SQLstream and Cloudera Impala, series still use SQL in its data-
base systems, because SQL is more reliable and simpler query language with high performance in stream Big Data real-time
analytics.
To store and manage unstructured data or non-relational data, NoSQL employs a number of specific approaches. Firstly,
data storage and management are separated into two independent parts. This is contrary to relational databases which try to
meet the concerns in the two sides simultaneously. This design gives NoSQL databases systems a lot of advantages. In the
storage part which is also called key-value storage, NoSQL focuses on the scalability of data storage with high-performance.
In the management part, NoSQL provides low-level access mechanism in which data management tasks can be implemented
in the application layer rather than having data management logic spread across in SQL or DB-specific stored procedure lan-
guages [37]. Therefore, NoSQL systems are very flexible for data modeling, and easy to update application developments and
deployments [60].
Most NoSQL databases have an important property. Namely, they are commonly schema-free. Indeed, the biggest advan-
tage of schema-free databases is that it enables applications to quickly modify the structure of data and does not need to
rewrite tables. Additionally, it possesses greater flexibility when the structured data is heterogeneously stored. In the data
management layer, the data is enforced to be integrated and valid. The most popular NoSQL database is Apache Cassandra.
Cassandra, which was once Facebook proprietary database, was released as open source in 2008. Other NoSQL implementa-
tions include SimpleDB, Google BigTable, Apache Hadoop, MapReduce, MemcacheDB, and Voldemort. Companies that use
NoSQL include Twitter, LinkedIn and NetFlix.
3.2.4. Data analysis
The first impression of Big Data is its volume, so the biggest and most important challenge is scalability when we deal
with the Big Data analysis tasks. In the last few decades, researchers paid more attentions to accelerate analysis algorithms
to cope with increasing volumes of data and speed up processors following the Moore’s Law. For the former, it is necessary to
develop sampling, on-line, and multiresolution analysis methods [59]. In the aspect of Big Data analytical techniques, incre-
ment algorithms have good scalability property, not for all machine learning algorithms. Some researchers devote into this
area [180,72,62]. As the data size is scaling much faster than CPU speeds, there is a natural dramatic shift [8] in processor
technology—although the clock cycle frequency of processors is doubling following Moore’s Law, the clock speeds still highly
lag behind. Alternatively, processors are being embedded with increasing numbers of cores. This shift in processors leads to
the development of parallel computing [130,168,52].
For those real-time Big Data applications, like navigation, social networks, finance, biomedicine, astronomy, intelligent
transport systems, and internet of thing, timeliness is at the top priority. How can we grantee the timeliness of response
when the volume of data will be processed is very large? It is still a big challenge for stream processing involved by Big Data.
It is right to say that Big Data not only have produced many challenge and changed the directions of the development of the
hardware, but also in software architectures. That is the swerve to cloud computing [50,186,7,48], which aggregates multiple
disparate workloads into a large cluster of processors. In this direction, distributed computing is being developed at high
speed recently. We will give a more detail discussion about it in next section.
Fig. 4. Hbase NoSQL database system architecture. Source: from Apache Hadoop.
320 C.L. Philip Chen, C.-Y. Zhang / Information Sciences 275 (2014) 314–347