大数据存储技术综述（2017年）_大数据存储技术综述

大数据存储

需积分: 50 105 浏览量更新于2023-03-16 评论 1 收藏 1.05MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

Siddiqa et al. / Front Inform Technol Electron Eng 2017 18(8):1040-1070

1040

Big data storage technologies: a survey

Aisha SIDDIQA

†‡1

, Ahmad KARIM

, Abdullah GANI

(

Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur 50603, Malaysia)

(

Department of Information Technology, Bahauddin Zakariya University, Multan 60000, Pakistan)

†

E-mail: aasiddiqa@gmail.com

Received Dec. 8, 2015; Revision accepted Mar. 28, 2016; Crosschecked Aug. 8, 2017

Abstract: There is a great thrust in industry toward the development of more feasible and viable tools for storing fast-growing

volume, velocity, and diversity of data, termed ‘big data’. The structural shift of the storage mechanism from traditional data

management systems to NoSQL technology is due to the intention of fulfilling big data storage requirements. However, the

available big data storage technologies are inefficient to provide consistent, scalable, and available solutions for continuously

growing heterogeneous data. Storage is the preliminary process of big data analytics for real-world applications such as scientific

experiments, healthcare, social networks, and e-business. So far, Amazon, Google, and Apache are some of the industry standards

in providing big data storage solutions, yet the literature does not report an in-depth survey of storage technologies available for

big data, investigating the performance and magnitude gains of these technologies. The primary objective of this paper is to

conduct a comprehensive investigation of state-of-the-art storage technologies available for big data. A well-defined taxonomy of

big data storage technologies is presented to assist data analysts and researchers in understanding and selecting a storage mecha-

nism that better fits their needs. To evaluate the performance of different storage architectures, we compare and analyze the ex-

isting approaches using Brewer’s CAP theorem. The significance and applications of storage technologies and support to other

categories are discussed. Several future research challenges are highlighted with the intention to expedite the deployment of a

reliable and scalable storage system.

Key words: Big data; Big data storage; NoSQL databases; Distributed databases; CAP theorem; Scalability; Consistency-

partition resilience; Availability-partition resilience

http://dx.doi.org/10.1631/FITEE.1500441 CLC number: TP311.13

1 Introduction

Nowadays, big data is the frontier topic for re-

searchers, as it refers to rapidly increasing amounts of

data gathered from heterogeneous devices (Chen and

Zhang, 2014). Sensor networks, scientific experi-

ments, websites, and many other applications produce

data in various formats (Abouzeid et al., 2009). The

tendency to shift from structured to unstructured data

(Subramaniyaswamy et al., 2015) makes traditional

relational databases unsuitable for storage. This in-

adequacy of relational databases motivates the de-

velopment of efficient distributed storage mecha-

nisms. Provision of highly scalable, reliable, and

efficient storage for dynamically growing data is the

main objective in deploying a tool for big data storage

(Oliveira et al., 2012). Thus, innovative development

of storage systems with improved access performance

and fault tolerance is required.

Big data has influenced research, management,

and business perspectives and has captured the atten-

tion of data solution providers toward the deployment

of satisfactory technologies for big data storage (Sakr

et al., 2011). Relational databases have been very

efficient for intensive amounts of data in terms of

storage and retrieval processes for many decades

(Vicknair et al., 2010). However, with the advent and

accessibility of the Internet, technology to the public

has turned the structure of data towards schema-less,

Frontiers of Information Technology & Electronic Engineering

www.jzus.zju.edu.cn; engineering.cae.cn; www.springerlink.com

ISSN 2095-9184 (print); ISSN 2095-9230 (online)

E-mail: jzus@zju.edu.cn

‡

Corresponding author

ORCID: Aisha SIDDIQA, http://orcid.org/0000-0002-1016-

758X

Review:

Siddiqa et al. / Front Inform Technol Electron Eng 2017 18(8):1040-1070

1041

interconnected, and rapidly growing. Apart from that,

the complexity of data generated by web resources

does not allow the use of relational database tech-

nologies for analyzing image data (Xiao and Liu,

2011). The exponential growth, lack of structure, and

the variety in types bring data storage and analysis

challenges for traditional data management systems

(Deka, 2014). Transformation of big data structures to

relational data models, strictly defined relational

schemas, and complexity of procedures for simple

tasks are the rigid features of relational databases

(Hecht and Jablonski, 2011), which are not acceptable

to big data.

NoSQL technologies introduce flexible data

models, horizontal scalability, and schema-less data

models (Gorton and Klein, 2015). These databases

aim to provide ease in scalability and management of

large-volume data (Padhye and Tripathi, 2015).

NoSQL databases offer a certain level of transaction

handling so that they are adequate for social net-

working, e-mail, and other web-based applications.

To improve the accessibility of data to its users, data

are distributed and replicated in more than one site.

Replication on the same site not only supports data

recovery in case of any damage but also contributes in

high availability if replicas are created on different

geographic locations (Tanenbaum and van Steen,

2007; Turk et al., 2014). Consistency is another as-

pect of distributed storage systems when data has

multiple copies and keeping the data up-to-date on

each site becomes more challenging. Brewer (2012)

pointed out that preference to either availability or

consistency is a common design objective for distrib-

uted databases whereas network partitions are rare.

To date, NoSQL technologies have been widely

deployed and reported as surveys in the literature, yet

state of the art does not provide an in-depth investi-

gation into the features and performance of NoSQL

technologies. For instance, Sakr et al. (2011) pre-

sented a survey highlighting features and challenges

of a few NoSQL databases to deploy on the cloud.

Deka (2014) surveyed 15 cloud-based NoSQL data-

bases to analyze read/write optimization, durability,

and reliability. Han et al. (2011) described seven

NoSQL databases under key-value, column-oriented,

and document categories at an abstract level and

classified them with the CAP theorem. Similarly,

Chen et al. (2014) surveyed nine databases under the

same three categories as described by Han et al. (2011)

in the storage section of their survey. Another signif-

icant contribution in reviewing big data storage sys-

tems was made by Chen et al. (2014), who explained

the issues related to massive storage, distributed

storage, and big data storage. Their review also covers

some well-known database technologies under

key-value, column-oriented, and graph data models

and categorizes them with Brewer’s CAP theorem.

However, these three studies did not cover a large

number of NoSQL databases in key-value, col-

umn-oriented, and document categories. Moreover,

graph databases are not considered in these studies. In

contrast, Vicknair et al. (2010) and Batra and Tyagi

(2012) have studied Neo4j, which is a graph database,

in comparison with the relational database to observe

full-text character searches, security, data scalability,

and other data provenance operations. Zhang and Xu

(2013) highlighted the challenges (i.e., volume, vari-

ety, velocity, value, and complexity) related to storage

of big data over distributed file systems in a different

perspective. However, this survey did not aim to re-

port the performance of existing NoSQL databases for

the described challenges. Many other comparative

studies are present in the literature, analyzing the

performance of some specific category or limited

features of NoSQL databases. However, the state of

the art does not report any detailed investigation of a

vast set of performance metrics covering a large

number of storage technologies for big data.

Therefore, in this paper we highlight the features

of distributed database technologies available for big

data and present a more comprehensive review. We

thoroughly study 26 NoSQL databases in this survey

and investigate their performance. Moreover, we

describe a number of recent widely spread storage

technologies for big data under each data model such

as key-value, column-oriented, document, and graph

along with their licensing. In addition to that, we have

strengthened our analysis with a discussion on the

existing and recent trends in Brewer’s theorem. In

accordance with Brewer’s recent explanation for

distributed system characterization, we have high-

lighted each NoSQL database as either consistent or

highly available. Therefore, the remarkable contribu-

tion of our work is to provide awareness for big data

analysts to choose a storage option from a vast variety

of databases with better tradeoff between consistency

Siddiqa et al. / Front Inform Technol Electron Eng 2017 18(8):1040-1070

1042

and availability. Furthermore, this study helps re-

searchers understand and leverage an optimum stor-

age solution for their future research work.

Following are the key objectives of this paper:

(1) to investigate storage structures of a wide range of

technologies in a big data environment; (2) to high-

light distinctive properties of each storage technology;

(3) to develop the taxonomy and evaluate big data

storage technologies according to the well-known

Brewer theorem presented for distributed systems;

(4) to identify the challenges and research directions

for coping with big data storage in the future.

The rest of the paper is organized as follows:

Section 2 describes the evolution of big data storage

technologies and their distinctive features over rela-

tional databases. Contemporary storage technologies

for big data are also detailed in Section 2. Section 3

presents the taxonomy and categorization based on

adopted data model and licensing. Section 4 describes

Brewer’s CAP theorem for distributed systems along

with its new explanation. Storage technologies are

investigated and analyzed to suggest a type based on

Brewer’s categorization. Section 5 summarizes the

discussion and highlights future research challenges.

Section 6 concludes the discussion.

2 Evolution of big data storage technologies

In this section we discuss the technological shift

from relational, well-structured databases to non-

relational, schema-less storage technologies. The

drive and challenges due to big data are summarized.

Moreover, the prominent features of storage tech-

nologies specified for big data are highlighted. Most

commonly used big data storage technologies are also

elaborated upon.

Over past few decades, relational databases have

been used as well-structured data management tech-

nologies. They have been recommended for per-

forming data management operations on structured

data (Deagustini et al., 2013). Datasets such as the

Internet Movie Database (IMDB, 2015) and Movie-

Lens (MovieLens, 2015) are available for being ma-

nipulated using relational databases. Big data and

emerging technologies like cloud computing allow

data to be captured from interactive and portable

devices in various formats (Kaisler et al., 2013; Chen

and Zhang, 2014). The data come with the new chal-

lenges of fast retrieval, real-time processing, and

interpretation over a large volume (Kumar, 2014).

Unfortunately, relational databases could not evolve

as fast as big data. Moreover, the support to fault

tolerance and complex data structures is not satis-

factory for heterogeneous data (Skoulis et al., 2015).

Furthermore, the schema of relational databases does

not support frequent changes (Neo4j, 2015). Google,

Amazon, and Facebook are some of the well-known

web data repositories. Their processing and dynamic

scalability requirements are beyond the capabilities of

relational databases (Pokorny, 2013). Thus, continu-

ously growing data that come with heterogeneous

structures need a better solution. We summarize the

comparison between relational databases and big data

storage systems in Table 1 using a SWOT analysis.

Big data is essential for enterprises to predict

valuable business outcomes. To meet the challenges

of big data, NoSQL (not only SQL) databases have

emerged as enterprise solutions. NoSQL databases

overcome the problems of relational databases and

offer horizontally scalable, flexible, highly available,

accessible, and relatively inexpensive storage solu-

tions (MacFadden, 2013). Thus, NoSQL databases

have become the mostly adopted technologies for

storing big data. Unlike relational databases, these

technologies offer support to a large number of users

interacting with big data simultaneously. NoSQL

databases are great in achieving consistency, fault

tolerance, availability, and support to query (Cattell,

2010). They also guarantee some distinctive features

over relational databases: scalability, availability,

fault tolerance, consistency, and secondary indexing.

2.1 Distinctions of big data storage technologies

It is very common to consider scalability, relia-

bility, and availability as the design goals for a big

data storage technology. However, it is also observed

that consistency and availability influence each other

in a distributed system, and one of them is compro-

mised (Diack et al., 2013). According to the nature of

big data, a single server is not a wise decision for

storage, and it is better to configure a cluster of mul-

tiple hardware elements as the distributed storage

system. To discuss storage technologies for big data,

the description of features for distributed NoSQL

systems provided in the literature is also significant.

Siddiqa et al. / Front Inform Technol Electron Eng 2017 18(8):1040-1070

1043

For this reason, we explain these features in distinc-

tion. Out of them, consistency, availability, and parti-

tion resilience are further considered to examine the

applicability of Brewer’s theorem for current big data

storage technologies in Section 4.

Scalability refers to support to growing volumes

of data in such a manner that a significant increase or

optimization in storage resources is possible (Putnik

et al., 2013). The paradigm shift from batch pro-

cessing to streaming data processing shows that the

data volume is continuously increasing. Referring to

our previous work (Gani et al., 2015), the volume for

future big data will be at zettabyte scale and the

storage requirements will increase with such volume.

As far as the availability of a system is concerned, it

suggests quick access to data storage resources

(Bohlouli et al., 2013). For this purpose, data are

replicated among different servers, which may be

placed on the same location or distant locations to

make the data highly available to users at their nearby

sites, thus increasing big data retrieval efficiency

(Azeem and Khan, 2012; Wang et al., 2015). In other

words, minimum downtime and promptness of a

system for ad hoc access requests define its availa-

bility (Oracle, 2015a).

Node failure is very common in distributed

storage systems. To make it fault-tolerant, multiple

copies of data are created and placed on the same

node and/or different nodes of the storage cluster.

Replication not only makes the system highly avail-

able but it is also useful for fault tolerance (Hilker,

2012). Furthermore, NoSQL databases offer very

flexible schema and data relationships and are not as

complex as relational databases (Kumar, 2014).

Nowadays, data not only comprise tuples; documents

and objects are also part of big data. Therefore, a

predefined schema cannot deal with varying data

structures (Cattell, 2010).

Regardless of the distribution, storage systems

for big data ensure the data be complete and correct.

Changes made by users are committed under defined

rules (Oracle, 2015a). Eventual consistency has be-

come a widely adopted mechanism to implement

consistency in NoSQL databases. Changes are prop-

agated eventually, and the system becomes consistent

after propagating the changes. However, instantane-

Table 1 SWOT analysis of relational databases and big data storage systems

Traditional database systems

Big data storage systems

Strengths

Support highly structured data stored and processed

over an auxiliary server

Vertical scalability with extendible processing on a

server

Specialized data manipulation languages

Specialized schema

Support heterogeneous structured data

Horizontal scalability with extendible

commodity servers

Support data-intensive applications

Simultaneous accessibility

Reliability and high availability

High fault tolerance

Eventual consistency

Weaknesses

Performance bottleneck

Processing delays

Increased deadlocks with growth of data

Limited storage and processing capacity

Co-relations which hinder scalability

Expensive join operations for

multidimensional data

No compliance with ACID due to

scalability and performance

Opportunities

Support complex queries

Atomicity in complex transactions

Built-in deployment support

Improved query response times

Simplicity in storage structures

Data-intensive

Threats

Extensive volume of data for storage with

dynamic growth

Frequently changing schema

Complex data structures

More concurrent access needs

Frequent I/O needs

Real-time processing needs

Consistency of a large number of storage servers

Large number of small files

Deployment may need community support

Siddiqa et al. / Front Inform Technol Electron Eng 2017 18(8):1040-1070

1044

ous propagation leads to the development of a

strongly consistent system, yet it results in frequent

access locks. In addition to fault tolerance and con-

sistency, indexing is worthwhile for big data storage.

Indexing is a method that improves the performance

of data retrieval. In relational databases, primary keys

are sufficient to perform search operations (Oracle

Secondary, 2015). However, with the advent of big

data, which comes with new challenges of heteroge-

neity in data structures, primary key indexing is not

the solution. Secondary indexes are mostly created

automatically by the system, and keys other than the

primary key are produced.

2.2 Contemporary big data storage technologies

Research outcomes of exploring storage tech-

nologies for big data advocate different aspects of

designing storage mechanisms. These reliable and

highly available mechanisms contribute to improving

data access performance. Improved data access per-

formance drives better quality of data analysis. These

technologies offer scalable storage solutions for

growing big data with enhanced data structures and

support fault tolerance. This section provides a brief

explanation of storage systems for big data in each

category. In accomplishing the design goals, current

storage technologies are described to review their

feasibility for big data. Their storage structure and

outstanding features to support scalable resource

provisioning for big data are also described here.

The Google File System (GFS) is a proprietary

system developed by Google Inc. (Ghemawat et al.,

2003) to manage data-intensive applications in a dis-

tributed manner. It is designed to satisfy the storage

needs for steadily growing data as a significant ob-

jective along with other features provided by con-

temporary techniques. Current and future estimated

workloads are analyzed to develop such a distributed

file system. To deal with the commodity component

failure problem, GFS facilitates continuous monitor-

ing and ensures detecting errors, tolerates component

faults, and recovers them automatically. GFS adopts a

clustered approach that divides data chunks into

64-KB blocks and stores a 32-bit checksum for each

block. As shown in Fig. 1, these checksums are stored

on servers as part of metadata to ensure integrity.

Moreover, the chunks are replicated to avoid chunk

server faults and for availability and reliability (Dean

and Ghemawat, 2008). Such systems are supposed to

handle large-volume data where many kilo-byte size

files become a challenge for them. However, GFS

guarantees support to manage these small files along

with appending new data concurrently for large files

even when they are read/write-intensive.

The Hadoop Distributed File System (HDFS) is

developed as an inspiration of GFS. HDFS is a dis-

tributed, scalable storage system designed as a core of

Apache Hadoop to run on inexpensive commodity

hardware, initially designed as infrastructure of

Apache Nutch. HDFS is a suitable solution for data-

intensive applications, typically at gigabyte to tera-

byte scales, which require high throughput. HDFS

provides quick fault detection and automatic recovery,

as it comprises a large number of components.

However, there is a probability of block failure and

nonfunctioning (Borthakur, 2008). Block replication

is offered to avoid node failure and unavailability or

loss of data (Shvachko, 2010). Replication ensures

not only the availability but also the reliability of the

system and it is automatically handled by HDFS

NameNode. Rather than just being a storage layer of

Hadoop, HDFS is a standalone distributed file system

that helps improve the throughput of the system.

HDFS has a namespace-separated architecture.

Metadata is stored on the master node, which is called

NameNode, whereas block-split files are stored on a

number of DataNodes. NameNode performs mapping

of data on DataNodes and namespace operations such

as open, close, and rename file. DataNodes fulfill

read–write requests and create block and replicas. The

architecture of HDFS is shown in Fig. 2.

Fig. 1 Google File System (GFS) architecture (Ghemawat

et al

., 2003)

剩余30页未读，继续阅读

syp_net

粉丝: 158
资源: 1196

会员权益专享

大数据存储技术综述（2017年）

评论0

会员权益专享

最新资源

大数据存储技术综述（2017年）

评论0

大数据技术综述

大数据存储技术进展2017

大数据存储技术和标准化_李海波.pdf

大数据存储技术综述。 分别对比介绍传统关系型数据库、NoSQL、NewSQL的原理与应用,介绍当前流行的大数据存储平台以及在这些平台上运行的大数据处理引擎,对其优缺点进行了综合阐述。

给出大数据存储和管理技术的技术路线图

大数据存储主要研究内容

大数据软件技术的主要研究内容

云计算与大数据融合技术相关文献

大数据存储python

hugegraph大规模图的存储技术选型哲学 - 2022大数据存储架构峰会

大数据存储项目-基于Flink的高速公路ETC入深圳数据实时分析平台

hadoop大数据技术综述详细一千字

.简要介绍大数据存储的3种典型方法。

大数据关键技术 csdn

太原理工大学大数据技术基础

大数据技术从事行业需要什么技术

大数据软件技术课题可以使用什么研究方法

工业大数据的关键技术与通用大数据技术相比有什么特殊之处

林子雨大数据技术原理与应用pdf

会员权益专享

最新资源

大数据存储技术综述。分别对比介绍传统关系型数据库、NoSQL、NewSQL的原理与应用,介绍当前流行的大数据存储平台以及在这些平台上运行的大数据处理引擎,对其优缺点进行了综合阐述。