没有合适的资源?快使用搜索试试~ 我知道了~
首页大数据存储技术综述(2017年)
大数据存储技术综述(2017年)
需积分: 50 24 下载量 105 浏览量
更新于2023-03-16
评论 1
收藏 1.05MB PDF 举报
对于容量快速增长、日趋多元化的大数据,业界亟需开发可行性更好的存储工具。为满足大数据存储需求,存储机制已经形成从传统数据管理系统到NoSQL技术的结构化转移。然而,目前可用的大数据存储技术无法为持续增长的异构数据提供一致、可扩展和可用的解决方案。
资源详情
资源评论
资源推荐
Siddiqa et al. / Front Inform Technol Electron Eng 2017 18(8):1040-1070
1040
Big data storage technologies: a survey
Aisha SIDDIQA
†‡1
, Ahmad KARIM
2
, Abdullah GANI
1
(
1
Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur 50603, Malaysia)
(
2
Department of Information Technology, Bahauddin Zakariya University, Multan 60000, Pakistan)
†
E-mail: aasiddiqa@gmail.com
Received Dec. 8, 2015; Revision accepted Mar. 28, 2016; Crosschecked Aug. 8, 2017
Abstract: There is a great thrust in industry toward the development of more feasible and viable tools for storing fast-growing
volume, velocity, and diversity of data, termed ‘big data’. The structural shift of the storage mechanism from traditional data
management systems to NoSQL technology is due to the intention of fulfilling big data storage requirements. However, the
available big data storage technologies are inefficient to provide consistent, scalable, and available solutions for continuously
growing heterogeneous data. Storage is the preliminary process of big data analytics for real-world applications such as scientific
experiments, healthcare, social networks, and e-business. So far, Amazon, Google, and Apache are some of the industry standards
in providing big data storage solutions, yet the literature does not report an in-depth survey of storage technologies available for
big data, investigating the performance and magnitude gains of these technologies. The primary objective of this paper is to
conduct a comprehensive investigation of state-of-the-art storage technologies available for big data. A well-defined taxonomy of
big data storage technologies is presented to assist data analysts and researchers in understanding and selecting a storage mecha-
nism that better fits their needs. To evaluate the performance of different storage architectures, we compare and analyze the ex-
isting approaches using Brewer’s CAP theorem. The significance and applications of storage technologies and support to other
categories are discussed. Several future research challenges are highlighted with the intention to expedite the deployment of a
reliable and scalable storage system.
Key words: Big data; Big data storage; NoSQL databases; Distributed databases; CAP theorem; Scalability; Consistency-
partition resilience; Availability-partition resilience
http://dx.doi.org/10.1631/FITEE.1500441 CLC number: TP311.13
1 Introduction
Nowadays, big data is the frontier topic for re-
searchers, as it refers to rapidly increasing amounts of
data gathered from heterogeneous devices (Chen and
Zhang, 2014). Sensor networks, scientific experi-
ments, websites, and many other applications produce
data in various formats (Abouzeid et al., 2009). The
tendency to shift from structured to unstructured data
(Subramaniyaswamy et al., 2015) makes traditional
relational databases unsuitable for storage. This in-
adequacy of relational databases motivates the de-
velopment of efficient distributed storage mecha-
nisms. Provision of highly scalable, reliable, and
efficient storage for dynamically growing data is the
main objective in deploying a tool for big data storage
(Oliveira et al., 2012). Thus, innovative development
of storage systems with improved access performance
and fault tolerance is required.
Big data has influenced research, management,
and business perspectives and has captured the atten-
tion of data solution providers toward the deployment
of satisfactory technologies for big data storage (Sakr
et al., 2011). Relational databases have been very
efficient for intensive amounts of data in terms of
storage and retrieval processes for many decades
(Vicknair et al., 2010). However, with the advent and
accessibility of the Internet, technology to the public
has turned the structure of data towards schema-less,
Frontiers of Information Technology & Electronic Engineering
www.jzus.zju.edu.cn; engineering.cae.cn; www.springerlink.com
ISSN 2095-9184 (print); ISSN 2095-9230 (online)
E-mail: jzus@zju.edu.cn
‡
Corresponding author
ORCID: Aisha SIDDIQA, http://orcid.org/0000-0002-1016-
758X
© Zhejiang University and Springer-Ver la g GmbH Germany 2017
Review:
Siddiqa et al. / Front Inform Technol Electron Eng 2017 18(8):1040-1070
1041
interconnected, and rapidly growing. Apart from that,
the complexity of data generated by web resources
does not allow the use of relational database tech-
nologies for analyzing image data (Xiao and Liu,
2011). The exponential growth, lack of structure, and
the variety in types bring data storage and analysis
challenges for traditional data management systems
(Deka, 2014). Transformation of big data structures to
relational data models, strictly defined relational
schemas, and complexity of procedures for simple
tasks are the rigid features of relational databases
(Hecht and Jablonski, 2011), which are not acceptable
to big data.
NoSQL technologies introduce flexible data
models, horizontal scalability, and schema-less data
models (Gorton and Klein, 2015). These databases
aim to provide ease in scalability and management of
large-volume data (Padhye and Tripathi, 2015).
NoSQL databases offer a certain level of transaction
handling so that they are adequate for social net-
working, e-mail, and other web-based applications.
To improve the accessibility of data to its users, data
are distributed and replicated in more than one site.
Replication on the same site not only supports data
recovery in case of any damage but also contributes in
high availability if replicas are created on different
geographic locations (Tanenbaum and van Steen,
2007; Turk et al., 2014). Consistency is another as-
pect of distributed storage systems when data has
multiple copies and keeping the data up-to-date on
each site becomes more challenging. Brewer (2012)
pointed out that preference to either availability or
consistency is a common design objective for distrib-
uted databases whereas network partitions are rare.
To date, NoSQL technologies have been widely
deployed and reported as surveys in the literature, yet
state of the art does not provide an in-depth investi-
gation into the features and performance of NoSQL
technologies. For instance, Sakr et al. (2011) pre-
sented a survey highlighting features and challenges
of a few NoSQL databases to deploy on the cloud.
Deka (2014) surveyed 15 cloud-based NoSQL data-
bases to analyze read/write optimization, durability,
and reliability. Han et al. (2011) described seven
NoSQL databases under key-value, column-oriented,
and document categories at an abstract level and
classified them with the CAP theorem. Similarly,
Chen et al. (2014) surveyed nine databases under the
same three categories as described by Han et al. (2011)
in the storage section of their survey. Another signif-
icant contribution in reviewing big data storage sys-
tems was made by Chen et al. (2014), who explained
the issues related to massive storage, distributed
storage, and big data storage. Their review also covers
some well-known database technologies under
key-value, column-oriented, and graph data models
and categorizes them with Brewer’s CAP theorem.
However, these three studies did not cover a large
number of NoSQL databases in key-value, col-
umn-oriented, and document categories. Moreover,
graph databases are not considered in these studies. In
contrast, Vicknair et al. (2010) and Batra and Tyagi
(2012) have studied Neo4j, which is a graph database,
in comparison with the relational database to observe
full-text character searches, security, data scalability,
and other data provenance operations. Zhang and Xu
(2013) highlighted the challenges (i.e., volume, vari-
ety, velocity, value, and complexity) related to storage
of big data over distributed file systems in a different
perspective. However, this survey did not aim to re-
port the performance of existing NoSQL databases for
the described challenges. Many other comparative
studies are present in the literature, analyzing the
performance of some specific category or limited
features of NoSQL databases. However, the state of
the art does not report any detailed investigation of a
vast set of performance metrics covering a large
number of storage technologies for big data.
Therefore, in this paper we highlight the features
of distributed database technologies available for big
data and present a more comprehensive review. We
thoroughly study 26 NoSQL databases in this survey
and investigate their performance. Moreover, we
describe a number of recent widely spread storage
technologies for big data under each data model such
as key-value, column-oriented, document, and graph
along with their licensing. In addition to that, we have
strengthened our analysis with a discussion on the
existing and recent trends in Brewer’s theorem. In
accordance with Brewer’s recent explanation for
distributed system characterization, we have high-
lighted each NoSQL database as either consistent or
highly available. Therefore, the remarkable contribu-
tion of our work is to provide awareness for big data
analysts to choose a storage option from a vast variety
of databases with better tradeoff between consistency
Siddiqa et al. / Front Inform Technol Electron Eng 2017 18(8):1040-1070
1042
and availability. Furthermore, this study helps re-
searchers understand and leverage an optimum stor-
age solution for their future research work.
Following are the key objectives of this paper:
(1) to investigate storage structures of a wide range of
technologies in a big data environment; (2) to high-
light distinctive properties of each storage technology;
(3) to develop the taxonomy and evaluate big data
storage technologies according to the well-known
Brewer theorem presented for distributed systems;
(4) to identify the challenges and research directions
for coping with big data storage in the future.
The rest of the paper is organized as follows:
Section 2 describes the evolution of big data storage
technologies and their distinctive features over rela-
tional databases. Contemporary storage technologies
for big data are also detailed in Section 2. Section 3
presents the taxonomy and categorization based on
adopted data model and licensing. Section 4 describes
Brewer’s CAP theorem for distributed systems along
with its new explanation. Storage technologies are
investigated and analyzed to suggest a type based on
Brewer’s categorization. Section 5 summarizes the
discussion and highlights future research challenges.
Section 6 concludes the discussion.
2 Evolution of big data storage technologies
In this section we discuss the technological shift
from relational, well-structured databases to non-
relational, schema-less storage technologies. The
drive and challenges due to big data are summarized.
Moreover, the prominent features of storage tech-
nologies specified for big data are highlighted. Most
commonly used big data storage technologies are also
elaborated upon.
Over past few decades, relational databases have
been used as well-structured data management tech-
nologies. They have been recommended for per-
forming data management operations on structured
data (Deagustini et al., 2013). Datasets such as the
Internet Movie Database (IMDB, 2015) and Movie-
Lens (MovieLens, 2015) are available for being ma-
nipulated using relational databases. Big data and
emerging technologies like cloud computing allow
data to be captured from interactive and portable
devices in various formats (Kaisler et al., 2013; Chen
and Zhang, 2014). The data come with the new chal-
lenges of fast retrieval, real-time processing, and
interpretation over a large volume (Kumar, 2014).
Unfortunately, relational databases could not evolve
as fast as big data. Moreover, the support to fault
tolerance and complex data structures is not satis-
factory for heterogeneous data (Skoulis et al., 2015).
Furthermore, the schema of relational databases does
not support frequent changes (Neo4j, 2015). Google,
Amazon, and Facebook are some of the well-known
web data repositories. Their processing and dynamic
scalability requirements are beyond the capabilities of
relational databases (Pokorny, 2013). Thus, continu-
ously growing data that come with heterogeneous
structures need a better solution. We summarize the
comparison between relational databases and big data
storage systems in Table 1 using a SWOT analysis.
Big data is essential for enterprises to predict
valuable business outcomes. To meet the challenges
of big data, NoSQL (not only SQL) databases have
emerged as enterprise solutions. NoSQL databases
overcome the problems of relational databases and
offer horizontally scalable, flexible, highly available,
accessible, and relatively inexpensive storage solu-
tions (MacFadden, 2013). Thus, NoSQL databases
have become the mostly adopted technologies for
storing big data. Unlike relational databases, these
technologies offer support to a large number of users
interacting with big data simultaneously. NoSQL
databases are great in achieving consistency, fault
tolerance, availability, and support to query (Cattell,
2010). They also guarantee some distinctive features
over relational databases: scalability, availability,
fault tolerance, consistency, and secondary indexing.
2.1 Distinctions of big data storage technologies
It is very common to consider scalability, relia-
bility, and availability as the design goals for a big
data storage technology. However, it is also observed
that consistency and availability influence each other
in a distributed system, and one of them is compro-
mised (Diack et al., 2013). According to the nature of
big data, a single server is not a wise decision for
storage, and it is better to configure a cluster of mul-
tiple hardware elements as the distributed storage
system. To discuss storage technologies for big data,
the description of features for distributed NoSQL
systems provided in the literature is also significant.
Siddiqa et al. / Front Inform Technol Electron Eng 2017 18(8):1040-1070
1043
For this reason, we explain these features in distinc-
tion. Out of them, consistency, availability, and parti-
tion resilience are further considered to examine the
applicability of Brewer’s theorem for current big data
storage technologies in Section 4.
Scalability refers to support to growing volumes
of data in such a manner that a significant increase or
optimization in storage resources is possible (Putnik
et al., 2013). The paradigm shift from batch pro-
cessing to streaming data processing shows that the
data volume is continuously increasing. Referring to
our previous work (Gani et al., 2015), the volume for
future big data will be at zettabyte scale and the
storage requirements will increase with such volume.
As far as the availability of a system is concerned, it
suggests quick access to data storage resources
(Bohlouli et al., 2013). For this purpose, data are
replicated among different servers, which may be
placed on the same location or distant locations to
make the data highly available to users at their nearby
sites, thus increasing big data retrieval efficiency
(Azeem and Khan, 2012; Wang et al., 2015). In other
words, minimum downtime and promptness of a
system for ad hoc access requests define its availa-
bility (Oracle, 2015a).
Node failure is very common in distributed
storage systems. To make it fault-tolerant, multiple
copies of data are created and placed on the same
node and/or different nodes of the storage cluster.
Replication not only makes the system highly avail-
able but it is also useful for fault tolerance (Hilker,
2012). Furthermore, NoSQL databases offer very
flexible schema and data relationships and are not as
complex as relational databases (Kumar, 2014).
Nowadays, data not only comprise tuples; documents
and objects are also part of big data. Therefore, a
predefined schema cannot deal with varying data
structures (Cattell, 2010).
Regardless of the distribution, storage systems
for big data ensure the data be complete and correct.
Changes made by users are committed under defined
rules (Oracle, 2015a). Eventual consistency has be-
come a widely adopted mechanism to implement
consistency in NoSQL databases. Changes are prop-
agated eventually, and the system becomes consistent
after propagating the changes. However, instantane-
Table 1 SWOT analysis of relational databases and big data storage systems
Traditional database systems
Big data storage systems
Strengths
Support highly structured data stored and processed
over an auxiliary server
Vertical scalability with extendible processing on a
server
Specialized data manipulation languages
Specialized schema
Support heterogeneous structured data
Horizontal scalability with extendible
commodity servers
Support data-intensive applications
Simultaneous accessibility
Reliability and high availability
High fault tolerance
Eventual consistency
Weaknesses
Performance bottleneck
Processing delays
Increased deadlocks with growth of data
Limited storage and processing capacity
Co-relations which hinder scalability
Expensive join operations for
multidimensional data
No compliance with ACID due to
scalability and performance
Opportunities
Support complex queries
Atomicity in complex transactions
Built-in deployment support
Improved query response times
Simplicity in storage structures
Data-intensive
Threats
Extensive volume of data for storage with
dynamic growth
Frequently changing schema
Complex data structures
More concurrent access needs
Frequent I/O needs
Real-time processing needs
Consistency of a large number of storage servers
Large number of small files
Deployment may need community support
Siddiqa et al. / Front Inform Technol Electron Eng 2017 18(8):1040-1070
1044
ous propagation leads to the development of a
strongly consistent system, yet it results in frequent
access locks. In addition to fault tolerance and con-
sistency, indexing is worthwhile for big data storage.
Indexing is a method that improves the performance
of data retrieval. In relational databases, primary keys
are sufficient to perform search operations (Oracle
Secondary, 2015). However, with the advent of big
data, which comes with new challenges of heteroge-
neity in data structures, primary key indexing is not
the solution. Secondary indexes are mostly created
automatically by the system, and keys other than the
primary key are produced.
2.2 Contemporary big data storage technologies
Research outcomes of exploring storage tech-
nologies for big data advocate different aspects of
designing storage mechanisms. These reliable and
highly available mechanisms contribute to improving
data access performance. Improved data access per-
formance drives better quality of data analysis. These
technologies offer scalable storage solutions for
growing big data with enhanced data structures and
support fault tolerance. This section provides a brief
explanation of storage systems for big data in each
category. In accomplishing the design goals, current
storage technologies are described to review their
feasibility for big data. Their storage structure and
outstanding features to support scalable resource
provisioning for big data are also described here.
The Google File System (GFS) is a proprietary
system developed by Google Inc. (Ghemawat et al.,
2003) to manage data-intensive applications in a dis-
tributed manner. It is designed to satisfy the storage
needs for steadily growing data as a significant ob-
jective along with other features provided by con-
temporary techniques. Current and future estimated
workloads are analyzed to develop such a distributed
file system. To deal with the commodity component
failure problem, GFS facilitates continuous monitor-
ing and ensures detecting errors, tolerates component
faults, and recovers them automatically. GFS adopts a
clustered approach that divides data chunks into
64-KB blocks and stores a 32-bit checksum for each
block. As shown in Fig. 1, these checksums are stored
on servers as part of metadata to ensure integrity.
Moreover, the chunks are replicated to avoid chunk
server faults and for availability and reliability (Dean
and Ghemawat, 2008). Such systems are supposed to
handle large-volume data where many kilo-byte size
files become a challenge for them. However, GFS
guarantees support to manage these small files along
with appending new data concurrently for large files
even when they are read/write-intensive.
The Hadoop Distributed File System (HDFS) is
developed as an inspiration of GFS. HDFS is a dis-
tributed, scalable storage system designed as a core of
Apache Hadoop to run on inexpensive commodity
hardware, initially designed as infrastructure of
Apache Nutch. HDFS is a suitable solution for data-
intensive applications, typically at gigabyte to tera-
byte scales, which require high throughput. HDFS
provides quick fault detection and automatic recovery,
as it comprises a large number of components.
However, there is a probability of block failure and
nonfunctioning (Borthakur, 2008). Block replication
is offered to avoid node failure and unavailability or
loss of data (Shvachko, 2010). Replication ensures
not only the availability but also the reliability of the
system and it is automatically handled by HDFS
NameNode. Rather than just being a storage layer of
Hadoop, HDFS is a standalone distributed file system
that helps improve the throughput of the system.
HDFS has a namespace-separated architecture.
Metadata is stored on the master node, which is called
NameNode, whereas block-split files are stored on a
number of DataNodes. NameNode performs mapping
of data on DataNodes and namespace operations such
as open, close, and rename file. DataNodes fulfill
read–write requests and create block and replicas. The
architecture of HDFS is shown in Fig. 2.
Fig. 1 Google File System (GFS) architecture (Ghemawat
et al
., 2003)
剩余30页未读,继续阅读
syp_net
- 粉丝: 158
- 资源: 1196
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- RTL8188FU-Linux-v5.7.4.2-36687.20200602.tar(20765).gz
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
- SPC统计方法基础知识.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0