Neurocomputing 261 (2017) 184–192
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
HB-File: An efficient and effective high-dimensional big data storage
structure based on US-ELM
Linlin Ding, Yu Liu, Baishuo Han, Shiwen Zhang, Baoyan Song
∗
School of Information, Liaoning University, Shenyang 110036, China
a r t i c l e i n f o
Article history:
Received 30 September 2015
Revised 11 June 2016
Accepted 16 June 2016
Available online 16 February 2017
Keywords:
US-ELM
HDFS
Big data
High-dimensional data
a b s t r a c t
With the rapid development of computer and the Internet techniques, the amount of data in all walks of
life increases sharply, especially accumulating numerous high-dimensional big data such as the network
transactions data, the user reviews data and the multimedia data. High-dimensional big data mixes the
typical features of both high-dimensional data and big data, which has also brought new problems and
great challenges for processing and optimizing the high-dimensional big data. In this case, the storage
structure of high-dimensional big data is a critical factor that can affect the processing performance in a
fundamental way. However, due to the huge dimensionality feature of high-dimensional data, the existing
data storage techniques, such as row-store and column-store, are not very suitable for high-dimensional
and large scale data. Therefore, in this paper, we present an efficient high-dimensional big data storage
structure based on US-ELM, H igh-dimensional B ig Data File , named HB-File . Then, we propose a fuzzy
cluster algorithm to differentiate the key dimension and non-key dimension of high-dimensional big data
based on US-ELM, which can also gain the clusters of key dimension . After that, we propose the execution
and API of HB-File based on the open source implementation of MapReduce, Hadoop system. With the
intensive experiments, we show the effectiveness of HB-File in satisfying the storage of high-dimensional
big data.
©2017 Published by Elsevier B.V.
1.
Introduction
With the rapid development of computer and the improvement
of human cognitive abilities, the understanding view and depth of
things by human also continue extending and deepening. Many
attributes are derived to describe the things and entities, so the
high-dimensional data is generated, such as the network transac-
tions data, the mine microseism data, the user reviews data and
the multimedia data. Especially when the era of data explosion
comes, many data sets to be processed and analyzed are being
the “big data”, so more and more high-dimensional data forms
the high-dimensional big data. For example, the number of user
comments is close to 3.2 billion every day on Facebook. The high-
dimensional big data contains valuable knowledge and informa-
tion, which has important theoretical sense and wide application
fields. Except for the four typical characteristics of big data, Vol-
ume, Variety, Value and Velocity, the high-dimensional big data
also has its own complex structure and numerous dimensions. That
is, the high-dimensional big data mixes the typical features of both
∗
Corresponding author.
E-mail address: bysong@lnu.edu.cn (B. Song).
high-dimensional data and big data, which brings the new prob-
lems and challenges of the query processing and optimization of
high-dimensional big data. In this case, the storage structure of
high-dimensional big data is a critical factor that can affect the
processing performance in a fundamental way.
However, the existing storage structures of big data are not
suitable for storing high-dimensional big data by the reason of
the numerous dimensions of high-dimensional big data. For ex-
ample, the column-store structure, typical HBase [1] , is very fit
for storing the data with sparse columns features. But, due to
the large amount and high coherence among dimensions of high-
dimensional big data, if we use the pure column-store technology
to manage high-dimensional big data, there would be numerous
join operations among the dimensions during recovering the data.
Instead, if we use row-store structure, typical HDFS [2] , to store the
high-dimensional big data, the single data record would be very
long due to so many data dimensions. So, each data block only has
a little high-dimensional big data records, which would reduce the
storage efficiency. In a word, it is an urgent need to design efficient
storage model for efficient storing high-dimensional big data.
Therefore, in this paper, we present an efficient high-
dimensional big data storage model, H igh-dimensional B ig Data
File , named HB-File . First, a table stored high-dimensional big data
http://dx.doi.org/10.1016/j.neucom.2016.06.080
0925-2312/© 2017 Published by Elsevier B.V.