Research on Attribute Dimension Partition based on SVM Classifying and MapReduce
Zhao Wenbin
1
, Fan Tongrang
1*
, Nie Yongchuan
2
, Wu Feng
2
, Wen Hou
1
1. School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang; Hebei, China
2. Institute of Scientific and Technical Information of Heibei Province, Shijiazhuang, Hebei, China
Abstract: The data analysis is closely related to data attribute dimension. The traditional extraction and
partition of data attribute dimension is so manual and inefficiency as to not meet the needs of analysing big data.
This paper proposed an attribute dimension partition scheme based on SVM classifying and MapReduce for
analysing big data. This scheme improve traditional SVM classifying method by combining Euclidean distance
theory for overcoming its disadvantages, and adopts punish coefficient to reduce the unbalance of data
distribution. With the improved SVM classifying method, the implementation of attribute dimension partition
take MapReduce model of Hadoop as process engine, use TF-IDF vector to save the extracted attribute
dimension, and use K-Means algorithm to clustering partition. The experiment result shows that the
execution efficiency of the proposed method is enhanced, and while the rationality of partition is guaranteed, the
increasing of data attributes does not significantly increase the execution time.
Keywords: Attribute Dimension Partition, SVM Classifying, MapReduce, Euclidean Distance, K-Means
1. Introduction
A mass of management information system (MIS) which manage large amounts of data, exist in current
society, But their data operation is limited to insert, delete , update, search and fixed statistics, so they do not
make the most out of those data [1,2]. With the development of big data technology, those data gradually play
new role. But data analysis is closely related to data attribute dimension, for example, the OLAP technology
which contains slice, dice and drill operation [3], is based on extraction and partition of data attribute dimension.
But The traditional extraction and partition of data attribute dimension is so manual and inefficiency as to not
meet the needs of big data analysis. The existing attribute dimension partition method, such as vertical partition,
horizontal partition and mixing partition, is related to the row or column of data attribute, without the
importance of data attribute to user [4]. This paper proposed an attribute dimension partition based on SVM
classifying and MapReduce. The SVM classifying method combines Euclidean distance theory for overcoming
the disadvantages of traditional SVM algorithm, and adopts punish coefficient to reduce the unbalance of data
distribution. With SVM classifying method, the implementation of attribute dimension partition take
MapReduce model of Hadoop as process engine, use TF-IDF vector to save the extracted attribute dimension,
use K-Means algorithm to clustering partition. The experiment result shows that the execution efficiency of
proposed method is enhanced, and while the rationality of partition is guaranteed, the increasing of data attribute
does not significantly increase the execution time.
2. Related Work
2.1 Data Analysis Method
OLAP (On-Line Analytical Processing) is a powerful data analysis engine that provides report display,
multidimensional analysis and the creation of an execution plan based on data scenarios. It is widely used for
business behaviour management, budget forecasting, financial reporting, knowledge discovery and data mining
and other production environment [14]. According to the needs of the data source of the OS, ERP, CRM data
through a unified extraction, transform, loading, the formation of data Warehouse, for the upper to provide
query analysis, report display, data mining and other functions.
Traditional OLAP analysis of the data, the relational database by means of the main data into two-
dimensional table and the fact table [15], to analyse the data cube by constructing a multidimensional data
structure that allows the data by the fact that several viewing axis query [16]. Numeric information is stored in
the main fact table under a certain dimension, as time goes on increasing, for example, from 2001 to 2016 input
and output values of a fact table under particular industry area storage Uygur and industrial dimension in
2001 and 2016 input-output factual information [17]. dimension table usually descriptive information with a
level of less change over time, such as regional division, time division, the division's three major industries, etc.
when the basic will not change after the entry into force of the development, which describes the data can be
encoded form regional dimension table, the time dimension table, data analysis model industrial dimension
tables[18]. OLAP by a single fact table and its associated dimension tables in a star or snowflake organized, by
associating the foreign key in the form of, when performing data OLAP analysis, data will first be described
according to the query in the fact table to the foreign key information table dimension, and then to be spliced
into Full description of the value, the final result is returned [19].