基于SVM分类与MapReduce的大数据属性维分区研究

89 浏览量更新于2024-08-29 收藏 1.05MB PDF 举报

本文针对大数据分析中数据属性维度的重要性，提出了一个基于支持向量机（SVM）分类和MapReduce模型的属性维度分区方案。传统上，数据属性维度的提取和划分过程往往依赖人工且效率低下，难以满足大数据时代对高效数据分析的需求。首先，作者对传统SVM分类方法进行了改进。他们结合欧几里得距离理论，解决了SVM在处理大规模数据时可能存在的局限性，如对异常值敏感和计算复杂度较高的问题。通过这种方法，文章试图提高分类的准确性并增强模型的鲁棒性。然后，该研究引入惩罚系数，旨在解决数据分布不均衡的问题。这有助于确保在分区过程中不同类别之间的数据样本得到均衡处理，避免了某些类别数据过多导致的分析偏倚。在实现策略上，论文将改进后的SVM分类方法与MapReduce模型相结合，以Hadoop平台作为处理引擎。MapReduce模型允许分布式处理海量数据，将复杂的计算任务分解为一系列独立的子任务，提高了执行效率。这种方法利用了Hadoop的并行计算能力，能够在大型集群上进行高效的处理。接着，文章采用了TF-IDF（Term Frequency-Inverse Document Frequency）向量来存储提取出的属性维度信息。TF-IDF是一种常见的文本挖掘技术，它能够量化一个词对于一个文档集合的重要程度，从而有效地表达和比较数据特征。最后，K-Means聚类算法被用来对处理后的属性维度进行分组，这是一种无监督学习方法，可以根据数据的内在结构将其分成若干个紧密相关的簇。K-Means算法在此阶段起到了至关重要的作用，因为它能够自动发现数据中的模式，并根据相似性将数据点分配到不同的群组。这篇研究论文创新性地结合了SVM、MapReduce和K-Means等技术，旨在解决大数据环境下属性维度处理的效率问题，为高效的数据分析提供了新的解决方案。这对于现代信息技术领域，特别是在无线个人通信等应用场景中，具有重要的实际价值和理论贡献。

Research on Attribute Dimension Partition based on SVM Classifying and MapReduce

Zhao Wenbin

, Fan Tongrang

, Nie Yongchuan

, Wu Feng

, Wen Hou

1. School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang; Hebei, China

2. Institute of Scientific and Technical Information of Heibei Province, Shijiazhuang, Hebei, China

Abstract: The data analysis is closely related to data attribute dimension. The traditional extraction and

partition of data attribute dimension is so manual and inefficiency as to not meet the needs of analysing big data.

This paper proposed an attribute dimension partition scheme based on SVM classifying and MapReduce for

analysing big data. This scheme improve traditional SVM classifying method by combining Euclidean distance

theory for overcoming its disadvantages, and adopts punish coefficient to reduce the unbalance of data

distribution. With the improved SVM classifying method, the implementation of attribute dimension partition

take MapReduce model of Hadoop as process engine, use TF-IDF vector to save the extracted attribute

dimension, and use K-Means algorithm to clustering partition. The experiment result shows that the

execution efficiency of the proposed method is enhanced, and while the rationality of partition is guaranteed, the

increasing of data attributes does not significantly increase the execution time.

Keywords: Attribute Dimension Partition, SVM Classifying, MapReduce, Euclidean Distance, K-Means

1. Introduction

A mass of management information system (MIS) which manage large amounts of data, exist in current

society, But their data operation is limited to insert, delete , update, search and fixed statistics, so they do not

make the most out of those data [1,2]. With the development of big data technology, those data gradually play

new role. But data analysis is closely related to data attribute dimension, for example, the OLAP technology

which contains slice, dice and drill operation [3], is based on extraction and partition of data attribute dimension.

But The traditional extraction and partition of data attribute dimension is so manual and inefficiency as to not

meet the needs of big data analysis. The existing attribute dimension partition method, such as vertical partition,

horizontal partition and mixing partition, is related to the row or column of data attribute, without the

importance of data attribute to user [4]. This paper proposed an attribute dimension partition based on SVM

classifying and MapReduce. The SVM classifying method combines Euclidean distance theory for overcoming

the disadvantages of traditional SVM algorithm, and adopts punish coefficient to reduce the unbalance of data

distribution. With SVM classifying method, the implementation of attribute dimension partition take

MapReduce model of Hadoop as process engine, use TF-IDF vector to save the extracted attribute dimension,

use K-Means algorithm to clustering partition. The experiment result shows that the execution efficiency of

proposed method is enhanced, and while the rationality of partition is guaranteed, the increasing of data attribute

does not significantly increase the execution time.

2. Related Work

2.1 Data Analysis Method

OLAP (On-Line Analytical Processing) is a powerful data analysis engine that provides report display,

multidimensional analysis and the creation of an execution plan based on data scenarios. It is widely used for

business behaviour management, budget forecasting, financial reporting, knowledge discovery and data mining

and other production environment [14]. According to the needs of the data source of the OS, ERP, CRM data

through a unified extraction, transform, loading, the formation of data Warehouse, for the upper to provide

query analysis, report display, data mining and other functions.

Traditional OLAP analysis of the data, the relational database by means of the main data into two-

dimensional table and the fact table [15], to analyse the data cube by constructing a multidimensional data

structure that allows the data by the fact that several viewing axis query [16]. Numeric information is stored in

the main fact table under a certain dimension, as time goes on increasing, for example, from 2001 to 2016 input

and output values of a fact table under particular industry area storage Uygur and industrial dimension in

2001 and 2016 input-output factual information [17]. dimension table usually descriptive information with a

level of less change over time, such as regional division, time division, the division's three major industries, etc.

when the basic will not change after the entry into force of the development, which describes the data can be

encoded form regional dimension table, the time dimension table, data analysis model industrial dimension

tables[18]. OLAP by a single fact table and its associated dimension tables in a star or snowflake organized, by

associating the foreign key in the form of, when performing data OLAP analysis, data will first be described

according to the query in the fact table to the foreign key information table dimension, and then to be spliced

into Full description of the value, the final result is returned [19].

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38696143

粉丝: 1
资源: 957

基于SVM分类与MapReduce的大数据属性维分区研究

ETSI TS 103532 V1.1.1: Attribute-Based Encryption for Access Control

"油田地面工程信息化管理研究及软件设计

"深入理解访问控制系统及其要素与授权决定

Research on Pedestrian Attribute Recognition Based on Semantic Segmentation

A Novel Attribute Reduction Approach based on Improved Attribute Significance

Attribute-Based Ring Signcryption Scheme and Its Application in Wireless Body Area Networks

Integrating Ciphertext-policy Attribute-Based Encryption with Identity-Based Ring Signature to Enhance Security and Privacy in Wireless Body Area Networks

Attribute Reduction Based on Closure Operators

Hybrid attribute reduction based on a novel fuzzy-rough model and information granulation

Attribute Reduction for Heterogeneous Data Based on the Combination of Classical and Fuzzy Rough Set Models

最新资源