Abstract—In order to identify the sensitive data of users in
Internet, a sensitive data identification method is proposed by weight
constraint Gaussian Mixture-Probability Hypothesis Density
(GM-PHD) filter and Restricted Boltzmann Machines (RBM) in this
thesis. At first, the data is normalized with weight constraint in this
method, and the random network is formed by the definition of the
collected characteristic simulation energy function of RBM. Then, the
sensitive feature weight of sensitive data is generated in GM-PHD
filter. Finally, the simulation experiments are conducted to study this
method performance compared with GM-PGD filter, Gaussian filter
by MATLAB, including filtering and tracking performance, relevancy
degree, sensitive words weight, cluster mapping and high frequency
approximation. The results show that, compared with other methods,
this method has better performance.
Keywords—
sensitive data, weight constraint, Gaussian
mixture-probability hypothesis density, restricted Boltzmann machine.
I. INTRODUCTION
ITH the development of science and technology, people’s
dependency on the Internet intensifies. Greater attention
has been paid to sensitive data with the popularity of
applications growing in modern life [1-4]. The common
collection of sensitive data involves Oracle Database, Android,
cloud environment [5-7] etc. Since the dataflow generated from
big data is dynamic, it therefore is influenced by the actual uses
and the network environment. As a result, some of the sensitive
data cannot be identified effectively when data in large-scale
network integration exchange.
The fact that generated data flow can be clustered in data
This work was supported in part by Foundation of Zhejiang Educational
Committee for contract (Y201738610), and National Natural Science
Foundation of China (41275116, 61202464, 61472136 and 61772196).
Zhengqiu Lu is with the Department of Information & Media. Zhejiang
Fashion Institute of Technology, Ningbo 315175, Zhejiang, China
(corresponding author; e-mail: 459246322@qq.com).
Shengjun Xue is with the Department of Computer & Software, Nanjing
University of Information Science & Technology, Nanjing 210044, Jiangsu,
China.
Chunliang Zhou is with the Department of Information & Engineering,
Dahongying University, Ningbo 315175, Zhejiang, China.
Quanping Hua is with the Department of Information & Media. Zhejiang
Fashion Institute of Technology, Ningbo 315175, Zhejiang, China.
Defa Hu is with the Computer and Information Engineering, Hunan
University of Commerce, Changsha 410205, Hunan, China
Weijin Jiang is with the Computer and Information Engineering, Hunan
University of Commerce, Changsha 410205, Hunan, China
transmission, and some applications have the behavior that
notify cluster data flow on their own initiative, some research
indicates that the disclosure of sensitive data occurs in the
course of initiative notification of cluster information. So, to
identify sensitive data effectively is of great significance. As
present, there are two major ways of sensitive data
identification: data dictionary matching and artificial
identification. To prevent the loss in economy and reputation
due to the disclosure of sensitive data, some sensitive data can
be secured by secret key encryption or another is setting up
protection barrier by popularity of cloud computing [8]. Among
which the main protective method is to use labels for sensitive
data identification in numerous data. Nowadays smart phones
are the important collection locations of sensitive data, and
some of the Android malware can associate one and another
automatically [9]. Literature [10] puts forward an Android
malware detection method based on permission sequential
pattern mining algorithm, it designs the mining algorithm to
permission sequential detection for malware, and warns
sensitive information, which could be produced when using
malware. However, this method lacks accuracy because the
permission mode can be applied in normal applications.
Sensitive data plays a significant role in other aspects
information, and the database normally protects sensitive data
with encryption algorithm, for example, using transparent data
to encrypt [11] the sensitive data in Oracle database. However,
the access control depends on the authorization of external
functions, yet it lacks pertinence identification.
So, this thesis puts forward a method to identify sensitive
data based on weight constraint GM-PHD [12-16] filter and
RBM [17-19]. It is primarily built on the random Neutral
network model based on probability, and which is normalized
with weight constraint. And finally, it can extract the features of
sensitive data and the structure of belief network effectively.
Meanwhile, the successful detection rate of the sensitive data in
stimulation neutral network can be improved by calculating the
probability of the sensitive words which occur frequently and
maximizing the eligible sample probability.
II. SENSITIVE
DATA FEATURE MODEL
Sensitive data occur frequently in online applications, and in
general, a malware involving sensitive data will generate the
cooperation between several permission frequent itemsets. In
addition, association rules and cluster mapping are formed
User Sensitive Data Identification Method Based
on Constraint Gaussian Mixture-Probability
Hypothesis Density Filter
Zhengqiu Lu, Shengjun Xue, Chunliang Zhou and Quanping Hua, Defa Hu, Weijin Jiang
INTERNATIONAL JOURNAL OF CIRCUITS, SYSTEMS AND SIGNAL PROCESSING