没有合适的资源?快使用搜索试试~ 我知道了~
首页强化学习与差分隐私.pdf
资源详情
资源评论
资源推荐

Differentially Private Reinforcement Learning
Pingchuan Ma
1
, Zhiqiang Wang
1,2,3,∗
, Le Zhang
1
, and Ruming Wang
4
,
Xiaoxiang Zou
5,∗
, Tao Yang
3
1
Beijing Electronic Science and Technology Institute
2
State Information Center
3
Key Lab of Information Network Security, Ministry of Public Security
4
Hainan University
5
National Computer Network Emergency Response Technical Team/Coordination
Center of China
∗
Corresponding author.
wangzq@besti.edu.cn
Abstract. With remarkable performance and extensive applications, re-
inforcement learning is becoming one of the most popular learning tech-
niques. Often, the policy π
∗
released by reinforcement learning model
may contain sensitive information, and an adversary can infer demo-
graphic information through observing the output of the environment.
In this paper, we formulate differential privacy in reinforcement learning
contexts, design mechanisms for -greedy and Softmax in the K-armed
bandit problem to achieve differentially private guarantees. Our imple-
mentation and experiments illustrate that the output policies are under
good privacy guarantees with a tolerable utility cost.
Keywords: differential privacy · reinforcement learning · privacy pre-
serving.
1 Introduction
Recent years have witnessed a boom in artificial intelligence, which contributes to
a wide range of applications, such as face recognition, self-driving, medical diag-
nosis, etc. [17, 25]. Notably, the success of AlphaGo
6
speeds up the development
of reinforcement learning. Nevertheless, the security and privacy issues combined
with artificial intelligence also draw full attention from researchers in the mean-
time. In real-world scenarios, trained reinforcement learning policies are released
to client-side and often contain sensitive information, from which adversaries may
infer demographic information. As AI models have millons of parameters, some
sensitive information are contained in the model parameters implicitly. Recently,
model inversion and membership inference attacks has shown it effectiveness on
some AI models [10, 28, 23, 22]. Although there are not any well-known attacks
performs on the data privacy of reinforcement learning models (to the best of our
6
https://deepmind.com/research/alphago/

2 P. Ma et al.
knowledge), we still think the environment should be protected since it contains
sensitive information.
As a promising privacy-preserving technique, differential privacy introduced
a strong privacy model which provides formal privacy guarantees that do not
depend on the background knowledge or computational power of an adversary
[11]. Apple Inc. integrated differential privacy for collecting sensitive data in its
operating systems, iOS and macOS, respectively. Google released a framework
for local differentially private data aggregation and deployed it in Chrome [9].
While differential privacy for reinforcement learning in specific cases has been
conveyed in some work [2, 19, 27], our definitions and methods are different from
theirs which adapt the features of generic reinforcement learning models.
Typically, the question of how to combine differential privacy with reinforce-
ment learning can be divided into two parts, (1) how to formulate the privacy
issues in reinforcement learning and (2) how to design mechanisms that achieve
differential privacy in reinforcement learning contexts.
Environment
Agent
Action
Reward
State
Fig. 1. Structure of Reinforcement Learning.
In this paper, the first objective is to define the formal privacy model in rein-
forcement learning contexts. However, differing from learning models with initial
datasets, as is shown in Fig. 1, reinforcement learning does not have the notion
of dataset or data tuples and only learns from the feedbacks of environments.
Therefore, the traditional definition of differential privacy is not applicable for re-
inforcement learning. Luckily, we notice that the functions of states, actions and
reward in reinforcement learning are similar to samples and labels in supervised
learning to some extent. We thereby define reinforcement learning differential
privacy based on this observation.
The second objective is to design a mechanism that achieves differential pri-
vacy guarantees. However, today’s reinforcement learning models usually adopt
deep neural networks to approximate action values or policies, and it is a com-
mon belief that these neural networks are hard to be analyzed theoretically. So in
this paper, we illustrate the ideas of the mechanism achieving differential privacy
in a simplified setting, i.e., the K-armed bandit problem, which still preserves

Differentially Private Reinforcement Learning 3
the important features distinguishing reinforcement learning from other types of
learning.
Contribution. In summary, our main contributions are as follows:
- (, δ)-Differentially Private Reinforcement Learning. We extend the
definition of differential privacy in reinforcement learning contexts. To be
precise, our definition no longer adopts the notion of databases in traditional
differential privacy models and utilizes the environments instead.
- Exponential Mechanism for -greedy. We analyze the sensitivity of util-
ity function, adopt the exponential mechanism for achieving differentially
private -greedy algorithm in the K-armed bandit problem (it can be simply
transferred to similar algorithms, such as Q-Learning, Sarsa.), and finally,
prove it.
- Laplace Mechanism for Softmax with Fine-grained Sensitivity. Dif-
ferent from -greedy, Softmax does not output an optimal policy, but a
p.d.f. denoting the probability of each action. Additionally, the high global
sensitivity of Softmax algorithm results in damage to data utility. We analyze
the smooth sensitivity of Softmax, utilize more fine-grained noise to perturb
output, and achieve (, δ)-differentially private reinforcement learning.
Future Direction. We present two prospective future directions.
- Differential Privacy for Multi-step Reinforcement Learning. We ad-
dress differential privacy in the K-armed bandit problem in this paper. It is
of great significance to study how to achieve differential privacy in multi-step
reinforcement learning models, which are much more popular in real-world
scenarios.
- Differential Privacy for Continuous Action Reinforcement Learn-
ing. While the problem of privacy-preserving discrete action model is solved
in this paper, how to perturb a continuous action in reinforcement learning
models is an important topic as well.
The next section reviews preliminaries on reinforcement learning and differ-
ential privacy, respectively. Section 3 demonstrates our formal definition. Section
4 presents the mechanism design for -greedy and Softmax, respectively. Section
5 describes our experimental results. Section 6 discusses related work, and Sec-
tion 7 concludes.
2 Preliminaries
In this section, we briefly introduce notions of reinforcement learning and differ-
ential privacy.
2.1 Reinforcement Learning
Reinforcement learning is a set of machine learning methods concerned with
how agents take actions in an environment for maximising cumulative reward

4 P. Ma et al.
[25]. Typically, the reinforcement learning problem can be cast as a Markov
Decision Process (MDP) [13]. For agents in the environment E, the state space
X , where each x ∈ X denotes the stage of an agent in the environment E, when
an action a ∈ A is taken, reward will be given by the environment E based on
the reward function R. To summarise, a quaternion E = hX , A, P, Ri denote
the reinforcement learning model, where P : X × A × X → R denotes the state
transition probability, R : X × A × X → R denotes the reward function of the
environment E.
Unlike other supervised learning techniques, the output of reinforcement
learning can only be invested after multi-step. In this article, we utilize the
K-armed bandit problem with -greedy and Softmax algorithms to convey our
research [16].
K-armed bandit. To be precise, the K-armed bandit is a problem where the
reward is allocated by choices for maximizing it when each choice’s properties are
only partially known at the time of allocation, and players may become better
understood as time passes [12, 3]. Occasionally, the bandit algorithms tend to
be trapped into Exploration-Exploitation dilemma, where the agents strive to
balance sufficiently exploring the variant space and exploiting the optimal action.
-greedy. Intuitively, a common policy is to take the optimal action with
the probability of 1 − and randomly choose an action with the probability of .
After an initial period, the agents can solve the optimal action π
∗
under which
most reward is given, but will still randomly try action with the probability of
.
Softmax (Boltzmann Exploration). Softmax is based on Luce’s choice
axiom and picks an arm with the probability given by Boltzmann distribution
according to its average reward [16]. Following is the p.d.f. of each action.
P (k) =
exp(
Q(k)
τ
)
P
K
i=0
exp(
Q(i)
τ
)
(1)
where τ is temperature parameter which controls the randomness. When τ = 0,
the algorithm is pure greedy. By the contrast, when τ → +∞, the algorithm
selects actions randomly.
2.2 Differential Privacy
With the growth of data aggregation and mining, the threats to data privacy
also increase. Roughly speaking, differential privacy is a mathematical model of
data privacy guarantees in a statistical dataset [7, 4–6]. The objective of differ-
ential privacy is to perturb the output of queries to prevent adversaries infer the
demographic information. The noise is controlled by the privacy budget
7
.
We let a vector D = [D
1
, D
2
, · · · , D
n
] to denote a statistical database, where
D
i
for each i ∈ {1, 2, · · · , n} is a tuple. The notion of (, δ)-differential privacy
can be defined as:
7
To distinguish the in differential privacy and -greedy, the in -greedy will be
replaced by
rl
in the remainder of the article, namely
rl
-greedy.
剩余15页未读,继续阅读
















安全验证
文档复制为VIP权益,开通VIP直接复制

评论0