1:6 Li et al.
ChunkChunk
File
Chunking
MLE
Protection
Chunk
Chunk
Deduplication
......
Tappedbyadversary
MultipleClients
DeduplicatedStorage
Disks
Client
Client
Client
Fig. 2. Architecture of encrypted deduplication.
file, a client divides file data into plaintext chunks that will be encrypted via MLE. It then uploads
the ciphertext chunks to deduplicated storage. An adversary can eavesdrop the ciphertext chunks
before deduplication and launch frequency analysis. We assume that the adversary is honest-but-
curious, meaning that it does not change the prescribed storage system protocols or modify any
stored data.
3.2 Auxiliary Information
To launch frequency analysis, the adversary should have access to the auxiliary information that
provides ground truths about the backups being stored. Prior studies have proposed different
approaches to obtain the auxiliary information to launch inference attacks. We briefly discuss
several representative ones that inspire our work.
•
Naveed et al. [
50
] examine inference attacks against the encrypted databases for electronic
medical records, some of which are protected by deterministic encryption. To evaluate the
feasibility of launching inference attacks, the authors obtain the auxiliary information from a
public user dataset released by the government health services.
•
Grubbs et al. [
28
] infer the plaintexts of the attributes of customer records (e.g., first name, last
name, ZIP codes, birth dates, etc.) stored in an encrypted database. They obtain the auxiliary
information regarding the plaintext distribution via the public census and survey datasets.
•
Bindschaedler et al. [
15
] also infer the plaintexts of the attributes of encrypted customer records
like Grubbs et al. [
28
], but use the public and purchased U.S. voter registration lists as the auxiliary
information. The authors also use the older versions of purchased hospital-discharge data and
public censor data to infer the newer versions of respective data.
•
Grubbs et al. [
27
] focus on Ubuntu Internet Relay Chat (IRC) logs [
2
,
61
] and extract the log
keywords. They generate the keyword query distribution from one year’s Ubuntu IRC logs as
the auxiliary information to infer the encrypted keywords in the logs of a later year.
•
Pouliot et al. [
52
] consider the inference attacks against the keywords in the Enron email dataset
[
37
]. They first partition each user’s emails into two non-overlapping sets (i.e., training and
testing sets). They then generate the necessary auxiliary information from the training set, and
infer the content of the testing set that is encrypted.
•
Other studies [
19
,
33
,
66
] also leverage the Enron email dataset. They create a Zipfian synthetic
keyword query distribution from the keyword list of the whole Enron dataset as the auxiliary
information, and use it to infer the keywords in the original dataset that is encrypted.
We observe that previous studies mainly obtain the auxiliary information from private [
15
,
19
,
27
,
33
,
52
,
66
] or public [
15
,
28
,
50
] sources. By private, we mean that the auxiliary information is
originally protected but is obtained through unintended data releases [
10
], data breaches [
29
], or
ACM Trans. Storage, Vol. 1, No. 1, Article 1. Publication date: January 2019.