3
However, when a classifier is deployed in the client-side
computer, the situation may become worse. As shown in Figure
1(b), for a client-side classifier, its operations are performed in a
white-box. The adversary can leverage almost all kinds of
analysis techniques, such as debugging, disassembling, code
analysis, dynamic taint tracking, etc., to thoroughly analyze the
target classifier. As a result, the adversary has an opportunity to
get more comprehensive knowledge about the classifier to
develop more sophisticated evasion attacks. The malformed
instance can be applicable for all the users using the classifier.
Besides, if the adversary gets perfect knowledge about the
classifier, she can even reengineer a new classifier for commercial
purposes. In this study, it is assumed that all the implementation
and configuration of the client-side classifier are available for the
adversary. The adversary can figure out the type of classification
model, the classification algorithm and the feature extraction
method by leveraging various techniques. Considering the
advancement of modern analysis techniques, this assumption is
reasonable.
Some client-side classifiers, have introduced some defense
techniques to prevent the adversary from learning crucial
information. For example, GPPF employs the cryptography
technique to protect the classification model. Unfortunately, it is
proved to be ineffective to against classifier cracking (discussed
in Section 3 and 4).
2.2 Phishing and GPPF
According to the latest report [3] of Anti-Phishing Working
Group (APWG), phishing attacks remain widespread: the number
of unique phishing reports submitted to APWG during Q4 of 2014
was 197,252, and there is an increase of 18 percent from the
163,333 received in Q3. To minimize the impact of phishing
attacks, a variety of methods have been proposed to detect
phishing pages, involving machine learning [39][52][56] or other
techniques [24] [25] [27] [31] [33] [46] [57] [58].
Modern web browsers also provide detection tools to assist end
users against phishing attacks. Safe Browsing, a service offered by
Chrome, is aiming at providing not only blacklists of malicious
URLs but also a trained classifier (GPPF) which automatically
detects phishing pages as a countermeasure to the phishing
problem [4]. In Chrome, Safe Browsing serves as a guard when a
request comes, and the request URL will be checked before the
content is allowed to begin loading. The URL is checked against
two blacklists: malware and phishing. If the URL is matched with
the blacklists, Chrome will block the request and jump to a
warning page as shown in Figure 2. More importantly, for the
URL that is not present in the blacklists, Chrome will further
invoke GPPF to determine whether it is legitimate or phishing. In
practice, the phishing blacklist needs to be updated constantly and
users will be vulnerable to newly created phishing websites.
GPPF acts as an indispensable role in protecting end users from
unknown phishing pages.
In fact, GPPF is the local version of a Google’s internal classifier.
Google developed and trained a scalable machine learning
classifier in its servers to detect phishing websites and use it to
maintain Google’s phishing blacklist automatically [56]. Training
the classifier is a constant offline process. The training process
uses a sample of roughly ten million URLs analyzed over the past
three months as the training dataset. The number of URLs from a
single domain is also limit to 150 per week to prevent a single
domain from having too much contribution to the classification
model. Consequently, the adversaries don’t have an opportunity
to alter the training dataset enough to make the trained classifier
misclassify phishing pages as legitimate. However, to provide the
real-time detection of unknown phishing pages, the trained
classifier is also implemented as a part of Safe Browsing, i.e.,
GPPF. As an internal component of the Chrome browser, GPPF is
completely deployed and running in the user environment. This
actually allows the adversary to freely analyze its implementation
and configurations to construct more sophisticated phishing
attacks.
According to the report of StatCounter [5], from Aug 2014 to Aug
2015, Chrome shares an average of 48.6% market and is the most
popular web browser. In May 2015, Google announced that
Chrome has over one billion active users [1]. This means over one
billion users’ web surfing are protected by GPPF. Note that if a
phishing page can fool GPPF, it will have more chances to keep
away from the Google’s phishing blacklist. Furthermore, the
phishing blacklist provided by Google is also employed in Firefox
and Safari browsers, as well as by Internet Service Providers
(ISPs) [6]. We have reason to believe that the security breach of
GPPF will potentially impact many more people besides just the
users of Chrome.
3. CRACKING GPPF
There is very limited public information about the design and
implementation of GPPF. We choose to directly analyze the
development version of the Chrome browser, Chromium, to crack
GPPF. The cracking includes two main steps: (1) extracting the
classification model of GPPF from Chromium; and (2) decrypting
the hashed features of the model. It needs to be mentioned that
some sensitive details of the cracking are intentionally omitted
to prevent them from being used for malicious purposes.
3.1 Extracting the Classification Model
3.1.1 Classification Algorithm
The multi-process architecture that Chrome/Chromium adopts
helps it be more robust. According to a very brief description in
[4], we can know that Browser process will periodically fetch an
updated model from Google’s server and send it to every Render
process via an IPC channel. This allows the classification to be
done in the Render process, which will score the request page to
tell whether it is phishing or not.
Figure 2. Phishing warning page.