首页论文研究 - 使用机器学习算法预测信用卡交易欺诈
论文研究 - 使用机器学习算法预测信用卡交易欺诈
需积分: 0 370 浏览量 更新于2023-05-26 评论 2 收藏 956KB PDF 举报
信用卡欺诈是金融机构的一个广泛问题，涉及使用支付卡进行的盗窃和欺诈。 在本文中，我们探索了线性和非线性统计模型以及机器学习模型在真实信用卡交易数据上的应用。 建立的模型是受监督的欺诈模型，试图识别哪些交易最有可能是欺诈性的。 我们讨论了数据探索，数据清理，变量创建，特征选择，模型算法和结果的过程。 探索和比较了五个不同的监督模型，包括逻辑回归，神经网络，随机森林，增强树和支持向量机。 增强树模型显示了针对此特定数据集的最佳欺诈检测结果（FDR = 49.83％）。 所得模型可以在信用卡欺诈检测系统中使用。 可以在相关业务领域（如保险和电信）中执行类似的模型开发过程，以避免或检测欺诈行为。
Journal of Intelligent Learning Systems and Applications, 2019, 11, 33-63
ISSN Online: 2150-8410
ISSN Print: 2150-8402
10.4236/jilsa.2019.113003 Aug. 14, 2019 33 Journal of Intelligent Learning Systems and Applications
Predicting Credit Card Transaction Fraud Using
Machine Learning Algorithms
, Zirui Zhou
, Jiangshan Ai
, Bingxin Xia
, Stephen Coggeshall
Hebei University of Economics and Business, Shijiazhuang, China
China University of Political Science and Law, Beijing, China
Wuhan Maple Leaf International School (High School), Wuhan, China
Wuhan Jinde Education Consulting, Co., Ltd., Wuhan, China
University of Southern California, Los Angeles, USA
Credit card fraud is a wide-
ranging issue for financial institutions, involving
theft and fraud committed using a payment card. In this paper,
the application of linear and nonlinear statistical modeling and machine
learning models on real credit card transaction da
ta. The models built are
supervised fraud models that attempt to identify which transactions are
most likely fraudulent. We discuss the processes of data exploration,
cleaning, variable creation, feature selection, model algorithms, and results.
Five different supervised models are explored and compared including lo-
gistic regression, neural networks, random forest,
boosted tree and support
vector machines. The boosted tree model shows the best fraud detection re-
sult (FDR = 49.83%) for this particular data set. The
resulting model can be
utilized in a credit card fraud detection system. A similar
process can be performed in related business domains such as insurance and
telecommunications, to avoid or detect fraudulent activity.
Credit Card Fraud, Machine Learning Algorithms, Logistic Regression,
Neural Networks, Random Forest, Boosted Tree, Support Vector Machines
Credit card fraud remains an important issue for theft and fraud committed us-
ing a payment card, such as a credit card or debit card. To combat this many
fraud detection algorithms are widely used in industry    . Card fraud
can happen with the theft of the physical card as well as with the compromise of
How to cite this paper:
.R., Ai, J.S., Xia, B.X. and Coggeshall, S.
Predicting Credit Card Transaction
Fraud Using Machine Learning Algo
Journal of Intelligent Learning Systems and
April 6, 2019
August 11, 2019
August 14, 2019
Copyright © 201
9 by author(s) and
Research Publishing Inc.
This work is licensed under the Creative
International License (
CC BY-NC 4.0).
J. X. Gao et al.
10.4236/jilsa.2019.113003 34 Journal of Intelligent Learning Systems and Applications
the card, including skimming, breach, account takeover, that would otherwise
look like a legitimate transaction. According to the Global Payments Report
2015 , the credit card is the highest-used payment method globally in 2014
compared to other methods such as an e-wallet and Bank Transfer. Along with
the rise of credit card usage, the number of fraud cases has also been steadily in-
creasing . The rise in credit card fraud has a large impact on the financial in-
dustry. The global credit card fraud in 2015 reached a staggering USD 21.84 bil-
Financial institutions today typically develop custom fraud detection systems
targeted to their own portfolios . The data mining and machine learning
techniques are vastly embraced to analyze patterns of normal and unusual beha-
vior as well as individual transactions in order to flag likely fraud. Given the re-
ality, the best cost-effective option is to tease out possible evidence of fraud from
the available data using statistical algorithms . Supervised models trained on
labeled data examine all previous labeled transactions to mathematically deter-
mine how a typical fraudulent transaction looks and assigns a fraud probability
score to each transaction . Among the supervised algorithms typically used,
the neural network is popular, and support vector machines (SVMs) have been
applied, as well as decision trees and other models   -. However,
little attention has been devoted in the literature to some comparison of all the
common algorithms, particularly using real data sets.
In this paper, we explore the application of various linear and nonlinear statis-
tical modeling and machine learning models on credit card transaction data. The
models built are supervised fraud models that attempt to identify which transac-
tions are most likely fraudulent.
2. Description of Data
The data available for this research project are a collection of credit card transac-
tions from a government agency located in Tennessee, U.S.A. The particular
agency is not known.
The data consist of 96,753 credit card transactions during the year 2010, with
1059 labeled as fraud. The file contains the fields:
• Record: A unique identifier for each data record. This field also represents
the time order;
• Cardnum: The account number for the transactions (we note that they are
Mastercard transaction since the account numbers begin with the digits 54);
• Date: The date of the transaction. Month, day and year only (no time of day);
• Merchnum: A typically 12-digit merchant identification number;
• Merch Description: A brief text description field of the merchant, typically
around 20 characters;
• Merch State: The state of the address for the merchant;
• Merch Zip: The zip code of the merchant;
• Transtype: A code denoting the type of transaction;
J. X. Gao et al.
10.4236/jilsa.2019.113003 35 Journal of Intelligent Learning Systems and Applications
• Amount: The dollar amount of the transaction;
• Fraud: A label for the transaction to indicate whether or not it was a fraudu-
Table 1 shows summary information about all the fields. Only the Amount
field is a numeric type field; the other fields are all categorical or text. The statis-
tical magnitudes in the table were calculated with the outliers eliminated. Three
fields have some missing values: Merchnum, Merch state, and Merch zip. It was
noticed that the number of unique values of the Merch state field is 227, which is
unexpected because the U.S. has only 50 states. Some of the values in this field
might be from other countries, such as Canada and/or Mexico.
Below we show some further information about the data.
Figure 1 shows the number of transactions each month. We noticed the gen-
eral upward trend through September, followed by a sharp drop in October. The
monthly transactions are fewer in the last quarter of the year compared with
other quarters. This is due to the government fiscal year which starts on October
1, and people tend to be more cautious with their money in the first few months
of the new fiscal year.
Figure 2 shows the top 10 of the most frequently traded merchant descrip-
tions. The total transaction frequency of the top 15 categories is 13,256, which is
about 13.7% of the records. The top 200 categories account for 41% of the total
Table 1, we see that there are 13,126 kinds of merchants by this
Merch description field, and 48.6% of the merchant descriptions only occurred
Figure 3 depicts the top 10 of the most frequently observed merchant states.
Table 1. Summary description of the data set.
Fields name Type
have a value
Mode # Unique values
Record Index 96,753 100% 96,753
Cardnum Categorical 96,753 100% 5142148452 1645
Date Time 96,753 100% 2/28/10 365
93,378 96.5% 930090121224 13,091
Merch description 96,753 100% GSA-FSS-ADV 13,126
Merch state 95,558 98.8% TN 227
Merch zip 92,097 95.2% 38118 4567
Transtype 96,753 100% P 4
Amount Numeric 96,753 100% 3.62
Fraud Categorical 96,753 100% 0 2
*Statistical magnitude without outliers.
J. X. Gao et al.
10.4236/jilsa.2019.113003 37 Journal of Intelligent Learning Systems and Applications
The most frequent state is TN which is about 12.6% of the whole transaction
frequency, and is not surprising because that is where the facility is located. The
total number of transactions in the top 15 states is 71,647, which is about 75% of
the entire records. From
Table 1 we see that there are 227 different states in the
data. And we note that 168 of the state’s field values are numbers rather than
3. Data Cleaning
When we examine the field Amount, shown in the box plot distribution Figure
4, we see that there is one large outlier with the Amount value recorded as over 3
million dollars. After thoroughly checking that particular record, which is not
labeled as fraud, we discover that it is an unusual but known transaction in
Mexican pesos from a particular Mexican organization, and we thus exclude it
from further analysis.
Table 2 shows the information about the fields with missing values. All of
these three fields have a strong relationship with the field Merch description, but
the same Merch description may also correspond to different values in the three
above fields. These three fields also have a strong relationship with each other, so
the mode of each field is used to fill in the missing fields.
Figure 4. Box plot distribution of the Amount field.
Table 2. Fields with missing values.
Records with missing values
Percentage (%) Mode
Merchnum 3357 3.49 930090121224
Merch state 1195 1.24 TN
Merch zip 4565 4.81 38118
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额