Download A Novel Classification Approach for C2C E

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
A Novel Classification Approach for C2C E-Commerce Fraud Detection
Haitao Xiong, Yufeng Ren, Pan Jia
A Novel Classification Approach for C2C E-Commerce Fraud Detection
*1
Haitao Xiong, 2Yufeng Ren, 2Pan Jia
School of Computer and Information Engineering, Beijing Technology and Business
University, Beijing 100084, China, [email protected]
2
School of Computer and Information Engineering, Beijing Technology and Business
University, Beijing 100084, China
*1
Abstract
Fraud in consumer-to-consumer (C2C) e-commerce is becoming more and more serious. The
purpose of this study is to develop an effective fraud detection model to assist customers in identifying
potential fraud transactions. We use Naive Bayes (NB), decision tree C4.5 and AdaBoost to construct
the model for classifying imbalance transaction data, and majority voting is used to combine the model.
Several experiments are conducted on Taobao data set to verify the classification performance of the
proposed model using four popular performance metrics. The experimental results demonstrate that
the model based on NB and AdaBoost&C4.5 can significantly increase the ability to locate potential
fraud transactions in C2C e-commerce.
Keywords: Fraud Detection, Decision Tree, Adaboost, Imbalance Data, Classification
1. Introduction
The fast and wide development of Internet has made C2C e-commerce become more and more
popular because of low cost and high efficiency. During the high development of C2C e-commerce,
hidden problems have been exposed. The virtual internet transaction will make it not easy to check the
identification of both sides in a transaction and customers have difficulty in buying products because of
asymmetric information of product quality. Therefore, the “lemon effect” will occur [1], and it is hard
to find a feasible solution for this problem [2-5]. Buyers take this incentive into consideration, and
deem the quality of goods to be uncertain. Only goods with average quality will be considered, which
in turn will have a side effect that goods with above average quality will be driven out of the market,
ultimately leading to the destruction of the market.
Currently, the reputation systems chosen by most C2C e-commerce sites to prevent fraud mainly
use simple summation or average of ratings. Summation of ratings is simply to sum the number of
positive ratings and negative ratings separately, and to keep a total score as the positive score minus the
negative score. The reputation systems adopted by eBay and Taobao use summation of ratings [6].
Average of ratings is to compute the reputation score as the average of all ratings which is used by
Amazon. Reputation systems can hardly depict the trader’s true reputation, and often be attacked by
fraudsters. As C2C e-commerce develops, more and more buyers and sellers participate in it.
Meanwhile, the number of fraud transaction also runs up remarkably. Non-fraud transaction is
represented by a large number of transactions while fraud transaction is represented by only a few. So
it is extremely difficult to extract the fraud patterns in C2C e-commerce and cause the class imbalance
problem.
In this study, we propose an innovative C2C e-commerce fraud detection model based on NB and
AdaBoost&C4.5 to classify imbalance transaction data. For the purpose of building the model, we
examine the components of the model and the architecture is revised. Then the capability of
discriminating abnormal transactions from experiments on Taobao data set will be evaluated. This
paper is organized as follows: Section 2 describes the methodology in the model. Section 3 describes
the classification mechanism of C2C ecommerce fraud detection model. Section 4 details our data and
performance metrics, followed by experiments of the model and results. Finally, the conclusions of this
study and future work are provided in Section 5.
International Journal of Digital Content Technology and its Applications(JDCTA)
Volume7,Number1,January 2013
doi:10.4156/jdcta.vol7.issue1.58
504
A Novel Classification Approach for C2C E-Commerce Fraud Detection
Haitao Xiong, Yufeng Ren, Pan Jia
2. Research methodology
In the following section, we will discuss the research methodology used in this study and the main
components involved.
2.1 Learning algorithms
In fraud detection research, there are several widely used classification algorithms which are naïve
Byes, C4.5 decision tree, AdaBoost and so on[7-9].
1) Naive Bayes
Naive Bayes is a simple probabilistic classifier based on applying Bayes theorem with naïve
independence assumptions. It assumes that the presence or absence of a particular feature of a class is
unrelated to the presence or absence of any other feature [7].
In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifier
often work much better in many complex real-world situations than one might expect. NB can give
better predictive accuracy than other algorithms such as C4.5 and BP when attributes are normally
distributed and not redundant. While attributes are not normally distributed and redundant, it will show
lower predictive accuracy.
2) Decision tree C4.5
Decision tree is a kind of decision support techniques that uses a tree-like graph or model of
decisions and their possible consequences. In machine learning, decision tree is a predictive model that
is a mapping from observations about an item to conclusions about its target value. Leaves represent
classifications and branches represent conjunctions of features that lead to those classifications [8].
Most decision tree inducers assume that the overall prediction decision can be made by dividing the
decision into a sequence of small decisions. Different decision tree inducers mainly differ in the
goodness measure used to select the splitting attribute at each intermediate tree node [10].
C4.5 is a decision tree algorithm which can not only make accurate predictions but also explain
hidden patterns in data. It can deal with numeric attributes, missing values, estimating error rates and
generating rules. C4.5 is one of the most commonly used algorithms in the data mining and machine
learning communities and C4.5 combined with under-sampling or over-sampling is quickly becoming
the accepted baseline to beat in research of class imbalance.In prediction accuracy, C4.5 performs
better than CART and ID3. However, C4.5 may cause scalability and over-fitting problems when it is
applied on large data sets.
3) AdaBoost
Boosting is one of the most powerful machine learning approaches to emerge in the past decade.
AdaBoost, as a kind of boost, is a meta-algorithm. It can be used to linearly combine many other
learning algorithms to correct the misclassifications made by weak classifiers. It is sensitive to noisy
data and outliers and less susceptible to the over-fitting problem than most learning algorithms are [11].
If simple weak classifiers are used, the AdaBoost algorithm is very fast [12].
AdaBoost is an ensemble constructing technique and has become a very popular one for its
simplicity and adaptability. There are two approaches implemented in AdaBoost: one is re-weighting,
and the other is re-sampling. In order to accompany with imbalanced data processing, AdaBoost with
re-weighting is used in our study. Using re-weighting approach, all training samples with weights are
used in each sample to train the final classifier. It calls a weak classifier repeatedly in a series of rounds
t =1,...,T . For each call, a distribution of weights Dt is updated that indicates the importance of
examples in the data set for the classification. And the weights of each incorrectly classified example
are increased, so that the new classifier focuses more on those examples.
All three classifier algorithms have advantages and disadvantages. According to the preceding
introduction of algorithms, we can summarize the performance of the three algorithms which are
shown in Table 1. Therefore, by using the three algorithms together on the same data, their strengths
can be combined and their weaknesses reduced.
505
A Novel Classification Approach for C2C E-Commerce Fraud Detection
Haitao Xiong, Yufeng Ren, Pan Jia
Algorithm
NB
C4.5
AdaBoost
Table 1. Comparison of three algorithms
Accuracy
Scalability
Speed
Good
Excellent
Excellent
Excellent
Poor
Poor
Excellent
Good
Poor
Over-fitting
Low
High
Medium
2.2. Imbalanced data processing: Under-sampling
Usually, classification algorithms perform not well while handling imbalanced data and the results
are biased to the majority class. When learning from imbalance data, traditional classification
algorithms tend to produce high prediction for the majority class but low for the minority class
[9,13,14]. So the data tends to be classified to majority class which is always the meaningful class we
want to get. Traditional classifiers can not have good performance while dealing with imbalanced
learning tasks. Hence, unbalanced data should be handled before being applied to classification
algorithms. Under-sampling is used in our model to handle imbalanced data. Majority under-sampling
is more preferable than minority over-sampling technique [13]. Under-sampling is a kind of approaches
that can handle the imbalance data. It is a method in which the minority class remains intact, while the
majority class is under-sample. Through using cost curves to explore the interaction of over and undersampling with the decision tree learner C4.5, under-sampling produces a reasonable sensitivity to
changes in misclassification costs distribution. On the other hand, over-sampling is surprisingly
ineffective, often producing little or no change in performance.
2.3. Combination technology: Majority voting
Majority voting method is a kind of combination technologies. Among all the combination
technologies, it is by far the simplest for implementation. It does not assume prior knowledge of
behavior of the individual classifiers, and it does not require training on large quantities of
representative recognition results from classifiers [15]. While employing five combination technologies,
which include majority vote, Bayesian, logistic regression, fuzzy integral and neural network, on seven
classifiers, majority vote is just as effective as the other more complicated technologies in improving
the recognition rate. In combining the decisions of the n classifiers using majority voting method, the
sample is assigned the class when there is a consensus, or when more than half of the classifiers agreed
on the identity. Otherwise, the sample is rejected.
3. Fraud detection mechanism design
There are some fraudulent characteristics in C2C e-commerce. Firstly, changes in the trading
characteristics of a trader, such as the types of commodity, turnover, and trading frequency, can be
detected as the evidence of abnormal trading. Secondly, Fraud can be detected by finding out similar
traits from the account information of sellers to learn rules because sellers with similar background are
likely to behave in the same way. In order to detect fraud through related characteristics, it is necessary
to gather transaction information, account information and reputation information of sellers. Since
there are some special traits of fraud transaction, a functional relationship between transaction
attributes and transaction types (fraud or non-fraud) will help us to detect fraud from a large amount of
trading data.
The purpose of this research is to design an effective and efficient fraud detection model used in
C2C e-commerce. In fraud detection, there is always an imbalance between the positive sample
representing fraud ones and negative sample representing no-fraud ones. However, classification
algorithms tend to ignore the minority class but present an accurate classification for the majority class.
Therefore, conventional algorithms are limited in the classification of imbalanced data, but an efficient
fraud detection model should focus on the fraud sample (minority class). Thus, the data in the two
types of sample should be balanced first and then be trained for the fraud detection model. To solve a
complicated problem of detection classification, a single method can hardly meet the detection
requirements. Because of the complementarities of classification, different classifiers need to be
combined to reduce detection errors and improve detection robustness. In this research, we tend to take
506
A Novel Classification Approach for C2C E-Commerce Fraud Detection
Haitao Xiong, Yufeng Ren, Pan Jia
full consideration of combined classifier in fraud detection to avoid imprudent decisions resulted from
using a single classifier.
3.1 Mechanism of C2C e-commerce fraud detection model
Figure 1. Mechanism of C2C e-commerce fraud detection model
C2C e-commerce fraud detection model is designed to detect abnormal transaction data from
transaction data and help users to make decision in selecting commodities. The model has several
different steps that begin with the preprocessing of raw data. After that, the data is ready to be utilized
for classification. The following step is NB classifier, which is the first classifier. NB classifier can
denote the distribution of the training set in the classification and refine the input of next classifier. The
third step is the second classifier, a combined classifier based on AdaBoost&C4.5 which is generated
using sub-training sets whose sampling approaches are proposed later. It uses majority voting method
to combine all sub-classifers in it. The results of the NB classifier are the inputs of the second classifier
and classification prediction results can be got from the second classifier. Finally, several performance
507
A Novel Classification Approach for C2C E-Commerce Fraud Detection
Haitao Xiong, Yufeng Ren, Pan Jia
metrics will be conducted on the prediction results. The complete mechanism of C2C e-commerce
fraud detection model is shown in Fig. 1.
4.2. Sampling process in C2C e-commerce fraud detection model
Figure 2. Generation of samples for classification
Figure 2 depicts the sampling model in C2C e-commerce fraud detection to tackle class imbalance
problem. The model is a combination of random sampling and under-sampling. According to the
qualitative introduction in 3.1, NB classifier is generated to denote the distribution of sample data and
C4.5 classifier is generated to get more accurate decision tree rules. So the training set for NB classifier
uses random sampling approach to get a part of transaction data as the training set and the sub-training
sets for combined C4.5 classifier use under-sampling approach on the former training set. In undersampling, the majority sample is randomly under-sampled and the union of sub-samples account for all
majority sample.
Initially, the data set is separated into two different sub data sets through random sampling: training
set and testing set. The testing set is used to test the classification performance of the model and the
training set is input to NB classifier to generate NB classification model. Then, the training set is
splitting into two parts. One is the fraud sample and the other is non-fraud sample. After that, nonfraud
sub-samples are randomly under-sampled from the majority sample which is non-fraud sample in such
508
A Novel Classification Approach for C2C E-Commerce Fraud Detection
Haitao Xiong, Yufeng Ren, Pan Jia
a way that the ratio of fraud to non-fraud is approximately 1. Later, each non-fraud subsamples are
combined with fraud sample to generate sub-training sets for a combined classifier based on
AdaBoost&C4.5.
4. Experiment and analysis
4.1 Experimental Data Sets and Metrics
Firstly, we choose cell phone as the study object because fraud in cell phone is more serious than
other commodities in china and collect cell phone data from Taobao from December 2011 to March
2012. The fraud behaviors considered in this study are mainly misrepresentation of items, fee stacking
and non-delivery of items. After data filtering and cleaning, blank and incorrect data was deleted. The
final data is composed by 212784 transaction records in which 10443 are fraud transactions, and the
number of non-fraud transaction is much larger than that of fraud cases. The data set is transformed to
a simple data set which is more suitable for learning. In this way, data can be more meaningful and
more easily handled. Then under-sampling approach will be used in the data. In this research the ratio
of non-fraud to fraud is approximately 19. We hypothesize the ratio of nonfraud to fraud in each subsamples are similar and ensure it approximately equals to 1. Through under-sampling approach, 19
sub-samples are generated. After under-sampling the majority samples, each sub-sample and minority
sample together form 19 sub-training sets of our study.
The experiments in this paper adopt a ten-fold cross-validation method. Each data set will be
divided into ten equal parts, using nine folds as the training set and the remaining block as an
independent test set. According to the researches in papers [9,16], Positive Accuracy, Negative
Accuracy, F-measure, and G-Mean are used to evaluate the performance of the C2C e-commerce fraud
detection model in this paper. Performance metrics are commonly calculated using the confusion
matrix. In this study, True Positives denote correctly identifying fraud transaction, and True Negatives
represent correctly identifying non-fraud transaction. Similarly, False Positives and False Negatives
denote incorrectly identifying fraud transaction as non-fraud transaction and incorrectly identifying
non-fraud transaction as fraud transaction [10].
4.2 Experimental Results
The objective of C2C e-commerce fraud detection model is identifying whether a transaction is
fraud or non-fraud. In our study, the model will be trained in different ways, which include NB
classifier (NB), combined classifier based on C4.5 (cC45), combined classifier based on
AdaBoost&C4.5 (cAdaC45), combined classifier based on NB and C4.5 (cNBC45) and combined
classifier based on NB and AdaBoost&C4.5 (cNBAdaC45).
Table 2. Classification performances of five classification algorithms
Performance Metrics (%)
Performance
Positive Accuracy Negative Accuracy
F-measure
NB
cC45
cAdaC45
cNBC45
cNBAdaC45
7.56
39.24
69.28
68.89
79.48
93.88
89.34
79.17
83.12
89.57
11.72
59.25
73.31
78.14
85.49
G-mean
26.64
59.21
74.06
75.67
84.37
Through passing testing set into different classifiers trained before, we can get the classification
performance of each experiment which is shown in Table 2. From this table, we can see that all these
five classifiers can indeed detect the fraud transactions. We can compare the accuracy of different
classifiers in the C2C e-commerce fraud detection model. The C2C e-commerce fraud detection model
can indeed greatly improve all the accuracies and enhance the classification performance. cNBAdaC45
has the best four accuracies, except for Negative Accuracy. Second, the imbalance problem in C2C ecommerce fraud detection must be resolved. As one can see, NB classifier has the lowest Positive
Accuracy, F-measure and G-mean. But the accuracies of the four C4.5 classifiers using balance data
509
A Novel Classification Approach for C2C E-Commerce Fraud Detection
Haitao Xiong, Yufeng Ren, Pan Jia
are very high and more than those of NB. Third, we want to make sure that adding AdaBoost to
classifier can affect the classification performance. After adding AdaBoost to cNBC45, cNBAdaC45
have a much better performance than cC45. Finally, the relation between NB and classification
accuracies needs to be discovered. NB has the worst performance. However, the combined classifier
based on NB and other algorithms shows higher accuracies than NB. If NB is added to cC45, all
accuracies will significantly fall. While we choose cNBAdaC45 in the model, it has the best
performance and obviously improves all the accuracies compared to cAdaC45, but can not obviously
improves accuracies compared to cC45.
Out of the five classifiers examined, all except NB result in an efficient classifier to detect fraud
transactions. Through the classification results of NB, we can find that NB is good at identifying
minority class which is fraud transaction and bad at identifying majority class which is non-fraud
transactions.
In order to eliminate the bias to majority class of C4.5 classifier in imbalance data, our proposed
sampling approaches described in Section 4.2 are used to build the training sets. The training sets for
these C4.5 classifiers are balance data sets generated by under-sampling approach. All four classifiers
based on C4.5 have high accuracies in classification. This sampling technique does not merely resolve
the imbalance problem, but it also generates several sub-classifiers to combine the final classification
results through majority voting method. Given the balance training set, the performance of a classifier
based on C4.5 is relatively good. We also find that adding AdaBoost algorithm to classifier can either
improve or worsen the classification performance. In our research, cNBAdaC45 has overall better
performance than cNBC45 and cAdaC45 has worse performance than cC45, proving that AdaBoost
can help a classifier improve its accuracies when it is used in the right place.
Then, through the comparison in all classifiers, cNBAdaC45 exhibits the best performance in
classification. However, the removal of NB and AdaBoost from classifier which is cC45 shows a little
worse performance. The difference between it and cC45 is very small. So, cC45 and cNBAdaC45 are
both good at fraud detection in C2C e-commerce. So, cNBAdaC45 is the best-performance classifier.
The capability of detecting potential fraud transactions is the most required capability in C2C ecommerce fraud detection. Another important finding of the research is that NB can detect the most
undetected fraud transactions because of the capability of NB to reveal potential undetected
abnormalities in large-scale data. After combining NB and cAdaC45, we can get overall best
classification performance to identify detected and undetected fraud transactions in C2C e-commerce
in the five classifiers studied in the research.
5. Conclusion and future work
This study proposes a C2C e-commerce fraud detection model for classifying imbalanced data. In
C2C e-commerce, the advantages of model using cNBAdaC45 include good accuracies and the best
performance in identifying undetected fraud transactions in all five examined classifiers. It not only
improves the classification performance, but also helps to identify the potential fraud transactions.
After combining C2C e-commerce fraud detection model with C2C e-commerce systems, we can
detect fraud transaction if there are some patterns that do not match normal patterns. Through these
preventions, fraud transactions can be stopped before more losses occur and the customer satisfaction
will improve. In the end, more and more legal customers take part in e-commerce because of the safe
and steady market and fraudsters are excluded from market. Company with the development of ecommerce, there are more and more new customers and transactions entering the C2C market. Old
identified patterns will be ineffective and C2C e-commerce systems should carry out model selflearning to adjust patterns for continuous fraud detection.
Our study also opens up several directions for future research in fraud detection. This study
conducts experiments only for the selected algorithms in the C2C e-commerce fraud detection model.
In the future, more studies are needed to use different classification algorithm to detect fraud in C2C ecommerce. For instance, future studies could consider genetic algorithm, BP neural networks and so on.
They can also improve the architecture of C2C e-commerce fraud detection model to evaluate those
performances. In the study, we only collect data of cell phones, and we can apply our model into
different commodities to examine the feasibility and effectiveness of the model. Furthermore, we just
510
A Novel Classification Approach for C2C E-Commerce Fraud Detection
Haitao Xiong, Yufeng Ren, Pan Jia
focus on the binary classification in C2C e-commerce, but there exist several fraudulent types and
future studies could examine if the results in this study is still useful.
6. Acknowledgement
The research is supported by the National Natural Science Foundation of China under Grant No.
71201004, the College Students Scientific Research and Undertaking Starting Action Project under
Grant No. PXM2012_014213_000067, Research Foundation for Youth Scholars of Beijing
Technology and Business University No. QNJ2011-39 and Scientific Research Common Program of
Beijing Municipal Commission of Education No. KM201310011009.
7. References
[1] D. Teeni, D.R. Young, “The changing role of nonprofits in the network economy”, Nonprofit and
Voluntary Sector Quarterly, vol. 32, no. 3, pp.397-414, 2003.
[2] Z.H. Zhou, X.Y. Liu, “Training cost-sensitive neural networks with methods addressing the class
imbalance problem”, IEEE Transactions on Knowledge and Data Engineering, vol. 18, no.1,
pp.63-77, 2006.
[3] M. Weatherford, “Mining for Fraud”, IEEE Intelligent Systems, vol. 17, no. 4, pp.4-6, 2002.
[4] J.T.S. Quah, M. Sriganesh, “Real-time credit card fraud detection using computational
intelligence”, Expert Systems with Applications, vol. 35, no. 4, pp.1721-1732,2008.
[5] Seokjoo Andrew Chang, "Forensic Data Pattern Analysis using Information Entropy",
International Journal on Data Mining and Intelligent Information Technology Applications, vol. 2,
no. 2, pp.12-20, 2012.
[6] A. Lin, J. Foster, S.Wang, “Understanding the factors that influence acceptance of online auction
platforms: a comparative study of Taobao and eBayEachnet”. International Journal of Business
and Systems Research, vol. 3, no. 2, pp.148-169, 2009.
[7] P.N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison Wesley, USA, 2005
[8] J.J. Wu, H. Xiong, J. Chen, “COG: Local Decomposition for Rare Class Analysis”, Data Mining
and Knowledge Discovery, vol. 20, no. 2, pp.191-220, 2010.
[9] H. He, E. A. Garcia, “Learning from Imbalanced Data”, IEEE Transactions on Knowledge and
Data Engineering, vol. 21, no. 9, pp.1263-1284, 2010.
[10] A.R. Sinha, H.M. Zhao, “Incorporating domain knowledge into data mining classifiers: An
application in indirect lending”, Decision Support Systems, vol. 46, no. 1, pp.287-299, 2008.
[11] X. Luo, Q. Zhu, “Cost-sensitive ensemble via Adaptive Weighted Cost Proportionate Sampling”.
International Journal of Digital Content Technology and its Applications, vol. 5, no. 7, pp.257265, 2011.
[12] H.T. Xiong, Y. Yang, S.X. Zhao. “Local Clustering Ensemble Learning Method based on
Improved AdaBoost for Rare Class Analysis”. Journal of Computational Information Systems, vol.
8, no. 4, p.1783-1790, 2012.
[13] G. Batista, R.C. Prati, M.C. Monard, “A study of the behavior of several methods for balancing
machine learning training data”, SIGKDD Explorations, vol. 6, no. 1, pp.20-29, 2004.
[14] M.C. Chen, L.S. Chen, C.C. Hsu, W.R. Zeng, “An information granulation based data mining
approach for classifying imbalanced data”, Information Sciences, vol. 178, no. 16, pp. 3214-3227,
2008.
[15] X.Y. Liu, J. Wu, Z.H. Zhou, “Exploratory Undersampling for Class-Imbalance Learning”, IEEE
Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 39, no. 2, pp.539-550,
2009.
[16] X. Li , “Research on online trading customer classification based on customer characteristics and
behaviors”, International Journal of Digital Content Technology and its Applications, vol. 6, no.
10, pp.430-437, 2012.
511