Download A large number of problems in data mining are related to fraud

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Machine Learning: SPSS
Neural Networks, KNN and Bayesian Methods
Application Area
Insurance, Fraud
Detection
Data Mining Task
Classification
Number of Instances
15,900
Number of Attributes
31
Abstract:
A large number of problems in data mining are related to fraud detection. Fraud is a common problem in
auto insurance claims, health insurance claims, credit card transactions, financial transaction and so on.
The data in this case comes from an actual auto insurance company. Each record represents an insurance
claim. The last column in the table tells you whether the claim was fraudulent or not. A number of people
have used this dataset and here are some observations from them:
•
•
•
“This is an interesting data because the rules that most tools are coming up with do not make any
intuitive sense. I think a lot of the tools are overfitting the data set.”
“The other systems are producing low error rates but the rules generated make no sense.”
“It is OK to have a higher overall error rate with simple human understandable rules for a
business use case like this.”
There are two datasets (Excel Files) –
1. Insurance Fraud – TRAIN-3000, and
2. Insurance Fraud – TEST-12900.
Train with the two neural network methods (multilayer perceptron and radial basis functions) and
the KNN and Bayesian Network methods by changing the parameters such as hidden nodes in each layer
with target variable as Fraud Prediction –Yes/No ( Flag).
This is an imbalanced dataset. Fraud cases are hard to find and hard to separate from non-fraud cases. SO
I did one or more decision trees by modifying the costs to obtain better accuracy.Accuracy is calculated as
per the below formula.
PREDICTED CLASS
ACTUAL
CLASS
.
Accuracy 
Class=Yes
Class=No
Class=Yes
a
(TP)
b
(FN)
Class=No
c
(FP)
d
(TN)
ad
TP  TN

a  b  c  d TP  TN  FP  FN
Recommendations:
After comparing all the trees, I got the highest accuracy rate of 96.14% for Radical Basis Perceptron tree
as below
The tree was as below
The Top predictors were: Accident area and Policy portfiled year had the highest predictor importance.