Download without feature selection

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Detection of Financial
Statement fraud and
Feature Selection Using
Data Mining Techniques
指導教授:徐立群 教授
學生:吳泰霖 R16031145
顏伶安 R16034193
Jan 9, 2015
1. Introduction
Briefly speaking…
Support
Data
mining
電腦審計部門
Fraud costs US business more than $400 billion
annually. (Kirkos et al.)
誰對財報舞弊
的預測能力最
強?
2. Literature Review
decision trees
neural
networks (NN)
Bayesian belief
networks
case based
reasoning
fuzzy rule-based
reasoning
hybrid methods
Aamodt and Plaza [1] and Kotsiantis et al.
Cecchini et al.
Huang et al.
Sohl and Venkatachalam
Cerullo and Cerullo
Calderon and Cheh
Koskivaara
Feroz et al.
Aamodt and Plaza and Kotsiantis et al.
support vector machines
Further, Deshmukh and Talluru
Pacheco et al.
3. Methodology
• Dataset
35 financial items of 202
companies from China
(101 were fraudulent
101 were non-fraudulent
companies)
• Dataset Processing
log transformation
ten-fold cross validation
→improve the reliability
of the result.
• Without feature selection
• With feature selection
Support Vector Machines (SVM)
---支持向量機器
• The training examples that are
closest to the maximum
margin hyperplane are called
support vectors.
• All other training examples are
irrelevant for defining the
binary class boundaries.
• financial applications : credit
rating, time series prediction,
and insurance claim frauds
detection.
Genetic Programming (GP)
---基因規劃
• an extension of genetic algorithms (GA).
• evolutionary computation→ reproduction,
crossover, mutation
→ find the best solution to the problem
Multi-Layer Feedforward Neural
Network (MLFF)---多層式前饋網路
• MLFF is one of the most common NN
structures, as they are simple and effective,
and have found home in a wide assortment of
machine learning applications.
Group Method Data Handling
(GMDH) --- 資料處理群集分析
• an inductive learning algorithm for
modeling complex systems.
• It is a self-organizing approach that tests
increasingly complicated models and
evaluates them using some external
criterion on separate parts of the data
sample.
Logistic Regression(LR)--羅吉斯回歸
• The response variable is taken to be
dichotomous or binary, i.e., yi=0 or 1 for all
i=1,...,n.
• LR is a commonly used approach for
performing binary classification.
Probabilistic Neural Network(PNN)
---機率神經網路
• It is a pattern classification network based on
the classical Bayes classifier, which is
statistically an optimal classifier that seeks to
minimize the risk of misclassification.
Input
layer
Pattern
layer
Summation
layer
output
layer
4. Feature Selection
• In this study, we employed a feature
selection phase by using the simple t-statistic
technique.
(For given features)
μ1 and μ2 represent the means of the samples of fraudulent companies and nonfraudulent companies for a given feature respectively, σ1 and σ2 represent the
standard deviation of the samples of fraudulent companies and non-fraudulent
companies for a given feature respectively. n1 and n2 represent the number of
samples of fraudulent companies and non-fraudulent companies for a given feature.
• The t-statistic values are computed for
each feature and the top 18 features with
the highest t-statistic are identified.
• A high t-statistic value: highly discriminate
between fraudulent and non-fraudulent
companies.
Top 18 items selected by t-statistic based
feature selection.
5. Results and Discussion
35 financial items
for
202 companies
1.
with
Feature selection
2.
TOP 18
3.
TOP 10
without
Feature selection
Which classifier is best?
Result 1: without feature selection
Explanations
• The sensitivity is the measure of the
proportion of the number of fraudulent
companies predicted correctly as fraudulent
by a particular model to the total number of
actual fraudulent companies. (true positive
rate)
• The specificity is the measure of the
proportion of the number of nonfraudulent
companies predicted as non-fraudulent by a
model to the total number of actual nonfraudulent companies. (true negative rate)
Result 2: Top18 by feature selection
Results based on AUC indicated that GP yielded highest accuracy followed by PNN,
which yielded marginally less accuracy. This makes us infer that the selected feature
subsets have a high discriminatory power and the ‘left-over’ features have
very little to contribute to the success of financial fraud detection.
Result 3: Top10 by feature selection
Without
Feature selection
With
Feature selection
TOP 18
PNN
GP
TOP 10
PNN
T-test : differences?
Remaining Classifiers
Result 4: PNN vs. other classifiers
(without feature selection)
In order to find out whether the difference in average AUCs is
statistically significant or not, we conducted a t-test between the top
performer and the remaining classifiers
✓
✓
✓
✓
✓
>
PNN is the best of the all
Critical value 1.73
Result 5: GP vs. other classifiers
(Top18 by feature selection)
✓
✓
✓
PNN is almost as good as GP
Result 6: PNN vs. other classifiers
(Top10 by feature selection)
✓
✓
✓
PNN, GP, GMDH are good
Conclusion and future research
directions
The results obtained in this study are
better than those obtained in an earlier
study on the same dataset.
Results based on AUC indicated that the
PNN was the top performer followed by
GP which yielded marginally less
accuracies in most of the cases.
The End