Download Applications of Unsupervised Learning in Property

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RELX Group wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Applications of Unsupervised Learning
in Property and Casualty Insurance
with emphasis on fraud analysis
Louise Francis, FCAS, MAAA
Francis Analytics and Actuarial
Data Mining, Inc.
www.data-mines.com
[email protected]
Objectives
̈
̈
Review classic unsupervised learning
techniques
Introduce 2 new unsupervised
learning techniques
RandomForest
Æ PRIDIT
Æ
̈
Apply the techniques to insurance
data
Automobile Fraud data set
Æ A publically available automobile
insurance database
Æ
Motivation for Topic
̈
New book: Predictive Modeling in
Actuarial Science
Æ
̈
̈
̈
̈
An introduction to predictive modeling for
actuaries and other insurance
professionals
Publisher: Cambridge University Press
Hope to Publish: Fall 2012
Chapter on Unsupervised Learning
Li Yang and Louise Francis
Æ
Æ
Li Yang Î Variable grouping (PCA)
Louise Francis- record grouping
(clustering)
Book Project
̈
Predictive Modeling 2 Volume Book Project
̈
̈
̈
̈
̈
A joint project leading to a two volume pair of
books on Predictive Modeling in Actuarial Science.
Volume 1 would be on Theory and Methods and
Volume 2 would be on Property and Casualty
Applications.
The first volume will be introductory with basic
concepts and a wide range of techniques designed
to acquaint actuaries with this sector of problem
solving techniques. The second volume would be
a collection of applications to P&C problems,
written by authors who are well aware of the
advantages and disadvantages of the first volume
techniques but who can explore relevant
applications in detail with positive results.
The Fraud Study Data
̋
̋
1993 AIB closed PIP claims
Dependent Variables
̋
̋
̋
Predictor Variables
̋
̋
7/5/2012
Suspicion Score
Expert assessment of liklihood of
fraud or abuse
Red flag indicators
Claim file variables
Francis Analytics and Actuarial
Data Mining, Inc.
5
The Fraud Problem
from: www.agentinsure.com
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
6
The Fraud Problem (2)
from Coalition Against Insurance Fraud
7/5/2012
Francis Analytics and Actuarial Da
Mining, Inc.
7
Fraud and Abuse
̈
Planned fraud
Æ
̈
Staged accidents
Abuse
Opportunistic
Æ Exaggerate claim
Æ
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
8
The Fraud Red Flags
̈
̈
̈
̈
̈
̈
̈
Binary variables that capture
characteristics of claims
associated with fraud and abuse
Accident variables (acc01 - acc19)
Injury variables (inj01 Î inj12)
Claimant variables (ch01 Î ch11)
Insured variables (ins01 Î ins06)
Treatment variables (trt01 Î trt09)
Lost wages variables (lw01 Î lw07)
The Red Flag Variables
Red Flag Variables
Subject
Accident
Claimant
Injury
Indicator
Variable
ACCO1
A0004
A0009
ACC10
ACC11
ACC14
ACC15
ACC16
ACC19
CLT02
Description
No report by police officer at scene
Single vehicle accident
No plausible explanation for accident
Claimant in old, low valued vehicle
Rental vehicle involved in accident
Property Damage was inconsistent with accident
Very minor impact collision
Claimant vehicle stopped short
Insured felt set up, denied fault
Had a history of previous claims
CLT04
Was an out of state accident
CLT07
Was one of three or more claimants in vehicle
INJO1
INJ02
Injury consisted of strain or sprain only
INJO3
INJ05
INJO6
INJ11
Insured
INSO1
INSO3
INSO6
INSO7
Lost Wages LWO1
LW03
7/5/2012
No objective evidence of injury
Police report showed no injury or pain
No emergency treatment was given
Non-emergency treatment was delayed
Unusual injury for auto accident
Had history of previous claims
Readily accepted fault for accident
Was difficult to contact/uncooperative
Accident occurred soon after effective date
Claimant worked for self or a family member
Claimant recently started employment
Francis Analytics and Actuarial
Data Mining, Inc.
10
Dependent Variable
Problem
̈
̈
̈
7/5/2012
Insurance companies frequently do
not collect information as to
whether a claim is suspected of
fraud or abuse
Even when claims are referred for
special investigation
Solution: unsupervised learning
Francis Analytics and Actuarial
Data Mining, Inc.
11
Dimension Reduction
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
13
The CAARP Data
̈
̈
̈
̈
7/5/2012
This assigned risk automobile data was made
available to researchers in 2005 for the purpose of
studying the effect of change in regultion on territorial
variables
contain exposure information (car counts, premium)
and claim and loss information (Bodily Injury (BI)
counts, BI ultimate losses, Property Damage (PD)
claim counts, PD ultimate losses).
Each record is a zip code
Good example of using unsupervised learning for
territory construction
Francis Analytics and Actuarial
Data Mining, Inc.
14
R Cluster Library
̈
̈
Vjg"ÐenwuvgtÑ"nkdtct{"htqo"T"used
Many of the functions in the library
are described in the Kaufman and
TqwuuggwyÓu (1990) classic
bookon clustering.
Æ
7/5/2012
Finding Groups in Data.
Francis Analytics and Actuarial
Data Mining, Inc.
15
Grouping Records
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
16
RF Similarity
̈
̈
Varies between 0 and 1
Proximity matrix is an output of RF
After a tree is fit, all records run through model
Æ If 2 records in same terminal node, their
proximity increased by 1
Æ 1-proximity forms distance
Æ
̈
̈
7/5/2012
Can be used as an input to clustering and other
unsupervised learning procedures
Ugg"ÐWpuwrgtxkugf"Ngctpkpi"ykvj"Tcpfqo"
Hqtguv"RtgfkevqtuÑ"d{"Ujk"cpf"Jqtxcvj
Francis Analytics and Actuarial
Data Mining, Inc.
18
Clustering
̈
̈
̈
7/5/2012
Hierarchical clustering
K-Means clustering
This analysis uses k-means
Francis Analytics and Actuarial
Data Mining, Inc.
19
K-means Clustering
̈
̈
̈
7/5/2012
An iterative procedure is used to assign
each record in the data to one of the k
clusters.
The iteration begins with the initial centers
or mediods for k groups.
uses a dissimilarity measure to assign
records to a group and to iterate to a final
grouping. An iterative procedure is used to
assign each record to one of the k
clusters. byFrancis
the Analytics
user,and Actuarial
21
Data Mining, Inc.
R Cluster Output
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
22
Cluster Plot
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
23
Silhouette Plot
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
24
Silhouette Plot RF
Proximity
Silhouette Plot Ï Euclidean
Distance Clustering
Testing using Expert
Scores: Fit a Tree to Suspicion
Score for Importance Ranking
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
27
Importance Ranking of
the Clusters
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
28
Fit Tree to Binary Fraud
Indicator
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
29
Importance Ranking (2)
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
30
RF Ranking of the
ÑRtgfkevqtuÒ<"Vqr"32"qh"66
#
$
)
*
+
'
(
(
-
)
*
+
#
(
)
*
-
)
(
#
0
0
+
0
*
0
+
+
0
1
*
*
)
-
1
#
7/5/2012
%
#
#
.
%
#
%
#
!
#
!
#
'
'
#
!
!
"
#
!
!
.
!
%
!
/
#
(
#
.
#
#
/
.
%
"
!
,
!
,
!
!
#
#
1
'
!
&
,
!
/
!
!
'
'
#
.
!
!
!
,
&
!
#
!
'
#
2
!
.
!
'
!
"
3
Francis Analytics and Actuarial
Data Mining, Inc.
31
PRIDITS of Accident
Flags
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
36
Fit Tree with PRIDITS for
Each Type of Flag
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
38
Importance Ranking of
Pridits
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
39
Importance Ranking of
Factors
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
40
Add RF and Euclid
Clusters to PRIDIT
Factors
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
41
Use Salford RF MDS
̈
̈
̈
̈
̈
7/5/2012
Top variable in importance (acc10)
used as binary dependent
Run tree with 1,000 forests
Output proximities and MDS
Use MDS scales as to cluster
(k=3)
Run Tree to get Importance
ranking
Francis Analytics and Actuarial
Data Mining, Inc.
42
MDS Graph
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
43
Rank of cluster
procedures to Tree
Prediction
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
44
Labeling Clusters
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
45
Relation Between
PRIDIT Factor and
Suspicion
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
46
Next Steps
̈
Add claim file variables
Rerun clusters
Æ Rerun PRIDITS
Æ
̈
̈
7/5/2012
Do Random Forest proximities on
the RIDITS
Apply the procedures to other
fraud databases
Francis Analytics and Actuarial
Data Mining, Inc.
47
PRIDIT REFERENCES
Ai, J., Brockett, Patrick L., cpf"Iqnfgp."Nkpfc"N0"*422;+"ÐCuuguukpi"Eqpuwogt"
Htcwf"Tkum"kp"Kpuwtcpeg"Enckou"ykvj"Fkuetgvg"cpf"Eqpvkpwqwu"Fcvc.Ñ"
North American Actuarial Journal 13: 438-458.
Brockett, Patrick L., Derrig, Richard A., Golden, Linda L., Levine, Albert and
Alpert, Mark, (2002), Fraud Classification Using Principal Component
Analysis of RIDITs, Journal of Risk and Insurance, 69:3, 341-373.
Brockett, Patrick L., Xiaohua, Xia and Derrig, Richard A., (1998), Using
MqjqpgpÓ"Ugnh-Organizing Feature Map to Uncover Automobile Bodily
Injury Claims Fraud, Journal of Risk and Insurance, 65:245-274
Bross, Irwin D.J., (1958), How To Use RIDIT Analysis, Biometrics,
4:18-38.
Chipman, H.E.I. George and R.E. McCulloch, 2006, Baysian Ensemble Learning,
Neural Information Processing Systems
Lieberthal, Robert D., (2008), Hospital Quality: A PRIDIT Approach, Health
Services Research, 43:3, 988Î1005.
Questions?
7/5/2012
Francis Analytics and Actuarial
Data Mining, Inc.
49