Download Knowledge Discovery from Data as a framework to decission

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Knowledge Discovery from Data as a
framework to decission support in medical
domains
K. Gibert(1) and L. Salvador-Carulla (2)
(1)Department
of Statistics and Operation Research
Knowledge Engineering and Machine Learning group
Universitat Politècnica de Catalunya, Barcelona
(2) PSICOST
Scientific Research Association
Bridging Knowledge in long term care and support , Barcelona, March 5th, 2009
K. Gibert
Outline
1.- Introduction
2.- KDD
3.- Our Research
ill-structured domains
Artificial Intelligence and Statistics
Development plattform KLASS
5.- Methodological overview (in parallel)
Clustering based on rules (by States)
Class panel graph
Traffic light panel
4.- Some real applications
DEFDEP project
Response to neurorehabilitation
Long-term quality of life perception
8.- Conclusions
K. Gibert
Introduction
ƒ
XXIth Century: Knowledge Society
ƒ Great need of getting knowledge from
ƒ
ƒ
ƒ
Data
Organizations
Natural, industrial or artificial phenomena
ƒ Support complex decision making processes
ƒ
Enormous quantities of data to analyze
ƒ Boom Internet
ƒ New technologies
ƒ
Classical data analysis is poor
ƒ Too much data
ƒ Phenomena too complex
ƒ
New approaches required
K. Gibert
Data Mining and Knowledge Discovery
ƒ
Interdisciplinary problem
“Non trivial identifying of valid, novel, potentially
useful, ultimately understandable patterns in data”
[Fayyad 96]
ƒ
Starting:
ƒ
ƒ
ƒ
ƒ
1989: First Int’l Workshop on KDD in IJCAI
1994: First proceedings
August 1995: First Int’l Conference on KDD
1996: First State of the art (Fayyad et al.)
K. Gibert
Data Mining and Knowledge Discovery
ƒ
Knowledge Discovery System [Fayy96]:
ƒProblem definition
Term
ƒData collection
inolog
ycal a
ƒData cleaning and preprocessing
Data
mbigu
M
ity
ining
ƒDimensionality reduction
vs KD
ƒDM technique choice
D
ƒData mining
ƒInterpretation and discovered knowledge production
K. Gibert
Data Mining and Knowledge Discovery
ƒ
Knowledge Discovery System [Fayy96]:
ƒVery ambicious goals
K. Gibert
Data Mining and Knowledge Discovery
ƒBanca d’Italia [1995]: Built a KDD system for
ƒDaily update of the
whole set of movements
ƒDecide what and how
to analyze
ƒSelect relevant results
ƒProduce a daily 2-pages
synthesis (natural language)
Daily support to main boss decision making
K. Gibert
Data Mining and Knowledge Discovery
ƒBanca d’Italia
ƒ Built a KDD system for
Daily support decision making of the main boss
ƒTechnological problems
ƒMillions of movements per day
ƒTime to transmit to the central server?
ƒTime to update the database?
ƒHow to select and retrieve proper data to analyze from DB?
ƒHow to validate results and verify technical assumptions?
ƒMethodological problems
ƒWhich is important to analyze today?
ƒWhich is the proper Data Mining technique?
ƒWhich are relevant results?
ƒHow to express results for the main boss?
K. Gibert
Data Mining and Knowledge Discovery
ƒBig Supermarket chains
ƒDaily update the datawarehouse with costumer’s bill contents
ƒDecide what and how to analyze
ƒSelect relevant results
ƒWhat is buyed more
ƒMain associations between products
“Buying nappies and beer in supermarket on Saturday evening”
ƒSupport decision making of
Buying department
Marketing department
Important economic implications
K. Gibert
Data Mining and Knowledge Discovery
ƒ
Knowledge Discovery System [Fayy96]:
ƒVery ambicious goals
ƒNo complete system on yet
ƒConnection to DataWarehouses
ƒTools to assist preprocessing
ƒCollection of data mining techniques (AMD, NN, IR, AssR, Reg…)
ƒSome help on reporting phase
ƒManual process management and knowledge production
K. Gibert
Data Mining and Knowledge Discovery
ƒ
New paradigm proposed by Fayyad
“Most previous work on KDD has focussed on [...] data
mining step. However, the other steps are of considerable
importance for the successful application of KDD in practice"”
[Fayyad 96]
ƒ
Include prior and posterior analysis in KDD
ƒ Requires Great efforts in real applications
ƒ
Specially in medical systems (uncertainty, imprecise, multi-scaled,..)
stablished)
ƒ Time consuming, difficult (no standardInmethodology
clude
i
ƒ Expert interaction required
exper nteraction
t
w
ƒ Domain-dependent?
meth as part of ith
odolo
gy its the
elf
ƒ After good prior analysis, proper data mining easy
K. Gibert
Data Mining and Knowledge Discovery
ƒ
Knowledge Discovery System [Fayy96]:
ƒWide scope approach
ƒAlso interesting to better know very complex small datasets
Multidisciplinariety
Combination or hybridation of techniques
K. Gibert
Data Mining and Knowledge Discovery
Domain
(complex)
?
KDD
Data Analysis
Inductive Learning
Data Bases
•Understandability
•Prediction
•Description
Visualisation
•Validity
•Summary
Discovered
knowledge base
•Overview
Domain model
•Utility
•Simplicity/complexity
•Novelty
K. Gibert
Our Research
ƒ
Applied approach (real domains)
Ill-structured domains (ISD)
[AIComm 94]
K. Gibert
Ill-structured domains
D
?
[AIComm94]
D
John
Partial knowledge
Heterogeneous
Data
Weight Height Sex Eyes
John
J
Additional Knowledge
on domain structure
85
.
.
1.85
1.85
.
.
...M
.
.
azul
...
.
.
Numerical Categorical
Heterogeneous data K. Gibert
Our Research
ƒ
Applied approach (real domains)
Ill-structured domains (ISD)
ƒ
[AIComm 94]
Solving problems of knowledge discovery on ISD
to support complex
DECISION-MAKING
ar
n
i
l
p
i
c
is
Multid ach
appro
K. Gibert
Artificial Intelligence and Statistics
Interdisciplinar research field
¾ Starting:
1985: Douglas Fisher and Bill Gale (AI&Stats Society)
1986: First Int’l Conference on AI & Stats
ƒ
ƒ
¾Main goals:
ƒ
Promote communication between AI and Statistics
communities
“We feel that there is great potential for development at
the intersection of Artificial Intelligence, Computational
Science and Statistics”
Cheeseman and Oldford 94.
ƒ
Improve research in problems common to both
( Data Mining and Knowledge Discovery, ...)
K. Gibert
Our Research
ƒ
Applied approach (real domains)
Ill-structured domains (ISD)
[Gibert 94]
ƒ
Solving problems of knowledge discovery on ISD
ƒ
Design of hybrid methodologies in the AI & Stats field
K. Gibert
Our Research
ƒ
Applied approach (real domains)
Ill-structured domains (ISD)
ƒ
[Gibert 94]
Solving problems of knowledge discovery on ISD
ƒ
Design of
ƒ
Building
of
s
p
u
o
r
hybrid methodologies in the
AI
ble&g Stats
a
s
tfield
h
c
s
i
e
j
u
b
g
o
n
i
s
t
s
u
i
D
neo
e
g
o
m
ts) to
ho oriented
hybrid Systems mainly
n
e
i
t
a
(p
ƒ KDD using Clustering as main Data Mining tool
“a number of real applications in KDD either require a clustering
process or can be reduced to it"”
[Nakhaeizadeh 98]
ƒ
ƒ
ƒ
ƒ
Focus on prior knowledge exploitation
Support for implicit knowledge elicitation
Focus on interpretation support tools
Post-processing discovered knowledge
K. Gibert
Our Research
ƒ
Applied approach (real domains)
Ill-structured domains (ISD)
[Gibert 94]
ƒ
Solving problems of knowledge discovery on ISD
ƒ
Design of mixed methodologies in the AI & Stats field
ƒ
Building hybrid Systems mainly oriented on
ƒ KDD using Clustering as main Data Mining tool
ƒ
Development platform
ƒ KLASS
[Gib91]
Integrates AI and Statistics
methods and pre and post
processing tools for
KDD on ISD
K. Gibert
Outline
1.- Introduction
2.- KDD
3.- Our Research
ill-structured domains
Artificial Intelligence and Statistics
Development plattform KLASS
5.- Methodological overview (in parallel)
Clustering based on rules (by States)
Class panel graph
Traffic light panel
4.- Some real applications
DEFDEP project
Response to neurorehabilitation
Long-term quality of life perception
8.- Conclusions
K. Gibert
Methodological overview
– Data cleaning
– Relevant variable selection
– Prior Knowledge acquisition
– Clustering based on rules (by States)
– Interpretation
„
Select number of classes
„
class panel graf
„
Traffics light panel
– Experts conceptualization
– [More frequent trajectories diagram]
tion
c
a
r
te
n
i
– Identification of profiles and labelling
s
rt
lv e
e
o
p
v
x
n
I
e
h
t
i
w
– Description of profile characteristics
– Validation
K. Gibert
Some real applications
DEFDEP project
„ Response to neurorehabilitation
„ Long term quality of life perception
„
K. Gibert
DEFDEP project
„
New Spanish “Dependency Low” (LPAD 39/2006, 14th Dec)
Law for the promotion of personal autonomy and care for persons with dep.
„
Spain: First Mediterranean country adopting dependency
policies which includes severe mental illness
„
PRODEP (specially created in Catalunya to develop the model and
supporting system to Dependency)
Conselleria de Salut + ICATSS (Institut Català de Serveis Socials)
„
Particularities of pshychic disabled population regarding dependency
„
DEFDEP project (leaded by Dr. L. Salvador Carulla)
Goal
Propose a dependency model and an assistential system
proper for pshychic disabled population
(Severe Mental Disorders and Intellectual Disability)
groups of experts, relevant institutions, familiar associations, and a
knowledge engineer + data from 306 patients with SMI
K. Gibert
Prior knowledge acquisition
R4 ={r1: If
then
the patient is institutionalized (INSTITUC = {EVOINST or INSTI}
mark him as an instituzionalized patient (i)
r2: If
the patient has poor levels of functioning ((GAFCLA2<40) or (GAFSOA2<40))
and high need of family support in daily activities ((MAXIMOA > 15)
and recurrent behavioral problems ((MAXIMOB=Every_Day))
then
the patient is in ill-condition (m)
r3: If
then
ECFOS was not evaluated because the patient is autonom
mark him as autonom (a)
r4: If
the patient was not evaluated under ECFOS by lack of carer
and is not institucionalized
then
mark it as leaving alone (s)
r5: If
}
the patient is able to work (INGRESE2 = TRABAJO)
and is not functionally impaired (GAFCLA2 > 70) or (GAFSOA2 > 70)
then
mark it as a patient in good condition (b)
K. Gibert
Clustering based on rules [AIComm 96]
Clase P
Clase S
Initial Data Set
Clase T
Clase residual
Use the KB to find the Rules Induced Partition
K. Gibert
Clustering based on rules
[AIComm 96]
Hierarchically
cluster every
Rules-induced class
Find Rules-induced
prototypes
Clase P
Clase S
Initial Data Set
Clase T
Clase residual
K. Gibert
Clustering based on rules
[AIComm 96]
Hierarchically
cluster new dataset
New Data Set
K. Gibert
Clustering based on rules
[AIComm 96]
Hierarchically
cluster new dataset
K. Gibert
Clustering based on rules
[AIComm 96]
Retrieve hierarchical
Structures of Rulesinduced prototypes
New hierarchical tree
K. Gibert
Determine the final classification
New hierarchical tree
K. Gibert
Dendrograma
ClBR
Autonomous (u)
Good (b)
Single (o)
Bad (m)
Residential (i)
306 pacs HSJD
bo
m
u
K. Gibert
i
Dendrograma
ClBR
TMG-K:
Autónomos-K (93)
Solos-K (87)
Dependientes- K (105)
Residenciales-K (9)
S
D
A
I
306 pacs HSJD
K. Gibert
Profiles of Dependency in Schyzophrenia
Autonomous (Auto-K) (93 pacs) (Cr292):
9
they work
9
scondary school
9
Better scores in all scales
9
do not need familiar help
9
almost do not use health-services
Alone (Alone-K) (87 pacs) (C300):
9 Do not have care giver
9 Intermediate scores, they are no well
9 Autonomous, but require help in domestic activities
9 Exagerated and chaotic use of health services
9 Require supervision (do not attend doctor appointments…)
Dependents (Dep-K) (105 pacs) (C297):
9 Cannot work
9 Without primary school
9 Worse scores in all scales
9 Require the higher quantity of help from care givers
9 Use a lot of health services: Long hospital stays
Institutionalized (Resid-K) (9 pacs) (Ci7):
9 Longest disease (23 years on average)
9 Suicide trials
9 Important negative sympthoms.
9 No help from family or health services, but from institution
K. Gibert
K. Gibert
Final results
„
[COMPSTAT2008, Phy-Verlag]
Five 5 types of Dependency in Severe Mental Illness
Autonomous, Living alone, Dependent in the community, Instituzionalized, Uncomplete
„
Elicitation of implicit known profiles
(living alone profile, requires special atention)
„
In SMI impaired dimensions of DLA are different than elderly persons
Motivation and volition, rather than execution
„
Correct assessment of dependency in SMI cannot be restricted to
– movility,
– self-care
– domestic tasks
–
–
–
Only between 4 y 6% of population with SMI have problems there
A 49,28% of dependent patients with SMI would not be identified
A 27% of the persons getting the best scores is in fact dependent
„
FICE is not reliable to assess dependency in SMI
„
Proposal of an operational definition of dependency for SMI,
(including specific items for SMI, not originaly in FICE)
– A 39,3% of persons with schizophrenia are dependents
K. Gibert
Some real applications
DEFDEP project
„ Response to neurorehabilitation
„ Long term quality of life perception
„
K. Gibert
Response to neurorehabilitation in TBI
Cognitive
Cognitive
Deficit
Deficit
Brain Damage
Neuropsychologycal
Rehabilitation
No scientific evidence type I yet
Collaboration with Institut Guttmann, Spain
Better know which rehabilitation programmes are more
Goal
effective according to the deficit characteristics
Data on neuropsychological functions
(Attention; Memory; Executive functions; Language)
of 47 patients before and after treatment
+ prior expert knowledge
K. Gibert
Results [MedArch 62(3)] [FAIA184]
t
ge
ua irmen
g
n pa
d
a
L im
ire
re impa
e
v
Se eech
Sp
ent
Resist
ent
impairm
e
r
e
v
e
S
o ns e
No res
p
X
Older
a ge
m
a
D
Mo r e
DisEx
e
Severe cutive
impair
Impair
ment
ed exe
cutive
functio
ns
Gl o
ba
Sev l Imp
ro
er
Up e imp veme
a
n
to n
orm irmen t
ality t
Valuab
le
Starting
better
Up to n
ormality
K. Gibert
Iterpretation tools: CPGs Memory tests [NNW05]
Classes
Variables :
V1
V2 …
V10
V11 …
C1
C2
C3
C4
:
Histogram
of V1|C1
K. Gibert
ass
l
c
w
S ho l a r i t i e s
icu
part
CPGs Memoria
NonAssessable
at the
beginning
The best
improvement
Still nonAssessable
after
rehabilitati
on
K. Gibert
From CPG to Traffic Lights Panel
LOW
HIGH
MEDIUM
HIGH
HIGH
K. Gibert
From CPG to Traffic Lights Panel
K. Gibert
From CPG to Traffic Lights Panel
Low
High
Medium
K. Gibert
Traffic lights panel for Memory assessment [AIM08]
Good
Normal
Bad
K. Gibert
Global TLP for the whole
neuropsicological assessment
K. Gibert
TLP supports expert conceptualization
AssessableImprove
?
K. Gibert
TLP supports expert conceptualization
AssessableImprove
GlobalImprovement
+MemoDisExe
Resistant
KN O
WLE
DGE
DISC
OVER
Ne w
Do m
Y:
ain M
odel
K. Gibert
Currently in progress
„
Assessable: mild to moderate neuropsychologic impairment,
use to improve after treatment (up to normality).
They could be assessed at the beginning of treatment
„
Global Improvement: initial severe impairment
global satisfactory improvement (up to normality)
The group with gretaer improvement regarding the initial conditions
atient’s profile
p
f
o
rs
to
ic
d
re
p
l
fu
e
s
entify uimpairment
Idsevere
Disexecutive: initial
s)
lesion characteristic
generally satisfactory
(including improvement
but persisting executive functions
disorder
rogramm done
p
n
o
ti
a
it
il
b
a
h
re
h
it
le w
s profito
Crosunable
remain
develop complex routines and planning
rehabilitation
rd
a
d
n
ta
s
ll
fu
s
s
e
c
olderPro
and
more
damage
than
previous
group
c
posal of su
rofile
Resistant: initial severe impairment
program for every p
mild improvement in tasks requiring minimal attention
good improvement in memory and learning skills
remains attention deficit for complez and executive tasks
cannot perform alone daily live activities
„
„
Low response to treatment
„
Language: language problems and very severe global cognitive impairment.
They only can improve as they recover language. Logopedic therapy
K. Gibert
Some real applications
DEFDEP project
„ Response to neurorehabilitation
„ Long term quality of life perception
„
K. Gibert
Long term Quality of Life perception. Spinal Cord Injury
„
„
Maintain and improve quality of life in chronic patients
patient’s-centered approach
QoL: Multidimensional construct
emotional wellness, functional autonomy, social inclusion…
Collaboration with Institut Guttmann, Spain
How is a patient with SCI perceiving his/her
Goal
quality of life along time
Data on IBP, CIF, ESIG of 109 patients at 3 consecutive
annual follow-up after clinical discharge (2002-2008) +
prior expert knowledge
K. Gibert
Clustering based on rules by States
ID
i1
i2
i3
i4
…
…
…
Xe11
…
Xe21
…
X11
…
…
…
X11
…
…
X21
…
X31
…
…
…
X21
…
…
…
X31
…
X41
…
…
…
X41
…
…
…
…
…
…
…
…
…
…
…
X
n1
…
…
n1
…
…
e2
e1
Xe1K1
…
…
…
XeE1
…
…
X11
…
…
…
…
X21
…
…
…
…
X31
…
…
…
…
…
X41
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
X
n1
…
…
…
…
…
n1
…
Xe2K2
eE
X
n1
n1
…
XeE
KE
P e 1P
e2
...
P
eE
in
Knowledge
KLASS
Base
τe
1
τe
2
Pe2
Pe1
…
τe
E
PeE
K. Gibert
Results: More typical patterns (γ≥0.05)
1st Assessment
2nd Assessment
3rd Assessment
C59
C63
C55
IndepPos
IndepPositius
IndepPos
TRAJECTORIES
IndepModerat
C62
C49
C57
IndepModAnt
SemidepHetero
Dependents
T4
T6
SemiDepNeg
DepEstoics
C54
T7
T12
C56
C46
IndepMod
C64
DepEstoics
C52
K. Gibert
Interpretation of patterns [Stud. Health Tech Inform 09 (in press)]
e
sion. Som
le
t
n
e
c
.
r. Re
tonomy
u
Younge
a
l
a
ic
s
Phy
distress. 1st Assessment
p stable
e
e
k
y
e
Th
VIP2
Physic
al au
wellne tonomy an
d
ss m a i
ntainin psycholog
ic
g alon
VIP3
g time al
C59
C63
C55
IndepPositius
IndepPositius
IndepPositius
TRAJECTORIES
IndepModerat
C62
C49
C57
IndepModAnt
SemidepHetero
Depenents
T4
T6
SemiDepNeg
DepEstoics
C54
T7
T12
C56
C46
IndepMod
C64
Beginning
: Function
distress. H
al autono
ealth pro
my, some
blems ap
and they
pear with
loose fun
time
ctionality
strategie
.
Different
s. Old pe
c
oping
ople, old
lesion.
DepEstoics
C52
rent
e
f
f
i
th d ation to
i
w
rting adapt
a
t
S
.
t
rm
ty
men Long te o anxie
r
i
a
imp
ies.
ss, n
High strateg e distre
ing oderat
cop
m
K. Gibert
Methodological review
– Data cleaning
– Relevant variable selection
– Prior Knowledge acquisition
– Clustering based on rules (by States)
– Interpretation
„
Select number of classes
„
class panel graf
„
Traffics light panel
– Experts conceptualization
– [More frequent trajectories diagram]
tion
c
a
r
te
n
i
– Identification of profiles and labelling
s
rt
lv e
e
o
p
v
x
n
I
e
h
t
i
w
– Description of profile characteristics
– Validation
K. Gibert
Conclusions
„
KDD useful complement to partial prior expert knowledge
„
Hybrid AI&Stats methodologies allows KDD in complex medica
domains
„
ClBR and ClBRxE resulted useful tools for KDD
„
Interpretation-oriented tools crucial for understandable results
– (CPG and TLP good support interpretation tools)
„
Expert should be integrated as part of the methodology itself
„
KDD helps elicitation of implicit expert knowledge
K. Gibert
Knowledge Discovery from Data as a
framework to decission support in medical
domains
Karina Gibert, Luis Salvador Carulla
Dpt. Statistics and Operation Research
Knowledge Engineering and Machine Learning Research group
Universitat Politècnica de Catalunya, Barcelona (Spain).
[email protected]
PSICOST Scientific Association
[email protected]
Are there any questions?...
Bridging 2009
La Pedrera, Barcelona, 4-7 th March 2009
K. Gibert