Download Medical Informatics: University of Ulster

Document related concepts

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Medical Informatics:
University of Ulster
Paul McCullagh
University of Ulster
[email protected]
14 June 2005
10 June 2005
Ulster Institute of
eHealth
10 June 2005
www.uieh.n-i.nhs.uk
10 June 2005
Stroke Web Interface
Features:



Animation feedback to
patient: 3D rendering of
patient movement during
rehabilitation
Communication tools for
patients and professionals
Decision support
Home-based system is
currently undergoing
development
10 June 2005
Application Of Multimedia To Nursing
Education: A Case Study Based On The
Diagnosis Of Alcohol Abuse
•
•
•
•
•
Culture of Binge Drinking
Multimedia as an Education Tool
Interactive Learning
Self Assessment
Exemplars From Other Areas Including:
Diabetes, Testicular Cancer, Anorexia
10 June 2005
Diabetes Education

Type I & Type II
innovative
multimedia patientcentred education
materials
10 June 2005

Intelligent Consultant :
Interface
Natural Language Processing
10 June 2005
Texture Analysis and
Classification Methods for
the Characterisation of
Ultrasonic Images of the
Placenta
Image or
Region in
an Image
Feature
Extraction
Classification
Image
Labeled
10 June 2005
The Grannum
Classification
10 June 2005
Age Related Macular Disease
Edge Detection
Line Thinning
Threshold
segmentation
10 June 2005
Full Screen Image of the Application displaying an
Occult lesion with edge detection, a pixel count and
histogram of the region inside the box
10 June 2005
Body Surface Potential
Mapping
10 June 2005
Observations Driving Lead
System Development

Electrophysiology associated with disease is most often
localized in the heart:

Infarction, ischemia, accessory A-V pathways, ectopic foci, “late
potentials”, conduction or repolarization abnormalities

ECG manifestations of localized disease are localized on the
body surface

Clinical lead systems are not optimized for diagnostic
information capture - often do not sample regions where
diagnostic information occurs
10 June 2005
Mining for Information In
Body Surface Potential Maps

Lead selection
Best leads for estimating all surface potentials
 Best leads for diagnostic information
 Are these the same?


Data mining techniques
Wrappers
 Sequential selection

10 June 2005
Hearing Screening
Normal hearing five peak response using high stimulus level (70db)
Peak amplitudes reduce as stimulus level is reduced
Only wave V remains at threshold - about 30db stimulus level
10 June 2005
Clementine data mining software used to generate neural network
and decision tree models for classification of ABR waveforms
The individual models will
make use of:
• time domain data
• frequency domain data
• correlation of subaverages
10 June 2005
-3.5
-4
-4.5
-5
Pre-stimulus data
-5.5
Post-stimulus data
-6
-6.5
-7
-7.5
-8
-8.5
0
50
100
150
200
250
300
350
400
Wavelet
Decomposition
Wavelet
Decomposition
256 coefficients
Analysis of 16
D4 Coefficients
Analysis of 16
D4 Coefficients
Ratio of the sum
of absolute values:
5
4
3
2
5
4
3
2
Pre-stimulus
1
0
-1
1
0
-1
Post-stimulus
-2
-2
-3
-3
-4
-4
-5
0
2
4
6
8
10
12
14
16
Closer ratio is to 0 the higher
probability of a response
-5
0
2
4
6
8
10
12
14
16
10 June 2005
CBR for wound healing
progress


Objective of research – Automated Wound healing
monitoring and assessment
 Determine size of whole wound
 Determine tissue types present
 Coverage of different types of tissues
 Automatically monitor healing over time
 Remove subjectivity
 Improve decision making process and care
Technologies used
 Case-based reasoning
 Feature extraction/transformation
10 June 2005


Work to date – classification for tissue types
Method





Take an image overlap with a grid
Make prediction for each type of tissue
Prediction made based on systems knowledge of previous tissue
types (cases) that have been identified by professionals
Overall accuracy – 86%
Publications


Zheng, Bradley, Patterson, Galushka, Winder, “New Protocol for
leg ulcer tissue classification from colour images”, Proc. 26th int.
conf. of engineering in medicine and biology society (EMBS 04)
Galushka, Zheng, Patterson, Bradley “Case-Based Tissue
Classification for Monitoring Leg Ulcer Healing, 18th IEEE Int.
Symposium on Computer-Based Medical Systems (CBMS 2005).
10 June 2005
Analysis
Prediction
Comparison
10 June 2005
Feature Selection and
Classification on Type 2
Diabetic Patients’ Data
Paul McCullagh
University of Ulster
[email protected]
14 June 2005
Diabetes

World’s situation



Northern Ireland






Around 194 million people with diabetes by WHO study
50% patients are undiagnosed
49,000 diagnosed patients in NI
Another 25,000 are unaware their condition
Type 2 diabetes (NIDDM)
Diabetic complications
Blood Glucose control
HbA1c test
10 June 2005
Data Mining

Large amounts of information gathered in medical
databases





Traditional manual analysis has become inadequate
Efficient computer-based analysis are indispensable
Noisy, incomplete and inconsistent data
Can we determine factors which influence how
well the patients progress?
Are these factors under our control?
10 June 2005
Relative Risk for the
Development of Diabetic
Complications
Source: Rahman Y, Nolan J and Grimson J.
E-Clinic: Re-engineering Clinical Care Process in Diabetes Management.
Healthcare Informatics Society of Ireland, 2002
10 June 2005
North Down Primary
Care Organisation


Quality of data in Primary Care Data Sets
Hba1c
Fundoscopy
Bar Chart Showing Percentage Recording of HbA1c For Diabetic Patients in Practices
11
10
9
Practice
8
7
6
5
4
3
2
1
0
10
20
30
40
50
60
70
80
90
100
Percentage %
10 June 2005
Data Set





Ulster Community &
Hospitals Trust
2064 type 2 patients, 20876
records
1148 males, 916 females
410 features reduced to 47
relevant features: 23
categorical, 24 numerical
Average 7.8% missing values
Distribution of Patients' Age
700
600
500
400
300
200
100
563
637
579
238
20-60
60-70
70-80
>80
A ge
10 June 2005
Research Goals




Identify significant factors that influence Type 2 diabetes control
 Weight, Smoking status or Alcohol?
 Height, Age or Gender?
 Time Interval between two tests?
 Cholesterol level?
Classifying individuals at bad disease control in the population
 Distinguish bad blood glucose control patients from good
blood glucose control patients based on physiological and
examination factors
Predict individuals in the population with poor diabetes control
status based on physiological and examination factors
Investigate the potential of data mining techniques in ‘real
world’ medical domain and evaluate different data mining
approaches
10 June 2005
Data Mining
Procedure
Pre-Processing
Clean Data
Model/Patterns
Raw Data
Knowledge
Post-Processing
Feature Selection
Target Data
Data Mining Schemes
10 June 2005
Data Preprocessing

Data Integration


Data Transformation



Combine data from multiple sources into a single data
table
Divide the patients into 2 categories (Better Control and
Worse Control) based on the comparison HbA1c value of
the laboratory test and the target value
Better: 34.33% ; Worse: 65.67%
Data Reduction


Remove the attributes with more than 50% missing data
Keep the features recommended by the diabetic expert
and the international diabetes guidelines
10 June 2005
Feature Selection

Identify and remove irrelevant and redundant
information
— Not
•




all attributes are actually useful
Noisy, irrelevant and redundant attributes
Minimize the associated measurement costs
Improve prediction accuracy
Reduce the complexity
Easier interpretation the classification results
10 June 2005
Feature Selection


Information Gain: delete less information attributes, also
adopted in ID3 and C4.5 as splitting criteria during the tree
growing procedure
 A measure based on Entropy
Relief: estimate attributes according to how well their values
distinguish among instances that are near each other.
 An instance based attribute ranking scheme
 Randomly sampling an instance I from the data
 Locate I’s nearest neighbour from the same and opposite
class
 Compare them and update relevance scores for each
attribute
10 June 2005
Top 15 predictors
1
2
3
4
5
6
7
8
Age
Diagnosis Duration
Insulin Treatment
Family History
Smoking
LabRBG
Diet Treatment
BMI
9
10
11
12
13
14
15
Glycosuria
Complication Type
BP Diastolic
Tablet Treatment
LabTriglycerides
General Proteinuria
BPSystolic
10 June 2005
Design Experiments

Classification Algorithms




Naïve Bayes – A Statistical Method for Classification
IB1 – Instance Based nearest neighbour algorithm
C4.5 – Inductive learning algorithm using decision trees
Sampling strategy: 10-fold cross validation
10 June 2005
Classification Results - initial
Attribute
Number
Naïve Bayes
IB1
C4.5
Discretize
d C4.5
Average
5
69.36
69.14
76.36
75.23
72.52
8
74.60
70.49
76.12
75.76
74.24
10
72.47
71.54
77.21
77.46
74.67
15
72.92
70.37
78.73
78.12
75.04
20
71.48
69.30
76.42
76.73
73.48
25
69.24
67.88
77.52
77.75
73.10
30
70.53
67.78
77.43
77.52
73.32
47
62.35
63.44
75.38
76.37
69.39
Average
70.37
68.74
76.90
76.87
---------------
Table 1: Classification accuracy (%) for Different Sizes Feature Subsets
10 June 2005
(10-CV/Training and Testing)
Classification Accuracy
CA Based on 10-CV
82
Naïve Bayes
77
IB1
72
C4.5
D C4.5
67
Average
62
5
8
10 15 20 25
Number of Features
30
47
10 June 2005
Sensitivity and Specificity - initial
Attribute
Number
NaïveBayes
IB1
C4.5
Discretized
C4.5
5
0.912/0.276
0.892/0.306
0.947/0.413
0.938/0.397
8
0.921/0.411
0.883/0.365
0.951/0.398
0.942/0.405
10
0.782/0.615
0.907/0.349
0.962/0.409
0.957/0.426
15
0.631/0.781
0.912/0.306
0.973/0.432
0.987/0.387
20
0.685/0.772
0.838/0.416
0.940/0.428
0.963/0.393
25
0.656/0.762
0.821/0.407
0.932/0.475
0.972/0.405
30
0.708/0.700
0.835/0.377
0.935/0.467
0.955/0.431
47
0.587/0.693
0.810/0.298
0.928/0.421
0.964/0.381
Average
0.735/0.625
0.862/0.353
0.946/0.430
0.960/0.403
10 June 2005
Discussion






C4.5 decision tree algorithm had the best
performance for classification
Discretization did not improve the performance of
C4.5 significantly on our data set
On average, the best results can be achieved when
the top 15 attributes were selected for prediction
IB1 and Naïve Bayes did benefit from the reduction
of the input parameters, C4.5 less so
Naïve Bayes can classify both patients groups with
a reasonable accuracy
Most classifiers tend to have better performance to
check the bad control cases in the population
10 June 2005
Relief Algorithms


A feature weight-based method inspired by instance-based
learning algorithms
Key idea of original Relief
—
—

Estimate the quality of attributes according to how well their
values distinguish among instances that are near to each other
Does not make the assumption that the attributes are
conditionally independent
ReliefF (Kononenko,1994): the extension of Relief
—
—
Applicable to the multi-class data sets
Tolerant to noisy and incomplete data
10 June 2005
Optimization of
ReliefF

Data transformation
—
Frequency based encoding scheme


Representing categorical code of a particular variable with a
numerical value derived from its relation frequency among outcomes
Supervised Model Construction for Starter Selection
—
—
—
—
Generate number of instances (m) automatically, eliminating
the dependency on the selection of a “good value” for m to
improve the efficiency of the algorithm
Basic idea: Group the “near” cases with the same class label
Similarity measurement: Euclidean distance function
Repeated until an instance with different class label is
encountered
10 June 2005
Feature Selection via
Supervised Model
Construction




Improve efficiency
Retain accuracy
Centre is a ‘good’ representation of cluster
Scope of local region?
10 June 2005
Experiment Design


C4.5 as the classification algorithm
Nine benchmark UCI data sets
Number of cases varies from 57 to 8,124
 Contains a mixture of nominal and numerical
attributes



10-fold Cross Validation
InfoGain and ReliefF were used for comparison
10 June 2005
Number of Selected
Attributes
Data Set
Cases
After
FSSMC
Reduction
Rate (%)
Breast
699
45
93.6
Credit
690
159
77.0
Diabetes
768
240
68.8
Glass
214
80
62.6
Heart
294
39
86.7
Iris
150
13
91.3
Labour
57
10
82.5
Mushroom
8124
89
98.9
Soybean
683
109
84.0
10 June 2005
Processing Time (in sec.)/ Classification Accuracy(%)
Data Sets
C4.5
Before FS
InfoGain
ReliefF
FSSMC
Breast
0/94.6
0.16/94.8
2.05/95.3
0.15/95.3
Credit
0/86.4
1.15/86.7
3.58/86.4
1.20/86.8
Diabetes
0/74.5
1.26/74.1
2.63/75.8
1.34/75.8
Glass
0/65.4
0.44/69.2
0.56/69.6
0.67/69.6
Heart
0/76.2
0.22/79.9
0.32/80.6
0.41/81.2
Iris
0/95.3
0.05/95.3
0.10/95.3
0.23/95.3
Labour
0/73.7
0.22/75.4
0.41/75.4
0.51/75.4
Mushroom
0/100
0.65/100
446/100
5.86/100
Soybean
0/92.4
0.34/90.2
5.92/92.4
1.73/93.2
Average
0/85.1
0.47/86.0
46.2/86.6
1.26/86.7
10 June 2005
Discussion
1.
InfoGain
—
2.
ReliefF
—
3.
The fastest approach
Long time to handle large data sets
FSSMC
—
—
—
—
Takes longer time on small data sets than InfoGain and ReliefF
No significant classification accuracy improvement
Achieves the best combined results (classification accuracy and
efficiency) on average
Overcomes the computational problem of ReliefF and preserves
classification accuracy
10 June 2005
10 June 2005
KNN imputation : Ulster Hospital
and PIMA data
20 Random Simulations
Ulster Hospital
18
5-NN
20 Random Simulations
PIMA
18
16
10-NN
16
14
NORM
14
12
NORM
8
Mean
imputation
6
4
12
Error
10
10
Mean
imputation
8
6
4
Missing Values
Missing Values
35%
30%
25%
20%
15%
LSImpute_
Rows
0
10%
35%
30%
25%
20%
15%
10%
0
EMImpute
_Columns
2
5%
EMImpute_
Columns
2
5%
Error
10-NN
LSImpute_
Rows
Comparison of different methods using different fractions of missing values in
the imputation process and different datasets.
10 June 2005
Classification System based on
Supervised Model
• Assessment of the risk of Cardiovascular Heart Diseases
(CHD) in patients with diabetes type 2 is the main objective
• The k-Nearest Neighbour (kNN) classification algorithm will
provide the basis for new decision support tools. It classifies
patients according to their similarity with previous cases
• A knowledge-driven, weighted kNN (WkNN) method has
been proposed to distinguish significant diagnostic markers
• A genetic algorithm (GA) that incorporates background
knowledge will be developed to support such a feature
relevance analysis task
10 June 2005
Background Knowledge
Patient 1
Patient 2
Patient 3
Patient N
• User feedback
• Constraints
•Ontology
• Annotation text
WkNN
GA
Results
10 June 2005
W1
Patien Feature
t
1
W2
Feature
2
…
Wn
n+1
Feature
n
Classificatio
n
WkNN
1
1
Another problem
2
0
How can you choose
the right weights?
.
.
N
Initial
Population
New
Population
0.5, 0.2, 0.9, …, 0.32
0.7, 0.1, 0.8,…, 0.6
0.4 , 0.5, 0.6,…, 0.3
0.38, 0.2, 07,…, 0.1
.
0.83, 0.34, 0.98, …,0.61
0.3, 0.4, 0.7, …, 0.3
0.7, 0.17, 0.5,…, 0.69
0.4 , 0.1, 0.8,…, 0.36
0.44, 0.2, 0.1,…, 0.89
.
0.61, 0.98, 0.34, …,0.83
2
GA
…
W1
W2
W3
Wn
N0. Missclassification
Fitness
0.3
0.4
0.9
0.3
9
lower
0.7
0.17
0.5
0.69
15
lowest
0.4
0.1
0.8
0.36
5
lower
0.44
0.2
0.1
0.89
high less
2
0.61
0.98
0.34
0.83
1
highest
10 June 2005
Semi-supervised
Clustering



Combines the benefits of supervised and unsupervised classification
methods
To make use of class labels or pairwise constraints on the data to guide
the clustering process
To allows users to guide and interact with the clustering process by
providing feedback during the learning and post-processing stages
Goals


To make clustering both more effective and meaningful
To support the selection of relevant, optimized partitions for decision
support
10 June 2005
Data
Background Information
Data Preprocessing
Clustering Model
Data: Diabetic Patients’ Records
Data Preprocessing: Normalization, Filtration, Missing Value Estimation
Background Information: Experts' Constraints and Feedback
Clustering Model: Detection of Relevant Groups of Similar Data (patients)
Using Different Statistical and Knowledge-Driven
Optimization Criteria
Clustering Output: Similar Groups of Data (patients) Associated with
Common Characteristic (significant medical outcomes,
conditions or coronary heart disease risk levels)
Clustering Output
10 June 2005
Initial Test

Simple Model on the Proposed Algorithm

Original Class distribution from PIMA dataset
Class 1
Class 2
2,4,6,8,11,13
1,3,5,7,9,10,12,14,15
M set: (2,4) (6,8) (8,11) (1,5) (7,9)
C set: (1,2) (4,7) (6,9)

Preliminary Results
Outlier: 3, 10
2,4,6,8
Class A
1,5,7,9,11,12,13,14,15
Class B
10 June 2005
Case Based Reasoning

Memory-based lazy problem-solver
System stores training data and waits until a new problem
is received before constructing a solution
 Differs from kNN in that case attributes can be of any
type (i.e. not just numeric)



How do CBR systems solve problems?
 CBR systems store a set of past problem cases together with
their solutions in a Case Base, e.g. a case could be a set of
patient symptoms + a diagnosis based on those symptoms
- when a new problem case is received, the system retrieves one
or more similar past cases, and re-uses or adapts their solutions
to solve the new case
10 June 2005
Acknowledgements




Medical Informatics Recognised Research
Group
NIKEL
North – South Collaboration Team
Roy Harper, Consultant at Ulster Hospital
10 June 2005
Thank You For
Your Attention
10 June 2005