Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Medical Informatics: University of Ulster Paul McCullagh University of Ulster [email protected] 14 June 2005 10 June 2005 Ulster Institute of eHealth 10 June 2005 www.uieh.n-i.nhs.uk 10 June 2005 Stroke Web Interface Features: Animation feedback to patient: 3D rendering of patient movement during rehabilitation Communication tools for patients and professionals Decision support Home-based system is currently undergoing development 10 June 2005 Application Of Multimedia To Nursing Education: A Case Study Based On The Diagnosis Of Alcohol Abuse • • • • • Culture of Binge Drinking Multimedia as an Education Tool Interactive Learning Self Assessment Exemplars From Other Areas Including: Diabetes, Testicular Cancer, Anorexia 10 June 2005 Diabetes Education Type I & Type II innovative multimedia patientcentred education materials 10 June 2005 Intelligent Consultant : Interface Natural Language Processing 10 June 2005 Texture Analysis and Classification Methods for the Characterisation of Ultrasonic Images of the Placenta Image or Region in an Image Feature Extraction Classification Image Labeled 10 June 2005 The Grannum Classification 10 June 2005 Age Related Macular Disease Edge Detection Line Thinning Threshold segmentation 10 June 2005 Full Screen Image of the Application displaying an Occult lesion with edge detection, a pixel count and histogram of the region inside the box 10 June 2005 Body Surface Potential Mapping 10 June 2005 Observations Driving Lead System Development Electrophysiology associated with disease is most often localized in the heart: Infarction, ischemia, accessory A-V pathways, ectopic foci, “late potentials”, conduction or repolarization abnormalities ECG manifestations of localized disease are localized on the body surface Clinical lead systems are not optimized for diagnostic information capture - often do not sample regions where diagnostic information occurs 10 June 2005 Mining for Information In Body Surface Potential Maps Lead selection Best leads for estimating all surface potentials Best leads for diagnostic information Are these the same? Data mining techniques Wrappers Sequential selection 10 June 2005 Hearing Screening Normal hearing five peak response using high stimulus level (70db) Peak amplitudes reduce as stimulus level is reduced Only wave V remains at threshold - about 30db stimulus level 10 June 2005 Clementine data mining software used to generate neural network and decision tree models for classification of ABR waveforms The individual models will make use of: • time domain data • frequency domain data • correlation of subaverages 10 June 2005 -3.5 -4 -4.5 -5 Pre-stimulus data -5.5 Post-stimulus data -6 -6.5 -7 -7.5 -8 -8.5 0 50 100 150 200 250 300 350 400 Wavelet Decomposition Wavelet Decomposition 256 coefficients Analysis of 16 D4 Coefficients Analysis of 16 D4 Coefficients Ratio of the sum of absolute values: 5 4 3 2 5 4 3 2 Pre-stimulus 1 0 -1 1 0 -1 Post-stimulus -2 -2 -3 -3 -4 -4 -5 0 2 4 6 8 10 12 14 16 Closer ratio is to 0 the higher probability of a response -5 0 2 4 6 8 10 12 14 16 10 June 2005 CBR for wound healing progress Objective of research – Automated Wound healing monitoring and assessment Determine size of whole wound Determine tissue types present Coverage of different types of tissues Automatically monitor healing over time Remove subjectivity Improve decision making process and care Technologies used Case-based reasoning Feature extraction/transformation 10 June 2005 Work to date – classification for tissue types Method Take an image overlap with a grid Make prediction for each type of tissue Prediction made based on systems knowledge of previous tissue types (cases) that have been identified by professionals Overall accuracy – 86% Publications Zheng, Bradley, Patterson, Galushka, Winder, “New Protocol for leg ulcer tissue classification from colour images”, Proc. 26th int. conf. of engineering in medicine and biology society (EMBS 04) Galushka, Zheng, Patterson, Bradley “Case-Based Tissue Classification for Monitoring Leg Ulcer Healing, 18th IEEE Int. Symposium on Computer-Based Medical Systems (CBMS 2005). 10 June 2005 Analysis Prediction Comparison 10 June 2005 Feature Selection and Classification on Type 2 Diabetic Patients’ Data Paul McCullagh University of Ulster [email protected] 14 June 2005 Diabetes World’s situation Northern Ireland Around 194 million people with diabetes by WHO study 50% patients are undiagnosed 49,000 diagnosed patients in NI Another 25,000 are unaware their condition Type 2 diabetes (NIDDM) Diabetic complications Blood Glucose control HbA1c test 10 June 2005 Data Mining Large amounts of information gathered in medical databases Traditional manual analysis has become inadequate Efficient computer-based analysis are indispensable Noisy, incomplete and inconsistent data Can we determine factors which influence how well the patients progress? Are these factors under our control? 10 June 2005 Relative Risk for the Development of Diabetic Complications Source: Rahman Y, Nolan J and Grimson J. E-Clinic: Re-engineering Clinical Care Process in Diabetes Management. Healthcare Informatics Society of Ireland, 2002 10 June 2005 North Down Primary Care Organisation Quality of data in Primary Care Data Sets Hba1c Fundoscopy Bar Chart Showing Percentage Recording of HbA1c For Diabetic Patients in Practices 11 10 9 Practice 8 7 6 5 4 3 2 1 0 10 20 30 40 50 60 70 80 90 100 Percentage % 10 June 2005 Data Set Ulster Community & Hospitals Trust 2064 type 2 patients, 20876 records 1148 males, 916 females 410 features reduced to 47 relevant features: 23 categorical, 24 numerical Average 7.8% missing values Distribution of Patients' Age 700 600 500 400 300 200 100 563 637 579 238 20-60 60-70 70-80 >80 A ge 10 June 2005 Research Goals Identify significant factors that influence Type 2 diabetes control Weight, Smoking status or Alcohol? Height, Age or Gender? Time Interval between two tests? Cholesterol level? Classifying individuals at bad disease control in the population Distinguish bad blood glucose control patients from good blood glucose control patients based on physiological and examination factors Predict individuals in the population with poor diabetes control status based on physiological and examination factors Investigate the potential of data mining techniques in ‘real world’ medical domain and evaluate different data mining approaches 10 June 2005 Data Mining Procedure Pre-Processing Clean Data Model/Patterns Raw Data Knowledge Post-Processing Feature Selection Target Data Data Mining Schemes 10 June 2005 Data Preprocessing Data Integration Data Transformation Combine data from multiple sources into a single data table Divide the patients into 2 categories (Better Control and Worse Control) based on the comparison HbA1c value of the laboratory test and the target value Better: 34.33% ; Worse: 65.67% Data Reduction Remove the attributes with more than 50% missing data Keep the features recommended by the diabetic expert and the international diabetes guidelines 10 June 2005 Feature Selection Identify and remove irrelevant and redundant information — Not • all attributes are actually useful Noisy, irrelevant and redundant attributes Minimize the associated measurement costs Improve prediction accuracy Reduce the complexity Easier interpretation the classification results 10 June 2005 Feature Selection Information Gain: delete less information attributes, also adopted in ID3 and C4.5 as splitting criteria during the tree growing procedure A measure based on Entropy Relief: estimate attributes according to how well their values distinguish among instances that are near each other. An instance based attribute ranking scheme Randomly sampling an instance I from the data Locate I’s nearest neighbour from the same and opposite class Compare them and update relevance scores for each attribute 10 June 2005 Top 15 predictors 1 2 3 4 5 6 7 8 Age Diagnosis Duration Insulin Treatment Family History Smoking LabRBG Diet Treatment BMI 9 10 11 12 13 14 15 Glycosuria Complication Type BP Diastolic Tablet Treatment LabTriglycerides General Proteinuria BPSystolic 10 June 2005 Design Experiments Classification Algorithms Naïve Bayes – A Statistical Method for Classification IB1 – Instance Based nearest neighbour algorithm C4.5 – Inductive learning algorithm using decision trees Sampling strategy: 10-fold cross validation 10 June 2005 Classification Results - initial Attribute Number Naïve Bayes IB1 C4.5 Discretize d C4.5 Average 5 69.36 69.14 76.36 75.23 72.52 8 74.60 70.49 76.12 75.76 74.24 10 72.47 71.54 77.21 77.46 74.67 15 72.92 70.37 78.73 78.12 75.04 20 71.48 69.30 76.42 76.73 73.48 25 69.24 67.88 77.52 77.75 73.10 30 70.53 67.78 77.43 77.52 73.32 47 62.35 63.44 75.38 76.37 69.39 Average 70.37 68.74 76.90 76.87 --------------- Table 1: Classification accuracy (%) for Different Sizes Feature Subsets 10 June 2005 (10-CV/Training and Testing) Classification Accuracy CA Based on 10-CV 82 Naïve Bayes 77 IB1 72 C4.5 D C4.5 67 Average 62 5 8 10 15 20 25 Number of Features 30 47 10 June 2005 Sensitivity and Specificity - initial Attribute Number NaïveBayes IB1 C4.5 Discretized C4.5 5 0.912/0.276 0.892/0.306 0.947/0.413 0.938/0.397 8 0.921/0.411 0.883/0.365 0.951/0.398 0.942/0.405 10 0.782/0.615 0.907/0.349 0.962/0.409 0.957/0.426 15 0.631/0.781 0.912/0.306 0.973/0.432 0.987/0.387 20 0.685/0.772 0.838/0.416 0.940/0.428 0.963/0.393 25 0.656/0.762 0.821/0.407 0.932/0.475 0.972/0.405 30 0.708/0.700 0.835/0.377 0.935/0.467 0.955/0.431 47 0.587/0.693 0.810/0.298 0.928/0.421 0.964/0.381 Average 0.735/0.625 0.862/0.353 0.946/0.430 0.960/0.403 10 June 2005 Discussion C4.5 decision tree algorithm had the best performance for classification Discretization did not improve the performance of C4.5 significantly on our data set On average, the best results can be achieved when the top 15 attributes were selected for prediction IB1 and Naïve Bayes did benefit from the reduction of the input parameters, C4.5 less so Naïve Bayes can classify both patients groups with a reasonable accuracy Most classifiers tend to have better performance to check the bad control cases in the population 10 June 2005 Relief Algorithms A feature weight-based method inspired by instance-based learning algorithms Key idea of original Relief — — Estimate the quality of attributes according to how well their values distinguish among instances that are near to each other Does not make the assumption that the attributes are conditionally independent ReliefF (Kononenko,1994): the extension of Relief — — Applicable to the multi-class data sets Tolerant to noisy and incomplete data 10 June 2005 Optimization of ReliefF Data transformation — Frequency based encoding scheme Representing categorical code of a particular variable with a numerical value derived from its relation frequency among outcomes Supervised Model Construction for Starter Selection — — — — Generate number of instances (m) automatically, eliminating the dependency on the selection of a “good value” for m to improve the efficiency of the algorithm Basic idea: Group the “near” cases with the same class label Similarity measurement: Euclidean distance function Repeated until an instance with different class label is encountered 10 June 2005 Feature Selection via Supervised Model Construction Improve efficiency Retain accuracy Centre is a ‘good’ representation of cluster Scope of local region? 10 June 2005 Experiment Design C4.5 as the classification algorithm Nine benchmark UCI data sets Number of cases varies from 57 to 8,124 Contains a mixture of nominal and numerical attributes 10-fold Cross Validation InfoGain and ReliefF were used for comparison 10 June 2005 Number of Selected Attributes Data Set Cases After FSSMC Reduction Rate (%) Breast 699 45 93.6 Credit 690 159 77.0 Diabetes 768 240 68.8 Glass 214 80 62.6 Heart 294 39 86.7 Iris 150 13 91.3 Labour 57 10 82.5 Mushroom 8124 89 98.9 Soybean 683 109 84.0 10 June 2005 Processing Time (in sec.)/ Classification Accuracy(%) Data Sets C4.5 Before FS InfoGain ReliefF FSSMC Breast 0/94.6 0.16/94.8 2.05/95.3 0.15/95.3 Credit 0/86.4 1.15/86.7 3.58/86.4 1.20/86.8 Diabetes 0/74.5 1.26/74.1 2.63/75.8 1.34/75.8 Glass 0/65.4 0.44/69.2 0.56/69.6 0.67/69.6 Heart 0/76.2 0.22/79.9 0.32/80.6 0.41/81.2 Iris 0/95.3 0.05/95.3 0.10/95.3 0.23/95.3 Labour 0/73.7 0.22/75.4 0.41/75.4 0.51/75.4 Mushroom 0/100 0.65/100 446/100 5.86/100 Soybean 0/92.4 0.34/90.2 5.92/92.4 1.73/93.2 Average 0/85.1 0.47/86.0 46.2/86.6 1.26/86.7 10 June 2005 Discussion 1. InfoGain — 2. ReliefF — 3. The fastest approach Long time to handle large data sets FSSMC — — — — Takes longer time on small data sets than InfoGain and ReliefF No significant classification accuracy improvement Achieves the best combined results (classification accuracy and efficiency) on average Overcomes the computational problem of ReliefF and preserves classification accuracy 10 June 2005 10 June 2005 KNN imputation : Ulster Hospital and PIMA data 20 Random Simulations Ulster Hospital 18 5-NN 20 Random Simulations PIMA 18 16 10-NN 16 14 NORM 14 12 NORM 8 Mean imputation 6 4 12 Error 10 10 Mean imputation 8 6 4 Missing Values Missing Values 35% 30% 25% 20% 15% LSImpute_ Rows 0 10% 35% 30% 25% 20% 15% 10% 0 EMImpute _Columns 2 5% EMImpute_ Columns 2 5% Error 10-NN LSImpute_ Rows Comparison of different methods using different fractions of missing values in the imputation process and different datasets. 10 June 2005 Classification System based on Supervised Model • Assessment of the risk of Cardiovascular Heart Diseases (CHD) in patients with diabetes type 2 is the main objective • The k-Nearest Neighbour (kNN) classification algorithm will provide the basis for new decision support tools. It classifies patients according to their similarity with previous cases • A knowledge-driven, weighted kNN (WkNN) method has been proposed to distinguish significant diagnostic markers • A genetic algorithm (GA) that incorporates background knowledge will be developed to support such a feature relevance analysis task 10 June 2005 Background Knowledge Patient 1 Patient 2 Patient 3 Patient N • User feedback • Constraints •Ontology • Annotation text WkNN GA Results 10 June 2005 W1 Patien Feature t 1 W2 Feature 2 … Wn n+1 Feature n Classificatio n WkNN 1 1 Another problem 2 0 How can you choose the right weights? . . N Initial Population New Population 0.5, 0.2, 0.9, …, 0.32 0.7, 0.1, 0.8,…, 0.6 0.4 , 0.5, 0.6,…, 0.3 0.38, 0.2, 07,…, 0.1 . 0.83, 0.34, 0.98, …,0.61 0.3, 0.4, 0.7, …, 0.3 0.7, 0.17, 0.5,…, 0.69 0.4 , 0.1, 0.8,…, 0.36 0.44, 0.2, 0.1,…, 0.89 . 0.61, 0.98, 0.34, …,0.83 2 GA … W1 W2 W3 Wn N0. Missclassification Fitness 0.3 0.4 0.9 0.3 9 lower 0.7 0.17 0.5 0.69 15 lowest 0.4 0.1 0.8 0.36 5 lower 0.44 0.2 0.1 0.89 high less 2 0.61 0.98 0.34 0.83 1 highest 10 June 2005 Semi-supervised Clustering Combines the benefits of supervised and unsupervised classification methods To make use of class labels or pairwise constraints on the data to guide the clustering process To allows users to guide and interact with the clustering process by providing feedback during the learning and post-processing stages Goals To make clustering both more effective and meaningful To support the selection of relevant, optimized partitions for decision support 10 June 2005 Data Background Information Data Preprocessing Clustering Model Data: Diabetic Patients’ Records Data Preprocessing: Normalization, Filtration, Missing Value Estimation Background Information: Experts' Constraints and Feedback Clustering Model: Detection of Relevant Groups of Similar Data (patients) Using Different Statistical and Knowledge-Driven Optimization Criteria Clustering Output: Similar Groups of Data (patients) Associated with Common Characteristic (significant medical outcomes, conditions or coronary heart disease risk levels) Clustering Output 10 June 2005 Initial Test Simple Model on the Proposed Algorithm Original Class distribution from PIMA dataset Class 1 Class 2 2,4,6,8,11,13 1,3,5,7,9,10,12,14,15 M set: (2,4) (6,8) (8,11) (1,5) (7,9) C set: (1,2) (4,7) (6,9) Preliminary Results Outlier: 3, 10 2,4,6,8 Class A 1,5,7,9,11,12,13,14,15 Class B 10 June 2005 Case Based Reasoning Memory-based lazy problem-solver System stores training data and waits until a new problem is received before constructing a solution Differs from kNN in that case attributes can be of any type (i.e. not just numeric) How do CBR systems solve problems? CBR systems store a set of past problem cases together with their solutions in a Case Base, e.g. a case could be a set of patient symptoms + a diagnosis based on those symptoms - when a new problem case is received, the system retrieves one or more similar past cases, and re-uses or adapts their solutions to solve the new case 10 June 2005 Acknowledgements Medical Informatics Recognised Research Group NIKEL North – South Collaboration Team Roy Harper, Consultant at Ulster Hospital 10 June 2005 Thank You For Your Attention 10 June 2005