Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Theory and Practice Dr. Azuraliza Abu Bakar http://www.ftsm.ukm.my/jabatan/ts/aab/index.htm What is Pattern Recognition    Pattern Recognition by Human – perceptual – specialized – decision making Pattern Recognition by Computers – benefit of automated pattern recognition – advantage in complex calculations Pattern Recognition from Data (Data Mining) Pattern Recognition from Data  Pattern recognition from data is a process of learning or observing the past data by studying the dependencies and extracting knowledge from data What is Data? 1 2 3 4 5 6 7 : 99 100 Studies Education Poor SPM Poor SPM Moderate SPM Moderate Diploma Poor SPM Moderate Diploma Good MSC Works Poor Good Poor Poor Poor Poor Good Income (D) None Low Low Low None Low Medium Poor Moderate Good Poor Low Low SPM Diploma What is Knowledge?? studies(Poor) AND work(Poor) => income(None) studies(Poor) AND work(Good) => income(Low) education(Diploma) => income(Low) education(MSc) => income(Medium) OR income(High) studies(Mod) => income(Low) studies(Good) => income(Medium) OR income(High) education(SPM) AND work(Good) => income(Low) What is Data Mining?? Extraction of knowledge from data exploration and analysis of large quantities of data to discover meaningful pattern from data. Discover Knowledge How data mining looks into data?? Data Data Data Data Mining : Motivation Huge amounts of data Important need for turning data into useful information Fast growing amount of data, collected and stored in large and numerous databases exceeded the human ability for comprehension without powerful tools Questions?? What goods should be promoted to this customer? What is the probability that a certain customer will respond to a planned promotion? Can one predict the most profitable securities to buy/sell during the next trading session? Will this customer default on a loan or pay back on schedule? What medical diagnose should be assigned to this patient? What kind of cars should be sell this year?? Data Mining is simply... Finds relationship make prediction Data Mining : 1-step of KDD KDD Data mining Task Techniques Data Mining as a Step of KDD Knowledge Evaluation & Presentation Patterns Data Mining Selection and Transformation Cleaning and Intergration Databases Data Warehouse Flat files Early Steps of Data Mining  Data preprocessing –  Data discretization/representation –  handling incomplete data, noisy data, uncertain data transforms data into suitable values for the mining algorithm to find patterns Data selection – selects the suitable data for mining purposes Data Mining Techniques Decision Trees Neural Network Genetic Algorithms Fuzzy Set Theory Rough Set Theory Statistical Method (Regression Analysis) Classification of Data Mining Systems Kinds of DB Kinds of Knowledge Relational Data warehouse Transactional DB Advanced DB system Flat files WWW Classification Association Clustering Prediction … … Classification of Data Mining Systems Techniques used DB oriented techniques Statistic Machine learning Pattern recognition Neural Network Rough Set etc Application adapted Finance Marketing Medical Stock Telecommunication, etc Data Mining: confluence of multiple discipline Database technology statistic HPerformance computing visualization Pattern recognition Machine learning DATA MINING Spatial data analysis Information retrieval Information science Neural network Data Mining What we are looking at?? What we are looking for?? Data Mining Tasks – – – – – – – – Prediction Classification Clustering Association Rules Sequential Analysis Deviation analysis Similarity analysis Trend analysis Classification Classification algorithm Training data 1 2 3 4 5 6 7 : 99 100 Studies Poor Poor Moderate Moderate Poor Moderate Good Education SPM SPM SPM Diploma SPM Diploma MSC Poor SPM Moderate Diploma Works Poor Good Poor Poor Poor Poor Good Income (D) None Low Low Low None Low Medium Good Poor Low Low Classification Rules If studies=“poor” and work=“poor” then Income=“poor” Classification Classification rules Test data Studies Education Moderate Diploma Poor SPM Moderate Diploma Good MSC : Works Poor Poor Poor Good Income (D) ? ? ? ? New data studies=“poor” and work=“poor” classify poor Type of Classifiers Neural Classifier Statistical Classifier –Bayesion approach –Multiple Regression –K-nearest neighbour –Naïve Bayes –Causal Network –Discriminant Analysis –Hopfield Network –Multilayer Perceptron –Radial Basis Function –Kohonen Networks Rough Classifier DATASET 1 2 3 4 5 6 7 : 99 100 Studies Education Poor SPM Poor SPM Moderate SPM Moderate Diploma Poor SPM Moderate Diploma Good MSC Works Poor Good Poor Poor Poor Poor Good Income (D) None Low Low Low None Low Medium Poor Moderate Good Poor Low Low SPM Diploma RULES studies(Poor) AND work(Poor) => income(None) studies(Poor) AND work(Good) => income(Low) education(Diploma) => income(Low) education(MSc) => income(Medium) OR income(High) studies(Mod) => income(Low) studies(Good) => income(Medium) OR income(High) education(SPM) AND work(Good) => income(Low) Comparing Classifiers      Predictive Accuracy Speed Robustness Scalability Interpretability Data Mining : Problems and Challenges Noisy data Large Databases Dynamic Databases Incomplete Data Performance Issues Time and Memory Constrain t Predictive Ability Performance Issues -number of examples necessary for training -cost of assuring the good accuracy Performance Issues Time and Memory Constrain t -time complexity of the learning phase -time taken for evaluation -time it takes to reach a certain level of accuracy Performance Issues Predictive Ability -to be able to predict the correct decision towards the test or unseen data -involve the generation of rules -measuring the quality or accuracy of rules Samples of the CLEV Dataset (before scaling) DA TA AG E SEX CP 1 63 Male 2 67 Male Typical angina Asymp 3 67 Male 4 37 5 41 6 56 7 62 8 57 9 TRE ST BPS 145 CH OL RESTECG THALA CH EXA NG 233 F B S T LV hyper 150 No OL DP EA 2.3 K SLOPE C A THAL DISEA SE 0 Fixed No 1.5 Downslo pe Flat 160 286 F LV hyper 108 Yes 3 Normal Yes Asymp 120 229 F LV hyper 129 Yes 2.6 Flat 2 Yes 187 No 3.5 0 LV hyper 172 No 1.4 0 Normal No F Normal 178 No 0.8 0 Normal No 268 F LV hyper 160 No 3.6 2 Normal Yes 120 354 F Normal 163 Yes 0.6 0 Normal No Asymp 130 254 F LV hyper 147 No 1.4 Downslo pe Upslopin g Upslopin g Downslo pe Upslopin g Flat Reversabl e Normal Male Non-anginal 130 250 F Normal Fema le Male Atypical 130 204 F Atypical 120 236 Asymp 140 Asymp 63 Fema le Fema le Male 1 Yes Male Asymp 140 203 T LV hyper 155 Yes 3.1 0 57 Male Asymp 140 192 F Normal 148 No 0.4 Downslo pe Flat 12 56 Atypical 140 294 F LV hyper 153 No 1.3 Flat 0 13 56 Fema le Male Reversabl e Reversabl e Fixed defect Normal 10 53 11 Non-anginal 130 256 T LV hyper 142 Yes 0.6 Flat 1 Yes 14 44 Male Atypical 120 263 F Normal 173 No 0 0 15 52 Male Non-anginal 172 199 T Normal 162 No 0.5 16 57 Male Non-anginal 150 168 F Normal 174 No 1.6 17 48 Male Atypical 110 229 F Normal 168 No 1 54 Male Asymp 140 239 F Normal 160 No 1.2 0 Reversabl e Normal Yes 18 19 48 Non-anginal 130 275 F Normal 139 No 0.2 0 Normal No 20 49 Fema le Male Atypical 130 266 F Normal 171 No 0.6 Upslopin g Upslopin g Upslopin g Downslo pe Upslopin g Upslopin g Upslopin g Fixed defect Reversabl e Reversabl e Normal 0 Normal No 0 0 0 0 No Yes No No No No No No Rules generated from data mining process oldpeak(0.7) => disease(No) oldpeak(4.4) => disease(Yes) chol(233) AND restecg(LV hypertrophy) => disease(No) chol(204) AND restecg(LV hypertrophy) => disease(No) chol(236) AND restecg(Normal) => disease(No) chol(203) AND restecg(LV hypertrophy) => disease(Yes) chol(294) AND restecg(LV hypertrophy) => disease(No) chol(275) AND restecg(Normal) => disease(No) chol(266) AND restecg(Normal) => disease(No) chol(247) AND restecg(Normal) => disease(No) chol(219) AND restecg(LV hypertrophy) => disease(No) chol(266) AND restecg(LV hypertrophy) => disease(Yes) chol(304) AND restecg(Normal) => disease(No) chol(254) AND restecg(Normal) => disease(Yes) chol(267) AND restecg(Normal) => disease(Yes) chol(264) AND restecg(LV hypertrophy) => disease(No) chol(234) AND restecg(LV hypertrophy) => disease(No)