Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining for the NHS Information Authority Brief review by Evandro Leite Jr Is it maths, management or computer science? Data mining definition: - Analysis of large volumes of data to extract important trends and higher level information. “We are drowning in data, but starving for knowledge! “ (J. Naisbett) Data mining became a Computer Science subject in the last 10 years, but it will always use mathematics as the base of it. Some quick definitions • Variables: Continuous: its measured values are real numbers (ex. 73.827, 23). Categorical: takes values in a finite set not having any natural ordering (ex. black, red, green). Ordered: finite set, with some way of sorting the elements of the set. (ex. age in years, interval of integer numbers, 01/09/2004). Dependent variable or set of classes: The aspect of the data to be studied. Independent variable or set of attributes: Variables that are manipulated to explain the dependent variable. • Types of problems: Regression-type -> dependent variable = Continuous: Ex: House selling price (£ value) price is real Classification-type -> dependent variable = Categorical: Ex: Who will graduate (yes, no) yes and no are categories DECISION TREES SOLVES CLASSIFICATION AND REGRESSION PROBLEMS The focus of the project There are many mathematical and computing tools that can be applied to data mining. Association Rules, Regression, Classification and Clustering. For now the focus has been give to: • Classification using Classification Trees. • Regression using Regression Trees. Classification Trees CAN BE USED ONLY IF THE DEPENDENT VARIABLE IS CATEGORICAL. Ex: Explain the reasons patient die after going to ICU. Outcome (Survived/Died) Aim: To understand complex datasets by splitting it into datasets with less entropy. The key is how to choose the best attribute to split the data. Classification Trees How to choose the best attribute to split? Gini impurity Used by the CART algorithm (Classification and Regression Trees). Suppose y takes on values in {1, 2, …, m}, and let f(i, j) = frequency of value j in node i. That is f(i, j) is the proportion of records assigned to node i for which y = j. I G (i) 1 j 1 f (i, j ) m 2 Entropy Used by the C4.5 and C5.0 algorithms. This measure is based on the concept of entropy used in information theory. I E (i) j 1 f (i, j ) log f (i, j ) m Gaps in knowledge • Are there other algorithms and function to be found? • What are the best functions and algorithms for each dataset? • The way to find out the goodness of a tree is known. However, finding the best size tree is a NP-complete problem. How to improve that? • How to combine the best from neural networks, support vector machines, relation rules, decision trees etc… to create a meta learner and metameta learner. A software which can implement multiple algorithms • The software will be able to run the different algorithms for the same dataset. – Trees generated from different algorithms will be created and will be compared. The user will be able to visually compare them, or to pick the one that has the inferior misclassification rate or model complexity. – Depending on the nature of the problem (classification or regression) a specific algorithm can be much more efficient. Last presentation’s play golf dataset Independent variables OUTLOOK TEMPERATURE HUMIDITY Dep. var WINDY PLAY sunny 85 85 FALSE Don't Play sunny 80 90 TRUE Don't Play overcast 83 78 FALSE Play rain 70 96 FALSE Play rain 68 80 FALSE Play rain 65 70 TRUE Don't Play overcast 64 65 TRUE Play sunny 72 95 FALSE Don't Play sunny 69 70 FALSE Play rain 75 80 FALSE Play sunny 75 70 TRUE Play overcast 72 90 TRUE Play overcast 81 75 FALSE Play rain 71 80 TRUE Don't Play Comparison between decision tree algorithms Answer tree solution using the famous CART algorithm SPSS Analytical Software Comparison between decision tree algorithms Spartacus Data Mining tools using the C4.5 algorithm Southampton University End of the introductory part Part 1: The meta and meta-meta learners • The meta-learner – The user will choose the dataset and the variables. – A trial of different runs, using combinations of different methods will be the input of a neural network (the meta-learner). Set of rules C1 C&RT Data quality Meta-learner CPU time & Memory utilisation Optimal data quality Neural network Dataset C2 QUEST CPU time & Memory utilisation Data quality Set of rules simpler rules Total CPU time Memory utilisation Σ memory(c) / CPU(c) c ------------------------------Total time The meta-meta-learners Meta-Learner 1 C&RT Dataset Meta-Learner2 Neural network Linear discriminant Meta-Learner 3 Relation rules C4.5 STR-Tree User defined could be a function like: Best meta-learner = DataQuality * A + Simpler rules * B + - Memory * C + - Time * D Neural network (probably not necessary) The meta-meta-learners user input and output Input: Dataset name? NHS Dependent variables? LOS, OUTCOME, STROKE Output: The best meta-learner for you is a combination of: C4.5, ANN and Relation rules. These are the best rules: How much do you care about: Data quality (0-99) Parsimonious models (0-99) Time to process (0-99) Memory utilisation (0-99) 1- IF HEART ATTACK and AGE > 90 then OUTCOME = DEATH (error 3%) 2- Everybody that has STOKE also has HIGH BLOOD PRESSURE 3- AGE * 2.3 + APACHE 2 * 0.4 = LOS (error 25%) A software which can implement multiple algorithms Once the best meta-learner is found for a given situation, dataset and dependent variable, the user can define this meta-learner as the one to be executed in similar situations. – Ex: To find the out the patients’ LOS in the ICU datasets the ML3(C&RT) will be used. However to find out the outcome of the patient (died or survived) the ML103(C4.5, relation rules) will be used. No more slides Nice things about decision trees There are many mathematical and computing tools that can be applied to data mining. Association Rules, Regression, Classification and Clustering. For now the focus has been give to: • Classification using Classification Trees. • Regression using Regression Trees. Regression Trees CAN BE USED ONLY IF THE DEPENDENT VARIABLE IS CONTINUOS. Ex: Time a patient stays in the hospital (LOS in days) Aim: To reduce the entropy of an dataset by splitting it into datasets with less entropy. The key is how to choose the best attribute to split the data.