Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005 Topics • • • • Lecture Demo Data Mining tool Exercises Data Mining tool Breaks TBD Data mining case study DARPA’s Bio-surveillance Agent Data mining case study DARPA’s Bio-surveillance Agent Data mining case study Credit Scoring for Loan Acceptance © Chordiant Software All applications Expert knowledge 29.8% accepted 12.7% infection Prediction model plus rules 34.5% accepted 9.1% infection Accepted volume Data mining case study Classifying Leukemia • Problem: – Leukemia (different types of Leukemia cells look very similar) – Given data for a number of samples (patients), can we • Accurately diagnose the disease? • Predict outcome for given treatment? • Recommend best treatment? • Solution – Data mining on micro-array data Data mining case study Classifying Leukemia • 38 training patients, 34 test patients, ~ 7,000 patient attributes (microarry gene data) • 2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid Leukemia (AML) • Use train data to build diagnostic model ALL • Results on test data: 33/34 correct, 1 error may be mislabeled AML 5 million terabytes created in 2002 • UC Berkeley 2003 estimate: 5 exabytes (5 million terabytes) of new data was created in 2002. • Twice as much information was created in 2002 as in 1999 (~30% growth rate) • Other growth rate estimates even higher • Very little data will ever be looked at by a human • Knowledge Discovery is NEEDED to make sense and use of data. Dilbert puts data mining in perspective Sources of (artificial) intelligence • Reasoning versus learning • Learning from data – – – – – – – – Patient data Customer records Stock prices Piano music Criminal mugshots Websites Robot perceptions Etc. Some working definitions…. • ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably • Data mining = – The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data • Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, …. Some working definitions…. • Concepts: kinds of things that can be learned – – • Instances: the individual, independent examples of a concept – • Example: a patient, candidate drug etc. Attributes: measuring aspects of an instance – • Aim: intelligible and operational concept description Example: the relation between patient characteristics and the probability to be diabetic Example: age, weight, lab tests, microarray data etc Pattern or attribute space Data mining tasks • Predictive data mining – Classification: classify an instance into a category – Regression: estimate some continuous value • Descriptive data mining – – – – – – Matching & search: finding instances similar to x Clustering: discovering groups of similar instances Association rule extraction: if a & b then c Summarization: summarizing group descriptions Link detection: finding relationships … Data Mining Tasks: Search Finding best matching instances Every instance is a point in pattern space. Attributes are the dimension of an instance, f.e. Age, weight, gender etc. f.e. weight Pattern spaces may be high dimensional (10 to thousands of dimensions) f.e. age Data Mining Tasks: Classification Goal classifier is to seperate classes on the basis of known attributes weight The classifier can be applied to an instance with unknow class age For instance, classes are healthy (circle) and sick (square); attributes are age and weight Data Mining Tasks: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar f.e. weight In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user f.e. age Data Mining Tasks: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar f.e. weight In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user f.e. age In >3 dimensions this is not possible Examples of Classification Techniques • • • • • • • • • Majority class vote Machine learning & AI Decision trees Nearest neighbor Neural networks Genetic algorithms / evolutionairy computing Artificial Immune Systems Good old statistics ….. Example Classification Algorithm 1 Decision Trees 20000 patients age > 67 yes no 1200 patients Weight > 85kg yes 400 patients Diabetic (%50) 18800 patients gender = male? no 800 customers Diabetic (%10) no etc. Decision Trees in Pattern Space Goal classifier is to seperate classes (circle, square) on the basis of attribute age and income weight Each line corresponds to a split in the tree Decision areas are ‘tiles’ in pattern space age Decision Trees in Pattern Space Goal classifier is to seperate classes (circle, square) on the basis of attribute age and income Each line corresponds to a split in the tree weight Decision areas are ‘tiles’ in pattern space age Special Cases of Decision Trees • Depth = 0 – Majority class classifier (ZeroR) • Depth = 1 – One question only – Also known as decision stump • Depth = n – Any amount of branches • Various algorithms exist to learn the tree from data – Major difference is criterion to determine on what attribute value to split Example classification algorithm 2: Nearest Neighbour • Data itself is the classification model, so no abstraction like a tree etc. • For a given instance x, search the k instances that are most similar to x • Classify x as the most occurring class for the k most similar instances Nearest Neighbor in Pattern Space Classification = new instance Any decision area possible fe weight Condition: enough data available fe age Nearest Neighbor in Pattern Space Voorspellen Any decision area possible bvb. weight Condition: enough data available f.e. age Example classification algorithm 3: Neural Networks • Inspired by neuronal computation in the brain (McCullough & Pitts 1943 (!)) invoer: bvb. klantkenmerken uitvoer: bvb. respons • Input (attributes) is coded as activation on the input layer neurons, activation feeds forward through network of weighted links between neurons and causes activations on the output neurons (for instance diabetic yes/no) • Algorithm learns to find optimal weight using the training instances and a general learning rule. Neural Networks • Example simple network (2 layers) age weightage body_mass_index Weightbody mass index Probability of being diabetic • Probability of being diabetic = f (age * weightage + body mass index * weightbody mass index) Neural Networks in Pattern Space Classification Simpel network: only a line available (why?) to seperate classes Multilayer network: f.e. weight Any classification boundary possible f.e. age e Decision Tree Demo in WEKA, An open source mining tool Descriptive data mining: association rules • Discovery of interesting patters • Rule format: if A (and B and C etc) then Z • Example: – If customer buys potatoes (A) and sauerkraut (B) then customer buys sausage (Z) • Quality measures for a rule – Support condition: how often do potatoes and sauerkraut occur together (A,B) – Confidence rule: how often do sausages then occur / support conditions (is A,B C always true?) e Associatie rule demo in WEKA Some examples of my research areas (Jointly with students) • Mix between applications and new algorithms – Video mining: recognize settings, porn filtering – Artificial Immune Systems: copying learning ability of immune systems – Predicting Survival Rate for Throat Cancer Patients – Crime Data Mining – Fusing Data from Multiple Sources – Decisioning: offering the right product to the right customer using predictions – Bias variance evaluation: distinguish between different sources of error for a classifier What have we learned so far? • • • • • • • • Case Studies Learning versus reasoning Data mining definitions Data mining tasks Example data mining techniques for classification Example data mining techniques for association rules WEKA Demos And now: lab sessions