Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to data mining Peter van der Putten Leiden Institute of Advanced Computer Science Leiden University [email protected] Transcriptomics and Proteomics in Zebrafish workshop, Leiden University March 9, 2006 Presentation Outline • Objective – Present the basics of data mining – Gain understanding of the potential for applying it in the bioinformatics domain Agenda Today • Data mining definitions • Before Starting to Mine…. • Descriptive Data Mining – Dimension Reduction & Projection – Clustering – Association rules • Predictive data mining concepts – Classification and regression – Bioinformatics applications • Predictive data mining techniques – – – – – Logistic Regression Nearest Neighbor Decision Trees Naive Bayes Neural Networks • Evaluating predictive models • Demonstration (optional) The Promise…. . The Promise…. . The Promise…. . • The Solution…. • NCBI Tools for data mining: – – – – – Nucleotide sequence analysis Proteine sequence analysis Structures Genome analysis Gene expression • Data mining or not?. What is data mining? Sources of (artificial) intelligence • Reasoning versus learning • Learning from data – – – – – – – – – Patient data Genomics, protemics Customer records Stock prices Piano music Criminal mug shots Websites Robot perceptions Etc. Some working definitions…. • ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably • Data mining = – The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data • Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, …. Some working definitions…. • Bioinformatics = – Bioinformatics is the research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data [http://www.bisti.nih.gov/]. – Or more pragmatic: Bioinformatics or computational biology is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems [Wikipedia Nov 2005] Bio informatics and data mining • From sequence to structure to function • Genomics (DNA), Transcriptomics (RNA), Proteomics (proteins), Metabolomics (metabolites) • Pattern matching and search • Sequence matching and alignment • Structure prediction – Predicting structure from sequence – Protein secondary structure prediction • Function prediction – Predicting function from structure – Protein localization • Expression analysis – Genes: micro array data analysis etc. – Proteins • Regulation analysis Bio informatics and data mining • • • • • • Classical medical and clinical studies Medical decision support tools Text mining on medical research literature (MEDLINE) Spectrometry, Imaging Systems biology and modeling biological systems Population biology & simulation • Spin Off: Biological inspired computational learning – Evolutionary algorithms, neural networks, artificial immune systems Genomic Microarrays – Case Study • Problem: – Leukemia (different types of Leukemia cells look very similar) – Given data for a number of samples (patients), can we • Accurately diagnose the disease? • Predict outcome for given treatment? • Recommend best treatment? • Solution – Data mining on micro-array data Microarray data • 50 most important genes • Rows: genes • Columns: samples / patients Example: ALL/AML data • 38 training patients, 34 test patients, ~ 7,000 patient attributes (micro array gene data) • 2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid Leukemia (AML) • Use train data to build diagnostic model ALL AML Results on test data: 33/34 correct, 1 error may be mislabeled Some working definitions…. • ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably • Data mining = – The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data • Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, …. The Knowledge Discovery Process Data Mining Objectives & Design Problem Objectives Deployment, Application & Monitoring Data Understanding Data Preparation Evaluation Modeling Some working definitions…. • Concepts: kinds of things that can be learned – – • Instances: the individual, independent examples of a concept – • Example: a patient, candidate drug etc. Attributes: measuring aspects of an instance – • Aim: intelligible and operational concept description Example: the relation between patient characteristics and the probability to be diabetic Example: age, weight, lab tests, microarray data etc Pattern or attribute space Data mining tasks • Descriptive data mining – – – – – – Matching & search: finding instances similar to x Clustering: discovering groups of similar instances Association rule extraction: if a & b then c Summarization: summarizing group descriptions Link detection: finding relationships … • Predictive data mining – Classification: classify an instance into a category – Regression: estimate some continuous value Before starting to mine…. • Pima Indians Diabetes Data – X = body mass index – Y = age Before starting to mine…. Before starting to mine…. Before starting to mine…. • Attribute Selection – This example: InfoGain by Attribute – Keep the most important ones Diastolic blood pressure (mm Hg) Diabetes pedigree function Number of times pregnant Triceps skin fold thickness (mm) 2-Hour serum insulin (mu U/ml) Age (years) Body mass index (weight in kg/(height in m)^2) Plasma glucose concentration a 2 hours in an oral glucose tolerance test 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 Before starting to mine…. • Types of Attribute Selection – Uni-variate versus multivariate (sub set selection) • The fact that attribute x is a strong uni-variate predictor does not necessarily mean it will add predictive power to a set of predictors already used by a model – Filter versus wrapper • Wrapper methods involve the subsequent learner (classifier or other) Dimension Reduction • Projecting high dimensional data into a lower dimension – – – – – Principal Component Analysis Independent Component Analysis Fisher Mapping, Sammon’s Mapping etc. Multi Dimensional Scaling …. Data Mining Tasks: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar f.e. weight In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user f.e. age Data Mining Tasks: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar f.e. weight In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user f.e. age In >3 dimensions this is not possible Clustering Techniques • Hierarchical algorithms – Agglomerative – Divisive • Partition based clustering – K-Means – Self Organizing Maps / Kohonen Networks • Probabilistic Model based – Expectation Maximization / Mixture Models Hierarchical clustering • Agglomerative / Bottom up – – – – • Divisive / Top Down – – – • Start with single-instance clusters At each step, join the two closest clusters Method to compute distance between cluster x and y: single linkage (distance between closest point in cluster x and y), average linkage (average distance between all points), complete linkage (distance between furthest points), centroid Distance measure: Euclidean, Correlation etc. Start with all data in one cluster Split into two clusters based on distance measure / split utility Proceed recursively on each subset Both methods produce a dendrogram Levels of Clustering Agglomerative Divisive Dunham, 2003 Hierarchical Clustering Example • Clustering Microarray Gene Expression Data – Gene expression measured using microarrays studied under variety of conditions – On budding yeast Saccharomyces cerevisiae – Groups together efficiently genes of known similar function, • Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro Hierarchical Clustering Example • Method – Genes are the instances, samples the attributes! – Agglomerative – Distance measure = correlation • Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro Simple Clustering: K-means • Pick a number (k) of cluster centers (at random) • • Assign every item to its nearest cluster center • • F.i. Euclidean distance Move each cluster center to the mean of its assigned items Repeat until convergence • • KDnuggets Cluster centers are sometimes called codes, and the k codes a codebook change in cluster assignments less than a threshold K-means example, step 1 k1 Y Initially distribute codes randomly in pattern space k2 k3 X KDnuggets K-means example, step 2 k1 Y Assign each point to the closest code k2 k3 X KDnuggets K-means example, step 3 k1 k1 Y Move each code to the mean of all its assigned points k2 k3 k2 k3 X KDnuggets K-means example, step 2 Repeat the process – Y reassign the data points to the codes Q: Which points are reassigned? k1 k3 k2 X KDnuggets K-means example Repeat the process – Y reassign the data points to the codes Q: Which points are reassigned? k1 k3 k2 X KDnuggets K-means example k1 Y re-compute cluster means k3 k2 X KDnuggets K-means example k1 Y move cluster centers to cluster means KDnuggets k2 k3 X K-means clustering summary Advantages • Simple, understandable • items automatically assigned to clusters Disadvantages • Must pick number of clusters before hand • All items forced into a cluster • Sensitive to outliers Extensions • Adaptive k-means • K-mediods (based on median instead of mean) – 1,2,3,4,100 average 22, median 3 Biological Example • Clustering of yeast cell images – Two clusters are found – Left cluster primarily cells with thick capsule, right cluster thin capsule • caused by media, proxy for sick vs healthy Self Organizing Maps (Kohonen Maps) • Claim to fame – Simplified models of cortical maps in the brain – Things that are near in the outside world link to areas near in the cortex – For a variety of modalities: touch, motor, …. up to echolocation – Nice visualization • From a data mining perspective: – SOMs are simple extensions of k-means clustering – Codes are connected in a lattice – In each iteration codes neighboring winning code in the lattice are also allowed to move SOM 10x10 SOM Gaussian Distribution SOM SOM SOM SOM example Famous example: Phonetic Typewriter • SOM lattice below left is trained on spoken letters, after convergence codes are labeled • Creates a ‘phonotopic’ map • Spoken word creates a sequence of labels Famous example: Phonetic Typewriter • Criticism – Topology preserving property is not used so why use SOMs and not adaptive k-means for instance? • K-means could also create a sequence • This is true for most SOM applications! – Is using clustering for classification optimal? Bioinformatics Example Clustering GPCRs • Clustering G Protein Coupled Receptors (GPCRs) [Samsanova et al, 2003, 2004] • Important drug target, function often unknown Bioinformatics Example Clustering GPCRs Association Rules Outline • What are frequent item sets & association rules? • Quality measures – support, confidence, lift • How to find item sets efficiently? – APRIORI • How to generate association rules from an item set? • Biological examples KDnuggets Market Basket Example Gene Expression Example TID Produce 1 MILK, BREAD, EGGS 2 BREAD, SUGAR 3 BREAD, CEREAL 4 MILK, BREAD, SUGAR 5 MILK, CEREAL 6 BREAD, CEREAL 7 MILK, CEREAL 8 MILK, BREAD, CEREAL, EGGS 9 MILK, BREAD, CEREAL • What genes are expressed (‘active’) together? • Interaction / regulation • Similar function • Frequent item set • {MILK, BREAD} = 4 • Association rule • {MILK, BREAD} {EGGS} • Frequency / importance = 2 (‘Support’) • Quality = 50% (‘Confidence’) ID 1 2 3 4 5 6 7 8 9 Expressed Genes in Sample GENE1, GENE2, GENE 5 GENE1, GENE3, GENE 5 GENE2 GENE8, GENE9 GENE8, GENE9, GENE10 GENE2, GENE8 GENE9, GENE10 GENE2 GENE11 Association Rule Definitions • • • • Set of items: I={I1,I2,…,Im} Transactions: D={t1,t2, …, tn}, tj I Itemset: {Ii1,Ii2, …, Iik} I Support of an itemset: Percentage of transactions which contain that itemset. • Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. Dunham, 2003 Frequent Item Set Example I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread,PeanutButter} is 60% Dunham, 2003 Association Rule Definitions • Association Rule (AR): implication X Y where X,Y I and X,Y disjunct; • Support of AR (s) X Y: Percentage of transactions that contain X Y • Confidence of AR (a) X Y: Ratio of number of transactions that contain X Y to the number that contain X Dunham, 2003 Association Rules Ex (cont’d) Dunham, 2003 Association Rule Problem • Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij I, the Association Rule Problem is to identify all association rules X Y with a minimum support and confidence. • NOTE: Support of X Y is same as support of X Y. Dunham, 2003 Association Rules Example • Q: Given frequent set {A,B,E}, what association rules have minsup = 2 and minconf= 50% ? A, B => E : conf=2/4 = 50% A, E => B : conf=2/2 = 100% B, E => A : conf=2/2 = 100% E => A, B : conf=2/2 = 100% Don’t qualify A =>B, E : conf=2/6 =33%< 50% B => A, E : conf=2/7 = 28% < 50% __ => A,B,E : conf: 2/9 = 22% < 50% KDnuggets TID List of items 1 A, B, E 2 B, D 3 B, C 4 A, B, D 5 A, C 6 B, C 7 A, C 8 A, B, C, E 9 A, B, C Solution Association Rule Problem • First, find all frequent itemsets with sup >=minsup – Exhaustive search won’t work • Assume we have a set of m items 2m subsets! – Exploit the subset property (APRIORI algorithm) • For every frequent item set, derive rules with confidence >= minconf KDnuggets Finding itemsets: next level • Apriori algorithm (Agrawal & Srikant) • Idea: use one-item sets to generate two-item sets, two-item sets to generate three-item sets, .. – Subset Property: If (A B) is a frequent item set, then (A) and (B) have to be frequent item sets as well! – In general: if X is frequent k-item set, then all (k-1)item subsets of X are also frequent Compute k-item set by merging (k-1)-item sets KDnuggets An example • Given: five three-item sets (A B C), (A B D), (A C D), (A C E), (B C D) • Candidate four-item sets: (A B C D) Q: OK? A: yes, because all 3-item subsets are frequent (A C D E) Q: OK? A: No, because (C D E) is not frequent KDnuggets From Frequent Itemsets to Association Rules • Q: Given frequent set {A,B,E}, what are possible association rules? – – – – – – – KDnuggets A => B, E A, B => E A, E => B B => A, E B, E => A E => A, B __ => A,B,E (empty rule), or true => A,B,E Example: ‘Generating Rules from an Itemset • Frequent itemset from golf data: Humidity = Normal, Windy = False, Play = Yes (4) • Seven potential rules: If Humidity = Normal and Windy = False then Play = Yes If Humidity = Normal and Play = Yes then Windy = False If Windy = False and Play = Yes then Humidity = Normal If Humidity = Normal then Windy = False and Play = Yes If Windy = False then Humidity = Normal and Play = Yes If Play = Yes then Humidity = Normal and Windy = False If True then Humidity = Normal and Windy = False and Play = Yes KDnuggets 4/4 4/6 4/6 4/7 4/8 4/9 4/12 Example: Generating Rules • Rules with support > 1 and confidence = 100%: Association rule Sup. Conf. 1 Humidity=Normal Windy=False Play=Yes 4 100% 2 Temperature=Cool Humidity=Normal 4 100% 3 Outlook=Overcast Play=Yes 4 100% 4 Temperature=Cold Play=Yes Humidity=Normal 3 100% ... ... ... ... ... 58 Outlook=Sunny Temperature=Hot Humidity=High 2 100% • In total: 3 rules with support four, 5 with support three, and 50 with support two KDnuggets Biomedical Application Head and Neck Cancer Example 1. ace27=0 fiveyr=alive 381 tumorbefore=0 372 conf:(0.98) 2. gender=M ace27=0 467 tumorbefore=0 455 conf:(0.97) 3. ace27=0 588 tumorbefore=0 572 conf:(0.97) 4. tnm=T0N0M0 ace27=0 405 tumorbefore=0 391 conf:(0.97) 5. loc=LOC7 tumorbefore=0 409 tnm=T0N0M0 391 conf:(0.96) 6. loc=LOC7 442 tnm=T0N0M0 422 conf:(0.95) 7. loc=LOC7 gender=M tumorbefore=0 374 tnm=T0N0M0 357 conf:(0.95) 8. loc=LOC7 gender=M 406 tnm=T0N0M0 387 9. gender=M fiveyr=alive 633 tumorbefore=0 595 10. fiveyr=alive 778 tumorbefore=0 726 conf:(0.95) conf:(0.94) conf:(0.93) Bioinformatics Application • The idea of association rules have been customized for bioinformatics applications • In biology it is often interesting to find frequent structures rather than items – For instance protein or other chemical structures • Solution: Mining Frequent Patterns – FSG (Kuramochi and Karypis, ICDM 2001) – gSpan (Yan and Han, ICDM 2002) – CloseGraph (Yan and Han, KDD 2002) FSG: Mining Frequent Patterns FSG: Mining Frequent Patterns FSG Algorithm for finding frequent subgraphs Frequent Subgraph Examples AIDS Data • Compounds are active, inactive or moderately active (CA, CI, CM) Predictive Subgraphs • The three most discriminating sub-structures for the PTC, AIDS, and Anthrax datasets FSG References • Frequent Sub-structure Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, and George Karypis ICDM 2003 • An Efficient Algorithm for Discovering Frequent Subgraphs Michihiro Kuramochi and George Karypis IEEE TKDE • Automated Approaches for Classifying Structures Mukund Deshpande, Michihiro Kuramochi, and George Karypis BIOKDD 2002 • Discovering Frequent Geometric Subgraphs Michihiro Kuramochi and George Karypis ICDM 2002 • Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis 1st IEEE Conference on Data Mining 2001 Recap • Before Starting to Mine…. • Descriptive Data Mining – Dimension Reduction & Projection – Clustering • Hierarchical clustering • K-means • Self organizing maps – Association rules • • • • Frequent item sets Association Rules APRIORI Bio-informatics case: FSG for frequent subgraph discovery • Next – Predictive data mining Data Mining Tasks: Classification Goal classifier is to seperate classes on the basis of known attributes weight The classifier can be applied to an instance with unknow class age For instance, classes are healthy (circle) and sick (square); attributes are age and weight Data Preparation for Classification • On attributes – Attribute selection – Attribute construction • On attribute values – – – – – Outlier removal / clipping Normalization Creating dummies Missing values imputation …. Examples of Classification Techniques • • • • • • • • Majority class vote Logistic Regression Nearest Neighbor Decision Trees, Decision Stumps Naive Bayes Neural Networks Genetic algorithms Artificial Immune Systems Example classification algorithm: Logistic Regression • Linear regression – For regression not classification (outcome numeric, not symbolic class) – Predicted value is linear combination of inputs y ax1 bx2 c • Logistic regression – Apply logistic function to linear regression formula – Scales output between 0 and 1 – For binary classification use thresholding y 1 1 e ( ax1 bx2 c ) y t c1 y t c2 Example classification algorithm: Logistic Regression Classification fe weight Linear decision boundaries can be represented well with linear classifiers like logistic regression fe age Logistic Regression in attribute space Voorspellen f.e. weight Linear decision boundaries can be represented well with linear classifiers like logistic regression f.e. age Logistic Regression in attribute space Voorspellen f.e. weight xxxx linear decision Non boundaries cannot be represented well with linear classifiers like logistic regression f.e. age Logistic Regression in attribute space Non linear decision boundaries cannot be represented well with linear classifiers like logistic regression Well known example: f.e. weight The XOR problem f.e. age Example classification algorithm: Nearest Neighbour • Data itself is the classification model, so no model abstraction like a tree etc. • For a given instance x, search the k instances that are most similar to x • Classify x as the most occurring class for the k most similar instances Nearest Neighbor in attribute space Classification = new instance Any decision area possible fe weight Condition: enough data available fe age Nearest Neighbor in attribute space Voorspellen Any decision area possible bvb. weight Condition: enough data available f.e. age Example Classification Algorithm Decision Trees 20000 patients age > 67 yes no 1200 patients Weight > 85kg yes 400 patients Diabetic (%50) 18800 patients gender = male? no 800 customers Diabetic (%10) no etc. Building Trees: Weather Data example Outlook Temperature Humidity Windy Play? sunny hot high false No sunny hot high true No overcast hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No overcast cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes overcast mild high true Yes overcast hot normal false Yes rain mild high true No KDNuggets / Witten & Frank, 2000 Building Trees • An internal node is a test on an attribute. • A branch represents an outcome of the test, e.g., Color=red. • A leaf node represents a class label or class label distribution. • At each node, one attribute is chosen to split training examples into distinct classes as much as possible • A new case is classified by following a matching path to a leaf node. KDNuggets / Witten & Frank, 2000 Outlook sunny rain overcast Yes Humidity Windy high normal true false No Yes No Yes Split on what attribute? • Which is the best attribute to split on? – The one which will result in the smallest tree – Heuristic: choose the attribute that produces best separation of classes (the “purest” nodes) • Popular impurity measure: information – Measured in bits – At a given node, how much more information do you need to classify an instance correctly? • What if at a given node all instances belong to one class? • Strategy – choose attribute that results in greatest information gain KDNuggets / Witten & Frank, 2000 Which attribute to select? • Candidate: outlook attribute • What is the info for the leafs? – info[2,3] = 0.971 bits – Info[4,0] = 0 bits – Info[3,2] = 0.971 bits • Total: take average weighted by nof instances – Info([2,3], [4,0], [3,2]) = 5/14 * 0.971 + 4/14* 0 + 5/14 * 0.971 = 0.693 bits • What was the info before the split? – Info[9,5] = 0.940 bits • What is the gain for a split on outlook? – Gain(outlook) = 0.940 – 0.693 = 0.247 bits Witten & Frank, 2000 Which attribute to select? Gain = 0.247 Gain = 0.152 Gain = 0.048 Witten & Frank, 2000 Gain = 0.029 Continuing to split gain(" Humidity" ) 0.971 bits gain(" Temperatur e" ) 0.571 bits gain(" Windy" ) 0.020 bits KDNuggets / Witten & Frank, 2000 The final decision tree • Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further KDNuggets / Witten & Frank, 2000 Computing information • Information is measured in bits – When a leaf contains once class only information is 0 (pure) – When the number of instances is the same for all classes information reaches a maximum (impure) • Measure: information value or entropy entropy( p1 , p2 ,..., pn ) p1 log p1 p2 log p2 ... pn log pn • Example (log base 2) – Info([2,3,4]) = -2/9 * log(2/9) – 3/9 * log(3/9) – 4/9 * log(4/9) KDNuggets / Witten & Frank, 2000 Decision Trees in Pattern Space Goal classifier is to seperate classes (circle, square) on the basis of attribute age and income weight Each line corresponds to a split in the tree Decision areas are ‘tiles’ in pattern space age Decision Trees in attribute space Goal classifier is to seperate classes (circle, square) on the basis of attribute age and weight Each line corresponds to a split in the tree weight Decision areas are ‘tiles’ in attribute space age Example classification algorithm: Naive Bayes • Naive Bayes = Probabilistic Classifier based on Bayes Rule • Will produce probability for each target / outcome class • ‘Naive’ because it assumes independence between attributes (uncorrelated) Bayes’s rule • Probability of event H given evidence E : • Pr[ E | H ] Pr[ H ] Pr[ E ] A priori probability of H : Pr[H ] Pr[ H | E ] – • Probability of event before evidence is seen A posteriori probability of H : Pr[ H | E ] – Probability of event after evidence is seen from Bayes “Essay towards solving a problem in the doctrine of chances” (1763) Thomas Bayes Born: 1702 in London, England Died: 1761 in Tunbridge Wells, Kent, England KDNuggets / Witten & Frank, 2000 Naïve Bayes for classification • Classification learning: what’s the probability of the class given an instance? – – • Evidence E = instance Event H = class value for instance Naïve assumption: evidence splits into parts (i.e. attributes) that are independent Pr[ E1 | H ] Pr[ E1 | H ]Pr[ En | H ] Pr[ H ] Pr[ H | E ] Pr[ E ] KDNuggets / Witten & Frank, 2000 Weather data example Outlook Temp. Humidity Windy Play Sunny Cool High True ? Evidence E Pr[ yes | E ] Pr[Outlook Sunny | yes ] Pr[Temperature Cool | yes ] Probability of class “yes” Pr[ Humidity High | yes ] Pr[Windy True | yes] Pr[ yes] Pr[ E ] 93 93 93 149 Pr[ E ] 2 9 KDNuggets / Witten & Frank, 2000 Probabilities for weather data Outlook Temperature Yes No Sunny 2 3 Hot 2 2 Overcast 4 0 Mild 4 2 Rainy 3 2 Cool 3 1 Sunny 2/9 3/5 Hot 2/9 2/5 Overcast 4/9 0/5 Mild 4/9 2/5 Rainy 3/9 2/5 Cool 3/9 1/5 KDNuggets / Witten & Frank, 2000 Yes Humidity No Windy Yes No High 3 4 Normal 6 High Normal Play Yes No Yes False 6 2 9 5 1 True 3 3 3/9 4/5 False 6/9 2/5 9/14 5/14 6/9 1/5 True 3/9 3/5 Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No No Probabilities for weather data Outlook Temperature Yes No Sunny 2 3 Hot 2 2 Overcast 4 0 Mild 4 2 Rainy 3 2 Cool 3 1 Sunny 2/9 3/5 Hot 2/9 2/5 Overcast 4/9 0/5 Mild 4/9 2/5 Rainy 3/9 2/5 Cool 3/9 1/5 • A new day: KDNuggets / Witten & Frank, 2000 Yes Humidity No Windy Yes No High 3 4 Normal 6 High Play Yes No Yes False 6 2 9 5 1 True 3 3 3/9 4/5 False 6/9 2/5 9/14 5/14 Normal 6/9 1/5 True 3/9 3/5 Outlook Temp. Humidity Windy Play Sunny Cool High True ? Likelihood of the two classes For “yes” = 2/9 3/9 3/9 3/9 9/14 = 0.0053 For “no” = 3/5 1/5 4/5 3/5 5/14 = 0.0206 Conversion into a probability by normalization: P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205 P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795 No Naïve Bayes: discussion • Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) • Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class • However: adding too many redundant attributes will cause problems (e.g. identical attributes) witten&eibe Naive Bayes in attribute space Classification fe weight NB can model non fe age Example classification algorithm: Neural Networks • Inspired by neuronal computation in the brain (McCullough & Pitts 1943 (!)) invoer: bvb. klantkenmerken uitvoer: bvb. respons • Input (attributes) is coded as activation on the input layer neurons, activation feeds forward through network of weighted links between neurons and causes activations on the output neurons (for instance diabetic yes/no) • Algorithm learns to find optimal weight using the training instances and a general learning rule. Neural Networks • Example simple network (2 layers) age weightage body_mass_index Weightbody mass index Probability of being diabetic • Probability of being diabetic = f (age * weightage + body mass index * weightbody mass index) Neural Networks in Pattern Space Classification Simpel network: only a line available (why?) to seperate classes Multilayer network: f.e. weight Any classification boundary possible f.e. age Evaluating Classifiers • Root mean squared error (rmse), Area Under the ROC Curve (AUC), confusion matrices, classification accuracy – Accuracy = 78% on test set 78% of classifications were correct • Hold out validation, n fold cross validation, leave one out validation – Build a model on a training set, evaluate on test set – Hold out: single test set (f.e. one thirds of data) – n fold cross validation • Divide into n groups • Perform n cycles, each cycle with different fold as test set – Leave one out • Test set of one instance, cycle trough all instances Evaluating Classifiers • Investigating the sources of error – bias variance decomposition – Informal definition • Bias: error due to limitations of model representation (eg linear classifier on non linear problem); even with infinite date there will be bias • Variance: error due to instability of classifier over different samples; error due to sample sizes, overfitting Example Results Predicting Survival for Head & Neck Cancer TNM Symbolic TNM Numeric Average and standard deviation (SD) on the classification accuracy for all classifiers Example Results Head and Neck Cancer: Bias Variance Decomposition • Quiz: What could be a strategy to improve these models? Agenda Today • Data mining definitions • Before Starting to Mine…. • Descriptive Data Mining – Dimension Reduction & Projection – Clustering – Association rules • Predictive data mining concepts – Classification and regression – Bioinformatics applications • Predictive data mining techniques – – – – – Logistic Regression Nearest Neighbor Decision Trees Naive Bayes Neural Networks • Evaluating predictive models • Demonstration (optional) General Data Mining Resources • Best KDD website & mailing lists: KDNuggets website (www.kdnuggets.com - check the bionformatics sections) • Most popular open source data mining tool: WEKA ( • Most important conference: KDD (see www.kdd2006.com for instance; check BIOKDD, bioinformatics data mining workshop) • You can contact me at [email protected] Demo: WEKA Questions and Answers