Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Databases and Data Mining Lecture 3: Descriptive Data Mining Peter van der Putten (putten_at_liacs.nl) Course Outline • Objective – Understand the basics of data mining – Gain understanding of the potential for applying it in the bioinformatics domain – Hands on experience • Schedule Date Time Room 4-Nov-05 13.45 - 15.30 174 Lecture 18-Nov-05 13.45 - 15.30 413 Lecture 15.45 - 17.30 306/308 Practical Assignments 25-Nov-05 13.45 - 15.30 413 Lecture 2-Dec-05 13.45 - 15.30 413 Lecture 15.45 - 17.30 306/308 Practical Assignments • Evaluation – Practical assignment (2nd) plus take home exercise • Website – http://www.liacs.nl/~putten/edu/dbdm05/ Agenda Today: Descriptive Data Mining • Before Starting to Mine…. • Descriptive Data Mining – Dimension Reduction & Projection – Clustering • Hierarchical clustering • K-means • Self organizing maps – Association rules • • • • Frequent item sets Association Rules APRIORI Bio-informatics case: FSG for frequent subgraph discovery Before starting to mine…. • Pima Indians Diabetes Data – X = body mass index – Y = age Before starting to mine…. Before starting to mine…. Before starting to mine…. • Attribute Selection – This example: InfoGain by Attribute – Keep the most important ones Diastolic blood pressure (mm Hg) Diabetes pedigree function Number of times pregnant Triceps skin fold thickness (mm) 2-Hour serum insulin (mu U/ml) Age (years) Body mass index (weight in kg/(height in m)^2) Plasma glucose concentration a 2 hours in an oral glucose tolerance test 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 Before starting to mine…. • Types of Attribute Selection – Uni-variate versus multivariate (sub set selection) • The fact that attribute x is a strong uni-variate predictor does not necessarily mean it will add predictive power to a set of predictors already used by a model – Filter versus wrapper • Wrapper methods involve the subsequent learner (classifier or other) Dimension Reduction • Projecting high dimensional data into a lower dimension – – – – Principal Component Analysis Independent Component Analysis Fisher Mapping, Sammon’s Mapping etc. Multi Dimensional Scaling • See Pattern Recognition Course (Duin) Data Mining Tasks: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar f.e. weight In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user f.e. age Data Mining Tasks: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar f.e. weight In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user f.e. age In >3 dimensions this is not possible Clustering Techniques • Hierarchical algorithms – Agglomerative – Divisive • Partition based clustering – K-Means – Self Organizing Maps / Kohonen Networks • Probabilistic Model based – Expectation Maximization / Mixture Models Hierarchical clustering • Agglomerative / Bottom up – – – – • Divisive / Top Down – – – • Start with single-instance clusters At each step, join the two closest clusters Method to compute distance between cluster x and y: single linkage (distance between closest point in cluster x and y), average linkage (average distance between all points), complete linkage (distance between furthest points), centroid Distance measure: Euclidean, Correlation etc. Start with all data in one cluster Split into two clusters based on category utility Proceed recursively on each subset Both methods produce a dendrogram Levels of Clustering Agglomerative Divisive Dunham, 2003 Hierarchical Clustering Example • Clustering Microarray Gene Expression Data – Gene expression measured using microarrays studied under variety of conditions – On budding yeast Saccharomyces cerevisiae – Groups together efficiently genes of known similar function, • Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro Hierarchical Clustering Example • Method – Genes are the instances, samples the attributes! – Agglomerative – Distance measure = correlation • Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro Simple Clustering: K-means • Pick a number (k) of cluster centers (at random) • • Assign every item to its nearest cluster center • • F.i. Euclidean distance Move each cluster center to the mean of its assigned items Repeat until convergence • • KDnuggets Cluster centers are sometimes called codes, and the k codes a codebook change in cluster assignments less than a threshold K-means example, step 1 k1 Y Initially distribute codes randomly in pattern space k2 k3 X KDnuggets K-means example, step 2 k1 Y Assign each point to the closest code k2 k3 X KDnuggets K-means example, step 3 k1 k1 Y Move each code to the mean of all its assigned points k2 k3 k2 k3 X KDnuggets K-means example, step 2 Repeat the process – Y reassign the data points to the codes Q: Which points are reassigned? k1 k3 k2 X KDnuggets K-means example Repeat the process – Y reassign the data points to the codes Q: Which points are reassigned? k1 k3 k2 X KDnuggets K-means example k1 Y re-compute cluster means k3 k2 X KDnuggets K-means example k1 Y move cluster centers to cluster means KDnuggets k2 k3 X K-means clustering summary Advantages • Simple, understandable • items automatically assigned to clusters Disadvantages • Must pick number of clusters before hand • All items forced into a cluster • Sensitive to outliers Extensions • Adaptive k-means • K-mediods (based on median instead of mean) – 1,2,3,4,100 average 22, median 3 Biological Example • Clustering of yeast cell images – Two clusters are found – Left cluster primarily cells with thick capsule, right cluster thin capsule • caused by media, proxy for sick vs healthy Self Organizing Maps (Kohonen Maps) • Claim to fame – Simplified models of cortical maps in the brain – Things that are near in the outside world link to areas near in the cortex – For a variety of modalities: touch, motor, …. up to echolocation – Nice visualization • From a data mining perspective: – SOMs are simple extensions of k-means clustering – Codes are connected in a lattice – In each iteration codes neighboring winning code in the lattice are also allowed to move SOM 10x10 SOM Gaussian Distribution SOM SOM SOM SOM example Famous example: Phonetic Typewriter • SOM lattice below left is trained on spoken letters, after convergence codes are labeled • Creates a ‘phonotopic’ map • Spoken word creates a sequence of labels Famous example: Phonetic Typewriter • Criticism – Topology preserving property is not used so why use SOMs and not adaptive k-means for instance? • K-means could also create a sequence • This is true for most SOM applications! – Is using clustering for classification optimal? Bioinformatics Example Clustering GPCRs • Clustering G Protein Coupled Receptors (GPCRs) [Samsanova et al, 2003, 2004] • Important drug target, function often unknown Bioinformatics Example Clustering GPCRs Association Rules Outline • What are frequent item sets & association rules? • Quality measures – support, confidence, lift • How to find item sets efficiently? – APRIORI • How to generate association rules from an item set? • Biological examples KDnuggets Market Basket Example Gene Expression Example TID Produce 1 MILK, BREAD, EGGS 2 BREAD, SUGAR 3 BREAD, CEREAL 4 MILK, BREAD, SUGAR 5 MILK, CEREAL 6 BREAD, CEREAL 7 MILK, CEREAL 8 MILK, BREAD, CEREAL, EGGS 9 MILK, BREAD, CEREAL • What genes are expressed (‘active’) together? • Interaction / regulation • Similar function • Frequent item set • {MILK, BREAD} = 4 • Association rule • {MILK, BREAD} {EGGS} • Frequency / importance = 2 (‘Support’) • Quality = 50% (‘Confidence’) ID 1 2 3 4 5 6 7 8 9 Expressed Genes in Sample GENE1, GENE2, GENE 5 GENE1, GENE3, GENE 5 GENE2 GENE8, GENE9 GENE8, GENE9, GENE10 GENE2, GENE8 GENE9, GENE10 GENE2 GENE11 Association Rule Definitions • • • • Set of items: I={I1,I2,…,Im} Transactions: D={t1,t2, …, tn}, tj I Itemset: {Ii1,Ii2, …, Iik} I Support of an itemset: Percentage of transactions which contain that itemset. • Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. Dunham, 2003 Frequent Item Set Example I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread,PeanutButter} is 60% Dunham, 2003 Association Rule Definitions • Association Rule (AR): implication X Y where X,Y I and X,Y disjunct; • Support of AR (s) X Y: Percentage of transactions that contain X Y • Confidence of AR (a) X Y: Ratio of number of transactions that contain X Y to the number that contain X Dunham, 2003 Association Rules Ex (cont’d) Dunham, 2003 Association Rule Problem • Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij I, the Association Rule Problem is to identify all association rules X Y with a minimum support and confidence. • NOTE: Support of X Y is same as support of X Y. Dunham, 2003 Association Rules Example • Q: Given frequent set {A,B,E}, what association rules have minsup = 2 and minconf= 50% ? A, B => E : conf=2/4 = 50% A, E => B : conf=2/2 = 100% B, E => A : conf=2/2 = 100% E => A, B : conf=2/2 = 100% Don’t qualify A =>B, E : conf=2/6 =33%< 50% B => A, E : conf=2/7 = 28% < 50% __ => A,B,E : conf: 2/9 = 22% < 50% KDnuggets TID List of items 1 A, B, E 2 B, D 3 B, C 4 A, B, D 5 A, C 6 B, C 7 A, C 8 A, B, C, E 9 A, B, C Solution Association Rule Problem • First, find all frequent itemsets with sup >=minsup – Exhaustive search won’t work • Assume we have a set of m items 2m subsets! – Exploit the subset property (APRIORI algorithm) • For every frequent item set, derive rules with confidence >= minconf KDnuggets Finding itemsets: next level • Apriori algorithm (Agrawal & Srikant) • Idea: use one-item sets to generate two-item sets, two-item sets to generate three-item sets, .. – Subset Property: If (A B) is a frequent item set, then (A) and (B) have to be frequent item sets as well! – In general: if X is frequent k-item set, then all (k-1)item subsets of X are also frequent Compute k-item set by merging (k-1)-item sets KDnuggets An example • Given: five three-item sets (A B C), (A B D), (A C D), (A C E), (B C D) • Candidate four-item sets: (A B C D) Q: OK? A: yes, because all 3-item subsets are frequent (A C D E) Q: OK? A: No, because (C D E) is not frequent KDnuggets From Frequent Itemsets to Association Rules • Q: Given frequent set {A,B,E}, what are possible association rules? – – – – – – – KDnuggets A => B, E A, B => E A, E => B B => A, E B, E => A E => A, B __ => A,B,E (empty rule), or true => A,B,E Example: ‘Generating Rules from an Itemset • Frequent itemset from golf data: Humidity = Normal, Windy = False, Play = Yes (4) • Seven potential rules: If Humidity = Normal and Windy = False then Play = Yes If Humidity = Normal and Play = Yes then Windy = False If Windy = False and Play = Yes then Humidity = Normal If Humidity = Normal then Windy = False and Play = Yes If Windy = False then Humidity = Normal and Play = Yes If Play = Yes then Humidity = Normal and Windy = False If True then Humidity = Normal and Windy = False and Play = Yes KDnuggets 4/4 4/6 4/6 4/7 4/8 4/9 4/12 Example: Generating Rules • Rules with support > 1 and confidence = 100%: Association rule Sup. Conf. 1 Humidity=Normal Windy=False Play=Yes 4 100% 2 Temperature=Cool Humidity=Normal 4 100% 3 Outlook=Overcast Play=Yes 4 100% 4 Temperature=Cold Play=Yes Humidity=Normal 3 100% ... ... ... ... ... 58 Outlook=Sunny Temperature=Hot Humidity=High 2 100% • In total: 3 rules with support four, 5 with support three, and 50 with support two KDnuggets Weka associations: output KDnuggets Extensions and Challenges • Extra quality measure: Lift – The lift of an association rule I => J is defined as: • lift = P(J|I) / P(J) • Note, P(I) = (support of I) / (no. of transactions) • ratio of confidence to expected confidence – Interpretation: • if lift > 1, then I and J are positively correlated lift < 1, then I are J are negatively correlated. lift = 1, then I and J are independent • Other measures for interestingness – A B, B C, but not A C • Efficient algorithms • Known Problem – What to do with all these rules? How to exploit / make useful / actionable? KDnuggets Biomedical Application Head and Neck Cancer Example 1. ace27=0 fiveyr=alive 381 tumorbefore=0 372 conf:(0.98) 2. gender=M ace27=0 467 tumorbefore=0 455 conf:(0.97) 3. ace27=0 588 tumorbefore=0 572 conf:(0.97) 4. tnm=T0N0M0 ace27=0 405 tumorbefore=0 391 conf:(0.97) 5. loc=LOC7 tumorbefore=0 409 tnm=T0N0M0 391 conf:(0.96) 6. loc=LOC7 442 tnm=T0N0M0 422 conf:(0.95) 7. loc=LOC7 gender=M tumorbefore=0 374 tnm=T0N0M0 357 conf:(0.95) 8. loc=LOC7 gender=M 406 tnm=T0N0M0 387 9. gender=M fiveyr=alive 633 tumorbefore=0 595 10. fiveyr=alive 778 tumorbefore=0 726 conf:(0.95) conf:(0.94) conf:(0.93) Bioinformatics Application • The idea of association rules have been customized for bioinformatics applications • In biology it is often interesting to find frequent structures rather than items – For instance protein or other chemical structures • Solution: Mining Frequent Patterns – FSG (Kuramochi and Karypis, ICDM 2001) – gSpan (Yan and Han, ICDM 2002) – CloseGraph (Yan and Han, KDD 2002) FSG: Mining Frequent Patterns FSG: Mining Frequent Patterns FSG Algorithm for finding frequent subgraphs Frequent Subgraph Examples AIDS Data • Compounds are active, inactive or moderately active (CA, CI, CM) Predictive Subgraphs • The three most discriminating sub-structures for the PTC, AIDS, and Anthrax datasets FSG References • Frequent Sub-structure Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, and George Karypis ICDM 2003 • An Efficient Algorithm for Discovering Frequent Subgraphs Michihiro Kuramochi and George Karypis IEEE TKDE • Automated Approaches for Classifying Structures Mukund Deshpande, Michihiro Kuramochi, and George Karypis BIOKDD 2002 • Discovering Frequent Geometric Subgraphs Michihiro Kuramochi and George Karypis ICDM 2002 • Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis 1st IEEE Conference on Data Mining 2001 Recap • Before Starting to Mine…. • Descriptive Data Mining – Dimension Reduction & Projection – Clustering • Hierarchical clustering • K-means • Self organizing maps – Association rules • • • • Frequent item sets Association Rules APRIORI Bio-informatics case: FSG for frequent subgraph discovery • Next week – Bioinformatics Data Mining Cases / Lab Session / Take Home Exercise