Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The news: INTELLIGENCE Prof. Dr. Herwig Unger 1 Striving for Intelligence.. “Data is raw and unadorned. Information is data endowed with some degree of business context and meaning. Intelligence elevates information to a higher level within an organization.” -- Bernard Liautaud, e-Business Intelligence Prof. Dr. Herwig Unger 2 The data pyramid Wisdom Knowledge + experience Knowledge Information + rules Information Data + context Data Prof. Dr. Herwig Unger 3 What is „data mining“??? Data mining (knowledge discovery in databases - KDD, business intelligence): Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) information from data in large databases • “Tell me something interesting about the data.” • “Describe the data.” Prof. Dr. Herwig Unger 4 What is „data mining“ (2)? What is not Data Mining? – Look up phone number in phone directory – Query a Web search engine for information about “Amazon” What is Data Mining? – Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) Prof. Dr. Herwig Unger 5 Tasks of Data Mining Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data. Prof. Dr. Herwig Unger 6 The data gap There is often information “hidden” in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all 4,000,000 3,500,000 The Data Gap 3,000,000 2,500,000 2,000,000 1,500,000 Total new disk (TB) since 1995 1,000,000 Number of analysts 500,000 0 1995 1996 1997 Prof. Dr. Herwig Unger 1998 1999 7 Application in Business Database analysis and decision support Market analysis and management Risk analysis and management Fraud detection and management Text analysis - Text Mining Web analysis - Web Mining Intelligent query answering Prof. Dr. Herwig Unger 8 Market analysis and management data Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies. Target marketing: Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time: Conversion of single to a joint bank account: marriage, etc. Prof. Dr. Herwig Unger 9 Analysis and risk management Finance planning and asset evaluation: cash flow analysis and prediction time series analysis (trend analysis, etc.) Resource planning: summarize and compare the resources and spending Competition: Monitor competitors and market directions Set pricing strategy in a highly competitive market Prof. Dr. Herwig Unger 10 Fraud detection Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples application: Auto Insurance: detect a group of people who stage accidents to collect on insurance Money Laundering: detect suspicious money transactions Detecting telephone fraud: detecting suspicious patterns (generate call model - destination, time, duration) Prof. Dr. Herwig Unger 11 Others Sports Analysis of game in NBA (eg., detect the opponent’s strategy) Astronomy discovery and classification of new objects Internet analysis of Web access logs, discovery of user behavior patterns, analyzing effectiveness of Web marketing, improving Web site organization Text news analysis, medical record analysis, automatic email sorting and filtering, automatic document categorization Prof. Dr. Herwig Unger 12 Datamining is interdisciplinary !!! Database systems, data warehouse and OLAP Statistics Machine learning Visualization Information science High performance computing Other disciplines: Neural networks, mathematical modeling, information retrieval, pattern recognition, ... Prof. Dr. Herwig Unger 13 From data to knowledge…. Data mining: the core of knowledge discovery process. Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases Prof. Dr. Herwig Unger Main steps of KDD Learning the application domain: relevant prior knowledge and goals of application Data cleaning and preprocessing: (may take 60% of effort!): creating a target data set: data selection find useful features, generate new features, map feature values, discretization of values Choosing data mining tools/algorithms summarization, classification, regression, association, clustering. Data mining: search for patterns of interest Interpretation: analysis of results. visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge. Prof. Dr. Herwig Unger 15 Finally: Data mining and BI Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP Prof. Dr. Herwig Unger DBA Decision support systems (DSS) Data Model Knowledge Base Base Base User Interface Prof. Dr. Herwig Unger 17 The Contents of Analytic Applications Analytic applications typically have no limits; analysts can see everything Analytic applications can view and analyze all of an organization’s data in a number of ways Analytic applications are powerful, but not as easy to use as other mechanisms Prof. Dr. Herwig Unger 18 Analytic Applications Prof. Dr. Herwig Unger 19 The Purpose of Analytic Applications Analytic applications free analysts from building complex models and writing complex queries Analysts are free to focus on the data and discover relationships and drivers behind numbers Rich visualizations allow much easier understanding of trends and relationships Prof. Dr. Herwig Unger 20 Benefits of Analytic Applications Data is significantly easier to analyze Analysts can focus on analyzing the data and not writing complex queries Reports created with analytic applications can be pushed out to the organization Graphical tools provide users throughout the organization with powerful reports and analytic capabilities Prof. Dr. Herwig Unger 21 The decision environment UNCERTAINTY COMPLEXITY EQUIVOCALITY • Facts not known • Too many facts • Facts not Clear • Gather Information • Generate Information • Interpret Information • Fact Finding /.Analysis • Simulation/Synthesis • Application of Expertise DATA BASED MODEL BASED Prof. Dr. Herwig Unger KNOWLEDGE BASED 22 Examples Standard query: List all customers whose peak-hour usage revenues have decreased by 20 percent or more Multidimensional analysis Slice these customers above by the southwest, west and northwest regions. Drill down to the largest city in the southwest region. Modeling and segmentation What are the demographic characteristics of these customers, and how can we use that to predict the revenue patterns of new customers ? Knowledge discovery A large fraction of these customers have recently responded to the ”Pampers” ads on our web-site. Prof. Dr. Herwig Unger 23 Some DM application Finance Industry (credit cards, insurance, mortgages, etc.) Telecommunications Utilities Medicine Search Engines (text data mining) Law Enforcement Prof. Dr. Herwig Unger 24 Overview: Data Mining Methods Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] Prof. Dr. Herwig Unger 25 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Prof. Dr. Herwig Unger 26 Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc Prof. Dr. Herwig Unger 27 Classification Techniques Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines Prof. Dr. Herwig Unger 28 Classification Example al al us c c i i o or or nu i g g t n te te ss a a o a l c c c c Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married No No Married 80K ? 60K 10 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 10 No Single 90K Yes Training Set Prof. Dr. Herwig Unger Learn Classifier Test Set Model 29 Classification: Application 1 Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach: • Use the data for a similar product introduced before. • We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. • Collect various demographic, lifestyle, and company-interaction related information about all such customers. – Type of business, where they stay, how much they earn, etc. • Use this information as input attributes to learn a classifier model. From [Berry & Linoff] Data Mining Techniques, 1997 Prof. Dr. Herwig Unger 30 Classification: Application 2 Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach: • Use credit card transactions and the information on its account-holder as attributes. – When does a customer buy, what does he buy, how often he pays on time, etc • Label past transactions as fraud or fair transactions. This forms the class attribute. • Learn a model for the class of the transactions. • Use this model to detect fraud by observing credit card transactions on an account. Prof. Dr. Herwig Unger 31 Classification: Application 3 Customer Attrition/Churn: Goal: To predict whether a customer is likely to be lost to a competitor. Approach: • Use detailed record of transactions with each of the past and present customers, to find attributes. – How often the customer calls, where he calls, what time-ofthe day he calls most, his financial status, marital status, etc. • Label the customers as loyal or disloyal. • Find a model for loyalty. From [Berry & Linoff] Data Mining Techniques, 1997 Prof. Dr. Herwig Unger 32 Classification: Application 4 Sky Survey Cataloging Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory). – 3000 images with 23,040 x 23,040 pixels per image. Approach: • Segment the image. • Measure image attributes (features) - 40 of them per object. • Model the class based on these features. • Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find! From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 Prof. Dr. Herwig Unger 33 Classifying Galaxies Courtesy: http://aps.umn.edu Early Class: • Stages of Formation Attributes: • Image features, • Characteristics of light waves received, etc. Intermediate Late Data Size: • 72 million stars, 20 million galaxies • Object Catalog: 9 GB • Image Database: 150 GB Prof. Dr. Herwig Unger 34 Another Example of Decision Tree al al us c c i i o or or nu i g g t ss e e n t t a l c ca ca co Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Married MarSt NO Single, Divorced Refund Yes NO No TaxInc < 80K NO > 80K YES There could be more than one tree that fits the same data! 10 Prof. Dr. Herwig Unger 35 Decision Tree Classification Task Tid Attrib1 1 Yes Large Attrib2 125K Attrib3 No Class 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learn Model 10 Tid Attrib1 11 No Small Attrib2 55K Attrib3 ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Decision Tree 10 Prof. Dr. Herwig Unger 36 Apply Model to Test Data Test Data Start from the root of tree. Refund No NO MarSt Single, Divorced TaxInc NO Taxable Income Cheat No 80K Married ? 10 Yes < 80K Refund Marital Status Married NO > 80K YES Prof. Dr. Herwig Unger 37 Apply Model to Test Data Test Data Refund No NO MarSt Single, Divorced TaxInc NO Taxable Income Cheat No 80K Married ? 10 Yes < 80K Refund Marital Status Married NO > 80K YES Prof. Dr. Herwig Unger 38 Apply Model to Test Data Test Data Refund No NO MarSt Single, Divorced TaxInc NO Taxable Income Cheat No 80K Married ? 10 Yes < 80K Refund Marital Status Married NO > 80K YES Prof. Dr. Herwig Unger 39 Apply Model to Test Data Test Data Refund No NO MarSt Single, Divorced TaxInc NO Taxable Income Cheat No 80K Married ? 10 Yes < 80K Refund Marital Status Married NO > 80K YES Prof. Dr. Herwig Unger 40 Apply Model to Test Data Test Data Refund No NO MarSt Single, Divorced TaxInc NO Taxable Income Cheat No 80K Married ? 10 Yes < 80K Refund Marital Status Married NO > 80K YES Prof. Dr. Herwig Unger 41 Apply Model to Test Data Test Data Refund No NO MarSt Single, Divorced TaxInc NO Taxable Income Cheat No 80K Married ? 10 Yes < 80K Refund Marital Status Married Assign Cheat to “No” NO > 80K YES Prof. Dr. Herwig Unger 42 Decision Tree Classification Task Tid Attrib1 1 Yes Large Attrib2 125K Attrib3 No Class 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learn Model 10 Tid Attrib1 11 No Small Attrib2 55K Attrib3 ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Decision Tree 10 Prof. Dr. Herwig Unger 43 Decision Tree Induction Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Prof. Dr. Herwig Unger 44 General Structure of Hunt’s Algorithm Let Dt be the set of training records that reach a node t General Procedure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Prof. Dr. Herwig Unger Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 Dt ? 45 Hunt’s Algorithm Refund Don’t Cheat Yes No Don’t Cheat Don’t Cheat Refund Refund Yes Yes No No Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 Don’t Cheat Don’t Cheat Marital Status Single, Divorced Cheat Married Marital Status Single, Divorced Married Don’t Cheat Taxable Income Don’t Cheat < 80K >= 80K Don’t Cheat Cheat Prof. Dr. Herwig Unger 46 Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records •How to specify the attribute test condition? •How to determine the best split? Determine when to stop splitting Prof. Dr. Herwig Unger 47 How to Specify Test Condition? Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split Prof. Dr. Herwig Unger 48 Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. CarType Family Luxury Sports Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} CarType {Family} OR Prof. Dr. Herwig Unger {Family, Luxury} CarType {Sports} 49 Splitting Based on Ordinal Attributes Multi-way split: Use as many partitions as distinct values. Size Small Medium Large Binary split: Divides values into two subsets. Need to find optimal partitioning. {Small, Medium} Size {Large} What about this split? OR {Small, Large} Prof. Dr. Herwig Unger {Medium, Large} Size {Small} Size {Medium} 50 Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute • Static – discretize once at the beginning • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary Decision: (A < v) or (A ≥ v) • consider all possible splits and finds the best cut • can be more compute intensive Prof. Dr. Herwig Unger 51 Splitting Based on Continuous Attributes Prof. Dr. Herwig Unger 52 Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records •How to specify the attribute test condition? •How to determine the best split? Determine when to stop splitting Prof. Dr. Herwig Unger 53 How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? Prof. Dr. Herwig Unger 54 How to determine the Best Split Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, Homogeneous, High degree of impurity Low degree of impurity Prof. Dr. Herwig Unger 55 Clustering Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures. Prof. Dr. Herwig Unger 56 Illustrating Clustering ⌧Euclidean Distance Based Clustering in 3-D space. Intracluster Intraclusterdistances distances are areminimized minimized Prof. Dr. Herwig Unger Intercluster Interclusterdistances distances are aremaximized maximized 57 Clustering: Application 1 Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: • Collect different attributes of customers based on their geographical and lifestyle related information. • Find clusters of similar customers. • Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. Prof. Dr. Herwig Unger 58 Clustering: Application 2 Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents. Prof. Dr. Herwig Unger 59 Illustrating Document Clustering Clustering Points: 3204 Articles of Los Angeles Times. Similarity Measure: How many words are common in these documents (after some word filtering). Category Total Articles Correctly Placed 555 364 Foreign 341 260 National 273 36 Metro 943 746 Sports 738 573 Entertainment 354 278 Financial Prof. Dr. Herwig Unger 60 Clustering of S&P Stock Data Observe Stock Movements every day. Clustering points: Stock-{UP/DOWN} Similarity Measure: Two points are more similar if the events described by them frequently happen together on the same day. We used association rules to quantify a similarity measure. Discovered Clusters 1 2 3 4 Applied-Matl-DOW N,Bay-Net work-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Co mm-DOW N,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N, Sun-DOW N Apple-Co mp-DOW N,Autodesk-DOWN,DEC-DOWN, ADV-M icro-Device-DOWN,Andrew-Corp-DOWN, Co mputer-Assoc-DOWN,Circuit-City-DOWN, Co mpaq-DOWN, EM C-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N, MBNA-Corp -DOWN,Morgan-Stanley-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlu mberger-UP Prof. Dr. Herwig Unger Industry Group Technology1-DOWN Technology2-DOWN Financial-DOWN Oil-UP 61 What is NOT clustering analysis? Supervised classification Have class label information Simple segmentation Dividing students into different registration groups alphabetically, by last name Results of a query Groupings are a result of an external specification Graph partitioning Some mutual relevance and synergy, but areas are not identical Prof. Dr. Herwig Unger 62 The Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters Prof. Dr. Herwig Unger 63 Types of clustering A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters Partitional Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering A set of nested clusters organized as a hierarchical tree Prof. Dr. Herwig Unger 64 Partitional clustering Original Points A Partitional Clustering Prof. Dr. Herwig Unger 65 Hierarchical clustering p1 p3 p4 p2 p1 p2 Traditional Hierarchical Clustering p3 p4 Traditional Dendrogram p1 p3 p4 p2 p1 p2 Non-traditional Hierarchical Clustering p3 p4 Non-traditional Dendrogram Prof. Dr. Herwig Unger 66 Other Distinctions Between Sets of Clusters Exclusive versus non-exclusive In non-exclusive clusterings, points may belong to multiple clusters. Can represent multiple classes or ‘border’ points Fuzzy versus non-fuzzy In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 Weights must sum to 1 Probabilistic clustering has similar characteristics Partial versus complete In some cases, we only want to cluster some of the data Heterogeneous versus homogeneous Cluster of widely different sizes, shapes, and densities Prof. Dr. Herwig Unger 67 Types of Clusters Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function Prof. Dr. Herwig Unger 68 Types of Clusters: Well-Separated Well-Separated Clusters: A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters Prof. Dr. Herwig Unger 69 Types of Clusters: Center-Based Center-based A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters Prof. Dr. Herwig Unger 70 Types of Clusters: Contiguity-Based Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters Prof. Dr. Herwig Unger 71 Types of Clusters: Density-Based Density-based A cluster is a dense region of points, which is separated by lowdensity regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters Prof. Dr. Herwig Unger 72 Types of Clusters: Conceptual Clusters Shared Property or Conceptual Clusters Finds clusters that share some common property or represent a particular concept. . 2 Overlapping Circles Prof. Dr. Herwig Unger 73 Types of Clusters: Objective Function Clusters Defined by an Objective Function Finds clusters that minimize or maximize an objective function. Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function. (NP Hard) Can have global or local objectives. • Hierarchical clustering algorithms typically have local objectives • Partitional algorithms typically have global objectives A variation of the global objective function approach is to fit the data to a parameterized model. • Parameters for the model are determined from the data. • Mixture models assume that the data is a ‘mixture' of a number of statistical distributions. Prof. Dr. Herwig Unger 74 Types of Clusters: Objective Function … Map the clustering problem to a different domain and solve a related problem in that domain Proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the weighted edges represent the proximities between points Clustering is equivalent to breaking the graph into connected components, one for each cluster. Want to minimize the edge weight between clusters and maximize the edge weight within clusters Prof. Dr. Herwig Unger 75 Characteristics of the Input Data Are Important Type of proximity or density measure This is a derived measure, but central to clustering Sparseness Dictates type of similarity Adds to efficiency Attribute type Dictates type of similarity Type of Data Dictates type of similarity Other characteristics, e.g., autocorrelation Dimensionality Noise and Outliers Type of Distribution Prof. Dr. Herwig Unger 76 Clustering Algorithms K-means and its variants Hierarchical clustering Density-based clustering Prof. Dr. Herwig Unger 77 K-means clusterung Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple Prof. Dr. Herwig Unger 78 k-means clustering: details Initial centroids are often chosen randomly. Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes Prof. Dr. Herwig Unger 79 Two different K-means Clusterings 3 2.5 Original Points 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 2.5 2.5 2 2 1.5 1.5 y 3 y 3 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 x -1.5 -1 -0.5 0 0.5 1 1.5 2 x Optimal Clustering Sub-optimal Clustering Prof. Dr. Herwig Unger 80 Importance of Choosing Initial Centroids Iteration 6 1 2 3 4 5 3 2.5 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x Prof. Dr. Herwig Unger 81 Importance of Choosing Initial Centroids Iteration 1 Iteration 2 Iteration 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y 3 y 3 y 3 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 x 0 0.5 1 1.5 2 -2 Iteration 4 Iteration 5 2.5 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0 -0.5 0 x 0.5 1 1.5 2 0 0.5 1 1.5 2 1 1.5 2 y 2.5 y 2.5 y 3 -1 -0.5 Iteration 6 3 -1.5 -1 x 3 -2 -1.5 x -2 -1.5 -1 -0.5 0 0.5 1 1.5 x Prof. Dr. Herwig Unger 2 -2 -1.5 -1 -0.5 0 0.5 x 82 Association Rule Discovery: Definition Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 2 3 4 5 Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Rules RulesDiscovered: Discovered: {Milk} {Milk}--> -->{Coke} {Coke} {Diaper, {Diaper,Milk} Milk}--> -->{Beer} {Beer} Prof. Dr. Herwig Unger 83 Association Rule Discovery: Application 1 Marketing and Sales Promotion: Let the rule discovered be {Bagels, … } --> {Potato Chips} Potato Chips as consequent => Can be used to determine what should be done to boost its sales. Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips! Prof. Dr. Herwig Unger 84 Association Rule Discovery: Application 2 Supermarket shelf management. Goal: To identify items that are bought together by sufficiently many customers. Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. A classic rule -• If a customer buys diaper and milk, then he is very likely to buy beer. • So, don’t be surprised if you find six-packs stacked next to diapers! Prof. Dr. Herwig Unger 85 Association Rule Discovery: Application 3 Inventory Management: Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households. Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the cooccurrence patterns. Prof. Dr. Herwig Unger 86 Definition: Frequent Itemset Itemset A collection of one or more items • Example: {Milk, Bread, Diaper} k-itemset • An itemset that contains k items Support count (σ) Frequency of occurrence of an itemset E.g. σ({Milk, Bread,Diaper}) = 2 TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Support Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold Prof. Dr. Herwig Unger 87 Definition: Association Rule Association Rule An implication expression of the form X → Y, where X and Y are itemsets Example: {Milk, Diaper} → {Beer} TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Rule Evaluation Metrics Example: Support (s) {Milk , Diaper } ⇒ Beer • Fraction of transactions that contain both X and Y s= Confidence (c) • Measures how often items in Y appear in transactions that contain X c= Prof. Dr. Herwig Unger σ ( Milk , Diaper, Beer ) |T| = 2 = 0 .4 5 σ (Milk, Diaper, Beer) 2 = = 0.67 3 σ (Milk, Diaper) 88 Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds ⇒ Computationally prohibitive! Prof. Dr. Herwig Unger 89 Mining Association Rules TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk,Beer} → {Diaper} (s=0.4, c=1.0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements Prof. Dr. Herwig Unger 90 Mining Association Rules Two-step approach: 1. Frequent Itemset Generation – 2. Generate all itemsets whose support ≥ minsup Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive Prof. Dr. Herwig Unger 91 Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE ABCDE Prof. Dr. Herwig Unger BCDE Given d items, there are 2d possible candidate itemsets 92 Frequent Itemset Generation Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Transactions TID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!! Prof. Dr. Herwig Unger 93 Computational Complexity Given d unique items: Total number of itemsets = 2d Total number of possible association rules: ⎡⎛ d ⎞ ⎛ d − k ⎞⎤ R = ∑ ⎢⎜ ⎟ × ∑ ⎜ ⎟⎥ ⎣⎝ k ⎠ ⎝ j ⎠⎦ = 3 − 2 +1 d −1 d −k k =1 j =1 d d +1 If d=6, R = 602 rules Prof. Dr. Herwig Unger 94 Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2d Use pruning techniques to reduce M Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction Prof. Dr. Herwig Unger 95 Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: ∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y ) Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support Prof. Dr. Herwig Unger 96 Illustrating Apriori Principle null A Found to be Infrequent B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE Pruned supersets ABDE ACDE BCDE ABCDE Prof. Dr. Herwig Unger 97 Illustrating Apriori Principle Item Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 4 1 Items (1-itemsets) Minimum Support = 3 If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13 Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper} Count 3 2 3 2 3 3 Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Ite m s e t { B r e a d ,M ilk ,D ia p e r } Prof. Dr. Herwig Unger C ount 3 98 Apriori Algorithm Method: Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified • Generate length (k+1) candidate itemsets from length k frequent itemsets • Prune candidate itemsets containing subsets of length k that are infrequent • Count the support of each candidate by scanning the DB • Eliminate candidates that are infrequent, leaving only those that are frequent Prof. Dr. Herwig Unger 99 Reducing Number of Comparisons Candidate counting: Scan the database of transactions to determine the support of each candidate itemset To reduce the number of comparisons, store the candidates in a hash structure • Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets Transactions TID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Prof. Dr. Herwig Unger 100 Generate Hash Tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: • Hash function • Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) Hash function 3,6,9 1,4,7 234 567 345 136 145 2,5,8 124 457 125 458 Prof. Dr. Herwig Unger 356 357 689 367 368 159 101 Association Rule Discovery: Hash tree Hash Function 1,4,7 Candidate Hash Tree 3,6,9 2,5,8 234 567 145 136 345 Hash on 1, 4 or 7 124 125 457 458 159 Prof. Dr. Herwig Unger 356 367 357 368 689 102 Association Rule Discovery: Hash tree Hash Function 1,4,7 Candidate Hash Tree 3,6,9 2,5,8 234 567 145 136 345 Hash on 2, 5 or 8 124 125 457 458 159 Prof. Dr. Herwig Unger 356 367 357 368 689 103 Association Rule Discovery: Hash tree Hash Function 1,4,7 Candidate Hash Tree 3,6,9 2,5,8 234 567 145 136 345 Hash on 3, 6 or 9 124 125 457 458 159 Prof. Dr. Herwig Unger 356 367 357 368 689 104 Subset Operation Given a transaction t, what are the possible subsets of size 3? Prof. Dr. Herwig Unger 105 Subset Operation Using Hash Tree Hash Function 1 2 3 5 6 transaction 1+ 2356 2+ 356 1,4,7 3+ 56 3,6,9 2,5,8 234 567 145 136 345 124 457 125 458 159 356 357 689 Prof. Dr. Herwig Unger 367 368 106 Subset Operation Using Hash Tree Hash Function 1 2 3 5 6 transaction 1+ 2356 2+ 356 1,4,7 12+ 356 3+ 56 3,6,9 2,5,8 13+ 56 234 567 15+ 6 145 136 345 124 457 125 458 159 Prof. Dr. Herwig Unger 356 357 689 367 368 107 Subset Operation Using Hash Tree Hash Function 1 2 3 5 6 transaction 1+ 2356 2+ 356 1,4,7 12+ 356 3+ 56 3,6,9 2,5,8 13+ 56 234 567 15+ 6 145 136 345 124 457 125 458 159 356 357 689 367 368 Match transaction against 11 out of 15 candidates Prof. Dr. Herwig Unger 108 Factors Affecting Complexity Choice of minimum support threshold lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of frequent itemsets Dimensionality (number of items) of the data set more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase Size of database since Apriori makes multiple passes, run time of algorithm may increase with number of transactions Average transaction width transaction width increases with denser data sets This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width) Prof. Dr. Herwig Unger 109 Rule Generation Given a frequent itemset L, find all non-empty subsets f ⊂ L such that f → L – f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC →D, A →BCD, AB →CD, BD →AC, ABD →C, B →ACD, AC → BD, CD →AB, ACD →B, C →ABD, AD → BC, BCD →A, D →ABC BC →AD, If |L| = k, then there are 2k – 2 candidate association rules (ignoring L → ∅ and ∅ → L) Prof. Dr. Herwig Unger 110 Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone property c(ABC →D) can be larger or smaller than c(AB →D) But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD) • Confidence is anti-monotone w.r.t. number of items on the RHS of the rule Prof. Dr. Herwig Unger 111 Rule Generation for Apriori Algorithm Lattice of rules Low Confidence Rule Pruned Rules Prof. Dr. Herwig Unger 112 Rule Generation for Apriori Algorithm Candidate rule is generated by merging two rules that share the same prefix CD=>AB BD=>AC in the rule consequent join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC D=>ABC Prune rule D=>ABC if its subset AD=>BC does not have high confidence Prof. Dr. Herwig Unger 113 Effect of minsup How to set the appropriate minsup threshold? If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products) If minsup is set too low, it is computationally expensive and the number of itemsets is very large Using a single minimum support threshold may not be effective Prof. Dr. Herwig Unger 114 Multiple Minimum Support How to apply multiple minimum supports? MS(i): minimum support for item i e.g.: MS(Milk)=5%, MS(Coke) = 3%, MS(Broccoli)=0.1%, MS(Salmon)=0.5% MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli)) = 0.1% Challenge: Support is no longer anti-monotone • Suppose: Support(Milk, Coke) = 1.5% and Support(Milk, Coke, Broccoli) = 0.5% • {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent Prof. Dr. Herwig Unger 115 Multiple Minimum Support Item MS (I) S u p (I) A 0.10% 0.25% B 0.20% 0.26% C 0.30% 0.29% D 0.50% 0.05% E 3% 4.20% AB ABC AC ABD AD ABE AE AC D BC AC E BD AD E BE BC D CD BC E CE BD E DE CDE A B C D E Prof. Dr. Herwig Unger 116 Multiple Minimum Support Item MS (I) S u p (I) AB ABC AC ABD AD ABE AE ACD BC ACE BD ADE BE BCD CD BCE CE BDE DE CDE A A B 0.10% 0.25% 0.20% 0.26% B C C 0.30% 0.29% D D 0.50% 0.05% E E 3% 4.20% Prof. Dr. Herwig Unger 117 Multiple Minimum Support (Liu 1999) Order the items according to their minimum support (in ascending order) e.g.: MS(Milk)=5%, MS(Coke) = 3%, MS(Broccoli)=0.1%, MS(Salmon)=0.5% Ordering: Broccoli, Salmon, Coke, Milk Need to modify Apriori such that: L1 : set of frequent items F1 : set of items whose support is ≥ MS(1) where MS(1) is mini( MS(i) ) C2 : candidate itemsets of size 2 is generated from F1 instead of L1 Prof. Dr. Herwig Unger 118 Multiple Minimum Support (Liu 1999) Modifications to Apriori: In traditional Apriori, • A candidate (k+1)-itemset is generated by merging two frequent itemsets of size k • The candidate is pruned if it contains any infrequent subsets of size k Pruning step has to be modified: • Prune only if subset contains the first item • e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to minimum support) • {Broccoli, Coke} and {Broccoli, Milk} are frequent but {Coke, Milk} is infrequent – Candidate is not pruned because {Coke,Milk} does not contain the first item, i.e., Broccoli. Prof. Dr. Herwig Unger 119 Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C} → {D} and {A,B} → {D} have same support & confidenc Interestingness measures can be used to prune/rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used Prof. Dr. Herwig Unger 120 Subjective Interestingness Measure Objective measure: Rank patterns based on statistics computed from data e.g., 21 measures of association (support, confidence, Laplace, Gini, mutual information, Jaccard, etc). Subjective measure: Rank patterns according to user’s interpretation • A pattern is subjectively interesting if it contradicts the expectation of a user (Silberschatz & Tuzhilin) • A pattern is subjectively interesting if it is actionable (Silberschatz & Tuzhilin) Prof. Dr. Herwig Unger 121 Interestingness via Unexpectedness Need to model expectation of users (domain knowledge) + - Pattern expected to be frequent Pattern expected to be infrequent Pattern found to be frequent Pattern found to be infrequent + - + Expected Patterns Unexpected Patterns Need to combine expectation of users with evidence from data (i.e., extracted patterns) Prof. Dr. Herwig Unger 122 Interestingness via Unexpectedness Web Data (Cooley et al 2001) Domain knowledge in the form of site structure Given an itemset F = {X1, X2, …, Xk} (Xi : Web pages) • L: number of links connecting the pages • lfactor = L / (k × k-1) • cfactor = 1 (if graph is connected), 0 (disconnected graph) Structure evidence = cfactor × lfactor Usage evidence P ( X Ι X Ι ... Ι X ) = P( X ∪ X ∪ ... ∪ X ) 1 1 2 2 k k Use Dempster-Shafer theory to combine domain knowledge and evidence from data Prof. Dr. Herwig Unger 123 Sequential Pattern Discovery: Definition Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. (A B) (C) (D E) Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints. (A B) (C) <= xg (D E) >ng <= ws <= ms Prof. Dr. Herwig Unger 124 Sequential Pattern Discovery: Examples In telecommunications alarm logs, (Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm) --> (Fire_Alarm) In point-of-sale transaction sequences, Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk) Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket) Prof. Dr. Herwig Unger 125 Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples: Predicting sales amounts of new product based on advetising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices. Prof. Dr. Herwig Unger 126 Deviation/Anomaly detection Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day Prof. Dr. Herwig Unger 127 Anomaly/Outlier Detection cont. What are anomalies/outliers? The set of data points that are considerably different than the remainder of the data Variants of Anomaly/Outlier Detection Problems Given a database D, find all the data points x ∈ D with anomaly scores greater than some threshold t Given a database D, find all the data points x ∈ D having the top-n largest anomaly scores f(x) Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D Applications: Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection Prof. Dr. Herwig Unger 128 Importance of Anomaly Detection Ozone Depletion History In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html Prof. Dr. Herwig Unger 129 Anomaly Detection Challenges How many outliers are there in the data? Method is unsupervised • Validation can be quite challenging (just like for clustering) Finding needle in a haystack Working assumption: There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data Prof. Dr. Herwig Unger 130 Anomaly Detection Schemes General Steps Build a profile of the “normal” behavior • Profile can be patterns or summary statistics for the overall population Use the “normal” profile to detect anomalies • Anomalies are observations whose characteristics differ significantly from the normal profile Types of anomaly detection schemes Graphical & Statistical-based Distance-based Model-based Prof. Dr. Herwig Unger 131 Graphical Approaches Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D) Limitations Time consuming Subjective Prof. Dr. Herwig Unger 132 Convex Hull Method Extreme points are assumed to be outliers Use convex hull method to detect extreme values What if the outlier occurs in the middle of the data? Prof. Dr. Herwig Unger 133 Statistical Approaches Assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on Data distribution Parameter of distribution (e.g., mean, variance) Number of expected outliers (confidence limit) Prof. Dr. Herwig Unger 134 Statistical-based – Likelihood Approach Data distribution, D = (1 – λ) M + λ A M is a probability distribution estimated from data Can be based on any modeling method (naïve Bayes, maximum entropy, etc) A is initially assumed to be uniform distribution Likelihood at time t: ⎞ ⎞⎛ |At | ⎛ |M t | Lt ( D ) = ∏ PD ( xi ) = ⎜⎜ (1 − λ ) ∏ PM t ( xi ) ⎟⎟⎜⎜ λ ∏ PAt ( xi ) ⎟⎟ i =1 xi ∈M t ⎠ ⎠⎝ xi ∈At ⎝ LLt ( D ) = M t log(1 − λ ) + ∑ log PM t ( xi ) + At log λ + ∑ log PAt ( xi ) N xi ∈M t Prof. Dr. Herwig Unger xi ∈At 135 Statistical-based – Likelihood Approach Assume the data set D contains samples from a mixture of two probability distributions: M (majority distribution) A (anomalous distribution) General Approach: Initially, assume all the data points belong to M Let Lt(D) be the log likelihood of D at time t For each point xt that belongs to M, move it to A • Let Lt+1 (D) be the new log likelihood. • Compute the difference, Δ = Lt(D) – Lt+1 (D) • If Δ > c (some threshold), then xt is declared as an anomaly and moved permanently from M to A Prof. Dr. Herwig Unger 136 Limitations of Statistical Approaches Most of the tests are for a single attribute In many cases, data distribution may not be known For high dimensional data, it may be difficult to estimate the true distribution Prof. Dr. Herwig Unger 137 Distance-based Approaches Data is represented as a vector of features Three major approaches Nearest-neighbor based Density based Clustering based Prof. Dr. Herwig Unger 138 Nearest-Neighbor Based Approach Approach: Compute the distance between every pair of data points There are various ways to define outliers: • Data points for which there are fewer than p neighboring points within a distance D • The top n data points whose distance to the kth nearest neighbor is greatest • The top n data points whose average distance to the k nearest neighbors is greatest Prof. Dr. Herwig Unger 139 Outliers in Lower Dimensional Projection Divide each attribute into φ equal-depth intervals Each interval contains a fraction f = 1/φ of the records Consider a k-dimensional cube created by picking grid ranges from k different dimensions If attributes are independent, we expect region to contain a fraction fk of the records If there are N points, we can measure sparsity of a cube D as: Negative sparsity indicates cube contains smaller number of points than expected Prof. Dr. Herwig Unger 141 Example N=100, φ = 5, f = 1/5 = 0.2, N × f2 = 4 Prof. Dr. Herwig Unger 142 Density-based: LOF approach For each point, compute the density of its local neighborhood Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors Outliers are points with largest LOF value In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers × p2 × p1 Prof. Dr. Herwig Unger 143 Clustering-Based Basic idea: Cluster the data into groups of different density Choose points in small cluster as candidate outliers Compute the distance between candidate points and non-candidate clusters. • If candidate points are far from all other non-candidate points, they are outliers Prof. Dr. Herwig Unger 144