Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Section 5 Data Mining 1 Section Content • 5.1 Introduction • 5.2 Knowledge Discovery • 5.3 Association Rules • 5.4 Sequential Patterns • 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining 5-2 5.1 Data Mining Introduction • Data mining: + the discovery of new information in terms of patterns or rules from huge amounts of data + mining tools should identify these patterns, rules and trends with minimal user input + data mining is related to • statistics: exploratory data analysis • artificial intelligence: knowledge discovery and machine learning + techniques from machine learning, statistics, neural networks and genetic algorithms are used + due to the vastness of the amount of data, efficiency/scalability of data mining algorithms is a key issue CA306 Data Mining 5-3 Data Mining and Data Warehousing • The goal of data warehousing is to support decision making with data. • Data mining can help in conjunction with a data warehouse with certain types of decisions. • Data mining helps to extract new patterns/rules that cannot be found by merely querying or processing data. • Aggregated or summarised collections of data in warehouses improves the efficiency of data mining in these cases. • The potential use of data mining needs to be considered early in the design of a data warehouse. CA306 Data Mining 5-4 Sections Covered 5.1 Introduction • 5.2 Knowledge Discovery • 5.3 Association Rules • 5.4 Sequential Patterns • 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining 5-5 5.2 Knowledge Discovery • Data mining is part of the knowledge discovery process: + + + + + + data selection data cleansing enrichment data transformation / encoding data mining reporting and display • Example: + Database: Transaction database for a goods retailer + Client data: name, zip code, phone, date of purchase, item code, price, quantity, total amount CA306 Data Mining 5-6 Knowledge Discovery - Example • New knowledge can be discovered from the client data + data selection: • data about specific items or categories of items • items from stores in specific regions + data cleansing: • correct incorrect zip codes • eliminate records with incorrect phone numbers + enrichment: add additional information • age, income, credit rating of client + data transformation: reduce the amount of data • group items into product categories • group zip codes into regions CA306 Data Mining 5-7 Data Mining - Knowledge Discovery • Data mining might discover + co-occurrences - items that are typically bought together + association rules - when a customer buys video equipment, he/she also buys another electronic gadget + sequential patters - when a customer buys a camera, then within 3 months he/she buys photographic supplies + classification trees - customers can be classified by frequency of visits, types of finance used, etc. combined with statistics about the classes • This information can then be used to for example + optimise store locations + run promotions + plan seasonal marketing strategies CA306 Data Mining 5-8 Goals of Data Mining • Prediction + show how certain attributes within the data will behave in the future + example: predict what customers will buy under certain discounts + example: predict sales volume for some period • Identification + data patterns can be used to identify the existence of an item, an event, or an activity + example: detecting intruders by the commands they execute CA306 Data Mining 5-9 Goals of Data Mining • Classification + partition data such that different classes or categories can be identified + example: customers can be categorised into regular and infrequent shoppers, into discount-seeking customers etc. + categorisation - e.g. into food categories - can reduce the complexity of data mining • Optimisation + optimise the use of limited resources (time, space, money, etc) + example: what are the best products to spend our money on over the next three months? CA306 Data Mining 5-10 Types of Knowledge Discovered • Co-occurrences + collection of items/actions/events that occur together + example: items that are bought together by a consumer in a shop • Association rules + correlation of a set of items with another range of values for another set of variables + example: when someone buys bread, he/she is likely to buy cheese • Classification hierarchies + create a hierarchy of classes from an existing set of events or transactions + example: customers might be divided into a credit worthiness hierarchy based on their previous credit transactions CA306 Data Mining 5-11 Types of Knowledge Discovered • Sequential patterns + search for a sequence of events or actions + example: a patient that underwent cardiac surgery and later developed high blood urea, is likely to suffer from kidney problems • Patterns within time series + detection of similarities within positions of the time series + example: a pattern in a time series of stock market prices may be used to predict employment rates • Categorisation and segmentation + partition a set of events of items into segments/categories/classes + example: treatment data on a disease can be partitioned into groups based on the side effects that are caused CA306 Data Mining 5-12 Counting Co-occurrences • The problem is to count co-occurring itemsets - motivated by market basket analysis. • A database of consumer transactions forms the basis + transaction: a single visit to a store, an order at a virtual store (Web site), or a single order through a mail-order catalog + a transaction consists of a transaction ID, customer ID, date, item and quantity • The goal is to identify items that are typically purchased together. • This can be used to improve the layout of shops or catalogs. CA306 Data Mining 5-13 Frequent Itemsets (1) • Consider the following transaction table: Transaction Customer Date Items bought 101 12 11/09 milk, bread, juice 792 13 12/09 milk, juice 1130 14 14/09 milk, eggs 1735 13 14/09 bread, coffee, biscuits Items bought in one visit are already grouped together into itemsets. • Support of an itemset: the fraction of transactions that contain all items in the itemset • Examples + {milk, juice} has a support of 50 % + {bread, coffee} has a support of 25 % CA306 Data Mining 5-14 Frequent Itemsets (2) • Large itemsets are itemsets that have a certain minimum support, i.e. are itemsets that occur frequently. • Example: + for a minimum support of 40%, the large itemsets are {milk, juice}, {milk}, {juice}, {bread} • Proposition: + every subset of a large itemset is also a large itemset • Algorithm: + large itemsets can be computed incrementally + start with itemsets of cardinality 1 that have the required support CA306 Data Mining 5-15 Sections Covered 5.1 Introduction 5.2 Knowledge Discovery • 5.3 Association Rules • 5.4 Sequential Patterns • 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining 5-16 5.3 Association Rules • A database can be regarded as a collection of transactions. • Each transaction involves a set of items. • Example: the items in a basket that a shopper uses in a supermarket Transaction 101 792 1130 1735 CA306 Data Mining Time 6:35 7:38 8:05 8:40 Items bought milk, bread, juice milk, juice milk, eggs bread, coffee, biscuits 5-17 Association Rules • An association rule is of form X => Y where X and Y are two disjoint sets of items • Example: + for sets of goods as itemsets X and Y, the expression X => Y means that if a customer buys X, he/she is also likely to buy Y. + if the customer buys milk, he/she is also likely to buy juice. • The support for a rule X => Y is the percentage of transactions that hold all of the items in the union X Y. • Examples: + Milk => Juice has 50% support + Bread => Juice has 25% support CA306 Data Mining 5-18 Association Rules • The confidence of a rule X => Y is the percentage (fraction) of all transactions including X that also include Y. • Example: + the rule Milk => Juice has confidence 66.7% + that means that 2/3 of all transactions with milk also include juice • Note that support and confidence might be different. • The goal is to discover rules with a certain minimum support and confidence. • These rules can be used for prediction: for a rule Pen => Ink offer discounts on pens and you might increase ink sales. CA306 Data Mining 5-19 Association Rules • How to compute these rules? + Generate large itemsets (itemsets with a certain minimum support) + For each large itemset X, generate all rules with a certain minimum confidence (mconf): for X and Y X, let Z = X - Y (divide X into Y and Z) if support(X) / support(Y) > mconf then Y => Z is a valid rule the confidence of rule Y => Z is defined as support(X) / support(Y) + Example: for X={milk, juice} and Y={milk} {milk, juice}, let Z={juice} X, Y, Z have support 50%, 75% and 50%, resp. (support for itemsets 5.14) for mconf=40% {milk} => {juice} is a valid rule with confidence 66.7% ( 50/75 ) CA306 Data Mining 5-20 Generating Association Rules • In principle, generating rules based on large itemsets and their support is straightforward. • Computing all large itemsets and their support creates an efficiency problem if the number of items is very high. • If m is the number of items, then 2m is the number of different itemsets. • Example: a typical supermarket might have several thousands of items. + Computing the support of all itemsets might take a long time. + Reducing the combinatorial search space is therefore important - the following properties can be used: • subsets of large itemsets are large • extensions of small itemsets are small CA306 Data Mining 5-21 Association Rules - Algorithms • Outline of an algorithm that finds large itemsets: • Step 1: + test the support for itemsets of length 1 - called 1-itemsets - by scanning the database; + discard those that do not meet the minimum requirement. • Step 2: + extend large 1-itemsets into 2-itemsets by appending one item each time (this generates all itemsets of length two); + test the support and eliminate all 2-itemsets that do not meet the minumum support. • Step 3: + repeat the above steps: extend (k-1)-itemsets into k-itemsets. CA306 Data Mining 5-22 Association Rules among Hierarchies • Items might be divided among disjoint hierarchies based on some classification, e.g. Beverage can be divided into Juice and Milk Associations might occur among the hierarchies of items. • Example: healthy frozen yoghurt => bottled water • Particularly interesting are associations across hierarchies. + this kind of information can be used to arrange different kinds of items in a supermarket CA306 Data Mining 5-23 Negative Associations • Negative associations are more difficult to detect than positive associations. • Example: 60% of customers who buy crisps do not buy bottled water. • There are usually more negative associations than positive ones. • The majority of itemset combinations do not occur in databases. • Finding interesting negative associations can be difficult. CA306 Data Mining 5-24 Association Rules - Additional Considerations • Sampling: + For very large databases, sampling improves efficiency. + Truly representative samples can help to find most of the rules. + The danger is that • false positives might be discovered (large itemsets that are not truly large); • true positives might be missing. • Other problems: + Cardinality of itemsets and volume of transactions can be very high. + Variablity of transactions (geographical, season) makes sampling difficult. + Multiple classifications along different dimensions. CA306 Data Mining 5-25 Sections Covered 5.1 Introduction 5.2 Knowledge Discovery 5.3 Association Rules • 5.4 Sequential Patterns • 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining 5-26 5.4 Sequential Patterns • Sequential patterns are based on sequences of itemsets. • Assume transactions to be ordered by time. • Example: + transactions in a supermarket + {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} may be based on three visits of a customer • A subsequence of a sequence is obtained by deleting one or more itemsets. • Example: + let {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} be the orginal sequence + {milk, bread, juice} ; {bread, eggs} is a subsequence + {milk, bread, juice} ; {milk, coffee, biscuits} is a subsequence CA306 Data Mining 5-27 Support for Sequences • A sequence {a1, ... , am} is contained in another sequence S if S has a subsequence {b1, ..., bn} such that ai bi for 1 <= i <= n • Example: + {milk, bread} ; {coffee, biscuits} is contained in {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} • The support of a sequence S is the percentage of a set of given sequences that contain S as a subsequence. CA306 Data Mining 5-28 Discovery of Patterns in Time Series • Time series are sequences of events. • An event might be a fixed type of transaction. • Example: + closing price of a stock or fund each day. • Analysis of time series: + find period of time in which the stock did not fluctuate more than 1% + find period (week/month/quarter) with the greatest loss + identify stocks with similar behaviour CA306 Data Mining 5-29 Sections Covered 5.1 Introduction 5.2 Knowledge Discovery 5.3 Association Rules 5.4 Sequential Patterns • 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining 5-30 5.5 Classification and Regression • Classification Rules • Regression • Tree-structured Rules CA306 Data Mining 5-31 Discovery of Classification Rules • Classification means defining/identifying a function that maps an object into one of many possible classes. • Example: a bank wants to classify loan applicants into “loanworthy” and “not loanworthy” + a classification rule could define the classification • not loanworthy: current monthly debt obligation exceeds 25% of monthly net income • loanworthy: otherwise + loanworthiness is a dependent, categorical attribute • In general there is one rule (set) per class (var1 in range1) and ... and (varn in rangen) => object O in class C1 var1 , ..., varn are the predictor attributes CA306 Data Mining 5-32 Support and Confidence • Again we can define support and confidence for these rules. • The support for a classification condition C is the percentage of tuples that satisfy C. • The support for a rule C1 => C2 is the support for the condition C1 C2. (C1 AND C2 is the set of objects in both C1 and C2.) • Consider those tuples that satisfy condition C1. The confidence for a rule C1 => C2 is the percentage of such rules that also satisfy condition C2. CA306 Data Mining 5-33 Regression • Regression is similar to classification, except that the dependent variable is numerical (and not categorical). • Rules (such as classification rules) can be regarded as functions. • A regression rule is a function that maps variables into a target class variable. • Example: LabTest(patientID, test1, ... , testn) + the values in that relation result from a series of lab tests + the target variable P is the probability of survival - a numerical variable + the regression rule: (test1 in range1) and ... and (testn in rangen) => P = x + the regression function is P = f(test1, ... , testn) CA306 Data Mining 5-34 Regression (2) • If P appears as a function y = f(x1, ... , xn) and f is linear in the domain variables, then the process of deriving f from a given set of tuples <x1, ... , xn, y> is called linear regression. • Linear regression is a common statistical technique. CA306 Data Mining 5-35 Tree-Structured Rules • Specific classification and regression rules shall now be examined. • These are rules that can be represented as trees - called classification trees or decision trees. • These trees are typically the output of the data mining activity. • Each path from a root to a leaf node represents one classification rule. • Example: Insurance risk determination for motor insurance Age <= 25 > 25 Car Type sports YES CA306 Data Mining NO family NO 5-36 Decision Trees • A decision tree is a graphical representation of a collection of classification rules. • Each node in the tree is labelled with a predictor or splitting attribute. • Each outgoing edge of an internal node is labelled with a predicate that involves the splitting attribute. • Each leaf node is labelled with a value of the depending attribute. • A classification rule can be associated with each leaf node constructed as the conjunction of the predicates: + Age <= 25 and Car Type = sports for the YES-leaf • Decision trees are constructed in two phases: + growth phase: create tree based on specialised rules from an input database (relation) + pruning phase: reduce tree size by generalising rules CA306 Data Mining 5-37 Sections Covered 5.1 Introduction 5.2 Knowledge Discovery 5.3 Association Rules 5.4 Sequential Patterns 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining 5-38 5.6 Other Types of Data Mining • Neural Networks • Genetic Algorithms • Clustering and Segmentation CA306 Data Mining 5-39 Neural Networks • Techniques from artificial intelligence can be used to generalise regression. • Neural networks provide an iterative method to carry out this generalised regression. • Neural networks use a curve-fitting approach to infer a function from a set of samples. • This process is based on learning: a test sample is the initial input, the system then incrementally infers functions based on more samples • Neural networks can be applied to classification problems. • Modelling time series with neural networks is difficult. CA306 Data Mining 5-40 Genetic Algorithms (1) • Genetic algorithms (GA) are a class of randomised search procedures for adaptive and robust search over a wide range of search topologies. • Principle: + Genetic algorithms extend the idea of characterising human DNA by a fourletter alphabet (A,C,T,G). • Construction: + Devise an alphabet that allows the encoding of a solution to the decision problem in terms of strings of that alphabet. • Usage: + Study the cutting and combination of strings (compare natural reproduction and evolution). + New generations of individuals (solutions) are generated and assessed survival of the fittest. CA306 Data Mining 5-41 Genetic Algorithms (2) • Generation of solutions - comparison with other techniques. + GA search uses a set of solutions during each generation rather than a single solution. + The search in the string-space represents a much larger parallel search in the space of encoded solutions. + The memory of the search completed is represented solely by the set of solutions available for generation. + A GA is a randomised algorithm since search mechanisms use probabilistic operators. + While progressing from one generation to the next, a GA finds near-optimal balance between knowledge acquisition and exploitation by manipulating encoded solutions. CA306 Data Mining 5-42 Clustering and Segmentation • Clustering is about identification and classification. • Clustering tries to identify categories (or clusters) to which a data object can be mapped. • The categories can be disjoint or might overlap; they might be organised into trees. • A related problem: multivariate probability density functions. CA306 Data Mining 5-43 Sections Covered 5.1 Introduction 5.2 Knowledge Discovery 5.3 Association Rules 5.4 Sequential Patterns 5.5 Classification and Regression 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining 5-44 5.7 Applications of Data Mining • Decision-making contexts: + marketing: • analysis of customer behaviour based on buying patterns; • determination of marketing strategies (store locations, advertising campaigns, etc); • segmentation of customers, stores, products. + finance: • • • • CA306 Data Mining analysis of creditworthiness of clients; performance analysis of finance investments; evaluation of financing options; fraud detection. 5-45 Applications + Manufacturing: • optimisation of resources (machines, manpower, material); • optimal design of manufacturing process, shop-floor layout, etc. + Health care: • • • • CA306 Data Mining analysis of effectiveness of certain treatments; optimisation of processes in a hospital; analysing side effects of drugs; relating patient wellness and doctor qualifications. 5-46