Download data mining unit 2

CONCEPT DESCRIPTION tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n Data mining can be classified into two categories:  Descriptive data mining: describes the data set in a concise and summative manner and presents interesting general properties of the data  Predictive data mining: analyzes the data in order to construct one or a set of models, and attempts to predict the behavior of new data sets. What is concept description?     ht  The simplest kind of descriptive data mining is concept description. As a data mining task, concept description is not a simple enumeration of the data. Instead, concept description generates descriptions for  characterization and  comparison of the data. It is sometimes called class description. Characterization provides a concise and succinct summarization of the given collection of the data. Concept or class comparison (also known as discrimination) provides discriminations comparing two or more collections of data. DATA GENERALIZATION tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n Data Generalization is the process of creating successive layers of summary data in an evaluational database. It is a process of zooming out to get a broader view of a problem, trend or situation. It is also known as rolling-up data. Or Data generalization is a form of descriptive data mining that abstracts a large set of task relevant data in database from a low level concept to high level concept. For example: mobile number and landline number can be replaced by telephone number.  It helps in providing easiness and flexibility to user in examining the general behavior of data.  For example: if manager wants to view the data generalized to higher level rather than examining individual employee, he/she can easily do it if the concept of generalization is used. This leads to the notion of concept description.  Data generalization can provide a great help in Online Analytical Processing (OLAP) technology. OLAP is used for providing quick answers to analytical queries which are by nature multidimensional. They are commonly used as part of a broader category of business intelligence.  Since OLAP is used mostly for business reporting such as those for sales, marketing, management reporting, business process management and other related areas, having a better view of trends and patterns greatly speeds up these reports.  Data generalization is also especially beneficial in the implementation of an Online transaction processing (OLTP). OLAP was created later than OLTP and had slight modifications from OLTP. Figure 1 Data generalization What are the differences between concept description in large databases and on-line analytical processing? The fundamental differences between the two include the following: ht 1. Many current OLAP systems confine dimensions to nonnumeric data. Similarly, measures (such as count (), sum (), average ()) in current OLAP systems apply only to numeric data. In contrast, for concept formation, the database attributes can be of various data types, including numeric, nonnumeric, spatial, text or image. 2. Concept description in databases can handle complex data types of the attributes and their aggregations, as necessary whereas OLAP cannot. ANALYTICAL CHARACTERIZATION: ANALYSIS OF ATTRIBUTE RELEVANCE tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n  Measures of attribute relevance analysis can be used to help identify irrelevant or weakly relevant attributes that can be excluded from the concept description process.  The incorporation of this preprocessing step into class characterization or comparison is referred to as analytical characterization or analytical comparison, respectively.  An attribute or dimension is considered highly relevant with respect to a given class if it is likely that the values of the attribute or dimension may be used to distinguish the class from others.  For example, it is unlikely that the color of an automobile can be used to distinguish expensive from cheap cars, but the model, make, style, and number of cylinders are likely to be more relevant attributes. Methods of Attribute Relevance Analysis The general idea behind attribute relevance analysis is to compute some measure which is used to quantify the relevance of an attribute with respect to a given class or concept. Such measures include the information gain, Gini index, uncertainty, and correlation coefficients. We first examine the information-theoretic approach applied to the analysis of attribute relevance. Let's take ID3 as an example.  ID3 constructs a decision tree based on a given set of data tuples, or training objects, where the class label of each tuple is known.  The decision tree can then be used to classify objects for which the class label is not known.  To build the tree, ID3 uses a measure known as information gain to rank each attribute.  How does the information gain calculation work?  Let S be a set of training objects where the class label of each object is known.  Suppose that there are m classes. Let S contain si objects of class Ci, for i = 1 …m.  An arbitrary object belongs to class Ci with probability si/s, where s is the total number of objects in set S.  When a decision tree is used to classify an object, it returns a class.  Thus, information gain is obtained in following way: I (s1, s2 … sm) = ∑ ht Attribute relevance analysis for concept description is performed as follows: 1. Data collection.  Collect data for both the target class and the contrasting class by query processing.  For class comparison, both the target class and the contrasting class are provided by the user in the data mining query.  For class characterization, the target class is the class to be characterized, whereas the contrasting class is the set of comparable data which are not in the target class. Preliminary relevance analysis using conservative AOI.  This step identifies a set of dimensions and attributes on which the selected relevance measure is to be applied.  Since different levels of a dimension may have dramatically different relevance with respect to a given class, each attribute defining the conceptual levels of the dimension should be included in the relevance analysis in principle.  Attribute-oriented induction (AOI) can be used to perform some preliminary relevance analysis on the data by removing or generalizing attributes having a very large number of distinct values (such as name and phone#). tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n 2. Evaluate each remaining attribute using the selected relevance analysis measure:  Evaluate each attribute in the candidate relation using the selected relevance analysis measure.  The relevance measure used in this step may be built into the data mining system, or provided by the user (depending on whether the system is flexible enough to allow users to define their own relevance measurements).  For example, the information gain measure described above may be used. 4. Remove irrelevant and weakly relevant attributes.  Remove from the candidate relation the attributes which are not relevant or are weakly relevant to the concept description task, based on the relevance analysis measure used above.  A threshold may be set to define “weakly relevant".  This step results in an initial target class working relation and an initial contrasting class working relation. 5. Generate the concept description using AOI.  Perform AOI using a less conservative set of attribute generalization thresholds.  If the descriptive mining task is class characterization, only the initial target class working relation is included here.  If the descriptive mining task is class comparison, both the initial target class working relation and the initial contrasting class working relation are included. ht 3. MINING CLASS COMPARISONS: DISCRIMINATING BETWEEN DIFFERENT CLASSES   tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n  Class discrimination or comparison (hereafter referred to as class comparison) mines descriptions that distinguish a target class from its contrasting classes. Note: the target and contrasting classes must be comparable in the sense that they share similar dimensions and attributes. For example: the three classes, person, address, and item, are not comparable. However, the sales in the last three years are comparable classes. “How is class comparison performed?” In general, the procedure is as follows: 1. Data collection:  The set of relevant data in the database is collected by query processing and is partitioned respectively into a target class and one or a set of contrasting class. 2. Dimension relevance analysis:  If there are many dimensions, then dimension relevance analysis should be performed on these classes to select only the highly relevant dimensions for further analysis.  Correlation or entropy-based measures can be used for this step. 3. Synchronous generalization:  Generalization is performed on the target class to the level controlled by a user- or expert-specified dimension threshold, which results in a prime target class relation.  The concepts in the contrasting class are generalized to the same level as those in the prime target class relation, forming the prime contrasting class relation. 4. Presentation of the derived comparison:  The resulting class comparison description can be visualized in the form of tables, graphs, and rules.  This presentation usually includes a “contrasting” measure such as count% (percentage count) that reflects the comparison between the target and contrasting classes.  The user can adjust the comparison description by applying drill-down, roll-up, and other OLAP operations to the target and contrasting classes, as desired. ht EXAMPLE 4.27 (Han n kamber third edition) MINING DESCRIPTIVE STATISTICAL MEASURES IN LARGE DATABASES tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n Descriptive data summarization techniques can be used to identify the typical properties of your data and highlight which data values should be treated as noise or outliers. Thus, we first introduce the basic concepts of descriptive data summarization before getting into the concrete workings of data preprocessing techniques. For many data preprocessing tasks, users would like to learn about data characteristics regarding both central tendency and dispersion of the data.  Measures of central tendency include mean, median, mode, and midrange, while measures of data dispersion include quartiles, interquartile range (IQR), and variance.  These descriptive statistics are of great help in understanding the distribution of the data. From the data mining point of view, we need to examine how they can be computed efficiently in large databases. Knowing what kind of measure we are dealing with can help us choose an efficient implementation for it. MEASURING THE CENTRAL TENDENCY 1. Mean: The most common and most effective numerical measure of the “center” of a set of data is the (arithmetic) mean. Let x1, x2……xN be a set of N values or observations, such as for some attribute, like salary. The mean of this set of values is X’= ∑ =  A distributive measure is a measure (i.e., function) that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset, and then merging the results in order to arrive at the measure’s value for the original (entire) data set. (partitioning the data into smaller subsets computing the measure for each subset then merging the results)  Both sum() and count() are distributive measures because they can be computed in this manner.  An algebraic measure is a measure that can be computed by applying an algebraic function to one or more distributive measures.  Hence, average (or mean()) is an algebraic measure because it can be computed by sum()/count(). ht 2. Weighted arithmetic mean or the weighted average: Sometimes, each value xi in a set may be associated with a weight wi, for i = 1………N. The weights reflect the significance, importance, or occurrence frequency attached to their respective values. In this case, we can compute X’ = ∑ ∑ = tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n 3. Median: For skewed (asymmetric) data, a better measure of the center of data is the median. Suppose that a given data set of N distinct values is sorted in numerical order.  If N is odd, then the median is the middle value of the ordered set.  If N is even, the median is the average of the middle two values.  A holistic measure is a measure that must be computed on the entire data set as a whole. It cannot be computed by partitioning the given data into subsets and merging the values obtained for the measure in each subset. The median is an example of a holistic measure.  Assume that data are grouped in intervals according to their xi data values and that the frequency (i.e., number of data values) of each interval is known.  For example, people may be grouped according to their annual salary in intervals such as 10–20K, 20–30K, and so on. Let the interval that contains the median frequency be the median interval. Median = L1+ ( (∑ )) *width Figure 2 Mean, median, and mode of symmetric versus positively and negatively skewed data. Where L1: lower boundary of the median interval N: number of values in the entire data set (freq)l : sum of the frequencies of all of the intervals that are lower than the median interval, freq median: frequency of the median interval, and width is the width of the median interval. ht 4. Mode: The mode for a set of data is the value that occurs most frequently in the set.  It is possible for the greatest frequency to correspond to several different values, which results in more than one mode.  Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal.  In general, a data set with two or more modes is multimodal.  At the other extreme, if each data value occurs only once, then there is no mode. Mean-mode = 3*(mean-median).  The midrange can also be used to assess the central tendency of a data set. It is the average of the largest and smallest values in the set. tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n MEASURING THE DISPERSION OF DATA The degree to which numerical data tend to spread is called the dispersion, or variance of the data. The most common measures of data dispersion are  Range  the five-number summary (based on quartiles)  interquartile range  Standard deviation. Boxplots can be plotted based on the five-number summary and are a useful tool for identifying outliers. Range, Quartiles, Outliers, and Boxplots Range: The range of the set is the difference between the largest (max()) and smallest (min()) values. Kth Percentile: Let x1, x2………..xN be a set of observations for some attribute. Then kth percentile is the value xi having the property that k percent of the data entries lie at or below xi. The median (discussed in the previous subsection) is the 50th percentile. The most commonly used percentiles other than the median are quartiles. First quartile: denoted by Q1, is the 25th percentile Third quartile: denoted by Q3, is the 75th percentile. Interquartile range: The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data. This distance is called the interquartile range (IQR) and is defined as IQR = Q3-Q1. Outliers: It is defined as to single out values falling at least 1:5_IQR above the third quartile or below the first quartile. ht Five-number summary: Because Q1, the median, andQ3 together contain no information about the endpoints (e.g., tails) of the data, a fuller summary of the shape of a distribution can be obtained by providing the lowest and highest data values as well. This is known as the fivenumber summary. It consists of median, the quartilesQ1 andQ3, and the smallest and largest individual observations, written in the order Minimum; Q1; Median; Q3; Maximum: tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n A boxplot incorporates the five-number summary as follows:  Typically, the ends of the box are at the quartiles, so that the box length is the interquartile range, IQR.  The median is marked by a line within the box.  Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations. Figure 3 Boxplot for the unit price data for items sold at four branches of AllElectronics during a given time period. Here we see, for branch 1, we see that the median price of items sold is $80, Q1 is $60, Q3 is $100. Variance and Standard Deviation The variance of N observations, x1, x2………xN, is ∑( ) [∑ (∑ ) ] ht Where x’: mean value of the observations The standard deviation,  is the square root of the variance, 2. The basic properties of the standard deviation, , as a measure of spread are     Measures spread about the mean and should be used only when the mean is chosen as the measure of center.  =0 only when there is no spread, that is, when all observations have the same value. Otherwise  > 0. tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n GRAPH DISPLAYS OF BASIC STATISTICAL CLASS DESCRIPTION Aside from the bar charts, pie charts, and line graphs used in most statistical or graphical data presentation software packages, there are other popular types of graphs for the display of data summaries and distributions. These include  Histograms  quantile plots  q-q plots  scatter plots, and  Loess curves. Histogram:  A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or bucket.  Width of each bucket is uniform.  Each bucket is represented by a rectangle whose height is equal to the count or relative frequency of the values at the bucket.  If A is categoric, such as automobile model or item type, then one rectangle is drawn for each known value of A, and the resulting graph is more commonly referred to as a bar chart.  If A is numeric, the term histogram is preferred. ht Above figure shows a histogram, where buckets are defined by equal-width ranges representing $20 increments and the frequency is the count of items sold. tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n Quantile plot:  A quantile plot is a simple and effective way to have a first look at a univariate data distribution.  First, it displays all of the data for the given attribute.  Second, it plots quantile information.  Let xi, for i = 1 to N, be the data sorted in increasing order so that x1 is the smallest observation and xN is the largest. Each observation, xi, is paired with a percentage, fi, which indicates that approximately 100 fi% of the data are below or equal to the value, xi. We say “approximately” because there may not be a value with exactly a fraction, fi, of the data below or equal to xi. Note that the 0.25 quantile corresponds to quartile Q1, the 0.50 quantile is the median, and the 0.75 quantile is Q3. Let fi= These numbers increase in equal steps of 1/N, ranging from 1/2N (which is slightly above zero) to 1-1/2N (which is slightly below one).On a quantile plot, xi is graphed against fi. This allows us to compare different distributions based on their quantiles. For example, given the quantile plots of sales data for two different time periods, we can compare theirQ1, median,Q3, and other fi values at a glance Figure 4 A quantile plot for the unit price data Quantile- quantile plot (q-q plot) ht Graphs the quantiles of one univariate distribution against the corresponding quantiles of another.  Suppose that we have two sets of observations for the variable unit price, taken from two different branch locations.  Let x1 ….xN be the data from the first branch, and y1……yM be the data from the second, where each data set is sorted in increasing order.  If M = N (i.e., the number of points in each set is the same), then we simply plot yi against xi, where yi and xi are both (i-0.5)/N quantiles of their respective data sets. If M < N (i.e., the second branch has fewer observations than the first), there can be only M points on the q-q plot. Here, yi is the (i-0.5)/M quantile of the y data, which is plotted against the (i-0.5)/M quantile of the x data. tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n  Figure 5 A quantile-quantile plot for unit price data from two different branches. Scatter plot   A scatter plot is one of the most effective graphical methods for determining if there appears to be a relationship, pattern, or trend between two numerical attributes. To construct a scatter plot, each pair of values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane. ht Figure 6 Scatter plots can be used to find (a) positive or (b) negative correlations between attributes. MINING VARIOUS KINDS OF ASSOCIATION RULES tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n Association rule mining finds interesting association or correlation relationships among a large set of data items. With massive amounts of data continuously being collected and stored, many industries are becoming interested in mining association huge amounts of business transaction records can help in many business decision making processes, such as catalog design, crossmarketing, and loss-leader analysis. A typical example of association rule mining is market basket analysis.    This process analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”. The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. For instance, if customers are buying milk, how likely are they to also buy bread (and what kind of bread) on the same trip to the supermarket? Such information can lead to increased sales by helping retailers do selective marketing and plan their shelf space. How can we find association rules from large amounts of data, where the data are either transactional or relational? Which association rules are the most interesting? How can we help or guide the mining procedure to discover interesting associations? What language constructs are useful in defining a data mining query language for association rule mining. ASSOCIATION RULE MINING Association rule mining searches for interesting relationships among items in a given data set. This section provides an introduction to association rule mining. We begin by presenting an example of market basket analysis, the earliest form of association rule mining. The basic concepts of mining associations are given and we present a road map to the different kinds of association rules that can mined. MARKET BASKET ANALYSIS: A MOTIVATING EXAMPLE FOR ASSOCIATION RULE MINING Suppose, as manager of an ABCompany branch, you would like to learn more about the buying habits of your customers. Specifically, you wonder, “Which groups or Sets of items are customers likely to purchase on a given trip to the store?” ht To answer this question, market basket analysis may be performed on the retail data of customer Transactions at your store. The results may be used to plan marketing or advertising Strategies, as well as catalog design. For instance, market basket analysis may help Managers design different store layouts. In one strategy, items that are frequently purchased together can be placed in close proximity in order to further encourage the Sale of such items together. If customers tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n who purchase computers also tend to buy Financial management software at the same time, then placing the hardware display Close to the software display may help to increase the sales of both these items. In an Alternative strategy, placing hardware and software at opposite ends of the store may entice customers who purchase such items to pick up other items along the way. Market Figure 7 Market basket analysis basket analysis can also help retailers to plan which items to put on sale at reduced prices. If customers tend to purchase computers and printers together, then having a sale on printers may encourage the sale of printers as well as computers. Associated Rule below: Computer=> financial_management_software [support = 2%,confidence == 60%] ht Rule support and confidence are two measures of rule interesting that were described earlier. They respectively reflect the usefulness and certainty of discovered rules. A support of 2% for Association Rule means that 2% of all the transaction under analysis show that computer and financial management software are purchased. Typically ,association rules are considered interesting if they satisfy both a minimum support thresholds and a minimum confidence threshold. Users or domain experts can set such thresholds. "How are association rules mined from large databases?" Association rule mining is a two-step process: tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n 1. Find all frequent item sets: By definition, each of these item sets will occur at least as frequently as a pre-determined minimum support count. 2. Generate strong association rules from the frequent item sets: By definition, these rules must satisfy minimum support and minimum support and minimum confidence. Additional interestingness measures can be applied, if desired. The second step is the easiest, of the two. The overall performance of mining association rules is determined by the first step. ASSOCIATION RULE MINING: A ROAD MAP Market basket analysis is just one form of association rule mining; in fact, there are many kinds of association rules. Association rules can be classified in various ways, based on the following criteria: Based on the types of values handled in the rule; (i) (ii) (iii) ht (iv) If a rule concerns associations between the presence or absence of items, it is a Boolean association rule. If a rule describes associations between quantitative items or attributes, then it is a quantitative association rule. In these rules, quantitative values for items or attributes are partitioned into intervals. The following rule is an example of a quantitative association rule, where X is a variable representing a customer: age(X,"30........39")^income(X,"42K.....48")implies buys(X, high resolutionTV) Based on dimensions in the data: If the items or attributes in an association rule reference only one dimension, then it is a single-dimensional association rule, Note that the rule written as buys(X,"computer") implies buys(X,"financial_management_software") is a single dimensional association rule since it refers to only one dimension, buys. If a rule references two or more dimensions,such as the dimensions buys,time_of_transaction,and customer_category,then it is a multidimensional association rule. Based on the levels of abstraction in the rule set: Some method for association rule mining can find rules at differing levels of abstraction. For example, suppose that a set of association rule mined includes the following rules: age(x,"30...39") buys(x,"laptop computer") age(x,"30...39") buys{x, "computer") tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n In above rules the items bought are referenced at different levels of abstraction. (e.g., "computer" is a higer-level abstraction of "laptop computer"). We refer to the rule set mined as consisting of multilevel association rules. If, instead,the rules within a given set do not reference items or attribute set different levels of abstraction, then then the set contains single-level association rules. Based on various extensions to association mining: Association mining can be extended to correlation analysis, where the absence or presence of correlated items can be identified. It can also be extended to mining max patters (i.e., maximal frequent patterns) and frequent closed item sets. A max pattern is a frequent pattern, p, such that any proper sub pattern of p is not frequent. A frequent closed item set is a frequent closed item set where an item set c is closed if there exists no proper superset of c, c ' such that every transaction containing c also contains c'. ht (v) MINING SINGLE-DIMENSIONAL ASSOCIATION RULES FROM TRANSACTIONAL DATABASES: The Apriori Algorithm: Finding Frequent Item sets Using Candidate Generation tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n Apriori algorithm      Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties. Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k + 1)-itemsets. First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The resulting set is denoted by L1. Next, L1 is used to find L2, the set of frequent 2itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. To improve the efficiency of the level-wise generation of frequent item sets, an important property called the Apriori property is used to reduce the search space. Apriori property: All nonempty subsets of a frequent item set must also be frequent. If an item set I does not satisfy the minimum support threshold, min sup, then I is not frequent If an item A is added to the item set I, then the resulting item set (i.e., I ∪A) cannot occur more frequently than I. “How is the Apriori property used in the algorithm?” To understand this, let us look at how Lk−1 is used to find Lk for k ≥ 2. A two-step process is followed, consisting of join and prune actions. ht The join step: To find Lk, a set of candidate k-item sets is generated by joining Lk−1 with itself. This set of candidates is denoted Ck . Let l1 and l2 be item sets in Lk−1. The notation li[j] refers to the jth item in li (e.g., l1[k − 2] refers to the second to the last item in l1). For efficient implementation, Apriori assumes that items within a transaction or item set are sorted in lexicographic order. For the (k − 1)-item set, li, this means that the items are sorted such that li[1] < li[2] < ··· < li[k − 1]. The join, Lk−1 ✶ Lk−1, is performed, where members of Lk−1 are joinable if their first (k − 2) items are in common. That is, members l1 and l2 of Lk−1 are joined if (l1[1] = l2[1]) ∧ (l1[2] = l2[2]) ∧ ··· ∧ (l1[k − 2] = l2[k − 2]) ∧(l1[k − 1] < l2[k − 1]). The condition l1[k − 1] < l2[k − 1] simply ensures that no duplicates are generated. The resulting item set formed by joining l1 and l2 is {l1[1], l1[2],..., l1[k − 2], l1[k − 1], l2[k − 1]}. tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n The prune step: Ck is a superset of Lk , that is, its members may or may not be frequent, but all of the frequent k-itemsets are included in Ck . A database scan to determine the count of each candidate in Ck would result in the determination of Lk (i.e., all candidates having a count no less than the minimum support count are frequent by definition, and therefore belong to Lk). Ck , however, can be huge, and so this could involve heavy computation. To reduce the size of C k , the Apriori property is used as follows. Any (k − 1)-item set that is not frequent cannot be a subset of a frequent k-item set. Hence, if any (k − 1)-subset of a candidate k-item set is not in Lk−1, then the candidate cannot be frequent either and so can be removed from Ck. This subset testing can be done quickly by maintaining a hash tree of all frequent item sets. Example 6.3 Apriori: Let’s look at a concrete example, based on the AllElectronics transaction database, D, of Table 6.1. There are nine transactions in this database, that is, |D| = 9. We use figure 1 to illustrate the Apriori algorithm for finding frequent item sets in D. 1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets, C1. The algorithm simply scans all of the transactions to count the number of occurrences of each item. 2. Suppose that the minimum support count required is 2, that is, min sup = 2. (Here, we are referring to absolute support because we are using a support count. The corresponding relative support is 2/9 = 22 %.) The set of frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets satisfying minimum support. In our example, all of the candidates in C1 satisfy minimum support. 3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 ✶ L1 to generate a candidate set of 2-itemsets, C2. Note that no candidates are removed from C2 during the prune step because each subset of the candidates is also frequent. Transactional Data for an AllElectronics Branch ht TID T100 T200 T300 T400 T500 Table 1 List of item_IDs I1, I2, I5 I2, I4 I2, I3 I1, I2, I4 I1, I3 T600 T700 T800 T900 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3 tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n Solution: Figure 1 4. Next, the transactions in D are scanned and the support count of each candidate item set in C2 is accumulated, as shown in the middle table of the second row in Figure1. 5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having minimum support. ht 6. The generation of the set of the candidate 3-itemsets, C3. From the join step, we first get C3 = L2 ✶ L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Based on the Apriori property that all subsets of a frequent item set must also be frequent, we can determine that the four latter candidates cannot possibly be frequent. We therefore remove them from C 3, thereby saving the effort of unnecessarily obtaining their counts during the subsequent scan of D to determine L3. Note that when given a candidate k-itemset, we only need to check if its (k − 1)-subsets are frequent since the Apriori algorithm uses a level-wise search strategy. The resulting pruned version of C3 is shown in the first table of the bottom row of Figure 1. tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n 7. The transactions in D are scanned to determine L3, consisting of those candidate 3-itemsets in C3 having minimum support. ht 8. The algorithm uses L3 ✶ L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, item set {I1, I2, I3, I5} is pruned because its subset {I2, I3, I5} is not frequent. Thus, C4 = φ, and the algorithm terminates, having found all of the frequent item sets. MINING MULTILEVEL ASSOCIATION RULES FROM TRANSACTION DATABASES tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n Multilevel Association Rules For many applications, it is difficult to find strong associations among data items at low or primitive levels of abstraction due to the sparsity of data in multidimensional space. Strong associations discovered at high concept levels that might represent common sense knowledge. However, what may represent common sense to one user may seem novel to another. Therefore, data mining systems should provide capabilities to mine association rules at multiple levels of abstraction and traverse easily among different abstraction spaces. ht Approaches to mining multilevel Association Rules "How can we mine multilevel association rules efficiently using concept hierarchies?" Let's look at some approaches based on a support-confidence framework. In general, a top-down strategy is employed, where counts are accumulated for the calculation of frequent itemsets at each concept level, starting at the concept level1 and working towards the lower, more specific concept levels, until no more frequent itemsets can be found. That is, once all frequent itemsets at concept level 1 are found, then the frequent itemsets at level2 are found, and so on, For each level, any algorithm for discovering frequent itemsets may be used, such as Apriori or its variations.  Using uniform minimum support for all levels (referred to as uniform support). The same minimum support threshold is used when mining at each level of abstraction. The uniform support approach, however, has some difficulties. It is unlikely that items at lower levels of abstraction will occur as frequently as those at higher levels of abstraction. If the minimum support threshold is set too high, it could miss several meaningful associations occurring at low abstraction levels. If the threshold is set too low, it may generate many uninteresting associations occurring at high abstraction levels. This provides the motivation for the following approach. tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n  Using reduced minimum support at lower levels (referred to as reduced sup-port); each level of abstraction has its own minimum support threshold. The lower the abstraction level, the smaller the corresponding threshold. ht For mining multiple-level associations with reduced support, there are a number of alternative search strategies: Level-by-Level independent: This is a full-breadth search, where no background knowledge of frequent itemsets is used for pruning. Each node is examined, regardless of whether or not its parent node is found to be frequent. Level -cross-filtering by single item: An item at the ith level is examined if and only if its parent node at the (i-1)th level is frequent .In other words ,we investigate a more specific association from a more general one. If a node is frequent, its children will be examined; otherwise, its descendants are pruned from the search. MINING MULTIDIMENSIONAL ASSOCIATION RULES FROM RELATIONAL DATABASES AND DATA WAREHOUSES tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n In this section, you will learn methods for mining multidimensional association rules, that is , rules involving more than one dimension or predicate(e.g., rules relating what a customer buys as well as the customer's age). These methods can be organized according to their treatment of quantitative attributes. Multidimensional Association Rules So far we have studied association rules that imply a single predicate, that is ,the predicate buys, For instance, in mining our ABC company database, we may discover the Boolean association rule "IBM desktop computer" implies "Sony b/w printer" Which can also be written as: Buys (X,"IBM desktop computer:") implies buys(X, "sony b/w printer"). Where X is a variable representing customers who purchased items in ABCompany transactions. Following the terminology used in multidimensional database, we refer to each distinct predicate in a rule as a dimensional. Hence, we can refer to the above rule as a single-dimensional or intra dimension association rule since it contains a single distinct predicate(e.g.. buys) with multiple occurrences.(i.e.. predicate occurs more than once within the rule). Such rules are commonly mined from transactions data. Suppose, however, that rather than using a transactional database, sales and related information are stared in a relational database or data warehouse. Such data stores are multidimensional, by definition. For instance, in addition to keeping track of the items purchased in sales transactions, a relational database may record other attributes associated with the items, such as the quantity purchased or the price, or the branch location of the sale. Considering each database attribute or warehouse dimension as a predicate, it can therefore be interesting to mine association rules containing multiple predicates, such as; Age (x,"20......29")^occupation(X, "student") implies buys(X, "laptop":) Association rules that involve two or more dimensions or predicates can be referred to as multidimensional association rules. The above rule contains three predicates (age, occupation, and buys), each of which occurs only once in the rule. ht Note that database attributes can be categorical or quantitative.  Categorical attributes have a finite number of possible values, with no ordering among the values (e.g., occupation, brand, color). Categorical attributes are also called nominal attributes, since their values are \names of things".  Quantitative attributes are numeric and have an implicit ordering among values (e.g., age, income, and price). Techniques for mining multidimensional association rules can be categorized according to three basic approaches regarding the treatment of quantitative (continuous-valued) attributes. tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n 1. in the first approach,  Quantitative attributes are discretized using predefined concept hierarchies.  This discretization occurs prior to mining.  For instance, a concept hierarchy for income may be used to replace the original numeric values of this attribute by ranges, such as \0-20K", \21-30K", \31-40K", and so on. Here,  Discretization is static and predetermined.  The discretized numeric attributes, with their range values, can then be treated as categorical attributes (where each range is considered a category).  We refer to this as mining multidimensional association rules using static discretization of quantitative attributes. 2. in the second approach,  Quantitative attributes are discretized into \bins" based on the distribution of the data.  These bins may be further combined during the mining process.  The discretization process is dynamic and established so as to satisfy some mining criteria, such as maximizing the confidence of the rules mined.  Because this strategy treats the numeric attribute values as quantities rather than as predefined ranges or categories, association rules mined from this approach are also referred to as quantitative association rules. 3. in the third approach,  Quantitative attributes are discretized so as to capture the semantic meaning of such interval data.  This dynamic discretization procedure considers the distance between data points. Hence, such quantitative association rules are also referred to as distance-based association rules. Mining quantitative association rules Quantitative association rules are multidimensional association rules in which the numeric attributes are dynamically discretized during the mining process so as to satisfy some mining criteria, such as maximizing the confidence or compactness of the rules mined. In this section, we will focus specifically on how to mine quantitative association rules having two quantitative attributes on the left-hand side of the rule, and one categorical attribute on the right-hand side of the rule, e.g., ht Aquan1 ^ Aquan2 Acat, tp D s:/ EE /w P w IK w A .d K ee A pi M ka B ka O m J bo j. i n Where Aquan1 and Aquan2 are tests on quantitative attribute ranges (where the ranges are dynamically determined), and Acat tests a categorical attribute from the task-relevant data. Such rules have been referred to as two- dimensional quantitative association rules, since they contain two quantitative dimensions. For instance, suppose you are curious about the association relationship between pairs of quantitative attributes, like customer age and income, and the type of television that customers like to buy. An example of such a 2-D quantitative association rule is Age (X; \30-34") ^ income (X; \42K-48K")  buys (X; \high resolution TV") Binning: Quantitative attributes can have a very wide range of values defining their domain. Just think about how big a 2-D grid would be if we plotted age and income as axes, where each possible value of age was assigned a unique position on one axis, and similarly, each possible value of income was assigned a unique position on the other axis! To keep grids down to a manageable size, we instead partition the ranges of quantitative attributes into intervals. These intervals are dynamic in that they may later be further combined during the mining process. The Partitioning process is referred to as binning, i.e., where the intervals are considered \bins". Three common binning strategies are: 1. equi-width binning: where the interval size of each bin is the same, 2. equi-depth binning: where each bin has approximately the same number of tuples assigned to it, and Figure 1 A 2-D grid for tuples representing customers who purchase high resolution TVs ht 3. homogeneity-based binning: where bin size is determined so that the tuples in each bin are uniformly distributed.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download data mining unit 2