Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 6 Themes in this session Data mining Reading Directions [Komp, article 3] Elmasri and Navathe, Foundation of Database Systems, Chapter 26.2 Data Mining What is data mining? “Data Mining is data analysis in order to discover hidden correlations (pattern, rules) in huge data sets” “Data Mining is the process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions.” 1 Data Mining versus KDD • Knowledge Discovery in Databases involves the extraction of implicit, previously unknown and potentially useful information from data. • Data Mining is the use of algorithms to extract the information and patterns derived by the KDD process. The KDD Process Knowledge Patterns Transformed data Target Data Preprocessed Data Interpretation/ Evaluation Data Mining Transformation Data Preprocessing Selection 2 The KDD Process • Selection: This first step obtains the data from various databases, files, and nonelectronic sources. • Preprocessing: Incorrect data is corrected or removed, missing data must be supplied or predicted. • Transformation: Data from different sources is converted into a common format for processing. Some data is encoded or transformed into more usable formats. Data reduction might be applied to shrink the data to be analysed. • Data Mining: Applying algorithms to the transformed data to generate the desired results. • Interpretation/Evaluation: Visualising the results by using different GUI strategies and interpreting them. Enabling factors for data mining Data availability • Increased amount of electronically stored data • Increased processing power • Increased data storage ability • Increased data gathering ability (networks, extraction tools) • Increased number of data warehouses Business conditions • Increased need to compete effectively • Increased awareness of need to know customers 3 Data mining uses in enterprises • Predict customer pattern of behaviour, e.g buying pattern • Discover market developments driven by demographic changes • Discover shifts in consumption • Identification of new customers • Anticipation of demands on inventory Data Mining Models and Tasks Data Mining Predictive Descriptive Clustering Classification Regression Time Series Analysis Prediction Summarisation Association Rules Sequence Discovery 4 Classification • Classification maps data into predefined groups of classes. Classification algorithms require the classes to be defined based on data attribute values. They often describe these classes by looking at the characteristics of data already known to belong to the classes. • Pattern recognition is a type of classification where an input patterns is classified into one of several classes based on its similarity to these predefined classes. • Example: Determining whether to approve a bank loan application. Regression • Regression is used to map a data item to a real valued prediction variable. In actuality, regression involves the learning of the function that does this mapping. • Regression assumes that the target data fit into some known type of function (e.g., linear) and then determines the best function of this type that models the given data. • Example: Eva wishes to reach a certain level of savings before her retirement. Periodically, she predicts what her retirement savings will be based on its current value and several past values. She uses a simple linear regression formula to predict this value by fitting past behaviour to a linear function and then using this function to predict the values at point in the future. Based on these values, she then alters her investment portfolio. 5 Time Series Analysis • With time series analysis, the value of an attribute is examined as it varies over time. The values are obtained as evenly spaced time points (daily, weekly, hourly, etc). • A time series plot is used to visualise the time series. • Example: Eva is trying to determine whether to purchase stocks from Companies X, Y or Z. For a period of one month she charts the daily stick price for these companies. Using this information she decides to purchase stocks from X, because it is less volatile while overall showing a slightly larger relative amount of growth then either of the other stocks. Prediction • Many real-world data mining applications can be seen as predicting future data states based on past and current data. Prediction can be viewed as a type of classification (with the difference that it is classifying a future state rather than a current state.) • Although future values may be predicted using time series analysis or regression techniques, other approaches may be used as well. • Example: Predicting flooding is a difficult problem. One approach uses monitors placed at various points in the river. These monitors collect data relevant to flood prediction, water level, rain amount, time, humidity, and so on. Then the water level at a potential flooding point in the river can be predicted based on data collected by the sensors upriver from this point. The prediction must be made with respect to the time the data were collected. 6 Clustering • Clustering is similar to classification except that the groups are not predefined, but rather defined by the data alone. It can be thought of as partitioning the data into groups that might or might not be disjointed. • The clustering is usually accomplished by determining the similarity among the data on predefined attributes. • Since the clusters are not predefined, a domain expert is often required to interpret the meaning XXX of the created clusters. Profitable X XX Dept X X customers! X • Example: X X XX X X XX Income Summarisation • Summarisation maps data into subsets with associated simple descriptions. • Summarisation is also called characterisation or generalisation. It extracts or derives representative information about the data set. • This may be accomplished by actually retrieving portions of the data. Alternatively, summary type information (such as mean of some numeric attribute) can be derived from the data. • Example: One of the many criteria used to compare universities by the U.S. News and World Report is the average score. 7 Association Rules • Link analysis, alternatively referred to as affinity analysis or association, refers to the data mining task of uncovering relationship among data. • An association rule is a model that identifies specific types of data associations. These associations are often used in the retail sales community to identify items that are frequently purchased together. • Example: A grocery store is trying to decide whether to put bread on sale. To help determine the impact of this decision, the retailer generates association rules that show what other products are frequently purchased wit bread. He finds that 70% of the time bread is sold jelly is also sold. Based on this, he decide to place some jelly at the end of the aisle where the bread is placed and decides to not have the jelly on sale at the same time. Sequence Discovery • Sequential analysis or sequence discovery is used to determine sequential patterns in data. • These patterns are similar to associations that are found in the data, but they are based on time. • Unlike a market basket analysis, which requires the items to be purchased at the same time, in sequence discovery the items are purchased over time in some order. • Example: The webmaster at XYZ Corp. periodically analyse the Web log data to determine how users of the XYZ’s Web pages access them. He is interested in determining which pages are most frequently accessed and in what sequence they are accessed. He determines that 70% of the users of page A follow one of the following patterns of behaviour: <A,B,C> or <A,D,B,C> or <A,E,B,C>. He then decides to add a link directly from page A to page C. 8 Association Rules Ex. If a customer buys X, (s)he is also likely to buy Y Transaction-id Time Items-Brought 101 792 1130 1730 milk, bread, juice milk, juice milk, eggs bread, cookies, coffee X⇒Y 6:35 7:38 8:05 8:40 where X = {x1, x2,…,xn} and Y = {y1, y2,…,ym} are sets of items, with xi ≠ yj for each i and j Support (prevalence) nr. of trans. cont. X ∪ Y nr. of trans. {Milk, Juice} = 2/4 = 50% {Bred, Juice} = 1/4 = 25% Confidence (strength) nr of trans cont. X ∪ Y nr of trans. cont. X Milk ⇒ Juice 2/3 = 66,7% Bred ⇒ Juice 1/2 = 50% Mining Association Rules 1. Generate all item sets that have a support that exceeds a threshold defined by the user 2. For each such item set generate all the rules that have confidence above a threshold defined by the user Example: nr of trans = 4 support ≥ 30% conf ≥ 70% 1. support {milk, bread, eggs} = 30% support {cookies, juice} = 0% support {cookies, coffee} = 20% support {milk, eggs} = 50 % … nr. of sets to be checked is 27 (in general 2nr of items) 1 2 3 4 5 6 7 8 9 10 milk, bread, eggs, juice milk, juice milk, eggs bread, cookies, coffee milk, bread, eggs, fruits milk, bread, eggs, coffee cookies, coffee coffee, milk fruits, milk eggs, milk 2. conf (milk, bread ⇒ eggs) = 3/3 = 100% conf (milk, eggs ⇒ bread ) = 3/5 = 60% conf (eggs, bread ⇒ milk ) = 3/3 = 100% conf (milk ⇒ bread, eggs) = 3/8 = 38% conf (bread ⇒ milk, eggs) = 3/4 = 75% conf (eggs ⇒ bread, milk) = 3/5 = 60% conf (milk ⇒ eggs) = 5/8 = 63% conf (eggs ⇒ milk) = 5/5 = 100% 9 Association Rules - Basic Algorithm • Test the support for item sets of length 1 (1-itemsets) by scanning the database. Discard those that do not meet the minimum required support • Extend the large 1-itemsets into 2-itemsets by appending one item each time, to generate all candidate item sets of length two. Test the support for all candidate item sets and eliminate those that do not meet the minimum support • Repeat the above steps; at step k, the previously found (k-1) item sets are extended into k-itemsets Association Rules among Hierarchies Beverages Carbonated Colas Clear drinks Non-Carbonated Mixed drinks Bottled juices Orange Bottled water Wine coolers Apple Beverage ⇒ Desserts Desserts ⇒ Beverage Desserts Ice Baked Cream Ice cream ⇒ Wine coolers Frozen Yoghurt Regular Low fat Low fat frozen yoghurt ⇒ Bottled water 10 Association Rules - Negative Associations “60% of customers who buy potato chips do not buy bottled water” The problem: In a DB with 10000 items there are 210000 possible combination of items, a majority of which do not appear even once in the DB. How to find only the interesting negative associations? Soft Drinks Joke Wakeup x x Topsy Chips Days Nightos Partyos Data visualisation - A picture tells more than thousand words • Five hundred people, all from the same section of London, England, died of cholera within a 10-day period in September 1854. Dr. John Snow a local physician, had been studying this spread of cholera for some time. One of the earliest known examples of data visualisation is Dr. Snow’s use of maps to provide his long-held theory that cholera was a waterborne infection. 11 Applications of Data Mining Marketing • analysis of customers behaviour based on buying patterns • determination of marketing strategies including advertising, store location, and targeted mailing • segmentation of customers, stores, or products • design of catalogs, store layouts, and advertising campaigns Finance • analysis of creditworthiness of clients • segmentation of accounts receivables • performance analysis of finance investments like stocks, bonds and mutual funds • evaluation of financing options • fraud detection Applications of Data Mining 2 Manufacturing • optimisation of resources like machines, manpower and materials • optimal design of manufacturing processes, shop-floor layouts and product design, such as for products tailored according to customers requirements Health Care • analysis of effectiveness of certain treatments • optimisation of processes within a hospital, relating patients wellness data with doctor qualifications • analysis of side effects of drugs 12