Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

Lecture 6 Data Mining DT786 Semester 2 2011-12 Pat Browne Data Mining: Outline Spatial DM compared to spatial statistics Background to SD, & spatial data mining (SDM). The DM process Spatial autocorrelation i.e. the non independence of phenomena in a contiguous geographic area. Spatial independence Classical data mining concepts: Classification Clustering Association rules Spatial data mining using co-location Rules Summary Statistics versus Data Mining Do we know the statistical properties of data? Is data spatially clustered, dispersed, or random? Data mining is strongly related to statistical analysis. Data mining can be seen as a filter (exploratory data analysis) before applying a rigorous statistical tool. Data mining generates hypotheses that are then verified (sometimes too many!). The filtering process does not guarantee completeness (wrong elimination or missing data). Data Mining Data mining is the process of discovering interesting and potentially useful patterns of information embedded in large databases. Spatial data mining has the same goals as conventional data mining but requires additional techniques that are tailored to the spatial domain. A key goal of spatial data mining is to partially automate knowledge discovery, i.e., search for “nuggets” of information embedded in very large quantities of spatial data. Data Mining Data mining lies at the intersection of database management, statistics, machine learning and artificial intelligence. DM provides semiautomatic techniques for discovering unexpected patterns in very large data sets. We must distinguish between operational systems (e.g. bank account transactions) and decision support systems (e.g. data mining). DM can support decision making. Spatial Data Mining SDM can be characterised by Tobler’s first law of geography (near things tend to be more related than far things). Which means that the standard DM assumptions that values are independently and identically distributed does not hold in spatially dependent data (SDD). The term spatial autocorrelation captures this property and augments standard DM techniques for SDM. Spatial Data Mining The important techniques in conventional DM are association rules, clustering, classification, and regression. These techniques need to be modified for spatial DM. Two approaches used when adapting DM techniques to the spatial domain: 1)Adjust the underlying (iid) statistical model 2)Include an objective function1 (some f(x) that we wish to maximize or minimize which drives the search) that is modified to include a spatial term. Spatial Data Mining Size of spatial data sets: NASA’s Earth Orbiting Satellites capture about a terabyte(1012) a day, YouTube 2008 = 6 terabytes. Environmental agencies, utilities (e.g. ESB), Central Statistics Office, government departments such as health/agriculture, and local authorities all have large spatial data sets. It is very difficult to analyse such large data sets manually or using only SQL. For examples see Chapter 7 from SDT Data Mining: Sub-processes Data mining involves many sub-process: Data collection: usually data was collected as part of the operational activities of an organization, rather than specifically for the data mining task. It is unlikely that the data mining requirements were considered during data collection. Data extraction/cleaning: data must be extracted & cleaned for the specific data mining task. Data Mining: Sub-processes Feature selection. Algorithm design. Analysis of output Level of aggregation at which the data is being analysed must be decided. Identical experiments at different levels of scale can sometimes lead to contradictory results (e.g. the choice of basic spatial unit can influence the results of a social survey). Geographic Data mining process Close interaction between Domain Expert & Data-Mining Analyst The output consists of hypotheses (data patterns) which can be verified with statistical tools and visualised using a GIS. The analyst can interpret the patterns recommend appropriate actions Unique features of spatial data mining The difference between classical & spatial data mining parallels the difference between classical & spatial statistics. Statistics assumes the samples are independently generated, which is generally not the case with SDD. Like things tend to cluster together. Change tends to be gradual over space. Non-Spatial Descriptive Data Mining Descriptive analysis is an analysis that results in some description or summarization of data. It characterizes the properties of the data by discovering patterns in the data, which would be difficult for the human analyst to identify by eye or by using standards statistical techniques. Description involves identifying rules or models that describe data. Both clustering and association rules are descriptive techniques employed by supermarket chains. Non-Spatial Data Mining Non-Spatial Descriptive Data Mining Clustering (unsupervised learning) is a descriptive data mining technique. Clustering is the task of assigning cases into groups of cases (clusters) so that the cases within a group are similar to each other and are as different as possible from the cases in other groups. Clustering can identify groups of customers with similar buying patterns and this knowledge can be used to help promote certain products. Clustering can help locate what are the crime ‘hot spots’ in a city. Clustering using Similarity graphs Problem: grouping objects into similarity classes based on various properties of the objects. For example, consider computer programs that implement the same algorithm have k properties (k = <1, 2, 3> ) 1. Number of lines in the program 2. Number of “GOTO” statements 3. Number of function calls Clustering using Similarity graphs Suppose five programs are compared using three attributes: Program # lines # GOTOS # functions 1 66 20 1 2 41 10 2 3 68 5 8 4 90 34 5 5 75 12 14 Clustering using Similarity graphs. A graph G is constructed as follows: V(G) is the set of programs {v1, v2, v3, v 4, v5 }. Each vertex vi is assigned a triple (p1, p2, p3), where pk is the value of property k = 1, 2, or 3 v1 = (66,20,1) Vertices not accurately positioned. v2 = (41, 10, 2) v3 = (68, 5, 8) v4 = (90, 34, 5) v5 = (75, 12, 14) Clustering using Similarity graphs. Define a dissimilarity function as follows: For each pair of vertices v = (p1, p2, p3),w = (q1, q2, q3) 3 s(v,w) = |pk – qk| k=1 s(v,w) is a measure of dissimilarity between any two programs v and w Fix a number N. Insert an edge between v and w if s(v,w) < N. Then: We say that v and w are in the same class if v = w or if there is a path between v and w. Clustering using Similarity graphs. If we let vi correspond to program i: s(v1,v2) = 36 s(v3,v4) = 54 s(v1,v3) = 24 s(v3,v5) = 20 s(v1,v4) = 42 s(v4,v5) = 46 s(v1,v5) = 30 s(v2,v3) = 38 s(v2,v4) = 76 s(v2,v5) = 48 s(v1,v2)= =|66-41|+|20-10|+|1-2| = 36 Clustering using Similarity graphs. Let N = 25. s(v1,v3) = 24, s(v3,v5) = 20 and all other s(vi,vj) > 25 There are three classes: {v1,v3, v5}, {v2} and {v4} The similarity graph = Dissimilarity matrix in R library('cluster') data < 2 3 4 5 matrix(c(66,20,1,41,10,2,68,5,8,90,34,5,75,12,14),ncol=3,byrow=TR UE) diss <- daisy(data,metric = "manhattan") Dissimilarities : 1 2 3 4 36 24 38 42 76 54 30 48 20 46 Metric : manhattan Number of objects : 5 Non-Spatial Descriptive Data Mining Association Rules. Association rule discovery (ARD) identifies the relationships within data. The rule can be expressed as a predicate in the form (IF x THEN y ). ARD can identify product lines that are bought together in a single shopping trip by many customers and this knowledge can be used to help decide on the layout of the product lines. We will look at ARD in detail later. Non-Spatial Predictive Data Mining Predictive DM results in some description or summarization of a sample of data which predicts the form of unobserved data. Prediction involves building a set of rules or a model that will enable unknown or future values of a variable to be predicted from known values of another variable. Classification Non-Spatial Predictive Data Mining Classification is a predictive data mining technique. Classification is the task of finding a model that maps (classifies) each case into one of several predefined classes. The goal of classification is to estimate the value of an attribute of a relation based on the value of the relation’s other attribute. Uses: Classification is used in risk assessment in the insurance industry. Determining the location of nests based on the values of vegetation durability & water depth is a location prediction problem (classification nest or no nest). Classifying the pixels of a satellite image into various thematic classes such as water, forest, or agricultural is a thematic classification problem. Classification Non-Spatial Predictive Data Mining Classification Non-Spatial Predictive Data Mining A classifier can choose a hyperplane that best classifies the data. Classification techniques A classification function, f : D -> L, maps a domain D consisting of one or more variables (e.g. vegetation durability, water depth, distance to open water) to a set of labels L (e.g. nest or not-nest). The goal of the classification is to determine the appropriate f, from a finite subset Train D L. Accuracy of f determined on Test which is disjoint from Train. The classification problem is known as predictive modelling because it is used to predict the labels L from D. Non-Spatial Predictive Data Mining Regression analysis is a predictive data mining technique that uses a model to predict a value. Regression can be used to predict sales of new product lines based on advertising expenditure. Case Study Data from 1995 & 1996 concerning two wetlands on the shores of Lake Erie, USA. Using this information we want to predict the spatial distribution of marsh breeding bird called the red-winged black bird. Where will they build nests? What conditions do they favour? A uniform grid (pixel=5 square metres) was superimposed on the wetland. Seven attributes were recorded. See link1 to Spatial Databases a Tour for details. Case Study Case Study Significance of three key variables established with statistical analysis. Vegetation durability Distance to open water Water depth Case Study Nest locations Water depth Distance to open water Vegetation durability Example showing different predictions: (a) the actual locations of nests; (b) pixels with actual nests; (c) locations predicted by one model; and (d) locations predicted by another model. Prediction (d) is spatially more accurate than (c). Classical statistical assumptions do not hold for spatially dependent data Case Study The previous maps illustrate two important features of spatial data: Spatial Autocorrelation (not independent) Spatial data is not identically distributed. Two random variables are identically distributed if and only if they have the same probability distribution. Spatial DBs needs to augment classical DM techniques because: Rich data types (e.g., extended spatial objects) Implicit spatial relationships among the variables, Observations that are not independent, Spatial autocorrelation exists among the values of the attributes of physical locations or features. Classical Data Mining Association rules: Determination of interaction between attributes. For example: X Y: Classification: Estimation of the attribute of an entity in terms of attribute values of another entity. Some applications are: Predicting locations (shopping centers, habitat, crime zones) Thematic classification (satellite images) Clustering: Unsupervised learning, where classes and the number of classes are unknown. Uses similarity criterion. Applications: Clustering pixels from a satellite image on the basis of their spectral signature, identifying hot spots in crime analysis and disease tracking. Regression: takes a numerical dataset and develops a mathematical formula that fits the data. The results can be used to predict future behavior. Works well with continuous quantitative data like weight, speed or age. Not good for categorical data where order is not significant, like colour, name, gender, nest/no nest. Determining the Interaction among Attributes We wish to discovery relationships between attributes of a relation. Examples: is_close(house,beach) -> is_expensive(house) low(vegetationDurability) -> high(stem density) Associations & association rules are often used to select subsets of features for more rigorous statistical correlation analysis. In probabilistic terms an association rule X->Y is an expression in conditional probability P(Y|X). P(X|Y) = P(X Y)/P(Y) (probability of X, given Y) Antecedent, AKA: hypotheses, assumptions, premises Spatial Association rules is_a(x, big_town) /\ implies Conclusion or Consequence intersect(x, highway) -> adjacent_to(x, river) [support=7%, confidence=85%] The relative frequency with which an antecedent appears in a database is called its support (other definitions possible). The confidence of a rule A->B is the conditional probability of B given A. Using probability notation: confidence(A implies B) = P (B | A). How does data mining differ from conventional methods of data analysis? Using conventional data analysis the analyst formulates and refines the hypothesis. This is known as hypothesis verification, which is an approach to identifying patterns in data where a human analyst formulates and refines the hypothesis. For example "Did the sales of cream increase when strawberries were available?" Using data mining the hypothesis is formulated and refined without human input. This approach is known as hypothesis generation, identifying patterns in that data where the hypotheses are automatically formulated and refined. Knowledge discovery is where the data mining tool formulates and refines the hypothesis by identifying patterns in the data. For example, "What are the factors that determine the sales of cream?" Association rules An association rule is a pattern that can be expressed as a predicate in the form (IF x THEN y ), where x and y are conditions (about cases), which state if x (the antecedent) occurs then, in most cases, so will y (the consequence). The antecedent may contain several conditions but the consequence usually contains only one term. Association rules Association rules need to be discovered. Rule discovery is data mining technique that identifies relationships within data. In the non-spatial case rule discovery is usually employed to discover relationships within transactions or between transactions in operational data. The relative frequency with which an antecedent appears in a database is called its support. High support is the frequency at which the relative frequency is considered significant and is called the support threshold (say 70%) Association rules Example: Market basket analysis is form of association rule discovery that discovers relationships in the purchases made by a customer during a single shopping trip. An itemset in the context of market basket analysis is the set of items found in a customer’s shopping basket. Association rules Association rules need to be discovered. Rule discovery is data mining technique that identifies relationships within data. In the non-spatial case rule discovery is usually employed to discover relationships within transactions or between transactions in operational data. The relative frequency with which an antecedent appears in a database is called its support (alternatively, fraction of transactions satisfying the rule). High support is the frequency at which the relative frequency is considered significant and is called the support threshold (say 70%) Association rules Example: Market basket analysis is form of association rule discovery that discovers relationships in the purchases made by a customer during a single shopping trip. An itemset in the context of market basket analysis the set of items found in a customer’s shopping basket. Item Set An itemset in the context of market basket analysis is the set of items found in a customer’s shopping basket (or order). A general form of association rule is if (x1 and x2 and .. xn THEN y1 and y2 and .. y3). In market basket analysis the set of items (x1 and x2 and .. xn and y1 and y2 and .. y3) is called the itemset. We are only interested in itemsets with high support (i.e. they appear together in many baskets). Frequent Item Set We then find association rules involving itemsets that appear in at least a certain percentage of the shopping baskets called the support threshold (i.e. frequency at which the appearance of an itemset in a shopping basket is considered significant). An itemset that appears in a percentage of baskets at or above the support threshold is called the frequent itemset. A candidate itemset is potentially a frequent itemset A-Priori algorithm A-Priori use iterative level-wise search where k-itmsets are used to explore k+1 itemsets. First the set of frequent 1-itemset is found. This is used to find the set of frequent 2itemset, and so on until no more k-itemset can be found. An itemset of k items is called a k-itemset. A-Priori algorithm The algorithm follows a two stage process. 1) Find the k-itemset that is at or above the support threshold giving the frequent k-itemset. If none is fond stop, otherwise. 2) Generate the k+1 itemset from the kitemset. Goto 1. A-Priori algorithm A) The first iteration generates candidate 1-itemsets. B) The frequent 1-itemsets are selected from the candidate 1-itemsets that satisfy the minimum support. C) The second iteration generates candidate 2-itemsets from the frequent 1itemsets. All possible pairs are checked to determine the frequency of each pair. A-Priori algorithm D) The frequent 2-itemsets are determined by selecting those candidate 2-itemsets that satisfy the minimum support. E) The third iteration generates candidate 3itemsets from the frequent 2-itemsets. All possible triples are checked to determine the frequency of each triple. F) The frequent 3-itemsets are determined by selecting those candidate 3-itemsets that satisfy the minimum support. There are none, terminate. A-Priori algorithm : Example A retail chain wishes to determine whether the five product lines, identified by the product code I1, I2, I3, I4 and I5 are often purchased together by a customer on the same shopping trip. The next slide shows a summary of the transactions. The support threshold is the frequency at which the appearance of an itemset in a shopping basket is considered significant, in this case it is 2000. Find the frequent itemsets and generate the association rules using the A-Priori algorithm. A-Priori algorithm : Example R: itemFrequencyPlot(trans,type="absolute") Association Rules: A priori Principle: If an item set has a high support, then so do all its subsets. The steps of the algorithm is as follows: first,discover all 1-itemsets that are frequent combine to form 2-itemsets and analyze for frequent set go on until no more itemsets exceed the threshold. search for rules Association rules Association rules & Spatial Domain Differences with respect to spatial domain: 1. The notion of transaction or case does not exist, since data are immerse in a continuous space.The partition of the space may introduce errors with respect to overestimation or sub-estimation confidences. The notion of transaction is replaced by neighborhood. 2. The size of itemsets is less in the spatial domain. Thus, the cost of generating candidate is not a dominant factor. The enumeration of neighbours dominates the final computational cost. 3. In most cases, the spatial items are discrete version of continuous variables. Spatial Association Rules Table 7.5 shows examples of association rules, support, and confidence that were discovered in Darr 1995 wetland data. Co-Location rules Colocation rules attempt to generalise association rules to point collection data sets that are indexed by space. The colocation pattern discovery process finds frequently colocated subsets of spatial event types given a map of their locations, see Figure 7.12 in SDAT. Co-location Examples (a) Illustration of Point Spatial Co-location Patterns. Shapes represent different spatial feature types. Spatial features in sets {`+,x} and {o,*} tend to be located together. (b) Illustration of Line String Co-location Patterns. Highways and frontage roads1 are co-located , e.g., Hwy100 is near frontage road Normandale Road. Two co-location patterns Answers: and Spatial Association Rules A spatial association rule is a rule indicating certain association relationship among a set of spatial and possibly some non-spatial predicates. Spatial association rules (SPAR) are defined in terms of spatial predicates rather than item. P1 P2 .. Pn Q1 .. Qm Where at least one of the terms (P or Q) is a spatial predicate. is(x,country)touches(x,Mediterranean) is(x,wine-exporter) Co-location V Association Rules Co-location V Association Rules Transactions are disjoint while spatial colocation is not. Something must be done. Three main options 1. Divide the space into areas and treat them as transactions 2. Choose a reference point pattern and treat the neighbourhood of each of its points as a transaction 3. Treat all point patterns as equal Co-location V Association Rules Co-location Co-location V Association Rules Co-location V Association Rules Co-location The participation ratio (support) is the number of row instances of co-location C divided by number of instances of Fi. The participation index (confidence) measures the implication strength of a pattern from spatial features in the pattern. Co-location Co-location V Association Rules Spatial Association Rules Mining (SARM) is similar to the raster view in the sense that it tessellates a study region S into discrete groups based on spatial or aspatial predicates derived from concept hierarchies. For instance, a spatial predicate close_to(α, β) divides S into two groups, locations close to β and those not. Co-location V Association Rules So, close_to(α, β) can be either true or false depends on α’s closeness to β. A spatial association rule is a rule that consists of a set of predicates in which at least a spatial predicate is involved. For instance, is_a(α, house) and close_to(α, beach) -> expensive(α). This approach efficiently mines large datasets using a progressive deepening approach. DM Summary Data mining is the process of finding significant previously unknown, and potentially valuable knowledge hidden in data. DM seeks to reveal useful and often novel patterns and relationships in the raw and summarized data in the warehouse in order to solve business problems. The answers are not pre-determined but often discovered through exploratory methods. Not usually part of operational systems (day-to-day) but rather a decision support system (sometimes once off). The variety of data mining methods include intelligent agents, expert systems, fuzzy logic, neural networks, exploratory data analysis, descriptive DM, predictive DM and data visualization. Closely related to Spatial Statistics (e.g. Moran's I). Summary DM, predictive DM and data visualization. Closely related to Spatial Statistics (e.g. Moran's I). The methods are able to intensively explore large amounts data for patterns and relationships, and to identify potential answers to complex business problems. Some of the areas of application are risk analysis, quality control, and fraud detection. There are several ways GIS and spatial techniques can be incorporated in data mining. Pre-DM, a data warehouse can be spatially partitioned, so the data mining is selectively applied to certain geographies (e.g. location or theme). During the data mining process, algorithms can be modified to incorporate spatial methods. For instance, correlations can be adjusted for spatial autocorrelation (or correlation across space and time), and cluster analysis can add spatial indices, association rules can be adapted to generate colocation inferences.. After data mining, patterns and relationships identified in the data can be mapped with GIS software. Summary DM Examples co-location , location prediction Application of SDM: The generation of colocation rules. Determining the location of nests based on the values of vegetation durability & water depth is a location prediction problem. AR-Summary Association Rules. An association rule can be expressed as a predicate in the form (IF x1,x2.. THEN y1,y2.. ) where {xi,yi} are called itemsets (e.g. items in a shopping basket). The AR algorithm takes a list of itemsets as intput and produces a set of rules each with a confidence measure. Association rule discovery (ARD) identifies the relationships within data. ARD can identify product lines that are bought together in a single shopping trip by many customers and this knowledge can be used to by a supermarket chain to help decide on the layout of the product lines. AR-Summary Association rules characterized by confidence and support. AR and co-location DM Example co-location , location prediction Application of SDM: The generation of co-location rules. Determining the location of nests based on the values of vegetation durability & water depth is a location prediction problem. Co-location is the presence of two or more spatial objects at the same location or at significantly close distances from each other. Co-location patterns can indicate interesting associations among spatial data objects with respect to their non-spatial attributes. For example, a data mining application could discover that sales at franchises of a specific pizza restaurant chain were higher at restaurants co-located with video stores than at restaurants not colocated with video stores. In probabilistic terms an association rule X->Y is an expression in conditional probability P(Y|X). Association rules for spatial data. Co-location rules attempt to generalise association rules to point collection data sets that are indexed by space. The colocation pattern discovery process finds frequently co-located subsets of spatial event types given a map of their locations Examples of co-location patterns: predator-prey species, symbiosis, Dental health and fluoride. Association rules for spatial data. Co-location extends traditional ARM to where the set of transactions is a continuum in a space, but we need additional definitions of both neighbour (say radius) and the statistical weight of neighbour. Use spatial statistic, the K function, to measure the correlation between one (same var) and two point (diff. var) patterns. K can measure If no spatial correlation, attraction, repulsion, between variables (predator-prey). Association rules for spatial data. Either the antecedent or the consequent of the rule will generally contain a spatial predicate (e.g. within X) These could be arranged as follows: Non-spatial antecedent and spatial consequent. All primary schools are located close to new suburban housing estates. Spatial antecedent and non-spatial consequent. Houses located close to the bay are expensive. Spatial antecedent and spatial consequent. Residential properties located in the city are south of the river. Here the antecedent also has a non-spatial filter 'residential' The participation ratio and participation index are two measures which replace support and confidence here. The participation ratio is the number of row instances of co-location C divided by number of instances of Fi. Example of spatial assocaition rule is_a(x, big_town) /\ intersect(x, highway) -> adjacent_to (x, river) [support=7%, confidence=85%] [participation =7%, participation =85%]