Download Data mining

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Lecture 6
Data Mining
DT786 Semester 2 2011-12
Pat Browne
Data Mining: Outline




Spatial DM compared to spatial statistics
Background to SD, & spatial data mining (SDM).
The DM process
Spatial autocorrelation i.e. the non independence of
phenomena in a contiguous geographic area.
 Spatial independence
 Classical data mining concepts:
 Classification
 Clustering
 Association rules
 Spatial data mining using co-location Rules
 Summary
Statistics versus Data Mining
 Do we know the statistical properties of data? Is data
spatially clustered, dispersed, or random?
 Data mining is strongly related to statistical analysis.
 Data mining can be seen as a filter (exploratory data
analysis) before applying a rigorous statistical tool.
 Data mining generates hypotheses that are then
verified (sometimes too many!).
 The filtering process does not guarantee
completeness (wrong elimination or missing data).
Data Mining
 Data mining is the process of discovering
interesting and potentially useful patterns of
information embedded in large databases.
 Spatial data mining has the same goals as
conventional data mining but requires additional
techniques that are tailored to the spatial
domain.
 A key goal of spatial data mining is to partially
automate knowledge discovery, i.e., search for
“nuggets” of information embedded in very large
quantities of spatial data.
Data Mining
 Data mining lies at the intersection of database
management, statistics, machine learning and
artificial intelligence. DM provides semiautomatic techniques for discovering
unexpected patterns in very large data sets.
 We must distinguish between operational
systems (e.g. bank account transactions) and
decision support systems (e.g. data mining). DM
can support decision making.
Spatial Data Mining
 SDM can be characterised by Tobler’s first
law of geography (near things tend to be
more related than far things). Which
means that the standard DM assumptions
that values are independently and
identically distributed does not hold in
spatially dependent data (SDD). The term
spatial autocorrelation captures this
property and augments standard DM
techniques for SDM.
Spatial Data Mining
 The important techniques in conventional DM
are association rules, clustering, classification,
and regression. These techniques need to be
modified for spatial DM. Two approaches used
when adapting DM techniques to the spatial
domain:
 1)Adjust the underlying (iid) statistical model
 2)Include an objective function1 (some f(x) that we
wish to maximize or minimize which drives the
search) that is modified to include a spatial term.
Spatial Data Mining
 Size of spatial data sets:
 NASA’s Earth Orbiting Satellites capture about a
terabyte(1012) a day, YouTube 2008 = 6 terabytes.
 Environmental agencies, utilities (e.g. ESB), Central
Statistics Office, government departments such as
health/agriculture, and local authorities all have large
spatial data sets.
 It is very difficult to analyse such large data sets
manually or using only SQL.
 For examples see Chapter 7 from SDT
Data Mining: Sub-processes
 Data mining involves many sub-process:
 Data collection: usually data was collected as
part of the operational activities of an
organization, rather than specifically for the data
mining task. It is unlikely that the data mining
requirements were considered during data
collection.
 Data extraction/cleaning: data must be extracted
& cleaned for the specific data mining task.
Data Mining: Sub-processes
 Feature selection.
 Algorithm design.
 Analysis of output
 Level of aggregation at which the data is
being analysed must be decided. Identical
experiments at different levels of scale can
sometimes lead to contradictory results
(e.g. the choice of basic spatial unit can
influence the results of a social survey).
Geographic Data mining process
Close interaction between Domain Expert & Data-Mining Analyst
The output consists of hypotheses (data patterns) which can be verified with
statistical tools and visualised using a GIS.
The analyst can interpret the patterns recommend appropriate actions
Unique features of spatial data
mining
 The difference between classical & spatial
data mining parallels the difference
between classical & spatial statistics.
 Statistics assumes the samples are
independently generated, which is
generally not the case with SDD.
 Like things tend to cluster together.
 Change tends to be gradual over space.
Non-Spatial Descriptive Data
Mining
 Descriptive analysis is an analysis that results in
some description or summarization of data. It
characterizes the properties of the data by
discovering patterns in the data, which would be
difficult for the human analyst to identify by eye
or by using standards statistical techniques.
Description involves identifying rules or models
that describe data. Both clustering and
association rules are descriptive techniques
employed by supermarket chains.
Non-Spatial Data Mining
Non-Spatial Descriptive Data
Mining
 Clustering (unsupervised learning) is a
descriptive data mining technique. Clustering is
the task of assigning cases into groups of cases
(clusters) so that the cases within a group are
similar to each other and are as different as
possible from the cases in other groups.
Clustering can identify groups of customers with
similar buying patterns and this knowledge can
be used to help promote certain products.
Clustering can help locate what are the crime
‘hot spots’ in a city.
Clustering using Similarity graphs
Problem: grouping objects into similarity
classes based on various properties of the
objects. For example, consider computer
programs that implement the same
algorithm have k properties (k = <1, 2, 3> )
 1. Number of lines in the program
 2. Number of “GOTO” statements
 3. Number of function calls
Clustering using Similarity graphs
Suppose five programs are compared using
three attributes:
Program
# lines
# GOTOS
# functions
1
66
20
1
2
41
10
2
3
68
5
8
4
90
34
5
5
75
12
14
Clustering using Similarity graphs.
 A graph G is constructed as follows:








V(G) is the set of programs {v1, v2, v3, v 4, v5 }.
Each vertex vi is assigned a triple (p1, p2, p3),
where pk is the value of property k = 1, 2, or 3
v1 = (66,20,1)
Vertices not accurately positioned.
v2 = (41, 10, 2)
v3 = (68, 5, 8)
v4 = (90, 34, 5)
v5 = (75, 12, 14)
Clustering using Similarity graphs.
 Define a dissimilarity function as follows:
 For each pair of vertices v = (p1, p2, p3),w = (q1, q2, q3)
3
s(v,w)
=
 |pk – qk|
k=1
s(v,w) is a measure of dissimilarity between any
two programs v and w
Fix a number N. Insert an edge between v and w if
s(v,w) < N. Then:
We say that v and w are in the same class if v = w
or if there is a path between v and w.
Clustering using Similarity
graphs.
 If we let vi correspond to program i:
s(v1,v2) = 36
s(v3,v4) = 54
s(v1,v3) = 24
s(v3,v5) = 20
s(v1,v4) = 42
s(v4,v5) = 46
s(v1,v5) = 30
s(v2,v3) = 38
s(v2,v4) = 76
s(v2,v5) = 48
s(v1,v2)= =|66-41|+|20-10|+|1-2| = 36
Clustering using Similarity graphs.
 Let N = 25.
 s(v1,v3) = 24, s(v3,v5) = 20 and all other
s(vi,vj) > 25
 There are three classes:
 {v1,v3, v5}, {v2} and {v4}
 The similarity graph =
Dissimilarity matrix in R
 library('cluster')
 data <

2
3
4
5


matrix(c(66,20,1,41,10,2,68,5,8,90,34,5,75,12,14),ncol=3,byrow=TR
UE)
diss <- daisy(data,metric = "manhattan")
Dissimilarities :
1 2 3 4
36
24 38
42 76 54
30 48 20 46
Metric : manhattan
Number of objects : 5
Non-Spatial Descriptive Data
Mining
 Association Rules. Association rule
discovery (ARD) identifies the
relationships within data. The rule can be
expressed as a predicate in the form (IF x
THEN y ). ARD can identify product lines
that are bought together in a single
shopping trip by many customers and this
knowledge can be used to help decide on
the layout of the product lines. We will look
at ARD in detail later.
Non-Spatial Predictive Data Mining
 Predictive DM results in some description
or summarization of a sample of data
which predicts the form of unobserved
data. Prediction involves building a set of
rules or a model that will enable unknown
or future values of a variable to be
predicted from known values of another
variable.
Classification Non-Spatial
Predictive Data Mining
 Classification is a predictive data mining technique.
Classification is the task of finding a model that maps
(classifies) each case into one of several predefined
classes. The goal of classification is to estimate the
value of an attribute of a relation based on the value of
the relation’s other attribute.
 Uses:
 Classification is used in risk assessment in the insurance
industry.
 Determining the location of nests based on the values of
vegetation durability & water depth is a location prediction
problem (classification nest or no nest).
 Classifying the pixels of a satellite image into various thematic
classes such as water, forest, or agricultural is a thematic
classification problem.
Classification Non-Spatial
Predictive Data Mining
Classification Non-Spatial
Predictive Data Mining
 A classifier can choose a hyperplane that
best classifies the data.
Classification techniques
 A classification function, f : D -> L, maps a
domain D consisting of one or more variables (e.g.
vegetation durability, water depth,
distance to open water) to a set of labels L (e.g.
nest or not-nest).
 The goal of the classification is to determine the
appropriate f, from a finite subset Train  D  L.
 Accuracy of f determined on Test which is disjoint
from Train.
 The classification problem is known as predictive
modelling because it is used to predict the labels L
from D.
Non-Spatial Predictive Data Mining
 Regression analysis is a predictive data
mining technique that uses a model to
predict a value. Regression can be used
to predict sales of new product lines based
on advertising expenditure.
Case Study
 Data from 1995 & 1996 concerning two wetlands




on the shores of Lake Erie, USA.
Using this information we want to predict the
spatial distribution of marsh breeding bird called
the red-winged black bird. Where will they build
nests? What conditions do they favour?
A uniform grid (pixel=5 square metres) was
superimposed on the wetland.
Seven attributes were recorded.
See link1 to Spatial Databases a Tour for details.
Case Study
Case Study
 Significance of three key variables
established with statistical analysis.
 Vegetation durability
 Distance to open water
 Water depth
Case Study
Nest locations
Water depth
Distance to open water
Vegetation durability
Example showing different predictions: (a) the actual locations of nests; (b) pixels with actual nests;
(c) locations predicted by one model; and (d) locations predicted by another model. Prediction (d) is
spatially more accurate than (c).
Classical statistical assumptions do
not hold for spatially dependent
data
Case Study
 The previous maps illustrate two important
features of spatial data:
 Spatial Autocorrelation (not independent)
 Spatial data is not identically distributed.
 Two random variables are identically
distributed if and only if they have the
same probability distribution.
Spatial DBs needs to augment
classical DM techniques because:
 Rich data types (e.g., extended spatial
objects)
 Implicit spatial relationships among the
variables,
 Observations that are not independent,
 Spatial autocorrelation exists among the
values of the attributes of physical
locations or features.
Classical Data Mining
Association rules: Determination of interaction between attributes. For
example: X Y:
 Classification: Estimation of the attribute of an entity in terms of
attribute values of another entity. Some applications are:
 Predicting locations (shopping centers, habitat, crime zones)
 Thematic classification (satellite images)
 Clustering: Unsupervised learning, where classes and the number
of classes are unknown. Uses similarity criterion. Applications:
Clustering pixels from a satellite image on the basis of their spectral
signature, identifying hot spots in crime analysis and disease
tracking.
 Regression: takes a numerical dataset and develops a
mathematical formula that fits the data. The results can be used to
predict future behavior. Works well with continuous quantitative data
like weight, speed or age. Not good for categorical data where order
is not significant, like colour, name, gender, nest/no nest.
Determining the Interaction among
Attributes
 We wish to discovery relationships between
attributes of a relation. Examples:
is_close(house,beach) -> is_expensive(house)
low(vegetationDurability) ->
high(stem density)
 Associations & association rules are often used
to select subsets of features for more rigorous
statistical correlation analysis.
 In probabilistic terms an association rule X->Y is
an expression in conditional probability P(Y|X).
 P(X|Y) = P(X  Y)/P(Y) (probability of X, given Y)
Antecedent, AKA: hypotheses, assumptions, premises
Spatial Association rules
 is_a(x, big_town) /\
implies
Conclusion or
Consequence
intersect(x, highway)
->
adjacent_to(x, river)
 [support=7%, confidence=85%]
 The relative frequency with which an antecedent
appears in a database is called its support (other
definitions possible).
 The confidence of a rule A->B is the conditional
probability of B given A. Using probability
notation: confidence(A implies B) = P (B | A).
How does data mining differ from
conventional methods of data analysis?
 Using conventional data analysis the analyst formulates
and refines the hypothesis. This is known as hypothesis
verification, which is an approach to identifying patterns
in data where a human analyst formulates and refines
the hypothesis. For example "Did the sales of cream
increase when strawberries were available?"
 Using data mining the hypothesis is formulated and
refined without human input. This approach is known as
hypothesis generation, identifying patterns in that data
where the hypotheses are automatically formulated and
refined. Knowledge discovery is where the data mining
tool formulates and refines the hypothesis by identifying
patterns in the data. For example, "What are the factors
that determine the sales of cream?"
Association rules
 An association rule is a pattern that can
be expressed as a predicate in the form
(IF x THEN y ), where x and y are
conditions (about cases), which state if x
(the antecedent) occurs then, in most
cases, so will y (the consequence). The
antecedent may contain several conditions
but the consequence usually contains only
one term.
Association rules
 Association rules need to be discovered. Rule
discovery is data mining technique that identifies
relationships within data. In the non-spatial case
rule discovery is usually employed to discover
relationships within transactions or between
transactions in operational data. The relative
frequency with which an antecedent appears in
a database is called its support. High support is
the frequency at which the relative frequency is
considered significant and is called the support
threshold (say 70%)
Association rules
 Example: Market basket analysis is form
of association rule discovery that
discovers relationships in the purchases
made by a customer during a single
shopping trip. An itemset in the context of
market basket analysis is the set of items
found in a customer’s shopping basket.
Association rules
 Association rules need to be discovered. Rule
discovery is data mining technique that identifies
relationships within data. In the non-spatial case
rule discovery is usually employed to discover
relationships within transactions or between
transactions in operational data. The relative
frequency with which an antecedent appears in
a database is called its support (alternatively,
fraction of transactions satisfying the rule). High
support is the frequency at which the relative
frequency is considered significant and is called
the support threshold (say 70%)
Association rules
 Example: Market basket analysis is form
of association rule discovery that
discovers relationships in the purchases
made by a customer during a single
shopping trip. An itemset in the context of
market basket analysis the set of items
found in a customer’s shopping basket.
Item Set
 An itemset in the context of market basket
analysis is the set of items found in a customer’s
shopping basket (or order). A general form of
association rule is if (x1 and x2 and .. xn THEN
y1 and y2 and .. y3). In market basket analysis
the set of items (x1 and x2 and .. xn and y1 and
y2 and .. y3) is called the itemset. We are only
interested in itemsets with high support (i.e. they
appear together in many baskets).
Frequent Item Set
 We then find association rules involving itemsets
that appear in at least a certain percentage of
the shopping baskets called the support
threshold (i.e. frequency at which the
appearance of an itemset in a shopping basket
is considered significant). An itemset that
appears in a percentage of baskets at or above
the support threshold is called the frequent
itemset.
 A candidate itemset is potentially a frequent
itemset
A-Priori algorithm
 A-Priori use iterative level-wise search
where k-itmsets are used to explore k+1
itemsets.
 First the set of frequent 1-itemset is found.
This is used to find the set of frequent 2itemset, and so on until no more k-itemset
can be found. An itemset of k items is
called a k-itemset.
A-Priori algorithm
 The algorithm follows a two stage
process.
 1) Find the k-itemset that is at or above
the support threshold giving the frequent
k-itemset. If none is fond stop, otherwise.
 2) Generate the k+1 itemset from the kitemset. Goto 1.
A-Priori algorithm
 A) The first iteration generates candidate
1-itemsets.
 B) The frequent 1-itemsets are selected
from the candidate 1-itemsets that satisfy
the minimum support.
 C) The second iteration generates
candidate 2-itemsets from the frequent 1itemsets. All possible pairs are checked to
determine the frequency of each pair.
A-Priori algorithm
 D) The frequent 2-itemsets are determined by
selecting those candidate 2-itemsets that satisfy
the minimum support.
 E) The third iteration generates candidate 3itemsets from the frequent 2-itemsets. All
possible triples are checked to determine the
frequency of each triple.
 F) The frequent 3-itemsets are determined by
selecting those candidate 3-itemsets that satisfy
the minimum support. There are none,
terminate.
A-Priori algorithm : Example
 A retail chain wishes to determine whether the
five product lines, identified by the product code
I1, I2, I3, I4 and I5 are often purchased together
by a customer on the same shopping trip. The
next slide shows a summary of the transactions.
The support threshold is the frequency at which
the appearance of an itemset in a shopping
basket is considered significant, in this case it is
2000.
 Find the frequent itemsets and generate the
association rules using the A-Priori algorithm.
A-Priori algorithm : Example
R: itemFrequencyPlot(trans,type="absolute")
Association Rules: A priori
 Principle: If an item set has a high support, then so do all its
subsets.
 The steps of the algorithm is as follows:




first,discover all 1-itemsets that are frequent
combine to form 2-itemsets and analyze for frequent set
go on until no more itemsets exceed the threshold.
search for rules
Association rules
Association rules & Spatial
Domain

Differences with respect to spatial domain:
1. The notion of transaction or case does not exist, since data
are immerse in a continuous space.The partition of the
space may introduce errors with respect to overestimation
or sub-estimation confidences. The notion of transaction is
replaced by neighborhood.
2. The size of itemsets is less in the spatial domain. Thus, the
cost of generating candidate is not a dominant factor. The
enumeration of neighbours dominates the final
computational cost.
3. In most cases, the spatial items are discrete version of
continuous variables.
Spatial Association Rules
 Table 7.5 shows examples of association
rules, support, and confidence that were
discovered in Darr 1995 wetland data.
Co-Location rules

Colocation rules attempt to generalise association rules to
point collection data sets that are indexed by space. The
colocation pattern discovery process finds frequently colocated subsets of spatial event types given a map of their
locations, see Figure 7.12 in SDAT.
Co-location Examples
(a) Illustration of Point Spatial Co-location Patterns. Shapes represent different
spatial feature types. Spatial features in sets {`+,x} and {o,*} tend to be
located together.
(b) Illustration of Line String Co-location Patterns. Highways and frontage
roads1 are co-located , e.g., Hwy100 is near frontage road Normandale
Road.
Two co-location patterns
Answers:
and
Spatial Association Rules
 A spatial association rule is a rule indicating certain
association relationship among a set of spatial and possibly
some non-spatial predicates.
 Spatial association rules (SPAR) are defined in terms of
spatial predicates rather than item.
 P1  P2 ..  Pn  Q1 ..  Qm
 Where at least one of the terms (P or Q) is a spatial
predicate.
is(x,country)touches(x,Mediterranean)
is(x,wine-exporter)
Co-location V Association Rules
Co-location V Association Rules
 Transactions are disjoint while spatial colocation is not. Something must be done.
Three main options
 1. Divide the space into areas and treat them
as transactions
 2. Choose a reference point pattern and treat
the neighbourhood of each of its points as a
transaction
 3. Treat all point patterns as equal
Co-location V Association Rules
Co-location
Co-location V Association Rules
Co-location V Association Rules
Co-location
The participation ratio (support) is the number of row instances of co-location C divided
by number of instances of Fi. The participation index (confidence) measures the
implication strength of a pattern from spatial features in the pattern.
Co-location
Co-location V Association Rules
 Spatial Association Rules Mining (SARM)
is similar to the raster view in the sense
that it tessellates a study region S into
discrete groups based on spatial or
aspatial predicates derived from concept
hierarchies. For instance, a spatial
predicate close_to(α, β) divides S into two
groups, locations close to β and those not.
Co-location V Association Rules
 So, close_to(α, β) can be either true or false
depends on α’s closeness to β. A spatial
association rule is a rule that consists of a set of
predicates in which at least a spatial predicate
is involved. For instance,
 is_a(α, house) and close_to(α, beach) ->
expensive(α).
 This approach efficiently mines large datasets
using a progressive deepening approach.
DM Summary
 Data mining is the process of finding significant
previously unknown, and potentially valuable knowledge
hidden in data. DM seeks to reveal useful and often
novel patterns and relationships in the raw and
summarized data in the warehouse in order to solve
business problems. The answers are not pre-determined
but often discovered through exploratory methods. Not
usually part of operational systems (day-to-day) but
rather a decision support system (sometimes once off).
The variety of data mining methods include intelligent
agents, expert systems, fuzzy logic, neural networks,
exploratory data analysis, descriptive DM, predictive DM
and data visualization. Closely related to Spatial
Statistics (e.g. Moran's I).
Summary
 DM, predictive DM and data visualization. Closely related to Spatial
Statistics (e.g. Moran's I). The methods are able to intensively
explore large amounts data for patterns and relationships, and to
identify potential answers to complex business problems. Some of
the areas of application are risk analysis, quality control, and fraud
detection. There are several ways GIS and spatial techniques can
be incorporated in data mining. Pre-DM, a data warehouse can be
spatially partitioned, so the data mining is selectively applied to
certain geographies (e.g. location or theme). During the data mining
process, algorithms can be modified to incorporate spatial methods.
For instance, correlations can be adjusted for spatial autocorrelation
(or correlation across space and time), and cluster analysis can add
spatial indices, association rules can be adapted to generate colocation inferences.. After data mining, patterns and relationships
identified in the data can be mapped with GIS software.
Summary
 DM Examples co-location , location
prediction
 Application of SDM: The generation of colocation rules. Determining the location of
nests based on the values of vegetation
durability & water depth is a location
prediction problem.
AR-Summary
 Association Rules. An association rule can be expressed
as a predicate in the form (IF x1,x2.. THEN y1,y2.. )
where {xi,yi} are called itemsets (e.g. items in a shopping
basket). The AR algorithm takes a list of itemsets as
intput and produces a set of rules each with a confidence
measure. Association rule discovery (ARD) identifies the
relationships within data. ARD can identify product lines
that are bought together in a single shopping trip by
many customers and this knowledge can be used to by a
supermarket chain to help decide on the layout of the
product lines.
AR-Summary
 Association rules characterized by confidence
and support.
AR and co-location
 DM Example co-location , location prediction
 Application of SDM: The generation of co-location rules.
Determining the location of nests based on the values of vegetation
durability & water depth is a location prediction problem.
 Co-location is the presence of two or more spatial objects at the
same location or at significantly close distances from each other.
Co-location patterns can indicate interesting associations among
spatial data objects with respect to their non-spatial attributes. For
example, a data mining application could discover that sales at
franchises of a specific pizza restaurant chain were higher at
restaurants co-located with video stores than at restaurants not colocated with video stores.
 In probabilistic terms an association rule X->Y is an expression in
conditional probability P(Y|X).
Association rules for spatial
data.
 Co-location rules attempt to generalise
association rules to point collection data
sets that are indexed by space. The colocation pattern discovery process finds
frequently co-located subsets of spatial
event types given a map of their locations
 Examples of co-location patterns:
predator-prey species, symbiosis, Dental
health and fluoride.
Association rules for spatial
data.
 Co-location extends traditional ARM to where
the set of transactions is a continuum in a space,
but we need additional definitions of both
neighbour (say radius) and the statistical weight
of neighbour. Use spatial statistic, the K
function, to measure the correlation between
one (same var) and two point (diff. var)
patterns. K can measure If no spatial correlation,
attraction, repulsion, between variables
(predator-prey).
Association rules for spatial
data.











Either the antecedent or the consequent of the rule will generally contain a spatial
predicate (e.g. within X) These could be arranged as follows:
Non-spatial antecedent and spatial consequent. All primary schools are located close
to new suburban housing estates.
Spatial antecedent and non-spatial consequent. Houses located close to the bay are
expensive.
Spatial antecedent and spatial consequent. Residential properties located in the city
are south of the river. Here the antecedent also has a non-spatial filter 'residential'
The participation ratio and participation index are two measures which replace
support and confidence here. The participation ratio is the number of row instances of
co-location C divided by number of instances of Fi.
Example of spatial assocaition rule
is_a(x, big_town) /\
intersect(x, highway) ->
adjacent_to (x, river)
[support=7%, confidence=85%]
[participation =7%, participation =85%]