Download Spatial Statistics and Spatial Knowledge Discovery

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Spatial Statistics and Spatial Knowledge
Discovery
First law of geography [Tobler]: Everything is related to everything, but
nearby things are more related than distant things.
Drowning in data yet starving for knowledge [Naisbitt -Rogers]
Lecture 0 : Introduction
Pat Browne
Introduction
• There are vast amounts of spatially related data
available from government departments (e.g. CSO,
agriculture, environment), local authorities, health
boards, and private industry. This module studies
techniques to analyze these large and diverse data sets
with a view to gleaning new and useful information. We
use two closely related approaches.
Munge = to imperfectly transform information.
Introduction
• Firstly, we study how basic statistical techniques such a
correlation and regression can be adapted to handle
spatial data.
• Secondly, we study how basic knowledge discovery
techniques, such as association rules, can be used in
location based analysis.
AIMS
• The aim is to equip the student with the
necessary skills to the extract decision
support information from large datasets
using statistical and knowledge discovery
techniques. We will study the techniques
and software that are necessary to
analyze large spatial data sets.
OUTCOMES
• On successful completion of the module the
students will be able to:
1. use basic descriptive statistics to describe spatial
data.
2. use inferential statistics and probability to help make
inferences, judgments, and decisions.
3. use statistical packages to analyze spatial data
4. use data mining software to assist in knowledge
discovery.
5. use data mining and statistical software for decision
support.
MODULE CONTENT
• Basic statistics and probability e.g. mean, variance,
standard deviation, sampling, correlation and regression
• Spatial autocorrelation and spatial regression.
• Association rules, and other techniques for data mining
and spatial data mining.
• A variety of spatial statistical techniques. For example,
spatial point patterns, spatial interpolation, analysis of
grids and surfaces.
• The use of statistical packages for spatial analysis.
• The use of data mining software in a spatial context.
Early Spatial Analysis
http://en.wikipedia.org/wiki/John_Snow_%28physician%29
http://en.wikipedia.org/wiki/Spatial_analysis
Knowledge Discovery (or Data
mining)
• What is data mining?: The non trivial extraction of
implicit, previously unknown, and potentially useful
information from data. Data mining finds valuable
information hidden in large volumes of data.
• Data mining is the analysis of data and the use of
software techniques for finding patterns and regularities
in sets of data.
• The computer is responsible for finding the patterns by
identifying the underlying rules and features in the data.
• It is possible to "strike gold" in unexpected places as the
data mining software extracts patterns not previously
discernible or so obvious that no-one has noticed them
before.
Knowledge Discovery (or Data
mining)
• Data mining lies at the intersection of
database management, statistics, machine
learning and artificial intelligence. DM
provides semi-automatic techniques for
discovering unexpected patterns in very
large data sets.
Descriptive and Predictive DM
Descriptive Data Mining
• Descriptive analysis is an analysis that results in
some description or summarization of data. It
characterizes the properties of the data by
discovering patterns in the data, which would be
difficult for the human analyst to identify by eye
or by using standards statistical techniques.
Description involves identifying rules or models
that describe data (e.g. 15% of those who buy
ice cream also buy wafers).
Descriptive Data Mining
• Clustering (unsupervised learning) is a
descriptive data mining technique. Clustering is
the task of assigning cases into groups of cases
(clusters) so that the cases within a group are
similar to each other and are as different as
possible from the cases in other groups.
Clustering can identify groups of customers with
similar buying patterns and this knowledge can
be used to help promote certain products.
Clustering can help locate what are the crime
‘hot spots’ in a city.
Descriptive Data Mining
• Association Rules. Association rule
discovery (ARD) identifies the logical
relationships within data. The rule can be
expressed as a predicate in the form (IF x
THEN y ). ARD can identify product lines
that are bought together in a single
shopping trip by many customers and this
knowledge can be used to by a
supermarket chain to help decide on the
layout of the product lines.
Association Rule Example
Predictive Data Mining
• Predictive DM results in some description
or summarization of a sample of data
which predicts the form of unobserved
data. Prediction involves building a set of
rules or a model that will enable unknown
or future values of a variable to be
predicted from known values of another
variable.
Predictive Data Mining
• Classification is a predictive data mining
technique. Classification is the task of
finding a model that maps (classifies) each
case into one of several predefined
classes. Classification is used in risk
assessment in the insurance industry.
Predictive Data Mining
• Regression analysis is a predictive data
mining technique that uses a model to
predict a value. Regression can be used
to predict sales of new product lines based
on advertising expenditure.
Linear regression : Example
• Below is a linear regression model. It shows the value
of the amount customers spend in a supermarket fitted
as a linear function of peoples income. Where a (the
intercept) and b (the slope) are found by the data
mining algorithm. If the model is reasonably accurate,
values of AnnualSpending(Y) can be predicted (or
calculated) from values of Income(X)
Statistical Techniques
• Data mining uses statistical concepts and
techniques e.g. mean, standards
deviation, population distribution,
probability, sampling.
• There are differences between DM and
statistics: DM is a process requiring many
steps such as data cleaning. Data mining
can be used as a prelude to a more formal
statistical study (hypothesis discovery).
Statistical techniques for spatial
data.
• There are special spatial statistical
techniques e.g. interpolation what is the
likely value of a point.
• Also many standard statistical techniques
can be adapted for spatial applications
(e.g. using Moran’s I). These usually
involve including a weight matrix
representing location in the basic formula.
Spatial autocorrelation
Dispersed
Negative
Spatial
Independence
Spatial Clustering
Positive
BB = Blue beside Blue
BW = Blue beside White
WW = White beside White.
32 white cell and 32 blue cells = 64 cells
Moran’s I – Same Mean & SD, but
different spatial configurations.
References
Lloyd: Spatial Data Analysis
Applied Spatial Data Analysis with R
Bivand, Pebesma, Gómez-Rubio
http://www.manning.com/obe/
http://www.spatial.cs.umn.edu/Book/