Download IV. MODELS FROM DATA: Data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
IV. MODELS FROM DATA:
Data mining
1
Outline
A) THEORETICAL BACKGROUND
1. Knowledge discovery in data bases (KDD)
2. Data mining
•
Data
•
Patterns
•
Data mining algorithms
B) PRACTICAL IMPLEMENTATIONS
3. Applications:
•
Equations
•
Decision trees
•
Rules
2
1
Methods
Methods - Data
Classic predictive
models
Hypothesis
Model
Model - Hypothesis
Knowledge
discovery in data
Data
New Knowledge
Marko Debeljak
Knowledge discovery in data bases (KDD)
KDD-data mining
Frawley et al., 1991: “KDD is the non-trivial process of identifying valid,
novel, potentially useful, and ultimately understandable patterns in data”,
How to find patters in data?
Data mining (DM) – central step in the KDD process concerned with
applying computational techniques to actually find patterns in the data (1525% of the effort of the overall KDD process).
- step 1: preparing data for DM (data preprocessing)
- step 3: evaluating the discovered patterns (results of DM)
4
2
Knowledge discovery in data bases (KDD)
When the patterns can be treated as knowledge?
Frawley et al., (1991): “A pattern that is interesting (according to
a user- imposed interest measure) and certain enough (again
according to the user’s criteria) is called knowledge. “
Condition 1: Discovered patterns should be valid on new
data with some degree of certainty (typically prescribed by the
user).
Condition 2: The patterns should potentially lead to some
useful actions (according to user defined utility criteria).
5
Knowledge discovery in data bases (KDD)
What can KDD and data mining contribute to
environmental sciences (ES) (e.g. agronomy, forestry,
ecology, …)?
Environmental sciences deal with complex unpredictable natural systems
(e.g. arable, forest and water ecosystems) in order to get answers on
complex questions.
The amount of collecting environmental data is increasing exponentially.
KDD was purposefully designed to cope with such complex questions
about complex systems like:
- Understanding the domain/system studied (e.g., gene flow, seed
bank, life cycle, …)
- Predicting future values of system variables of interest (e.g., the rate
of out-crossing with GM plants at location x at time y, seedbank
6
dynamics, …)
3
What is data mining?
Data mining focuses on the discovery of previously unknown
knowledge and integrates machine learning.
Machine learning focuses on descriptions and
prediction, based on known properties learned from the
training empirical data (examples) using computer
algorithms.
Learning from examples is called inductive learning.
If the goal of inductive learning is to obtain a
model that predicts the value of the target
variable from learning examples, then it is called
predictive or supervised learning.
Data mining (DM)
What is data mining from a technical perspective?
Data Mining, is the process of automatically searching large
volumes of data for patterns using algorithms.
Data Mining – Machine learning
Data Mining is the application of Machine Learning techniques
to data analysis problems.
The most relevant notions of data mining:
1. Data
2. Patterns
3. Data mining algorithms
8
4
Data mining (DM) - data
Data stored in one flat table.
Each example represented by a fixed number
of attributes.
Data stored in original tables or relations.
No loss of information, due to aggregation.
RELATIONAL data mining
PROPOSITIONAL data mining
Objects
Loss of information due to aggregation
Distance (m)
10
12
14
18
20
22
…
Properties of objects
Wind direction (0) Wind speed (m/s)
123
3
88
4
121
6
147
2
93
1
115
3
…
…
Out-crossing rate (%)
8
7
3
4
5
1
..
9
Marko Debeljak
This
image
cannot
currentl
y be
display
ed.
WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007
Data
Data are not stored at all but they continuously flow through an
algorithm.
Each example can be propositional or relational.
DATA STREAM mining
Marko Debeljak
This
image
cannot
currentl
y be
display
ed.
WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007
5
Data mining (DM) - pattern
2. What is a pattern?
A pattern is defined as: ”A statement (expression) in a given
language, that describes (relationships among) the facts in
a subset of the given data and is (in some sense) simpler
than the enumeration of all facts in the subset” (Frawley et al.
1991, Fayyad et al. 1996).
Classes of patterns considered in data mining:
A. equations,
B. decision trees, relational decision trees
C. association, classification, and regression rules.
11
Data mining (DM) - pattern
A. Equations
To predict the value of a target (dependent) variable as a
linear or nonlinear combination of the input (independent)
variables:
- Algebraic equations
To predict the behavior of dynamic systems, which change
their rates over time:
- Difference equations
- Differential equations
12
6
Data mining (DM) - pattern
B. Decision trees
To predict the value of one or several target dependent variables from
the values of other independent variables by a decision tree.
Decision tree has a hierarchical structure, where:
- each internal node contains a test on an independent variable(s),
- each branch corresponds to an outcome of the test (critical values of
independent variable(s)),
- each leaf gives a prediction for the value of the dependent (predicted)
variable(s).
13
Data mining (DM) - pattern
The decision tree is called:
A classification tree: class value in the
leaf is discrete (a finite set of nominal values):
e.g., (yes, no), (spec. A, spec. B, …)
A regression tree: class value in leaf is a
constant (infinite set of values): e.g.,120, 220,
312, …
A model tree: leaf contains linear model
predicting the class value (piecewise linear
function):
out-crossing rate= 12.3 distance - 0.123 wind
speed + 0.00123 wind direction
14
7
Data mining (DM) - pattern
3. Rules
To perform association analysis between attributes
discovered by association rules.
The rule denotes patterns of the form:
“IF Conjunction of conditions THEN Conclusion.”
- For classification rules, the conclusion assigns one of the
possible discrete values to the class (a finite set of nominal
values): e.g., (yes, no), (spec. A, spec. B, spec. D)
- For predictive rules, the conclusion gives a prediction for the
value of the target (class) variable (infinite set of values): e.g.,
120, 220, 312, …
15
Data mining (DM) - algorithm
3. What is data mining algorithm?
Algorithm in general:
- A procedure (a finite set of well-defined instructions) for
accomplishing some task which will terminate in a defined endstate.
Data mining algorithm:
- A computational process defined by a Turing machine
(Gurevich et al. 2000) for finding patterns in data
16
8
Data mining (DM) - algorithm
What kind of possible algorithms do we use for
discovering patterns?
Selection of the algorithm depends on the problem at
hand:
1. Equations = Linear and multiple regressions, equation
discovery
2. Decision trees = Top/down induction of decision trees
3. Rules = Rule induction
17
Data mining (DM) - algorithm
1. Linear and multiple regression
• Bivariate linear regression:
Predicted variable (C-class (ML) may be contusions or discontinues) can
be expressed as a linear function of one attribute (A):
C = α+ β×A
• Multiple regression:
Predicted variable (C-class (ML) may be contusions or discontinues) can
be expressed as a linear function of a multi-dimensional
attribute vector (AI):
C = Σni =1 βi×Ai
18
9
Data mining (DM) - algorithm
2. Top/down induction of decision trees
The decision tree is induced by Top-Down Induction of
Decision Trees (TDIDT) algorithm (Quinlan, 1986)
Tree construction proceeds recursively starting with the entire
set of training examples (entire table). At each step, an
attribute is selected as the root of the (sub) tree and the
current training set is split into subsets according to the
values of the selected attribute.
19
Data mining (DM) - algorithm
3. Rule induction
A rule that correctly classifies some examples is constructed
first.
The positive examples covered by the rule from the training set
are removed and the process is repeated until no more
examples remain.
20
10
Data mining (DM) - Statistics
Data mining vs. Statistics
Common to both approaches:
Reasoning FROM properties of a data sample TO properties
of a population.
21
Data mining (DM) – Machine learning Statistics
Statistics
Hypothesis testing when certain theoretical expectations about
the data distribution, independence, random sampling, sample
size, etc. are satisfied.
Main approach: best fitting all the available data.
Data mining
Automated construction of understandable patterns, and
structured models.
Main approach: structuring the data space, heuristic search for
decision trees, rules, … covering (parts of) the data space.
22
11
DATA MINING – CASE STUDIES
23
DATA MINING – Hands-on exercises
1. Data preprocessing
24
12
Applications – ecological modeling
Propositional and relational supervised data mining:
- Simple data mining
- Data mining of time series
- Spatial data mining
1. Equations:
• Algebraic equations
• Differential equations
2. Single and multi target decision trees:
• Classification trees
• Regression trees
• Model trees (single target only)
Applications – ecological modeling
POPULATION
DYNAMICS
HABITAT
MODELLING
GENE FLOW
MODELLING
RISK MODELLING
13
Algebraic equations: SOIL WATER FLOW
Problem: Prediction of drainage water
Type of pattern: Algebraic equation
Algorithm: CIPER
PCQE Database
• Experimental site La Jaillière
• Western France
• Owned by ARVALIS
• Shallow silt clay soils
• 11 fields are observed
• Field size about 0.3 – 1 ha
Introduction
Domain & Problem
Related work
Methodology
Data sources
Experimental design
Results
Discussion
Further work
14
PCQE Database (continued)
• Agricultural practices





Fertilization
Irrigation
Phytochemical protection
Harvesting
Tillage
• Slope
• Water flow
 Drainage
 Runoff
• 25 campaigns (1987 - 2011)
• The campaign is defined as the period starting from
01 SEP and finishing on 31 AUG, following year
Introduction
Domain & Problem
Related work
Methodology
Data sources
Experimental design
Results
Discussion
Further work
DRAINAGE predictive model - CIPER
• Polynomial equations induced on data for a whole
campaign – CIPER algorithm
• Evaluation
 “Leave one out” approach
Introduction
Fields
All
All
Test field
T3
T4
Std. Dev.
3.187
3.188
RMSE
2.1119
1.7220
RRSE
66.26 %
54.01 %
Corr. coeff. (r)
0.7855
0.8273
All
All
T5
T6
3.163
3.096
2.2478
2.1784
71.06 %
70.36 %
0.7467
0.8096
All
T7
3.229
1.3286
41.15 %
0.7812
All
T8
3.210
1.3813
43.03 %
0.7839
All
T9
3.210
1.5927
49.62 %
0.7434
All
T10
3.130
1.6672
53.26 %
0.7766
All
T11
3.108
1.6274
52.36 %
0.7841
T3
T6
3.056
2.1251
69.54 %
0.8125
T6
T3
3.600
2.0961
58.22 %
0.7745
Methodology
Data sources
Experimental design
Domain & Problem
Related work
Results
Discussion
Further work
15
Predictive models
Model (All/T4)
Drainage = 0.0196445 * RainfallA1 * Temp
+ 0.33246 * DrainageN1
+ 0.000662861 * CDCoef 2 * RainfallA12 * Slope
+ 0.00000107253 * Runoff * DrainageN1 * Temp2 * RainfallA12
– 0.00115983 * Runoff 2 * DrainageN1 * Slope3
– 0.00114057 * Temp2 * RainfallA1
+ 0.00153725 * RainfallA12
+ 1.63563 * Runoff * Slope
– 1.90622 * Runoff
– 0.0231748 * Slope2 * RainfallA1
+ 0.0654042 * RainfallA1 * Slope
– 0.00755737 * Slope3 * CDCoef 3 * Runoff 2
+ 0.0675951 * Slope
– 0.146702
Introduction
Domain & Problem
Related work
Methodology
Data sources
Experimental design
Results
Discussion
Further work
Experimental design
Results
Discussion
Further work
Predictive models
Introduction
Domain & Problem
Related work
Methodology
Data sources
16
Constraint algebraic equations: GENE FLOW
Constraint algebraic equations: GENE FLOW
Problem: Prediction of gene flow
Type of pattern: Constraint algebraic equation
Algorithm: Lagramge
17
Constraint algebraic equations: GENE FLOW
96 points
Experimental design: Federal Biological Research Centre, BBA, D
2m
100 m
Field design 2000
220 m
6
transg enic field / donor
non-transgenic field / recipient
access paths
4
sampling point
3
n
o
60 cobs – 2500 kernels
l
% of outcrossing
j
25,5m
49,5 m
q
a
3
1
b
c
Donors GMO corns
d
g
e
Direction of drilling
k
4.5 m 3 m
p
N
m
7.5 m
13.5 m
5
2
system o f coordinates
a for the sampling points
2m
h
f
Receptors NT corns
Constraint algebraic equations: GENE FLOW
18
Constraint algebraic equations: GENE FLOW
Differential equations: COMMUNITY STRUCTURE
19
Differential equations: COMMUNITY STRUCTURE
Problem: Time dependent ecosystem processes
Type of pattern: Differential equations
Algorithm: LAgramge
Differential equations: COMMUNITY STRUCTURE
20
Differential equations: COMMUNITY STRUCTURE
Data: 1995 to 2002
Differential equations: COMMUNITY STRUCTURE
Phosphorus
Water inflow
out-flow
Respiration
Growth
21
Differential equations: COMMUNITY STRUCTURE
Phytoplankton
Growth
Respiration
Sedimentation
Grazin
g
43
Differential equations: COMMUNITY STRUCTURE
Zooplankton
Feeds on
phytoplankton
Respiration
Mortality
22
Differential equations: COMMUNITY STRUCTURE
EQUATIONS
Algebraic - CIPER
Constraint algebraic- Lagramge
Differential- Lagramge
SUITABLE for predictions,
UNSUITABLE for interpretation
23
Decision trees: HABITAT MODELS
Decision trees: HABITAT MODELS
Problem: Classification of habitats (suitable, unsuitable)
Type of pattern: Classification decision trees
Algorithm: J4.8
24
Decision trees: HABITAT MODELS
Observed locations of BBs
#
##
#
##
### #
#
#
#
##
#
#
##
#
## #
#
##
# #
# #
##
# # ##
## # #
#
# # # # # ## #
## # #
## ## # ##
##
#
# ### #
# ##
#
#
## # ## ###
##
##
##
##
###
##
#
#
##
#
#
# ###
##
#
#
##
# #
# #
###
### # # #
##
#
#
#
#
##
#
#
#
#
##
#
#
#
# #
#
#
#
## #
#
##
#
#
#
# #
#
##
#
# # #
#
#
#
# # #
# # ##
# #
#
###
#
# ## ###
#
#
#
#
# # # ##
##
#
###
##
#
#
#
# ###
# ##
##
##
##
#
#
##
##### ###
#
###
####
#
#
### # #
######
## #
#
##
##
#### # #
#
#
#
#
#
#
#
#
#
#
#
#
#
# ##
#
#
#
#
#
# ##
#
#
# #
## #
##
#
#
##
#
##
#
##
#
##
## #
##
#
#
#
#
###
##
#
#
#
# # #### #
##
# ## #
# # #
#
#
##
## # #
# # # ## ### #
### #
#
#
# #
##
#
#
#
##
# ## ## # #
# ### #
#
#
# #
##
#
# # # ##
###
# ## #
###
# #
### #
## #
# # # ###
# ### ##
# # ## ##
#
# ## ### # #
##
##
##
#
#
#
#
#
#
#
# #
### #### # #
#
### ### ## #
#
#
##
#
#
#
##
#
#
#
#
####### # ##
## # #
## #
#
#
## #
##
###
#
## #
# ##
##
#
#
## # ###
#
##
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# #
##
### ######
#
###
#
#
#
#
# ##
## # # # #
#
#
# #
#
#
##
##
#
#
##
##
####
#### #####
###
#### ###
##
#
#
###
#
# # ##
# #
## #
###
# ## #
#
#
##
# # # #
#
#
# #
# #
#
###
#
# ##
# ## #
#
# # #
#####
#
###
###
#
##
######
## ### ##
#
#####
# #
## # # #
#
#
#
##
#
##
##
#
# ##
# ##
#
##
######
###
#
#
##
##
#
## #
#
#
##
##
####
#
##
# #
#
#
#
#### ##
#
# # # ##
#
#
##
#
##
# #
###
#
##
# ##
#
#
##
##
##
#
######
####
###
#
#
##
##
#
##
#
#
#
##
###
#
#
#
#
#
#
##
#
#
###
##
##
#
# # # ##
#
#
# #
##
#
##
# #
## # ## # # # ## #
##
#
# #
##
#
#
##
###
#
#
## # ##
###
#
#
#
## # #
## # #
## # ##
#
#
# # ## #
#
#
#
#
#
#
#
# ##
#
#
#
#
#
#
#
#
##
#
##
#
#
#
#
#
#
#
##
# #
##
##
## #
##
#
##
#
##
##
#
##
#
#
#
###
#
#
#####
##
##
#
## #
#
#
#### #
##
#
#
##
# # ### ##
#
###
##
##
#
##
#
##
#
######
## #
#
# ## #
##
##
## ## # # ####
# #
##
#
##
#
##
#
## # # #
##
# ##
##
# #
#
##
##
## #
##
# # # ##
#
##
##
## #
#
#
###
##
###
##
##
#
##### ###
#
#
###
###
##
#
#
#
#
# #
#
#
#
#
# # ##
##
##
##
##
##
##
#
#
## #
# #
#
#
#
##
#
##
##
##
#
## ##
##
###
##
##
#
#
#### #
##
# #
##
#
##
## # ##
##
###
#
#
#
##
####
### ## #
#
##
## #
##
## #
######
##
##
##
## #
#
#
### #
#
##
##
#
##
##
#
#
##
#
#
#
#
###
##
#
#
#
#
#
#
### ##
#
#
## #
##
# #
##
#
#
#####
##
#
#
#
# # ###
#
#
######
#
# #
##########
#
# ##
# #
#
##
#
#
# #
#
#
# #
# ##
###
#
##
##
##
###
#
#
##
##
###
#
#
##
#
##
####
#
## #
##
# # ## ##
######
## #
#
#
##
##
##
###
##
###
##
##
#
#
#
##
###
#
##
#
#####
#
##
##
##
### #### #
#
# # ##
##
##
#
##
#
#
#
##
#### # ##
#
# ## #
##
#
##
##
##
##
##
##
##
#
#
#
#
## # #
##
#
#
##
#
#
#
#
##
#
#
###
##
#
#
###### ##
#
###
##
#
#
#
#
###
### ####
#
### #
## # #
##
# ##
##
##
#
#
#
#
####
#####
##
###### #
##
####
#
## # ##
#
#
##
# #
#
####
# ##
# ## ##
#
#
#
##
##
###
#
##
#
## ### # ##### #####
### #
##
#####
##
#
##
## # #
##
#
#######
##
#
### #
#
#####
#
#
# #
#
## ### ##
#
#
##
##
########
##
#
#
###
##
######
#
#
###### # ## #
##
## #
## # ## #
#
##
#
#
##########
# #
#
# #
##
###
######
#
# ##
##
#
##
## ##
# # ## ##
#
######
#
### #
##### ## #### ##
##
#
##
#
##
#
#
# #
# # ##
#
##
#####
#
#
#
##
###
##
##
# ##
#
##
##
##
#
###### # ## # #
####
##
##
#
#
##
#
## #
##
#
#
##
# #
# ##
#
###
#
# #
#
##
##
#
#
#
## #
##
#
##
# ### #### #####
######
#
##
##
##
##
#
## #
# #
##### ###### #
### ##########
##
#
#####
#
#
####
##
#
#
######
########
##
####
##
#
# #
##
# # ##
#
####### #
##
##
#
#
# #
#
##
## # #########
# ## ### #
#
# #####
##
### #####
##
## # # #
#
##
#
####
# #
# ## # # #
####
##
#
## # #
# # ##
#
#
####
## ## #
## #
### ### ### #####
### ## #
## #
# ##
#
# #
# #
### #
####
##
#
##
## #
##
#
##
##
#
## # ## ## #
###
## ##
### # ## ### #
###
##
#### # ## ## ###
# # # #
##
# #
#
# ### #
#
# # #### #
#
## ### #
###
##
#
#
## # ###
# ### # #
### ## # ## #
# #
# #
#
## # #
#
## ## ##
#
# ##
###
####
#
###
#
##
# # # ### # # #
#
####
## #
## #
# # ## ## ##
##
###
#
## # #
#####
### #
#
#
##
# #### ## # #
#
####
#
# # # ##
##
###
##
## # # #
# ######## # #
##
##
#
##
## # ##
##
##
# ## ##
#
###
##
# #
#
##
##
#
###### ###
##
## #
#######
## # # # #
#
## #
###
#
#
## # ##
# #
# #
#
#
###
# ##
# ##
#
#
#
#
##
#
#
#
#
#
#
#
# ##
######
#
#
#
##
## #
# ###
# ##
###
# ## # #### ##
#
##
# ### ##
### # ##
# #
#####
#
#
# # #
##
###
####
### ## #
###
##
##
#
##
#
# # # ## ##
#
#
##
#
# ###### ## #
#
##
##
######## # #### #
##
# # ##
# ###
#
##
## #
##
### ##### # ##
##
#
####
###
####
#
##
#
# #
#####
#
##
#
#
#
#
#
#
#
###
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
### #### # #
#
##
# # ## # #### #
#
#
## #
# ##
## # #
# #
# # ## #
# #
#
##
#
#
### # #
##
### ###
###
# ### # #
##
###
##
#####
##
#
# # # #### ### ## #
##
### ##
##
#### #
# # # #####
##### ##
# ##
### ## #
##
# # # ## # # # ### ##### ###
##
##
# #
# # ### # #
# #
# ##
# ## # ##
##
##
#### ###
##
##
#######
# # # ###
#
#### #
# #
#
##### # #
###
# ## #####
#
# ##
# #
#
#
##
###
#
# ###
## # ###
##
##
#
#####
#
# # # # ## ##
####
# # ### ##
# #
#
# #
#### # #####
###
##
####
####
#### ##
# ##
###
## # ###
##
# #
#
###
## # ### ##
#
#
# ####
# # #
# # ###
##
# ###
##
#
##
###
####
##
##
#
##
##
####
##
# ##
###
###
##
##
####
##
###
#
##
#
#
# # ###
###
###
##
#
#
#
#
#
## ###
## ##
##
##
##
# ###
### ####
##
###
######
#### #
## # ## ##
## ##
# #
##
###
## #
#
#
#
### ##
##
## #
# ##### #
## ########
##
#
# # ##
# #
#
# # #### ## #####
# # ## # #
# ##### ####
##
#
##
## ##
####
#### # ## # #
#
##
#
###
#### # #
#
# # ##
#
## #
#
#
#
###
###
#
#
#
##
##
# # ####
# ##
# # ## ###
#
# ##
# # ###### ##### ##
#
# #
###
##
###
#
## # #
# # #
# # ## #
##
#
#
###
##
##
#
##
#
##
#
# # #
#
#
## #
## #### #
#
###
#####
###
#
# ## #
## # ##
#
##
##
# #
##
###
##
#
#
##
###
##
# ##
###
######
##
#
##
##
####
##
##
# ## #### ###
## # ###
####
## #
##
## # ##
#
##
## ##
##
##
#
# # ###
######
##
##
##
#####
#
#
#
#
#
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
# ##
##
#
##
#
##
####
##
### #
#
#
#
#
##
##
###
#
## ##
#
#
## ## #
##
#
# # # ## # #
##
##
##
## #
#
### ##
#
##### # #
#
#
##
# ##
#
##
##
#
#
##
## #
#####
#
#
#### # ###
#
##
#
#
#
#
## # #
## # #
## ##
####
# #
##
###
#
##
##
##
##
#
# #
#
# ##
##
#
#
#
# # ## #
##
##
###
####
# # ###
### ####
# ### # # ###
#
#
#
#
#
#
#
#
#
#
#
#
#
#
####
#
# ### # #
# # # ## # # # #
## # # # #
# #
##
# # # # # ####
## #
##
##### # # # # #
# ####
##
#
##
# #
##
# #
#### # ##
###
#
# ###
#
#
####
##
### ##### ####
#
# # ##
#
##
##
#
#
# ##
##
##
## ##
####
#
###
##
###
#
#
## ## ## ##### #
# ####
####
#
## ### # ## # #
## ### # # # ## ###
##
#
#
#
#
# ### ###
###
####
#
## #
##
## # ####
# # #### # ##
##
###
#
####
## ## #
## # # #
# ### # # #### # #
## # # #
# # #### #
#
###
##
##
# # ### # #
#
#
#
##
# ## ##### ##
#
#
##
## ##
##
#
#
# # # ##
# #
### # ##
##
# # ####
# # # #
#
# ###
##
### ##
### ### # #
###
#
# ###
#
# ###
#
# ####
####
#
# #
# # ######
####
##
#### #### #
#
##
# ##
###
##
#
##
# ####
####
#
#### #
# ##
### #
##
# ##
# ## ## ## ## ## ##
#
####### ####
##
#
## # # #
### ### #
#
#
#
#
#
##
#
# ####
##### # #
# # ##
# #
#
# ## ## # # # ###
####
#
####### # #
# ##
##
## # # #
# # ###
# # #
##
##
##
##
# #
#
##
# # ##
#
##
#
## #
#
#
#
# #
##
#
##
#
##
#
#
# ### #
#
#
#
# #
#
#
#
#
#
#
### # #
#
## # # #
###### ### # # #
# #
# #
# #
##
#### #
#### ##
###
## # ##
# #
#
##
Decision trees: HABITAT MODELS
The training dataset
• Positive examples:
- Locations of bear sightings
(Hunting Association; telemetry)
- Females only
- Using home-range (HR) areas instead of “raw”
locations
- Narrower HR for optimal habitat, wider for maximal
• Negative examples:
- Sampled from the unsuitable part of the study area
- Stratified random sampling
- A different land cover types equally accounted for
25
Decision trees: HABITAT MODELS
Propositional dataset
1,73,26,0,0,1,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,0,0,0,4123,0,0,0,0,63,211,11,11,11,83,213,213,0,0,4155.
1,62,37,0,0,2,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,1,53,0,3640,0,0,0,-1347,63,211,11,11,11,83,213,213,11,89,3858.
2,0,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,10404,0,2074,-309,48,0,0,11,11,11,83,83,83,0,20,3862.
2,0,100,0,0,1,76,0,16,71,0,12,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,7500,0,1661,-319,-942,0,0,11,11,11,0,0,0,0,20,4088.
1,8,91,0,0,1,52,0,59,41,0,0,0,0,0,0,0,4,0,0,0,0,5,1,6,82,0,6500,0,1505,-166,879,9,57,11,11,11,281,281,281,0,20,3199.
4,3,0,86,9,0,75,0,33,67,0,0,0,0,0,0,1,2,0,0,0,0,0,1,2,54,0,0,0,465,-66,-191,4,225,11,31,31,41,72,272,60,619,4013.
1,34,65,0,0,2,51,9,76,9,5,1,4,1,0,1,0,29,0,0,0,0,0,1,2,54,0,3000,0,841,-111,-264,34,220,11,41,41,151,141,112,60,619,3897.
1,100,0,0,0,3,52,0,86,6,3,5,9,6,7,38,40,0,0,0,0,0,0,1,17,64,0,8062,0,932,-603,-71,100,337,11,41,41,171,232,202,4,24,3732.
…….
…….
…….
Present: 1
Absent: 0
The model for maximal habitat
26
The model for optimal habitat
Decision trees: HABITAT MODELS
Map of maximal habitat
(39% SLO territory)
27
Decision trees: HABITAT MODELS
Map of optimal habitat
(13% SLO territory)
55
Multi target predictions: COMMUNITY STRUCTURE
28
Decision trees: COMMUNITY STRUCTURE
Problem: Temporal prediction of height and cover of
crop and weeds
Type of pattern:
a) Multi target regression trees / time series
b) Constraint multi target regression clustering trees
Algorithm: CLUS
Data
• 130 sites, monitoring every 7 to 14 days for 5
months (2665 samples: 1322 conventional,
1333, HT OSR observations)
• Each sample (observation) described by 65
attributes
• Original data collected by the Centre for
Ecology and Hydrology, Rothamsted Research
and SCRI within the Farm Scale Evaluation
Program (2000, 2001, 2002)
29
Results scenario A: Multiple target
regression tree
Target: Avg Crop Covers,
Avg Weed Covers
Excluded attributes: /
Constraints: MinimalInstances = 64.0; MaxSize = 15
Predictive power:
Corr.Coef.:
0.8513, 0.3746
RMSE:
16.504,12.6038
RRMSE:
0.5248,0.9301
Results scenario B: Constraint predictive clustering trees
for time series
Syntactic constraint
30
Results scenario B: Constraint predictive clustering trees
for time series
Target: Avg Weed Covers (Time Series)
Scenario 3.9
Constraints: Syntactic, MinInstances = 32
Predictive power:
TSRMSExval:
4.98
TSRMSEtrain:
4.86
ICVtrain:
30.44
Results scenario B: Constraint predictive clustering trees
for time series
31
Results scenario B: Constraint predictive clustering trees
for time series
Relational data mining: GENE FLOW MODELLING
Marko Debeljak
WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007
32
Relational data mining: GENE FLOW MODELLING
Problem: Classification of fields to above or below
0.9% of outcrossing
Type of pattern: Relational classification decision tree
Algorithm: TILDE
Spatial temporal relations
2004:
40 GM fields
7 non-GM fields
181 sampling points
2005:
17 GM fields
4 non-GM fields
127 sampling points
2006:
43 GM fields
4 non-GM fields
112 sampling points
Marko Debeljak
WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007
33
Relational data mining: GENE FLOW MODELLING
Data scattered over several tables or relations:
• A table storing general information on each field
(e.g., area)
• A table storing the cultivation techniques for
each field and each year
• A table storing the relations (e.g., distance)
between fields
Marko Debeljak
WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007
Relational data mining: GENE FLOW MODELLING
Marko Debeljak
WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007
34
Relational data mining: GENE FLOW MODELLING
Relation database system PostGIS
Marko Debeljak
WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007
Relational data mining – building model
Relation data analysis:
• Algorithm Tilde (Blockeel and De Raedt, 1998; De
Raedt et al., 2001) => upgrade of algorithm C.4.5 (Quinlan, 1993) for
classification decision trees
• The algorithm is included in the ACE-ilProlog data mining system
Marko Debeljak
WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007
35
Relational data mining – results
Threshold 0.01%
Threshold 0.45%
Threshold 0.9%
Marko Debeljak
WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007
Multi-target regression model: RISK MODELLING
36
Decision trees: RISK MODELLING
Problem: Prediction of soil exposed to disturbances
Type of pattern: Multi target regression trees (resistance,
resilience)
Algorithm: CLUS
Multi-target regression model: RISK MODELLING
The dataset: soil samples taken at 26 locations throughout SCO
The dataset: The flat table of data: 26 of 18 data entries
37
Multi-target regression model: RISK MODELLING
The dataset:
Physical properties: soil texture: sand, silt, clay
Chemical properties: pH, C, N, SOM (soil organic matter)
FAO soil classification: Order and Suborder
Physical resilience: resistance to compression: 1/Cc,
recovery from compression: Ce/Cc, overburden stress: e.g.,
recovery from overburden stress after two day cycles: eg2dc
Biological resilience: heat, copper
Multi-target regression model: RISK MODELLING
38
Multi-target regression model: RISK MODELLING
Macaulay Institute (Aberdeen): soils data – attributes and
maps:
Approximately 13 000 soil profiles held in database
Descriptions of over 40 000 soil horizons
Application
78
39
Application
79
Application
80
40
Decision trees
Single or multiple decision trees
Classification, regression, model trees
Propositional and relational data
Temporal and spatial data
SUITABLE for predictions,
SUITABLE for interpretation
Conclusions
What can data mining
do for you?
Knowledge discovered by analyzing data with DM
techniques can help:
Understand the domain studied
Make predictions/classifications
Support decision processes in environmental
management
41
Conclusions
What data mining cannot do for you?
 The law of information conservation (garbage-ingarbage-out)
 The knowledge we are seeking to discover has to
come from the combination of data and background
knowledge
 If we have very little data of very low quality and no
background knowledge no form of data analysis will
help
Conclusions
Side-effects?
• Discovering problems with the data during analysis
– Missing values
– Erroneous values
– Inappropriately measured variables
• Identifying new opportunities
– New problems to be addressed
– Recommendations on what data to collect and how
42
DATA MINING – data preprocessing
DATA FORMAT
• File extension .arff
• This a plain text format, files should be edited by
editors such as Notepad, TextPad, WordPad (that
do not add extra formatting information)
• The file consists of
– Title: @relation NameOfDataset
– List of attributes: @attribute AttName AttType
• AttType can be ‘numeric’ or nominal list of categorical
values, e.g., ‘{red, green, blue}’
– Data: @data (in a separate line), followed by the
actual data in comma separated value (.csv) format
85
DATA MINING – data preprocessing
DATA FORMAT
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
…
86
43
DATA MINING – data preprocessing
• Excel
• Attributes (variables) in columns and cases in
lines
• Use decimal POINT and not decimal COMMA
for numbers
• Save excel sheet as CSV file
87
DATA MINING – data preprocessing
• TextPad, Notepad
• Open CSV file
• Delete “ “ on the beginning of lines and save (just
save, don’t change the format)
• Change all ; to ,
• The numbers must have decimal dot (.) and not a
decimal comma (,)
• Save file as CSV file (don't change the format)
88
44
DATA MINING – Hands-on exercises
2. Data mining
89
DATA MINING – data preprocessing
• WEKA
http://www.cs.waikato.ac.nz/ml/weka/index.html
• Open CVS file in WEKA
• Select algorithm and attributes
• Perform data mining
90
45
How to select the “best” classification tree?
Performance of the classification tree:
Classification accuracy:
(Correctly classified
examples) /(all examples)
1
0.9
0.8
True positive rate
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
False positive rate
Confusion matrix is a
matrix showing actual
and predicted
classifications
91
How to select the “best” classification tree?
Classification trees: J48
Interpretable size:
-Pruned or unpruned
- Minimal number of objects per
leaf
The number of instances
classified in this leaf
The number of instances
INCORRECTLY classified
in this leaf
It could appear:
(13) – no incorrectly classified instances
Or
(3.5/0.5) – due to missing values (?) where instances are fractured
Or
(0,0) – a split on a nominal attribute and one or more of the values do not occur in the subset of
instances at the node in question
92
46
How to select the “best” regression /
model tree?
The performance of the regression / model tree:
93
How to select the “best” regression /
model tree?
The interpretable size:
-Pruned or unpruned
- Minimal number of objects per
leaf
The number of instances
that REACH this leaf
Root of the mean squared error (RMSE) of the
predictions from the leaf's linear model for the instances
that reach the leaf, expressed as a percentage of the
global standard deviation of the class attribute (i.e. the
standard deviation of the class attribute computed from
all the training data). Sum is not 100%.
The smaller this value, the better.
94
47
Accuracy and error
Avoid overfitting the data by tree pruning.
Pruned trees are:
- Less accurate (percentage of correct classifications) of
training data
- More accurate when classifying unseen data
95
How to prune optimally?
Pre-pruning: stop growing the tree, e.g. when data split not
statistically significant or too few examples are in a split
(minimum number of objects in leaf)
Post-pruning: grow full tree, then post-prune (confidence
factor-classification trees)
96
48
Optimal accuracy
10-fold cross-validation is a standard classifier evaluation
method used in machine learning:
- Break data into 10 sets of size n/10.
- Train on 9 datasets and test on 1.
- Repeat 10 times and takes a mean accuracy.
97
49