Download Lecture 2 - IDA.LiU.se

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Outline

An example of data mining

SAS Enterprise miner
Data mining and statistical learning,
lecture 2
700000
600000
500000
400000
300000
200000
100000
0
01
01
01
01
01
1/
/0
1/
/0
1/
/0
1/
/0
1/
/0
06
05
04
03
02
Data mining and statistical learning,
lecture 2
20
20
20
20
20
Daily electricity consumption (MWh)
Daily electricity consumption in Sweden
14.0
13.5
13.0
12.5
12.0
06
05
04
03
02
Data mining and statistical learning,
lecture 2
20
1/
/0
01
20
1/
/0
01
20
1/
/0
01
20
1/
/0
01
20
1/
/0
01
ln daily electricity consumption
(MWh)
ln daily electricity consumption in Sweden
Available data

Daily levels of the total electricity consumption in Sweden
2002-2006

Daily levels of temperature, wind speed, and precipitation at a
large number of weather stations in Sweden

Population in all municipalities in Sweden

Calendar data (Julian day, weekdays, holidays)
Data mining and statistical learning,
lecture 2
Selecting, exploring, and modifying data
Too much weather data!

We assigned a weather station to each municipality, and
computed population-weighted mean values for the
temperature, wind speed and precipitation in the whole of
Sweden

Then we examined the relationship between the electricity
consumption and the population-weighted weather data
Data mining and statistical learning,
lecture 2
ln daily electricity
consumption (MWh)
ln daily electricity consumption vs populationweighted mean temperature in Sweden
13.4
13.3
13.2
13.1
13.0
12.9
12.8
12.7
12.6
12.5
12.4
12.3
-30
-20
-10
0
10
20
Population-weighted temperature
Data mining and statistical learning,
lecture 2
30
Cubic spline with one knot (at x=1)
Cubic spline with one knot
7
6
5
4
3
2
1
0
0
0.5
1
1.5
2
Between knots, the spline function is identical to a third order polynomial
At knots the function and its first two derivatives are continuous
Data mining and statistical learning,
lecture 2
Some examples of additive models
A nonlinear, additive model
Y    s1 ( X 1 )  ...  s p ( X p )  
A mixed linear and nonlinear, additive model
q
Y    s1 ( X 1 )  ...  s p ( X p )    j X p  j  
j 1
Data mining and statistical learning,
lecture 2
Modelling ln daily electricity consumption as a spline function
of the population-weighted mean temperature in Sweden
proc gam data=mining.electricity;
model lnConsumption = spline(Mean_temp, df=20);
ID Time(day);
output out=smhiouttemp pred resid;
run;
Observed
Fitted
ln daily electricity
consumption (MWh)
13.4
13.2
13.0
12.8
12.6
12.4
12.2
-30
-20
-10
0
10
20
Data
mining and statistical learning,
Population-weighted
temperature
lecture 2
30
Residual
Modelling ln daily electricity consumption as a spline function
of the population-weighted mean temperature in Sweden:
residual analysis
0.20
0.15
0.10
0.05
0.00
-0.05
-0.10
-0.15
-0.20
-0.25
0
100
200
Julian day
Data mining and statistical learning,
lecture 2
300
400
Modelling ln daily electricity consumption in Sweden
- residual analysis
Spline of temperature
Spline of Julian day
Weekday dummies
0.20
0.15
0.10
0.05
0.00
-0.05
-0.10
-0.15
-0.20
-0.25
Residual
Residual
Spline of
temperature
0
100
200
Julian day
300
400
0.20
0.15
0.10
0.05
0.00
-0.05
-0.10
-0.15
-0.20
-0.25
0
Data mining and statistical learning,
lecture 2
100
200
Julian day
300
400
Modelling ln daily electricity consumption in Sweden
- residual analysis
Splines of contemporaneous and
time-lagged weather data
Splines of Julian day and time
Weekday and holiday dummies
0.20
0.15
0.10
0.05
0.00
-0.05
-0.10
-0.15
-0.20
-0.25
0.20
0.15
0.10
Residual
Residual
Spline of temperature
Spline of Julian day
Weekday dummies
0.05
0.00
-0.05
-0.10
-0.15
-0.20
0
100
200
Julian day
300
400
-0.25
0
Data mining and statistical learning,
lecture 2
100
200
Julian day
300
400
Deviance analysis of the investigated models of
ln daily electricity consumption in Sweden
Deviance
12
10.233
10
8
6
3.822
4
0.742
2
0
Temp only
Temp, Julian day,
weekday
The residual deviance of a fitted model is
minus twice its log-likelihood
Final model
If the error terms are normally
distributed, the deviance is equal to the
sum of squared residuals
Data mining and statistical learning,
lecture 2
Modelling ln daily electricity consumption in Sweden:
time series plot of residuals
0.15
Residual
0.10
0.05
0.00
-0.05
-0.10
-0.15
0
500
1000
Time
Data mining and statistical learning,
lecture 2
1500
2000
Model selection in data-rich environments
Divide the given data sets into two parts
Training
Test
Use the training set to fit all potential models
Use the test set to validate the tested models
Data mining and statistical learning,
lecture 2
Model selection and unbiased estimation of the
predictive power of the selected model
Divide the given data sets into three parts
Training
Validation
Test
Use the training set to fit all potential models
Use the validation set to select a model
Use the test set to compute an unbiased estimate of the
predictive power of the selected model
Data mining and statistical learning,
lecture 2
SAS Enterprise Miner
A toolbox for the five elements of data mining offering:



Convenient handling of large and complex datasets
Convenient comparison and assessment of many
models
Widely used procedures for prediction,
classification and association analysis
Data mining and statistical learning,
lecture 2
SAS Enterprise Miner
Run the miner
 Import data
 Create a project
 Create a dataflow diagram
 Edit the nodes of the diagram
 Run a diagram
 Assess the results
Write and run SAS code
Data mining and statistical learning,
lecture 2
Related documents