Download Mining Frequent Patterns Without Candidate Generation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Regression Analysis


The contents in this chapter are from Chapters
20-23 of the textbook.
The cntry15.sav data will be used. The data
collected 15 countries’ information



lifeexpf: female life expectancy
Birthrat: births per 1000 population
Both are scale variables.
1
Linear regression model
2
Linear regression model

It is obviously, the points are not randomly
scattered over the grid. Instead, there appears to
be a pattern.

As birthrate increases, life expectancy decreases.

How to choose the “best” line?

The least squares principle is recommended.
3
Least squares principle
4
Least squares principle



Dependent variable: the variable you wish to predict
Independent variable: variables used to make the
prediction
Simple linear regression: in which a single numerical
independent variable X is used to predict the
numerical dependent variable Y.
Y   0  1 X  
where  0 and 1 are regression coefficien ts,
 is random error with E( )  0 and Var( )   2 .
 0 , 1 and  are unknown.
2
5
Least squares principle
To fit a data set {xi , yi , i  1,2, ,n} to the above model we have
Yi   0  1 X i   i
where Yi  dependent variable (sometimes referred to as the response variable)
X i  independen t variable (sometimes referred to as the explanator y variable )
 0  intercept for the population
1  slope for the population
 i  random error in Yi for observatio n i ,  i' s are iid with mean 0 and variance  2 .
6
Least squares principle
The least squares method can help us to estimate the regression coefficien ts
 0 and 1 and variance of the random error.
The idea of the least squares estimation for  0 and 1 is to minimize
n
n
2
ˆ
(Y

Y
)

(Y

(b

b
X
)
)
 i i  i 0 1 i
2
i 1
i 1
Yˆi  b0  b1 X i
where Yˆi  predicted value of Y for observatio n i
X i  value of X for observatio n i
b0  sample Y intercept
b1  sample slope
7
Least squares principle
1 n
1 n
x   xi , y   yi
n i 1
n i 1
1  n  n 
SS xy   ( xi  x )( yi  y )   xi yi    xi   yi 
n  i 1  i 1 
i 1
i 1
n
n
1

SS x   ( xi  x )2   xi2    xi 
n  i 1 
i 1
i 1
SS xy
b1 
,b0  y  b1 x
SS x
n
n
n
2
1 n
2
ˆ
s  ˆ 
(
y

y
)
 i i
n  2 i 1
where ˆy i  b 0  b1x i , i  1,  , n.
2
2
8
Linear regression model
9
Linear regression model


The regression model becomes
life expectancy=90-(0.70 x birthrate)
That tells us that for an increase of 1 in birthrate,
there is a decrease in life expectancy of 0.70 years.
C oe ff i ci en tsa
Model
1
(Constant)
Births per 1000
population,
Unstandardized
Coefficients
B
Std. Error
89.985
1.765
-.697
.050
Standardized
Coefficients
Beta
-.968
t
50.995
Sig.
.000
-13.988
.000
a. Dependent Variable: Female life expectancy
10
Prediction and residuals
C as e S um ma r ie s
country
Algeria
Burkina Faso
Cuba
Equador
France
Mongolia
Namibia
Netherlands
North Korea
Somalia
Tanzania
Thailand
Turkey
Zaire
Zambia
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Births per
1000
population,
31
50
18
28
13
34
45
13
24
46
50
20
28
45
48
Female life
expectancy
68
53
79
72
82
68
63
81
72
55
55
71
72
56
59
Unstandardize
d Predicted
Value
68.36833
55.11929
77.43346
70.46028
80.92004
66.27637
58.60588
80.92004
73.24955
57.90856
55.11929
76.03882
70.46028
58.60588
56.51393
Unstandardize
d Residual
-.36833
-2.11929
1.56654
1.53972
1.07996
1.72363
4.39412
.07996
-1.24955
-2.90856
-.11929
-5.03882
1.53972
-2.60588
2.48607
11
Coefficient of Correlation

It measures the strength of the linear relationship
between two numerical variables.
cov( X , Y )
r
S X SY
n
where
cov( X ,Y ) 
n
SX 
( X
i 1
i
( X
i 1
X )
n 1
i
 X )( Yi  Y )
n 1
n
2
SY 
( Y  Y )
i 1
2
i
n 1
12
Coefficient of Correlation

Coefficient of correlation

-1=< r >= 1
13
Prediction and residuals

The coefficient determination
regression sum of squares SSR
R 

total sum of squares
SST
2
14
ANOVA
M od el Su mm a ryb
Adjusted R
Square
.933
Std. Error of
the Estimate
2.537
Model
R
R Square
a
1
.968
.938
a. Predictors: (Constant), Births per 1000
population,
b. Dependent Variable: Female life expectancy
A NO VA b
Model
1
Regression
Residual
Total
Sum of
Squares
1259.263
83.671
1342.933
df
1
13
14
Mean Square
1259.263
6.436
F
195.653
Sig.
.000a
a. Predictors: (Constant), Births per 1000 population,
b. Dependent Variable: Female life expectancy
15
Testing hypotheses
about the assumptions



Independence: all of the observations are
independent
The variance homogeneity: the variance of the
distribution of the dependent variable must be the
same for all values of the independent variable.
Normality: for each value of the independent
variable, the distribution of the related dependent
variable follows a normal distribution.
16
Testing hypotheses
C oe ff i ci en t sa
Model
1
(Constant)
Births per 1000
population,
Unstandardized
Coefficients
B
Std. Error
89.985
1.765
-.697
.050
Standardized
Coefficients
Beta
-.968
t
50.995
Sig.
.000
-13.988
.000
95% Confidence Interval
for B
Lower Bound Upper Bound
86.173
93.797
-.805
-.590
a. Dependent Variable: Female life expectancy
M od el Su mm a ryb
Adjusted R
Square
.933
Std. Error of
the Estimate
2.537
Model
R
R Square
1
.968a
.938
a. Predictors: (Constant), Births per 1000
population,
b. Dependent Variable: Female life expectancy
17
Testing hypotheses

Testing that the slope is zero


In this example, the sample slope is about -0.70 and its
standard error is 0.05, so the value for the t statistics is
-0.70/0.05=-14, related p-value is less that 0.0005. We
should reject the hypothesis. There appears to be a
linear relationship between 1992 female life expectancy
and birthrate.
The 95% confidence interval for the population slope is
(-0.805, -0.590).
18
Prediction



The regression equation obtained can be used for
predict the life expectancy based on birthrates.
For a country with a birthrate of 30 per 1000
population
Predicted life expectancy
=89.99-0.697 x 30=69.08 years
19
Predicting means and
individual observations



The plot on the next page gives the standard
error of the predicted mean life expectancy for
different values of birthrate.
The vertical line at 32.9 is the average birthrate
for all cases.
The farther birthrates are from the sample mean,
the larger the standard error of the predicted
means.
20
Plot of standard error of predicted mean
21
The 95% fitting confidence region
22
Statistical diagnostics
 Is the model correct?
 Are there any outliers?
 Is the variance constant?
 Is the error normally distributed?
23
Statistical diagnostics



Residuals can provide many useful information for
the above four issues in statistical diagnostics.
You can’t judge the related size of a residual by
looking at its value alone as it depends on the unit
of the dependent variable and are not convenient
to use.
Standardized residuals: divide the residual by the
estimated standard deviation of the residuals.
24
Statistical diagnostics

If the distribution of residuals is approximately
normal, about 95% of the standardized residuals
should be between -2 and 2; 99% should be
between -2.58 and 2.58. It is easy to see
whether there are some outliers.
25
Statistical diagnostics




When you compute a standardized residuals, all of the
observed residuals are divided by the same number.
The variability of the dependent variable is not constant for
all points, but depends on the value of the independent
variable.
The studentized residual takes into account the differences
in variability from point to point.
We calculate it by dividing the residual by an estimate of
the standard deviation of the residual at that point.
26
Statistical diagnostics


A residual divided by an estimate of the standard
deviation of the residual at that point is called its
studentized residual.
The studentized residuals make it easier to see
violations of the regression assumptions.
27
Statistical diagnostics
C as e S um ma ri e s
country
Algeria
Burkina Faso
Cuba
Equador
France
Mongolia
Namibia
Netherlands
North Korea
Somalia
Tanzania
Thailand
Turkey
Zaire
Zambia
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Births per
1000
population,
31
50
18
28
13
34
45
13
24
46
50
20
28
45
48
Female life
expectancy
68
53
79
72
82
68
63
81
72
55
55
71
72
56
59
Unstandardize
d Predicted
Value
68.36833
55.11929
77.43346
70.46028
80.92004
66.27637
58.60588
80.92004
73.24955
57.90856
55.11929
76.03882
70.46028
58.60588
56.51393
Unstandardize
d Residual
-.36833
-2.11929
1.56654
1.53972
1.07996
1.72363
4.39412
.07996
-1.24955
-2.90856
-.11929
-5.03882
1.53972
-2.60588
2.48607
Standardized
Residual
-.14518
-.83536
.61749
.60691
.42569
.67940
1.73204
.03152
-.49254
-1.14647
-.04702
-1.98616
.60691
-1.02716
.97994
Studentized
Residual
-.15039
-.92252
.67055
.63132
.48171
.70344
1.85005
.03566
-.51832
-1.23146
-.05193
-2.13011
.63132
-1.09715
1.06610
28
Standardized Residuals
Standardized Residual Stem-and-Leaf Plot
Frequency
3.00
4.00
7.00
1.00
Stem & Leaf
-1 .
-0 .
0.
1.
019
0148
0466669
7
Stem width: 1.00000
Each leaf:
1 case(s)
29
Checking for normality
30
Checking for normality



If the data are a sample from a normal distribution, you
expect the points to fall more or less on a straight line.
You can see the two largest residuals in absolute value
(Thailand and Namibia) are stragglers from the line.
Next page is a detrended normality plot. If the data are
from a normal, the points in the detrended normal plot
should fall randomly in a band abound 0.
31
Checking for normality
32
Testing for normality

Many statistical tests for normality have been
proposed, one of them is the Kolmogorov-Smirnov
test.
Tests of Normality
Kolmogorov-Smirnova
Shapiro-Wilk
Statistic
df
Sig.
Statistic
df
Standardized Residual
.137
15
.200*
.971
15
*. This is a lower bound of the true significance.
Sig.
.866
a. Lilliefors Significance Correction
33
Checking for constant variance



Residual plot: plot of
studentized residuals
against the estimated
values.
From the residual plot you
can see whether there are
some pattern.
For a normal case, the
residuals appears to be
randomly scattered around
a horizontal line through 0.
34
Checking for constant variance
35
Checking linearity


When the relationship
between two variables is
not linear, you can
sometimes transform the
variables to make the
relationship linear, for
example, take logarithm,
sine, exponential, etc.
Scale plot of female life
expectancy against natural
log of phones per 100.
36
Multiple Regression Models

Considering the country.sav data, you are
interesting to predict female life expectancy from





Urban: percentage of the population living in urban
areas
Docs: number of doctors per 10,000 people
Beds: number of hospital beds per 10,000 people
Gdp: per capita gross domestic product in dollars
Radios: radios per people
37
Multiple Regression Models

A linear regression model is
Predicted life expectancy 
Constant  B1urban  B2 doc  B3 beds  B4 gdp  B5 radios

Scatterplot matrix is useful.
38
Scatterplot matrix
39
Scatterplot matrix



The relationship between female life expectancy
and the percentage of the population living urban
areas appears to be more or less linear.
The other four independent variables appear to be
related to female life expectancy, but the relation
is not linear.
We take log of the values of the four independent
variables.
40
41
Correlation matrix
C or re l at io ns
Pearson Correlation
Sig. (1-tailed)
N
Female life expectancy
Natural log hospital
beds/10,000
Natural log of doctors
per 10000
Natural log of GDP
Natural log of radios
per 100 people
Percent urban
Female life expectancy
Natural log hospital
beds/10,000
Natural log of doctors
per 10000
Natural log of GDP
Natural log of radios
per 100 people
Percent urban
Female life expectancy
Natural log hospital
beds/10,000
Natural log of doctors
per 10000
Natural log of GDP
Natural log of radios
per 100 people
Percent urban
Female life
expectancy
1.000
Natural log
hospital
beds/10,000
.730
Natural log
of doctors
per 10000
.880
Natural
log of GDP
.836
Natural log
of radios per
100 people
.693
Percent urban
.697
.730
1.000
.711
.741
.616
.576
.880
.711
1.000
.824
.633
.763
.836
.741
.824
1.000
.716
.748
.693
.616
.633
.716
1.000
.579
.697
.
.576
.000
.763
.000
.748
.000
.579
.000
1.000
.000
.000
.
.000
.000
.000
.000
.000
.000
.
.000
.000
.000
.000
.000
.000
.
.000
.000
.000
.000
.000
.000
.
.000
.000
116
.000
116
.000
116
.000
116
.000
116
.
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
116
42
Regression coefficients

The estimated regression model
Y=40.78-0.007 urban + 3.96 lndocs + 1.17 lnbeds
+1.63 lngdp +1.54 lnradio
C oe ff i ci en t sa
Model
1
(Constant)
Natural log hospital
beds/10,000
Natural log of
doctors per 10000
Natural log of GDP
Natural log of radios
per 100 people
Percent urban
Unstandardized
Coefficients
B
Std. Error
40.767
3.174
Standardized
Coefficients
Beta
t
12.845
Sig.
.000
1.147
.749
.095
1.532
.128
4.069
.563
.569
7.228
.000
1.709
.616
.236
2.776
.006
1.542
.686
.130
2.247
.027
-.020
.029
-.045
-.686
.494
a. Dependent Variable: Female life expectancy
43
SPSS output: model summary
statistics
V ar ia b le s E nt er e d/ Re m ov edb
Model
1
Variables
Entered
Percent
urban,
Natural log
hospital
beds/10,00
0, Natural
log of
radios per
100 people,
Natural log
of doctors
per 10000,
Natural
a log
of GDP
Variables
Removed
Method
.
Enter
a. All requested variables entered.
b. Dependent Variable: Female life expectancy
M od el Su mm a ryb
Adjusted R Std. Error of
Model
R
R Square
Square
the Estimate
1
.910a
.827
.819
4.742
a. Predictors: (Constant), Percent urban, Natural log
hospital beds/10,000, Natural log of radios per
100 people, Natural log of doctors per 10000,
Natural log of GDP
b. Dependent Variable: Female life expectancy
44
SPSS output: ANOVA


This regression is meaningful as the significance
level is less than 0.0005.
The residual variance is 22.489
A NO VA b
Model
1
Regression
Residual
Total
Sum of
Squares
11844.633
2473.807
14318.440
df
5
110
115
Mean Square
2368.927
22.489
F
105.336
Sig.
.000a
a. Predictors: (Constant), Percent urban, Natural log hospital beds/10,000,
Natural log of radios per 100 people, Natural log of doctors per 10000,
Natural log of GDP
b. Dependent Variable: Female life expectancy
45