Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
USE OF GENERALIZED LINEAR
MODEL IN FORECASTING OF AIR
PASSENGERS CONVEYANCES
FROM EU COUNTRIES
Catherine Zhukovskaya
Faculty of Transport and Mechanical
Engineering
Riga Technical University
The 8th Tartu Conference on Multivariate Statistics
Outline
1.
2.
3.
4.
5.
6.
7.
Introduction
Informative base
Used models for analyzing and forecasting of the air
passengers’ conveyances
Elaboration of linear models
Elaboration of generalized linear models
Conclusion
References
2
The 8th Tartu Conference on Multivariate Statistics
1. Introduction
Most the literature which is devoted to forecasting of transport flows
contain only simple forecasting models on the base of the time series
methods [Hünt (2003)] or linear regression methods with small number
of explanatory variables [Butkevičius, Vyskupaitis (2005), Šliupas
(2006)].
Two different approaches for the forecasting of air passengers
conveyances from EU countries were considered in this investigation:
the classical method of linear regression;
the generalized linear model (GLM).
The aim of this investigation is to illustrate the advantage of using the
GLM comparing with the simple linear regression models.
The verification of the models and the evaluation of the unknown
parameters are included as well.
All calculations are being done with Statistica 6.0 and elaborated
computer software in MathCad 12.
3
The 8th Tartu Conference on Multivariate Statistics
2. Informative base
The forecasted variable was the number of air passenger carried,
expressed in millions of passengers.
Factors
t1 t2 t3 t4 t5 -
t6 t7 t8 t9 t10 -
total population of the country (TP), millions of inhabitants;
area of the country (AREA), thousands of km2;
density of the country population (PD), number of inhabitants per
km2;
monthly labour costs (MLC), thousands of euros;
gross domestic product (GDP) “per capita” in Purchasing Power
Standards (PPS) (GDP_PPS);
gross domestic product (GDP), billions of euro;
comparative price level (CPL);
inflation rate (IR);
unemployment rate (UR);
labour productivity per hour worked (LPHW).
4
The 8th Tartu Conference on Multivariate Statistics
The following 25 countries of EU were selected: Belgium, Czech
Republic, Denmark, Germany, Estonia, Greece, Spain, France, Ireland,
Italy, Cyprus, Latvia, Lithuania, Luxembourg, Hungary, Malta,
Netherlands, Austria, Poland, Portugal, Slovenia, Slovakia, Finland,
Sweden and United Kingdom.
The considered period was from 1996 to 2005.
All data for this investigation have been received from the electronic
database
“The Statistical Office of the European Communities” (EUROSTAT)
http://epp.eurostat.ec.europa.eu
The final number of the observation was 161:
Data for the period from 1996 to 2004 have been used for the estimation
and forecasting - 140 observations;
Data of the 2005 have been used for the check out of the quality of
forecasting, so called the cross-validation (CV) - 21 observations.
5
The 8th Tartu Conference on Multivariate Statistics
3. Used models for analyzing and forecasting of
the air passengers’ conveyances
Main notions
The data about concrete country for the concrete year were taken as the
observation.
The main object of the consideration was the air passengers’
conveyances from EU countries.
All the considered models were the group models [Andronov (1983)].
Classification of regressional models according to their mathematical
form:
Linear regression models;
Generalized linear regression models (GLM).
6
The 8th Tartu Conference on Multivariate Statistics
The linear regression model [Hardle (2004)]:
E(Y(k)(x)) = xT,
(1)
where:
Y(k) is a dependent variable for the k-th considered model;
x = (x1, x2, …, xd)T is d-dimensional vector of explanatory variables;
= (0, 1, 2, …, d)T is a coefficient vector that has to be estimated
from observations for Y(k) and x.
The generalized linear regression model:
E(Y(k)(x)) = G{xT},
(2)
where G() is the known function of the one dimensional variable.
7
The 8th Tartu Conference on Multivariate Statistics
4. Elaboration of linear models
The basic criteria for the best model choosing:
1.
2.
3.
4.
Multiple coefficient of determination (R2);
Fisher criterion (F);
Sum of the squares of the residuals (SSRes);
Sum of the squares of residuals for the cross-validation (CV SSRes).
For the checking of the statistical hypotheses we always used the
statistical significance level = 0.05.
MODEL #1
Y(1) = 0 + 1x1 + 2x2 + 3x3 + 4x4 + 5x5 + 6x6+ 7x7 + 8x8 + 9x9 + 10x10,
where Y(1) is the total number of air passenger carried;
x1 = t1, x2 = t2, x3 = t3, x4 = t4, x5 = t5, x6 = t6, x7 = t7, x8 = t8, x9 = t9, x10 = t10.
8
The 8th Tartu Conference on Multivariate Statistics
Results for the MODEL #1
Ê(Y(1)(x)) = 14 – 0,77x1 + 0,16x2 + 185,8x3 -2,44x4 + 0,53x5 + 0,07x6 + 0,05x7 +
+ 0,32x8 -1,2x9 - 1,03x10
Table 1
Variable
Factor
b
t(129)
p-level
Intercept
14.00
0.84
0.405
x1
TP
-0.77
-1.56
0.121
x2
AREA
0.16
5.60
0.000
x3
PD
185.80
4.67
0.000
x4
MLC
-2.44
-0.44
0.660
x5
GDP_PPS
0.53
1.68
0.096
x6
GDP
0.07
3.81
0.000
x7
CPL
0.05
0.37
0.710
x8
IR
0.32
0.29
0.771
x9
UR
-1.20
-1.59
0.114
x10
LPHW
-1.03
-3.75
0.000
R2 = 0.831
.
Fisher criterion F = 63.49
9
The 8th Tartu Conference on Multivariate Statistics
New factor
t11 (ON) =
0, if the considered country is the old member of EU;
1, if the considered country is the new one.
MODEL #2
Y(2) = 0 + 1x1 + 2x2 + 3x3 + 4x4 + 5x5,
where Y(2) = Y(1);
x1 = t2, x2 = t3, x3 = t6, x4 = t10, x5 = t11.
Results for the MODEL #2
Ê(Y(2)(x)) = 13.56 + 0,09x1 + 134,01x2 + 0,05x3 - 0,68x4 + 29,36x5.
Table 2
Variable
Factor
Intercept
b
t(134)
p-level
13.56
2.45
0.016
0.09
4.45
0.000
134.01
4.32
0.000
0.05
10.34
0.000
x1
AREA
x2
PD
x3
GDP
x4
LPHW
-0.68
-5.12
0.000
x5
ON
29.36
4.21
0.000
R2 = 0.829
Fisher criterion F = 129.85
10
The 8th Tartu Conference on Multivariate Statistics
Modifications of factors
t1, t12 , t 2 , t 2 t1 , t 2 t1 , t 6 t1 , t 6 t1 t 2 , t 6 t1 t 2
MODEL #3
Y(2) = 0 + 1x1 + 2x2 + 3x3 + 4x4 + 5x5,
where Y(3) = Y(1);
x1 t 3 , x2 t 6 , x3 t10 , x 4 t12 , x5 t 2
Results for the MODEL #3
Ê(Y(3)(x)) = -6,34 + 113,26x1 + 0,14x2 - 0,52x3 - 0,03x4 + 3,03x5
Table 3
Variable
Factor
Intercept
b
t(134)
p-level
-6.34
-1.05
0.296
113.26
4.00
0.000
0.14
10.66
0.000
x1
PD
x2
GDP
x3
LPHW
-0.52
-5.80
0.000
x4
sq(TP)
-0.03
-7.56
0.000
x5
sqrt(AREA)
3.03
5.74
0.000
R2 = 0.867
Fisher criterion F = 174.08
11
The 8th Tartu Conference on Multivariate Statistics
Analysis of observed and predicted values
for the MODEL #3
1
2
250.00
250.00
200.00
200.00
150.00
150.00
100.00
100.00
50.00
50.00
0.00
0.00
0
20
40
60
80
-50.00
100
120
140
0
3
6
9
12
15
18
-50.00
Observed
Predicted
CVObserved
CVPredicted
Figure 1. Plot of observed and predicted values.
Figure 2. Plot of observed and predicted values for the CV.
12
21
The 8th Tartu Conference on Multivariate Statistics
MODEL #4
Y(4) = 0 + 1x1 + 2x2 + 3x3 + 4x4 + 5x5 + 6x6 + 7x7 + 8x8 + 9x9,
where Y(4) = Y(1)/t1 - the ratio between the total number of air passenger carried and
the number of inhabitants of the country;
x1 t 2, x2 t3 , x3 t 4 , x4 t 6 , x5 t11, x6 t1, x7 t 2 , x8 t 2 t1 , x9 t 6 t1
Results for the MODEL #4
Ê(Y(4)(x)) = 0,56 + 2,33x1 - 1,04x2 - 0,02x3 + 0,001x4 + 1,76x5 - 0,0004x6 +
+0,04x7 + 0,17x8.
Variable
Factor
b
t(131)
p-level
Intercept
-5.67
-6.25
0.000
x1
AREA
-0.02
-6.73
0.000
x2
PD
10.37
6.19
0.000
x3
MLC
-0.73
-4.19
0.000
x4
ON
0.83
8.30
0.000
x5
sqrt(TP)
-1.02
-7.32
0.000
x6
sqrt(AREA)
1.06
7.10
0.000
x7
AREA/TP
-0.12
-6.98
0.000
x8
sqrt(AREA)/TP
0.94
5.84
0.000
x9
GDP/TP
0.15
6.28
0.000
Table 4
R2 = 0.760
Fisher criterion F = 45.81
13
The 8th Tartu Conference on Multivariate Statistics
New factor
t12 (HL) =
0, if the value y/t1 for the considered country is small (less than 2);
1, if the value y/t1 is larger than 2.
MODEL #5
Y(2) = 0 + 1x1 + 2x2 + 3x3 + 4x4 + 5x5 + 6x6 + 7x7 + 8x8,
where Y(5) = Y(4);
x1 t 4 , x2 t 5 , x3 t 8 , x 4 t 9 , x5 t10 , x6 t11, x7 t12 , x8 t 6 t1 .
Results for the MODEL #5
Ê(Y(5)(x)) = 0,99 - 0,46x1 - 0,02x2 - 0,02x3 - 0,02x4 + 0,01x5 + 1,27x6 + 1,15x7 + 0,07x8
Variable
Factor
Intercept
b
t(131)
p-level
0.99
3.93
0.000
x1
MLC
-0.46
-3.41
0.001
x2
GDP_PPS
-0.02
-3.81
0.000
x3
IR
-0.02
-1.33
0.187
x4
UR
-0.02
-1.90
0.056
x5
LPHW
0.01
3.72
0.000
x6
ON
1.27
9.21
0.000
x7
HL
1.15
15.30
0.000
x8
GDP/TP
0.07
3.41
0.001
Table 5
R2 = 0.864
Fisher criterion F = 104.174
14
The 8th Tartu Conference on Multivariate Statistics
Pivot results for the linear regression models
Table 6
R4
Sum
R
Total R
114 885
5
17
5
5
109 723
4
15
3
41 599
2
49 450
1
5
1
5
35 064
3
57 310
3
16
4
3
12 775
1
51 448
2
8
2
Model
R2
R1
#1
0.831
3
63.49
4
52 651
5
#2
0.829
4
129.85
2
53 344
#3
0.867
1
174.10
1
#4
0.760
5
45.81
#5
0.864
2
104.20
F
R2
SSRes
R3
CV
SSRes
15
The 8th Tartu Conference on Multivariate Statistics
Analysis of observed and predicted values
for the MODEL #5
3
4
250.00
250.00
200.00
200.00
150.00
150.00
100.00
100.00
50.00
50.00
0.00
0.00
0
20
40
60
RObserved
80
100
120
140
0
3
RPredicted
6
9
RObserved
12
15
18
RCVPredicted
Figure 3. Plot of recalculated observed and predicted values.
Figure 4. Plot of recalculated observed and predicted values for the CV.
16
21
The 8th Tartu Conference on Multivariate Statistics
4. Elaboration of generalized linear models
For the further investigation the best linear regression model (Model #5) has
been chosen
Two different GLM were considered. In both of them the value of the
regressand Y(GLM) = Y(5) / t1 and the collection of the regressors are the same
as for Model #5.
GLM1
E Y GLM1 x i
exp β j xi , j
j
,
hi
1 exp β j xi , j
j
(3)
where hi is the total population number, xi is vector-columns of the independent
variables, i is the observation number, i = 1, 2, …, n.
GLM2
E Y GLM2 x i hi
1
a exp β j xi , j
j
where a is additional parameter (constant).
,
(4)
17
The 8th Tartu Conference on Multivariate Statistics
For unknown parameter vector estimation we used the least squares
criterion
n
R0 β Yi Yˆi
i 1
2
min
β
(5)
where Yi and Ŷi are observed and calculated values of Y.
1. Linearization
LM1
Y*
ln
β j xi , j
*
1 Y
j
(6)
LM2
1
ln * a β j xi , j
Y
j
(7)
where Y* = Y/ h.
18
The 8th Tartu Conference on Multivariate Statistics
The models LM1 and LM2 give the following estimate for E(Y)
Eˆ Y
LM1
e 13.78 0.001x16.68 x2 0.02 x3 0.7 x4 48.8 x5 0.44 x6 0.29 x7 7.81x8 0.64 x9
x h
.
13.78 0.001x1 6.68 x2 0.02 x3 0.7 x 4 48.8 x5 0.44 x6 0.29 x7 7.81x8 0.64 x9
1 e
Eˆ Y LM2 x h
1
11.65 1.63 x1 1.7 x2 0.04 x3 0.81x 4 17.96 x5 1.67 x6 0.2 x7 0.41x8 0.11x9
0.3 e
The values of SSRes and CV SSRes for the Model #5 and LM
Table 7
SSRes
R0/n
CV SSRes
Model #5
LM1
LM2
Model #5
LM1
LM2
12 775
27 447
21 834
51 448
676 576
229 554
We can see that linearization gives bad results. Making attempts to improve the
obtained results a two-stage estimation procedure was developed.
The first stage corresponds to the considered linearization. As the second step
we used the procedure of calibration when we precise the gotten estimates by
using the well-known gradient method.
19
.
The 8th Tartu Conference on Multivariate Statistics
2. Calibration
Gradients for the least squares criterion
exp β j xi , j
hi exp β j xi , j xi
n 1
j
j
GLM1 R β 2 Yi hi
2
i 1
1 exp β j x i , j 1 exp β x
j i, j
j
j
(8)
hi exp β j xi , j xi
n 1
1
j
GLM2 R β 2 Yi hi
2
i 1
a exp β j xi , j a exp β x
j i, j
j
j
(9)
20
The 8th Tartu Conference on Multivariate Statistics
For the GLM2 we found the optimum value of R0 not only from the values but
from the parameter also.
The GLM1 and GLM2 have the following estimates for E(Y):
7.05 1.05 x1 1.22 x2 0.02 x3 0.76 x 4 5.77 x5 1.26 x6 0.11x7 0.68 x8 0.15 x9
e
Eˆ Y GLM1 x h
,
7.05 1.05 x1 1.22 x2 0.02 x3 0.76 x 4 5.77 x5 1.26 x6 0.11x7 0.68 x8 0.15 x9
1 e
Eˆ Y GLM2 x h
1
6.3 e
7.26 1.09 x1 0.78 x2 0.02 x3 0.82 x 4 7.81x5 1.12 x6 0.1x7 0.13 x8 0.06 x9
Table 8
CV SSRes
Model #5
R0/n
51 447
GLM1
47 807
GLM2
34 567
21
.
The 8th Tartu Conference on Multivariate Statistics
Analysis of observed and predicted values
for the GLM
5
6
300
250.00
250
200.00
200
150.00
150
100.00
100
50.00
50
0.00
0
-50
0
20
40
60
Robserved
80
GLM1
100
120
140
0
3
6
9
12
15
18
-50.00
GLM2
CV Observed
CV GLM1
CV GLM2
Figure 5. Plot of observed and predicted values.
Figure 6. Plot of observed and predicted values for the CV.
22
21
The 8th Tartu Conference on Multivariate Statistics
Dependence of values SSRes and CV SSRes from the
value of parameter for GLM2
7
80000
70000
60000
50000
40000
30000
20000
10000
0
1
2
3
4
5
SSRes
6
7
8
9
10
CV SSRes
Figure 7. The values of SSRes and CV SSRes as a function of parameter for GLM 2
The optimal value for analysis of SSRes was obtained then = 2.
The best result for the analysis of CV SSRes was obtained then = 6.
23
The 8th Tartu Conference on Multivariate Statistics
6. Conclusion
The linear and generalized linear regressional models for the
forecasting of air passengers conveyances from EU countries were
considered. These models contain a big number of explanatory factors
and their combinations.
For the estimation of the unknown parameters of the linear regressional
models we used the standard procedures. For the estimation of
unknown parameters of GLM the special two-stage procedure has been
elaborated.
The cross-validation approach has been taken as the main procedure
for the check out the adequacy of all considered models and choosing
the best model for the forecasting.
The advantage of GLM application has been shown.
24
The 8th Tartu Conference on Multivariate Statistics
7. References
1. Andronov A.M. etc. Forecasting of air passengers conveyances on the
transport. // Transport, Moscow, 1983. (In Russian).
2. Butkevičius J., Vyskupaitis A. Development of passenger transportation
by Lithuanian sea transport. // In Proceedings of International
Conference RelStat’04, Transport and Telecommunication, Vol.6. N 2,
2005.
3. Hardle W., Muller M., Sperlich S., Werwatz A. Nonparametric and
Semiparametric Models. Springer, Berlin, 2004.
4. Hünt U. Forecasting of railway freight volume: approach of Estonian
railway to arise efficiency. // In TRANSPORT – 2003, Vol. XXVIII, No 6,
pp. 255-258.
5. Šliupas T. Annual average daily traffic forecasting using different
techniques. // In TRANSPORT – 2006, Vol. XXI, No 1, pp. 38-43.
6. EUROSTAT YEARBOOK 2005. The statistical guide to Europe. Data
1993–2004. EU, EuroSTAT, 2005.
URL: http://epp.eurostat.ec.europa.eu
25
The 8th Tartu Conference on Multivariate Statistics
THANK YOU FOR YOUR ATTENTION
26