Download xxxxxxxx xxxxxxxx Regression Analysis – Student Project January

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Least squares wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
xxxxxxxx xxxxxxxx
Regression Analysis – Student Project
January 2013
TOTAL MEDALS WON AT THE LONDON 2012 SUMMER OLYMPICS
Objective:
The London 2012 Summer Olympics featured an array of athletes with varying abilities from 204
countries around the globe. After 3 weeks of competition, all participating athletes returned to their
bases – some with medals to show for their participation and others satisfied with the thrill of having
been a part of the games. Unquestionably, a highly captivating part of this global event is the medal
count. So which factors are statistically significant to the number of medals won? Numerous factors
could be examined. But for this project, I examined the significance of the following factors:
1.
2.
3.
4.
5.
The gender of the participating athlete
The total number of athletes representing each country
The population of each country
The number of athletes per country in proportion to the population of that country
The gross domestic product (GDP) of each country.
Data:
The data used in this analysis was obtained from the archives of the U.K. news agency, The Guardian
(http://www.guardian.co.uk/sport/datablog/2012/jul/30/olympics-2012-alternative-medal-table#data
The data contains the total number of medals won by each country, along with each of the five variables
listed above per country.
Analysis:
The approach I took in this analysis is to run a regression using the Total Medals as the response variable
and the five variables listed above as the explanatory variables. The classical regression model was
employed.
π‘Œ = 𝛼 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + … . + 𝛽5 𝑋5
Once the regression for the full model is complete, I then proceed to exclude certain variables to test
the significance of those variables using the F-Test
𝐹=
𝑅𝑆𝑆 π‘Ÿπ‘’π‘‘π‘’π‘π‘’π‘‘ π‘šπ‘œπ‘‘π‘’π‘™βˆ’π‘…π‘†π‘† 𝑓𝑒𝑙𝑙 π‘šπ‘œπ‘‘π‘’π‘™
𝑑𝑓 𝑓𝑒𝑙𝑙 π‘šπ‘œπ‘‘π‘’π‘™βˆ’π‘‘π‘“ π‘Ÿπ‘’π‘‘π‘’π‘π‘’π‘‘ π‘šπ‘œπ‘‘π‘’π‘™
𝑅𝑆𝑆 𝑓𝑒𝑙𝑙 π‘šπ‘œπ‘‘π‘’π‘™
𝑑𝑓 𝑓𝑒𝑙𝑙 π‘šπ‘œπ‘‘π‘’π‘™
The result from the F-test was then compared to a calculated critical value at the 95% significance level
to determine whether a stated null hypothesis can or cannot be rejected.
Results:
Full Model
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.950327626
0.903122598
0.900676198
4.231944983
204
ANOVA
df
Regression
Residual
Total
Intercept
Male Athletes
Female Athletes
GDP.2011 (Billions)
Population.2010 (Millions)
Athletes Per Population (100K)
5
198
203
Coefficients
-1.092093
-0.030457
0.224253
0.003079
0.008482
0.034970
SS
MS
F
33057.45685 6611.49137 369.16406
3546.052951 17.9093583
36603.5098
Standard Error
0.385325
0.020896
0.024654
0.000366
0.002629
0.063815
t Stat
-2.834214
-1.457525
9.095911
8.417454
3.226761
0.547980
P-value
0.005070
0.146556
0.000000
0.000000
0.001465
0.584323
Significance
F
2.80758E-98
Lower 95%
-1.851960
-0.071664
0.175635
0.002358
0.003298
-0.090875
The Adjusted R Square value of 0.9007 suggests that, after adjusting for degrees of freedom, our
explanatory variables have good predictive powers and a large percentage of variation is explained by
the regression.
The P-value for β€˜Athletes Per Population (100K)’ claims that the probability of the variation being caused
by chance is 58.4%, which is fairly high. The implication is that this particular variable may not be
statistically significant. This will be examined, and possibly confirmed, later in my analysis.
Reduced Models:
The next step in this analysis was to exclude certain variable to determine the significance.
Exclude Male Athletes:
Null Hypothesis, H0: Total number of male athletes representing a country is not significant
Regression Statistics
Multiple R
0.949780596
R Square
0.902083181
Adjusted R Square
0.900115004
Standard Error
4.24388371
Observations
204
ANOVA
df
Regression
Residual
Total
Intercept
Female Athletes
GDP.2011 (Billions)
Population.2010 (Millions)
Athletes Per Population (100K)
SS
MS
F
33019.41056 8254.85264 458.33432
3584.099239 18.0105489
36603.5098
4
199
203
Coefficients
-1.25912
0.19109
0.00317
0.00860
0.04091
Standard Error
0.36893
0.00952
0.00036
0.00263
0.06386
Full Model
Reduced Model
RSS
df
3546.052951
3584.099239
F-Test
0.053645967
Critical Value
12.70620473
t Stat
-3.41291
20.07385
8.76276
3.26353
0.64060
P-value
0.00078
0.00000
0.00000
0.00130
0.52252
5
4
Since the F-Test produced a value that is less than the Critical value, we cannot reject the null
hypothesis.
Significance
F
3.5333E-99
Lower 95%
-1.98663
0.17232
0.00246
0.00340
-0.08503
Exclude Female Athletes:
Null Hypothesis, H0: Total number of female athletes representing a country is not significant
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.928785089
0.862641742
0.859880772
5.026459812
204
ANOVA
df
Regression
Residual
Total
Intercept
Male Athletes
GDP.2011 (Billions)
Population.2010 (Millions)
Athletes Per Population (100K)
SS
MS
F
31575.71545 7893.92886 312.44155
5027.794351 25.2652982
36603.5098
4
199
203
Coefficients
-1.52789
0.14496
0.00447
0.00883
0.05353
Standard Error
0.45411
0.00956
0.00039
0.00312
0.07576
Full Model
Reduced Model
RSS
df
3546.052951
5027.794351
F-Test
2.089282676
Critical Value
12.70620473
t Stat
-3.36454
15.16941
11.31013
2.83003
0.70665
P-value
0.00092
0.00000
0.00000
0.00513
0.48061
Significance
F
1.42966E-84
Lower 95%
-2.42338
0.12612
0.00369
0.00268
-0.09586
5
4
Again, based on the comparison of the F-value and the Critical value, we cannot reject the null
hypothesis. However, an observation worth noting is that the F-Test produced a higher value in this
instance than it did when male athletes were excluded. Could it be that, even though not statistically
significant, a country may be able to slightly increase the total number of medals won at the Olympics
by investing in female athletes and sending a larger contingent of female athletes?
Exclude All Athletes:
Null Hypothesis, H0: Total number of athletes representing a country is not significant
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.838933441
0.703809319
0.699366459
7.362614516
204
ANOVA
df
Regression
Residual
Total
SS
MS
F
25761.8913 8587.2971 158.41356
10841.6185 54.2080925
36603.5098
3
200
203
Intercept
GDP.2011 (Billions)
Population.2010 (Millions)
Athletes Per Population (100K)
Coefficients
1.76542
0.00818
0.00676
-0.04910
Full Model
Reduced Model
RSS
df
3546.052951
10841.6185
F-Test
5.143440926
Critical Value
Standard Error
0.58423
0.00045
0.00457
0.11052
t Stat
3.02177
18.00235
1.47941
-0.44426
P-value
0.00284
0.00000
0.14060
0.65734
Significance
F
1.36725E-52
Lower 95%
0.61337
0.00728
-0.00225
-0.26704
5
3
4.30265273
In this case, the F-Test produces a value higher than the critical value. This implies the null hypothesis
has to be rejected.
Exclude GDP:
Null Hypothesis, H0: The GDP of a participating country is not significant
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.931909542
0.868455394
0.865811282
4.918938001
204
ANOVA
df
Regression
Residual
Total
Intercept
Female Athletes
Male Athletes
Population.2010 (Millions)
Athletes Per Population (100K)
SS
MS
F
31788.51554 7947.12889 328.44871
4814.994261 24.1959511
36603.5098
4
199
203
Coefficients
-1.48554
0.31078
-0.06002
0.01676
0.03904
Standard Error
0.44457
0.02605
0.02394
0.00283
0.07417
Full Model
Reduced Model
RSS
df
3546.052951
4814.994261
F-Test
1.789230628
Critical Value
12.70620473
5
4
Based on the result above, the null hypothesis cannot be rejected.
t Stat
-3.34153
11.93178
-2.50691
5.91438
0.52638
P-value
0.00100
0.00000
0.01298
0.00000
0.59921
Significance
F
1.94686E-86
Lower 95%
-2.36221
0.25942
-0.10724
0.01117
-0.10722
Exclude Population:
Null Hypothesis, H0: The population of a participating country is not significant
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.9476435
0.8980282
0.8959785
4.3308667
204
ANOVA
df
Regression
Residual
Total
Intercept
Female Athletes
Male Athletes
GDP.2011 (Billions)
Athletes Per Population (100K)
4
199
203
SS
32870.98489
3732.524911
36603.5098
Coefficient
s
Standard Error
-0.8980
0.3895
0.2254
0.0252
-0.0325
0.0214
0.0035
0.0003
0.0212
0.0652
Full Model
Reduced Model
RSS
df
3546.052951
3732.524911
F-Test
0.262928899
Critical Value
12.70620473
The null hypothesis cannot be rejected in this case.
5
4
MS
8217.746
18.756
F
438.130
Significance
F
1.99391E-97
t Stat
-2.3055
8.9357
-1.5211
10.1408
0.3249
P-value
0.0222
0.0000
0.1298
0.0000
0.7456
Lower 95%
-1.6660
0.1757
-0.0747
0.0028
-0.1073
Exclude Athletes Per Population:
Null Hypothesis, H0: The number of athletes representing a country as a percentage of the country’s
population is not significant
Regression Statistics
Multiple R
0.950250322
R Square
0.902975675
Adjusted R Square
0.901025438
Standard Error
4.224498317
Observations
204
ANOVA
df
Regression
Residual
Total
Intercept
Female Athletes
Male Athletes
GDP.2011 (Billions)
Population.2010 (Millions)
4
199
203
SS
MS
F
33052.07898 8263.01975 463.00801
3551.43082 17.846386
36603.5098
Coefficients Standard Error
-1.01104
0.35518
0.22469
0.02460
-0.03119
0.02082
0.00308
0.00037
0.00839
0.00262
Full Model
Reduced Model
RSS
df
3546.052951
3551.43082
F-Test
0.007582894
Critical Value
12.70620473
Again, the null hypothesis cannot be rejected in this case.
5
4
t Stat
-2.84656
9.13417
-1.49822
8.43670
3.20286
P-value
0.00488
0.00000
0.13566
0.00000
0.00158
Significance
F
1.4221E-99
Lower 95%
-1.71144
0.17618
-0.07224
0.00236
0.00322
Conclusion:
My analysis confirms what many people might have suspected all along: The size of a country’s
contingent to the Olympic Games has a statistical significance to the total number of medals won by the
country at the Games. This is evident in the outcome of the London Olympic Games as the countries
with the 8 largest contingents finished in the top 8 on the medals table. Another result that may not be
surprising is that countries with higher GDPs may perform better at the Games, even though some
countries with significantly lower GDPs (such as Jamaica and Belarus) had a greater medal haul then
other countries with higher GDPs (such as India and Nigeria) that both finished with no medal.
A finding that may not have been so apparent, however, is that, while statistically insignificant, the total
number of medals won by a country at the Olympic Games may be slightly increased by having a larger
contingent of female athletes than male athletes to the Olympic Games. This may be due to the fact that
the average number of female athletes per country currently trails the average number of male athletes
per country.
Lastly, it is worth pointing out that the impacts of the variables examined in this analysis are not
necessary mutually exclusive from one another. Countries such as the United States, China, Japan, and
Germany all finished with a lot of medals respectively. These countries also happen to have higher GDPs.
They also sent a lot of athletes (male or female) to the Olympic Games. For each of these countries,
there is an abundance of world-class athletes competing in numerous sports. In addition to the need to
examine interactions between variables included in this analysis, there is also the need to include
several other variables that could potentially contribute to medal winnings. Therefore, for any analysis
on this topic to be complete, several additional factors per country (socio-economic, cultural, political,
religious, along with their interaction) need to be thoroughly examined.