Download Ex5Trends

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CE 397 Statistics in Water Resources
Exercise 5
Analyzing Trends
by:
Cody Hudson, Ashlynn Stillwell, Patrick Frasier and David Maidment
University of Texas at Austin
February 2009
Contents
Introduction .................................................................................................................................................. 1
Simple Linear Regression .............................................................................................................................. 2
Characterizing Outliers.................................................................................................................................. 8
Multiple Linear Regression ......................................................................................................................... 10
To be turned in............................................................................................................................................ 14
Introduction
In this exercise we will explore how to deal with trends. How can we tell if Y is dependent on X? How
do we tell if a data point is an outlier? How can we tell if Y is dependent on X1, X2, X3 or more
variables?
Goals of this Exercise
To answer these questions, we will evaluate two different data sets. In the first we will evaluate data to
see if there is a trend between two variables using simple linear regression methods. We will use the
same data set to determine if a data point is an outlier. Finally, we will evaluate the case where there
may be multiple explanatory variables.
Computer Requirements
This exercise is to be performed in Microsoft Excel (2007 version used here) using the Data Analysis
Toolpack from Exercise #2. The data for this exercise is at:
http://www.ce.utexas.edu/prof/maidment/StatWR2009/Ex5/Ex5Data.xls
The data used in this exercise were derived from the following web sites:
Census (links to each decade): http://www.census.gov/popest/archives/index.html
1
TWDB Historical Water Use: http://www.twdb.state.tx.us/wushistorical/DesktopDefault.aspx?PageID=1
Fresh Water Withdrawal by Country: http://www.worldwater.org/data.html
Population by Country: http://en.wikipedia.org/wiki/List_of_countries_by_population
GDP by Country: https://www.cia.gov/library/publications/the-world-factbook/fields/2195.html
Average annual rainfall in largest city: http://www.worldclimate.com/
Simple Linear Regression
Simple Trend Analysis
We all love Texas. We all want Texas to continue to have enough water to serve its residents. This part
of the exercise evaluates water use in Texas over the past 35 years. We will look at water use in Texas
and plot it versus census population data starting in 1974. Water use data was taken from the Texas
Water Development Board (TWDB). Population census data was taken from the census.gov webpage.
Lets make a simple study of the trends in these data through time.
Analysis 1: Population Growth vs Time. First, create a graph for population growth versus time. Add a
trendline to this graph by right-clicking on the data points, selecting ‘Add Trendline’, and selecting linear.
Check the boxes to add the equation and R2 value to the chart as shown below.
2
With this graph we see a strong trend of increasing population growth with time.
Analysis 2: Water Use vs Time. Next, graph water use versus time with a trendline. What did you
expect to see?
Analysis 3: Water Use vs Population. Finally, graph total water use versus population growth with a
trendline. Does there appear to be a trend in the data? The total water use includes all sources,
including manufacturing, steam electric, irrigation, mining, and livestock. These water uses are most
likely not affected by increased population.
Analysis 4: Municipal Water Use vs Time. Now let’s consider municipal water use only.
Graph municipal water use versus time with a trendline. What did you expect to see?
Analysis 5: Municipal Water Use vs Population. Finally, graph municipal water use versus population
with a trendline. Does there appear to be a trend in the data?
3
To be turned in: A graph of the data with the trend lines and equations for Analyses 1-5, plus a table that
summarizes the five trend equations and their R2 values. Discuss the results. Which relationships do you
think are statistically significant? What is the population growth rate per year in Texas?
Interpreting Regression Statistics
Now, anybody can fit trend lines but the differences among the equations you have just fitted need to
be examined more closely and the significance of the individual coefficients studied. We can do this
with the Regression Statistics. Let’s take a closer look at the trendline for municipal water use versus
population using the Excel Regression function from the Data Analysis Toolpack. Under Data, click on
Data Analysis, Regression.
4
The Input Y Range will be the municipal water use and Input X will be population. Check labels as
appropriate. We’d also like to have the 95% confidence level and residuals for later, as shown below.
5
Excel gives us lots of fun output regarding the relationship between municipal water use and population.
I’ll explain more about how to interpret this information in class. Being able to do this properly is what
separates naïve statistics from properly informed statistical analysis.
6
The right way to describe this equation is:
R2 = 0.903 Se = 159033 F = 195.83
Water Use = -7407.22 + 0.18348 * Population
(-0.03053) (13.994)
The numbers in parentheses under the coefficients are the t-values that tell you how significant the
coefficients are. The R2 value tells you what % of the variability in Water Use is explained by the
equation. The Standard Error (Se) value defines the vertical uncertainty around the line for estimates
made by this equation. The F-ratio is a more formal measure of the statistical significance of the
equation as a whole than R2.
We also get p-values for the parameter estimates for the intercept (0.976) and population slope
coefficient (4.07E-12). Since the p-value for the intercept is not less than a significance level () of 0.05,
we fail to reject the null hypothesis that the intercept is zero, thus the intercept estimate is not
statistically different from zero. Another way we can determine whether a parameter estimate is
statistically significant is to look at the upper and lower 95% confidence intervals. For the intercept,
zero is contained within the interval, thus the estimate is not statistically significant. On the other hand,
the slope coefficient estimate for population is highly significant.
Repeat the Regression analysis but this time check the box that sets “Constant is Zero”. This eliminates
the intercept of the equation.
7
TO BE TURNED IN: The table of interpretive statistics from the simple regression of municipal water use
vs population for the regressions with and without an intercept. Write both equations in the format
illustrated above. Which equation has the greater statistical significance? What is the average municipal
water demand in gallons per capita day in Texas? Is it changing through time? Note: 1 acre-ft = 325,851
US gallons.
Characterizing Outliers
We’re going to use the same data set to evaluate if can consider a data point to be an outlier. Use the
information from the regression function in the Data Analysis Toolpack. The summary will contain the
value of the standard error, s.
According to Helsel and Hirsch p.246, “leverage is a measure of an ‘outlier’ in the x direction.” A high
leverage point is one where
, where p is the number of coefficients in the model and n is the
number of data use. In our case, p = 2 and n = 23, so 3p/n = 3*2/23 = 0.261. The idea is to check the
degree of deviation of an individual point from the regression line in the x and y directions with this
value.
For deviation in the x-direction, the statistics hi is computed as:
Where SSx is the sum of the squares x.
For deviations in the y direction we use the standardized residual esi. It is the actual residual
divided by its standard error, Se. The estimated y can be calculated using the trendline
equation. Alternatively, the residual can be found in the residuals output of the regression analysis.
Then
Where the s in this equation is the standard error of estimate of the regression equation.
An extreme outlier is one for which |esi|>3.
8
Outlier?
Using the information in the Excel regression summary, determine if the 1974 data point is an outlier in
the x and y.
To be turned in: Without making any calculations, does the point for municipal water use in 1974 appear
to be an outlier? Why or why not? Using the procedure outlined above from Helsel and Hirsch, does the
point for municipal water use in 1974 have high leverage? Is this point an outlier?
9
Multiple Linear Regression
Mulitple Linear Regression is used when it is thought that the (y) variable is dependent on
multiple (x) variables. For example, stormwater runoff is not just dependent on rainfall but on
soil types and land slopes as well. Expanding the model for the (y) variable will help us
determine which factors are statistically important and which do not affect the (y) variable.
Data was compiled concerning multiple countries. We are interested in which factors would
affect water use such as Gross Domestic Product (GDP), population, and annual rainfall. Annual
rainfall was approximated as the annual rainfall in the most populated city of the country. In
some cases (small countries) an approximate national average was obtained. GDP and
population data were taken from the most recent data available.
US
Canada
Mexico
Australia
China
Brazil
India
Sudan
Cameroon
Ukraine
UK
France
Japan
Vietnam
Saudi
Arabia
Honduras
Annual Fresh
Water
Withdrawals
(km3/year)
477
44.72
78.22
24.06
549.76
59.3
645.84
37.32
0.99
37.53
11.75
33.16
88.43
71.39
Population
305,862,000
33,563,000
106,682,500
21,597,121
1,335,962,132
190,769,000
1,144,810,000
38,560,000
18,549,000
46,191,022
61,612,300
65,073,482
127,704,000
87,375,000
GDP
($ billion)
14330
1564
1143
1069
4222
1665
1237
62.19
25
198
2787
2978
4844
90.88
Average
Annual
Rainfall of
Largest
City
(mm)
960
817.5
749
1200
970
1350
2200
250
4060
615
584
642
1565
1976
17.32
0.86
24,735,000
7,106,000
528.3
13.78
81
907
Source: The World's Water, Pacific Institute; Wikipedia; CIA
10
Here we have three independent variables and one dependent variable, thus we cannot graph
our data. To determine which factors were important we use multiple linear regression. Use
the Data Analysis Toolpack to do a regression analysis on the data. Highlight all the
independent variables as the input X range and the fresh water withdrawals and the dependent
Y variable. We will not use residuals in this step.
11
Note the p-values for each coefficient estimate in the output of the regression analysis. Which
do you think is the most significant factor? What other issues could be affecting this model that
were not accounted for?
12
To be turned in: Present the output table of regression statistics and write the estimated model for a
country’s fresh water withdrawals as a function of population, GDP, and average annual rainfall in the
standard form with t-statistics, R2, Se and F ratio. Are there any coefficient estimates that are not
statistically significant? If so, which ones? How can you tell? What is the average fresh water
withdrawal per person (gpcd) derived from the equation? Compare this to the values derived directly
from the data for each country. How is the average value of gpcd derived from the equation different
from the average of the gpcd values derived directly from the data? What is the significance of the
positive coefficient on GDP? [1 cubic meter = 264.17 US gallons].
13
To be turned in
1. A graph of the data with the trend lines and equations for Analyses 1-5, plus a table that
summarizes the five trend equations and their R2 values. Discuss the results. Which
relationships do you think are statistically significant? What is the population growth rate per
year
2. Using the procedure outlined above from Helsel and Hirsch, does the point for municipal water
use in 1974 have high leverage? Is this point an outlier?
3. The table of interpretive statistics from the simple regression of municipal water use vs
population for the regressions with and without an intercept. Write both equations in the
format illustrated above. Which equation has the greater statistical significance? What is the
average municipal water demand in gallons per capita day in Texas? Is it changing through
time? Note: 1 acre-ft = 325,851 US gallons.
4. Present the output table of regression statistics and write the estimated model for a country’s
fresh water withdrawals as a function of population, GDP, and average annual rainfall in the
standard form with t-statistics, R2, Se and F ratio. Are there any coefficient estimates that are not
statistically significant? If so, which ones? How can you tell? What is the average fresh water
withdrawal per person (gpcd) derived from the equation? Compare this to the values derived
directly from the data for each country. How is the average value of gpcd derived from the
equation different from the average of the gpcd values derived directly from the data? What is
the significance of the positive coefficient on GDP? [1 cubic meter = 264.17 US gallons].
14