Download Statistical Formulas

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
BUS 211 Notes
Chapter 1 Introduction and Data Collection
Categorical Variables – responses are a selection i.e. Gender (male or female), Class (freshman,
sophomore, junior, senior), Smoke (yes or no), etc.
Numerical Variables – responses are numbers i.e. Income ($30,000), Age (25), etc.
Can be Discrete (Integer) or Continuous (fractional parts),
Chapter 2 Presenting Data in Tables and Charts
Sort Data – Data | Sort
Stem-and-Leaf Graph – PHStat | Descriptive Statistics | Stem-and-Leaf Display
Frequency Distribution - PHStat | Descriptive Statistics | Frequency Distribution
Set up classes then array (bin) the upper limit of the desired frequency distribution
Be sure to include a label for the array (use Upper Limit)
Relative Frequency distribution – Divide the frequency distribution by the total
Percentage Distribution - Divide the frequency distribution by the total and multiply by 100
Or use Format | Cells… | Percentage
Cumulative Distribution – Sum the frequencies from top to bottom listing each total as you go.
Graphs - PHStat does not work well for most graphs use the chart wizard in Excel
Histogram also known as a Vertical Bar Chart or Column Chart Set up the frequency distribution then use the midpoints for labels
Double click the chart icon and select a column graph type
Select the frequency without labels as the data
Select the Series tab, mouse into the X-axis label box then select the midpoints
Select Next to insert the title and axis labels and make any other changes
Select Next to pick a location for the chart then Finish
Double click a bar and select Options, set gap width to 0
Polygon also known as a line graph Set up the frequency distribution then use the midpoints for labels.
Insert a class with O frequency and an appropriate label at the top and the bottom.
Double click the chart icon and select a line graph type
Select the frequency without labels as the data
Select the Series tab, mouse into the X-axis label box then select the midpoints
Select Next to insert the title and axis labels and make any other changes
Select Next to pick a location for the chart then Finish
Ogive also known as a cumulative line graph or cumulative polygon
Set up the cumulative frequency distribution use the upper class limit for labels.
Insert a class with O frequency and an appropriate label at the top but not the bottom.
Double click the chart icon and select a line graph type and complete the steps
XY Scatter Set up the data in columns with the X values first and the Y in the second column
Double click the chart icon and select XY Scatter graph
Select both columns as the data, do not select the labels, and complete the steps
Bar Chart
Same as Histogram but for categorical data.
Use the category labels: if not numerical values they can be selected with the data.
Pie Chart
Same as above. Be sure to remove legend, select Data Labels, check Category name
Pareto Chart Raw Data: use line chart on 2 axis or
Select Descriptive Statistics | One-Way Tables & Charts…
Be sure to select labels as the model will not work otherwise
Check table of frequencies and Pareto Diagram
Bivariate Categorical Tables and Charts Use PHStat (also available in Excel - Data | Pivot Wizard)
In PHStat select Descriptive Statistics | Two-Way Tables & Charts
1
Chapter 3 Numerical Descriptive Measures
Use Tools | Data Analysis | Descriptive Statistics, check the Summary statistics box to get the following:
sample mean, median, mode, standard deviation, variance, range
population mean, median, mode, range
Use fx the individual functions for the following measures
geometric mean (GEOMEAN), population variance (VARP) and standard deviation (STDEVP)
approximate quartiles (QUARTILE), approximate percentiles (PERCENTILE)
Coefficient of variation: Divide the standard deviation by the mean and multiply by 100%
Box-and-Whisker Plot and Five-Number Summary
PHStat | Descriptive Statistics | Box-and-Whisker Plot then check Five-Number Summary
Gives the exact quartiles not approximations
Coefficient of Correlation: fx (CORREL), or Tools | Data Analysis | Correlation
Chapter 4 Basic Probability
Probability of A or B:
If A and B are Mutually Exclusive:
P( A or B)  P( A)  P( B)  P( A and B)
Conditional probability of A given B:
P( A B) 
P( A or B)  P( A)  P( B)
If A and B are Independent:
P( A and B)
P( B )
P( A B)  P( A)
Joint Probability of A and B:
If A and B are Independent:
P( A and B)  P( A B) P( B)
P( A and B)  P( A) P( B)
Bayes' Theorem
P( Bi A) 
P( A Bi ) P( Bi )
P( A B1 ) P( B1 )  P( A B2 ) P( B2 )  ...  P( A Bk ) P( Bk )
Chapter 5 Some Important Discrete Probability Distributions
N
  E ( X )   X i P( X i )
i 1
Combinations:
N
 2   [ X i  E ( X )] 2 P( X i )
i 1
n!
X ! ( n  X )!
Binomial distribution: (for an infinite population)
PHStat | Probability & Prob. Distributions | Binomial then check Cumulative Probabilities
Hypergeometric distribution: (for a finite population)
PHStat | Probability & Prob. Distributions | Hypergeometric no cumulative probabilities available
Poisson distribution:
PHStat | Probability & Prob. Distributions | Poisson then check Cumulative Probabilities-
2
Chapter 6 The Normal Distribution and Other Continuous Distributions
Normal Distribution
PHStat | Probability & Prob. Distributions | Normal then check the desired calculation
To check the normality assumption construct a stem-and-leaf, box-and-whisker, histogram or a
Normal probability plot PHStat | Probability & Prob. Distributions | Normal Probability Plot
Uniform Distribution

ab
2
2 
(b  a) 2
12
where a and b are the endpoints of the uniform distribution.
Exponential distribution
PHStat | Probability & Prob. Distributions | Exponential
Only returns results for  X, for > x use 1-probability, for results between two values find the
probability for each and subtract the smaller from the larger
Sampling distribution of the mean
Calculate the standard deviation of the sampling distribution also called the Standard error of the
mean then use the Normal Distribution calculator if the population is normally distributed or
the sample size is > 30 or the population distribution is symmetrical and the sample size is > 15
x 
Infinite population
x
n
x 
Finite population
x
n
N n
N 1
Sampling distribution of the proportion:
Calculate the standard deviation of the sampling distribution (Standard Error of the Mean) then
If np > 5 and n(1-p) > 5 use the Normal Distribution calculator PHStat | Probability & Prob.
Distributions | Normal
ps 
X number of sucesses

n
sample size
Infinite population  p 
s
ps = sample proportion
p(1  p)
n
p = population proportion
Finite population  p 
s
p(1  p) N  n
n
N 1
Chapter 7 Confidence Interval Estimation
Interval estimate of the population mean (x) with x unknown:
PHStat | Confidence Intervals | Estimate for the Mean, sigma unknown
be sure to check the finite box for finite populations
Interval estimate of the population proportion:
PHStat | Confidence Intervals | Estimate for the Proportion
be sure to check the finite box for finite populations
Interval estimate of the population total:
PHStat | Confidence Intervals | Estimate for the Population Total
Sample size (n) for estimating a mean:
PHStat | Sample Size | Determination for the Mean
be sure to check the finite box for finite populations
Estimate of parameters would be from a preliminary sample
Sample size for estimating a proportion:
PHStat | Sample Size | Determination for the Proportion
be sure to check the finite box for finite populations
Estimate of True Proportion would be the proportion from a preliminary sample
If a preliminary sample is not available use .5
3
Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests
One Sample numerical data
 unknown
Hypothesis
Ho: x = value
Ha: x  value
Ho: x  value
Ho: x  value
a two tail test
Ha: x  value upper tail test
Ha: x  value lower tail test
Test Statistic
t
Procedure
Summary Data: PHStat | One-Sample Tests | t Test for the Mean, sigma unknown
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that (Question asked)
If not rejected – There is not sufficient evidence that (Question asked).
Parentheses indicate information to be taken from the problem
One Sample Categorical Data
Hypothesis
Ho: p = value
Ha: p  value
a two tail test
Ho: p  value
Ho: p  value
Ha: p  value upper tail test
Ha: p  value lower tail test
Test Statistic
Z
Procedure
Summary Data: PHStat | One-Sample Tests | Z Test for the Proportion
Raw Data: No Tests available, calculate p and use PHStat
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that (Question asked)
If not rejected – There is not sufficient evidence that (Question asked)
Parentheses indicate information to be taken from the problem
4
Chapter 9 Two-Sample Tests
Procedure to determine the proper two sample mean test for numerical data:
Are Data Paired
Yes
Use Paired Data Model
No
No
Use  Unequal Model
2
F Test
Are 2's Equal
Yes
Use 2 Equal Model
Two Sample test of Means with Paired numerical data
Hypothesis
Ho: 1 = 2
Ha: 1  2
Ho: 1  2
Ho: 1  2
Procedure
Summary Data: no PHStat calculation available
Raw Data: Data Analysis | t Test: Paired Two Sample for Means
Test Statistic
t
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that (Question asked)
If not rejected – There is not sufficient evidence that (Question asked)
a two tail test
Interval estimate of the difference
D  tn1
SD
n
Ha: 1  2
Ha: 1  2
upper tail test
lower tail test
To get t use function TINV(1-Confidence, df)
Use Descriptive Statistics to get D and sd
Or PhStat | Confidence Intervals | Estimate for the Mean, sigma unknown - Select the differences as the data
Two Sample test of Variances with numerical data
Hypothesis
Ho: 21 = 22 a two tail test
Ha: 21  22
Procedure
Summary Data: PHStat | Two-Sample Tests | F Test for the Difference in Two Variances
Raw data: Data Analysis | F Test Two Sample for Variances Do not use only gives lower
tail value
Test Statistic
F
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
Ho: 21  22
Ho: 21  22
Ha: 21  22
Ha: 21  22
upper tail test
lower tail test
If rejected – There is sufficient evidence that (Question asked)
If not rejected–There is not sufficient evidence that (Question asked)
5
2’s not proven unequal with the F test
Two Sample test of Means with numerical data
Hypothesis
Ho: 1 = 2
Ha: 1  2
Ho: 1  2
Ho: 1  2
Procedure
Summary Data: PHStat | Two-Sample Tests | t Test for Differences in Two Means
Raw Data: Data Analysis | t Test: Two Sample Assuming Equal Variances
Test Statistic
t
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that (Question asked)
If not rejected – There is not sufficient evidence that (Question asked)
a two tail test
X  X   t
Interval estimate of the difference
1
2
n1  n 2 -2
To get t use function TINV(1-Confidence, df)
Ha: 1  2
Ha: 1  2
upper tail test
lower tail test
1 1 
S2p   
 n1 n 2 
2‘s proven unequal with the F test
Two Sample test of Means with numerical data
Hypothesis
Ho: 1 = 2
Ha: 1  2
Procedure
Summary Data: Use spreadsheet downloaded from the Homework web page
a two tail test
Ho: 1  2
Ho: 1  2
Ha: 1  2
Ha: 1  2
upper tail test
lower tail test
Raw Data: Data Analysis | t Test: Two Sample Assuming Unequal Variances
Test Statistic
t
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that (Question asked)
If not rejected – There is not sufficient evidence that (Question asked)
Interval estimate of the difference


s12 s22

n1 n2
CI  X 1  X 2  t
To get t use function TINV(1-Confidence, df)
Two Sample test of a Proportion with categorical data
pi 
X i Number in the sample with the desired charateristic

ni
Total number of items in the sample
Ho: p1  p2 Ha: p1  p2
Ho: p1  p2 Ha: p1  p2
Hypothesis
Ho: p1 = p2
Ha: p1  p2
Procedure
PHStat | Two-Sample Tests | Z Test for the Differences in Two Proportions
Test Statistic
Z
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that (Question asked)
If not rejected – There is not sufficient evidence that (Question asked)
Interval estimate of the difference
a two tail test
p
s1
ps1 (1  ps1 )
 ps2   Z
n1
To get Z use function NORMSINV(two tail)
where two tail=Confidence+(1-Confidence)/2
6

ps2 (1  ps2 )
n2
upper tail test
lower tail test
Chapter 10 Analysis of Variance
(Multi (c) Sample tests with numerical data)
Equality of Variances
Hypothesis
Ho: 21 = 22= 23
a two tail test
Ha: not all ’s are equal
Procedure
Raw data: PHStat | Multiple-Sample Tests | Levene’s Test
Test Statistic
F
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that (Question asked)
If not rejected–There is not sufficient evidence that (Question asked)
One Factor ANOVA
Hypothesis
Ho: 1 = 2 = 3 … = c
Ha: not all ’s are equal
Procedure
Tools | Data Analysis |Anova: Single Factor
Test Statistic
F from the computer printout
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that (Question asked)
If not rejected – There is not sufficient evidence that (Question asked)
c = the number of populations
P-value = The Probability of F
Tukey's multiple comparison method: (determines which of the c means are different from each other).
Procedure
PHStat | Multiple-Sample Tests | Tukey-Kramer Procedure
Test Statistic
Critical Range
Input
Q found in the Studentized Range Table where column = c and row = n-c
c = number of groups n = total number of data points in all groups
Decision Rule
If the absolute difference between any two pairs of means is greater than the critical range the
pair is different.
Two Factor With Replication
Hypothesis
Ho1: A1 = A2 = A3 … = r
Ha1: not all ’s are equal
r = the number of levels in Factor A
Ho2: B1 = B2 = B3 … = c
Ha2: not all ’s are equal
c = the number of levels in Factor B
Ho3: No Interaction
Ha3: Interaction
Procedure
Tools | Data Analysis |Anova: Two Factor With Replication
Test Statistic
F from the computer printout. p-value = The Probability of F
For differences in rows see p-value for the Sample row of the ANOVA
For differences in columns see p-value for the Columns row of the ANOVA
For interaction between factors see p-value for the Interaction row of the ANOVA
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
H1 If rejected – There is sufficient evidence of a difference in (factor A)
H2 If rejected – There is sufficient evidence of a difference in (factor B)
H3 If rejected – There is sufficient evidence of an interaction term
If not rejected – There is not sufficient evidence to make a conclusion about …
7
Tukey's multiple comparison method for Two Factor ANOVA with replication:
No spreadsheet, hand calculate with the following formulas:
critical range A  Q
MSW
cn'
MSW from ANOVA MS Within
Q table column is r the number of levels in Factor A
Q table row is rc(n’-1) where c is the levels in Factor B, and n’ is the number of replications
critical range B  Q
MSW
rn'
MSW from ANOVA MS Within
Q table column is c the number of levels in Factor B
Q table row is rc(n’-1) where r is the levels in Factor A, and n’ is the number of replications
8
Chapter 11 Chi-Square Tests and Nonparametric Tests
Two Sample test of a Proportion with categorical data (Alternate Procedure)
Ha: p1  p2
Hypothesis
Ho: p1 = p2
(No <, or > Hypothesis)
Procedure
PHStat | Two-Sample Tests | Chi-Square Test for the Differences in Two Proportions
Test Statistic
2
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that (Question asked)
If not rejected – There is not sufficient evidence that (Question asked)
Multi (c) Sample test of Proportions with categorical data
Hypothesis
Ho: p1 = p2 = p3 … pc
c = the number of samples
Ha: not all p’s are equal
Procedure
PHStat | Multiple-Sample Tests | Chi-Square Test
Test Statistic
2
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that (Question asked)
If not rejected – There is not sufficient evidence that (Question asked)
Be sure to check the box for the Marascuilo Procedure to determine which proportions are different.
2 Test of Independence
Hypothesis
Ho: Two categorical variables are independent
Ha: Two categorical variables are related
Procedure
PHStat | Multiple-Sample Tests | Chi-Square Test
Test Statistic
2
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that the variables are related
If not rejected – There is not sufficient evidence that the variables are related.
Two Sample test of Medians with numerical data
a two tail test
Ho: M1  M2
Ho: M1  M2
Ha: M1  M2
Ha: M1  M2
Hypothesis
Ho: M1 = M2
Ha: M1  M2
Procedure
Raw Data
PHStat | Two-Sample Tests | Wilcoxon Rank Sum Test
Summary Data No Tests available.
Test Statistic
Z
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that (Question asked)
If not rejected – There is not sufficient evidence that (Question asked)
9
upper tail test
lower tail test
Kruskal-Wallis Rank Test for Differences Between c Medians
Hypothesis
Ho: M1 = M2 = M3 = MC
Ha: Not all Mj are equal ( j=1,2,…C)
Procedure
Raw Data
PHStat | Multiple-Sample Tests | Kruskal-Wallis Rank Test
Summary Data No PHStat or Excel calculation available
Test Statistic
H
Decision Rule
If the p-value is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that (Question asked)
If not rejected – There is not sufficient evidence that (Question asked)
10
Chapter 12 Simple Linear Regression
Linear Regression Model: relationship represented as Yˆi  b0  b1 X i
Determining if the linear model is significant
Hypothesis
Ho: 1 = 0
Ha: 1  0
Procedure
PHStat | Regression | Simple Linear Regression or
Tools | Data Analysis | Regression
Test Statistic
F
Decision Rule
If the significant F (a p-value) is less than alpha Reject the Hypothesis
If the significant F is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence to accept the linear regression model
If not rejected – There is not sufficient evidence of a linear model end the analysis
Confidence Interval estimate of 1 found on the ANOVA output.
See the independent variable line under Lower 95% and Upper 95%.
Confidence interval estimates for the dependent variable be sure to check the input box and insert a value.
Durbin Watson statistic for autocorrelation be sure to check the input box.
Additional measures from the regression
Standard error of the estimate: a measure of variability of the data around the regression line
Coefficient of determination (r2): measures the percent of the variation in the dependent variable Y that is
explained by the independent variable X in the regression model. Shows the strength of the relationship.
Adjusted r2: modifies the r2 for the number of explanatory variables in the model and the sample size
Sample coefficient of correlation (r): estimator of 
Checking the Assumptions of regression:
1. Normality - to check normality analyze the normal probability plot of the sample values.
2. Homoscedasticity - variation around the regression line must be constant for all values of X
to check analyze the residual plot for horn shape.
3. Independent residuals - to check analyze residual plot for randomness.
11
Chapter 13 Introduction to Multiple Regression
Multiple Regression Model: represented as
Yˆi  b0  b1 X 1i  b2 X 2 i  b3 X 3i  bk X ki
Determining if the multiple linear model is significant
Hypothesis
Ho: 1 = 2 =…= k = 0
Ha: Not all ’s = 0
Procedure
PHStat | Regression | Multiple Regression or
Tools | Data Analysis | Regression
Test Statistic
F
Decision Rule
If the significant F (a p-value) is less than alpha Reject the Hypothesis
If the significant F is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that all or part of the model is significant,
(proceed with the analysis)
where k equals the number of variables
If not rejected – There is not sufficient evidence of a linear model (end the analysis)
Determining which variables are significant
Hypothesis
Ho: 1 = 0
Ha: 1  0
Ho: 2 = 0
Ha: 2  0
Procedure
PHStat | Regression | Multiple Regression or
Tools | Data Analysis | Regression
Test Statistic
t
Decision Rule
If the p-value of the t statistic is less than alpha Reject the Hypothesis
If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis
Conclusion
If rejected – There is sufficient evidence that (variable) is significant
If not rejected – There is not sufficient evidence to prove (variable) is significant
…
…
Ho: k = 0
Ha: k  0
Confidence Interval estimates of ’s found on the ANOVA output.
See the variable lines under Lower 95% and Upper 95%.
Confidence interval estimates for the dependent variable be sure to check the input box and insert
a value.
Durbin Watson statistic for autocorrelation be sure to check the input box.
Additional measures from the regression
Standard error of the estimate: a measure of variability of the data around the regression line
Coefficient of multiple determination (r2): measures the percent of the variation in the dependent variable Y
that is explained by the independent variables in the regression model. Shows the strength of the relationship.
Adjusted r2: modifies the r2 for the number of explanatory variables in the model and the sample size
Sample coefficient of correlation (r): estimator of 
Coefficient of partial determination (r2Y) contribution of each variable holding the others constant
12
Dummy Variables Model used to include categorical variables.
Prepare a data matrix with Y, X1..Xn and dummy variables with 1 representing the characteristic and 0 its
absence
Y
X1
… Xn
XD1 … XDn
3.8
3
1
4.2
2
0
.
.
0
.
.
1
.
Follow the usual multiple regression procedures. Then test for an interaction term between numerical and
categorical variables. If the interaction term is significant you can not use the dummy variable.
Dummy Variables Interactions Model
Used to check on the interaction between the numerical and categorical variables.
To test for an interaction term prepare a data matrix with Y, X1..n , the dummy variables and the product of the
dummy variable and the numerical variables
Y
3.8
4.2
.
.
X1..n
3
2
.
.
XD1..n
1
0
0
1
X1 * XD
3
0
Include all possible combinations
Follow the usual multiple regression procedures.
Independent Variables Interactions Model used to check on interaction between numerical variables.
To test for an interaction term prepare a data matrix with Y, X1..Xn and the product of all pairs of numerical
variables
Prepare a data matrix with Y, X1..Xn and Xa*Xb
Y
3.8
4.2
.
.
X1
3
2
.
.
X2
11
15
12
18
Include all possible combinations
Follow the usual multiple regression procedures.
13
Xa * Xb
33
30
Chapter 14 Multiple Regression Model Building
The Quadratic model
Yˆi  b0  b1 X 1  b11 X i2
b0 = estimated Y intercept
b1 = estimated linear effect on Y
b11 = estimated curvilinear effect on Y
Prepare a data matrix with the dependent variable Y and the independent variables X and X2
Y
3.8
4.2
…
X
3
2
X2
9
4
Do a multiple regression with X and X2 as the independent variables.
The Square-Root Transformation Model
Yˆi  b0  b1 X 1i
Prepare a data matrix with the dependent variable Y and the independent variable square root of X
Y
3.8
4.2
…
X
9
4
Do a simple linear regression with
X
3
2
X as the independent variables.
The Log Multiplicative Model
Yi  b0 X 1ib1 X 2ib2
log Yˆi  log b0  b1 log X1i  b2 log X 2i
Prepare a data matrix with Y , X1 , X2 and their logs: use =Log(..)
Y
3.8
4.2
…
X1
3
5
X2
9
8
Log Y
.579784
.623249
Log X1
.477121
.69897
Log X2
.954243
.90309
Do a multiple regression with Log Y, Log X1 and Log X2 as the independent variables.
To convert your predictions to the original data range take 10 to the power Log Y
The Natural Log Exponential Model
Yˆi  e b0 b1X1ib2 X 2i
ln Yˆi  b0  b1 X 1i  b2 X 2i
Prepare a data matrix with Y , X1 , X2 and their logs: use =ln(..)
Y
X1
X2
Ln Y
Ln X1
Ln X2
3.8
3
9
1.335001
1.098612
2.197225
4.2
5
8
1.435085
1.609438
2.079442
…
Do a multiple regression with X and X2 as the independent variables.
To convert your predictions to the original data range take e to the power Ln Y or use =Exp(Ln Y)
14
Model Building
Stepwise Regression – limited evaluation of alternative models
Procedure:
PHStat | Regression | Stepwise Regression
Best-Subsets – all possible subsets of the independent variables.
Procedure:
PHStat | Regression | Best Subsets
1. Fit a model with all the independent variables and check the VIF box.
2. If all VIF’s are  10 proceed to the next step,
else eliminate the variable with the highest VIF and go to back to step 1
3. Sort the results by the adjusted r2 select the model with the least variables if the r2’s are close. Or
Sort the results by Cp and pick models with Cp  to k+1 (k=total number of variables)and pick the best.
15
Chapter 15 Time-Series Forecasting and Index Numbers
Time-Series models use the same least squares technique as regression models. Only the data is different.
The Linear model
Yˆi  b0  b1 X 1
b0 = estimated Y intercept
b1 = estimated linear effect on Y
Prepare a data matrix with the dependent variable Y and the independent variable X
Y
3.8
4.2
…
X
1
2
Do a simple linear regression with X as the independent variable.
Forecast by plugging the next X value into the linear equation
The Quadratic model
Yˆi  b0  b1 X1  b11 X i2
b0 = estimated Y intercept
b1 = estimated linear effect on Y
b11 = estimated curvilinear effect on Y
Prepare a data matrix with the dependent variable Y and the independent variables X and X2
Y
3.8
4.2
…
X
1
2
X2
1
4
Do a multiple regression with X and X2 as the independent variables.
Forecast by plugging the next X value into the quadratic equation
The Exponential model
Yˆi  b0b1X i
b0 = estimated Y intercept
b1 = is the compound growth factor where (b1 -1)*100% is the compound growth rate
Prepare a data matrix with the independent variable X and the common log of the dependent variable Y
Y
3.8
4.2
…
log Y
.5798
.6232
X
1
2
The data for the independent variable is often time series data where X is the year or month.
Do a linear regression with X as the independent variable, and the log of Y as the dependent variable.
This provides the following transformation
log Yˆi  log b0  X i log bi
Forecast by plugging the X value into this linear equation yielding the log of Y.
Take 10 to the power (log of Y) to get the antilog which is the actual Y forecast. Y  10
log of Y
or
Take 10 to the power (log of b0) to get the antilog which is the actual b0 then
Take 10 to the power (log of b1) to get the antilog which is the actual b1 and use the exponential equation
Yˆi  b0b1X i
16
Autoregressive Models
Yi
Yi
Yi
Yi
First-Order autoregressive model
Second-Order autoregressive model
Third-Order autoregressive model
pth-Order autoregressive model
 A0  A1Yi 1   i
 A0  A1Yi 1  A2Yi 2   i
 A0  A1Yi 1  A2Yi 2  A3Yi 3   i
 A0  A1Yi 1  A2 Yi  2  A3Yi 3  ...  A p Yi  p   i
Autoregressive models lag the dependent variable data by one or more periods to provide a weighted moving
average of the previous values of the variable Y.
Third-Order autoregressive model
Prepare a data matrix with the dependent variable Y and lagged versions of Y as the independent variables Y
X
1
2
3
4
.
.
.
Y
3.8
4.2
3.0
4.6
5.0
.
.
.
Y lag 1
Y lag 2
3.8
4.2
3.0
4.6
.
.
.
3.8
4.2
3.0
Y lag 3
3.8
4.2
For this type of autoregressive analysis the X variable is not needed. The dependent variable is Y and the
independent variables are the lagged versions of Y
For first-order autoregressive, do a multiple regression with Y lag 1 as the independent variable and Y as the
dependent. Forecast by plugging the last Y value into the equation. Forecast additional periods into the future
by using the most recently forecast value as the independent variable.
For second-order autoregressive, do a multiple regression with Y lag 1, and Y lag 2 as the independent
variables. Forecast by plugging the last two Y values into the equation. Forecast additional periods into the
future by using the most recently forecast values and previous values of Y as needed for the independent
variables.
For third-order autoregressive, do a multiple regression with Y lag 1,Y lag 2, and Y lag 3 as the independent
variables. Forecast by plugging the last three Y values into the equation. Forecast additional periods into the
future by using the most recently forecast values and previous values of Y as needed for the independent
variables.
Choosing the Best Model
Choose the model with the best adjusted r2, where r2’s are close choose the simplest model.
17