Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Analysis • Preliminary examination and classification of open ended responses • Coding and input • transformations • description, checking of normality • Testing the hypotheses • Discussion and conclusions • Software: Excel, SPSS, SAS, Statgraphics, DataFit, E-Views, Stata, etc. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Coding • • • • Numerical variables if at all possible Exact first, you can classify later Define informative variable and value labels What to do with missing data (real missing or NA, cases with many missing values, variables with many missing values, imputation?) • What is a missing value (checklists) • Identification variables (ID number, dates, interviewer, etc.) Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Transformations • • • • • • • Classification of open-ended responses Classifying continuous variables Reversing items (1=5, 2=4, 4=2, 5=1) Computing multi-item scales Computing lags, logs, ratios, or other new variables Checking for inconsistent responses Removing outliers? Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. ANALYSIS, ”THEORY” •Data description: graphics and statistics •Statistical inference: estimation, standard errors and confidence intervals •Statistical inference: hypothesis testing •Testing for a single mean •Contingency tables and Chi-Square tests •Testing for equality of two means: t-test •Testing for equality of three or more means: ANOVA •Correlation Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.4 Types of Data Data Categorical Numerical Examples: Marital Status Political Party Eye Color (Defined categories) Discrete Examples: Copyright © 2005 Brooks/Cole, division of Inc. Inc. Business Statistics: A FirstaCourse, 5eThomson © 2009 Learning, Prentice-Hall, Number of Children Defects per hour (Counted items) Continuous Examples: Weight Voltage (Measured characteristics) 1.5 Chap 1-5 Nominal Data [qualitative or categorical] … Nominal [Categorical] Data • The values of nominal data are categories. E.g. responses to questions about marital status, coded as: Single = 1, Married = 2, Divorced = 3, Widowed = 4 Because the numbers are arbitrary, arithmetic operations don’t make any sense (e.g. does Widowed ÷ 2 = Married?!) Nominal data has no natural order to the values. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.6 Ordinal Data [qualitative or categorical] … Ordinal [Categorical] Data appear to be categorical in nature, but their values have an order; a ranking to them: E.g. College course rating system: poor = 1, fair = 2, good = 3, very good = 4, excellent = 5 While its still not meaningful to do arithmetic on this data (e.g. does 2*fair = very good?!), we can say things like: excellent > poor or fair < very good That is, order is maintained no matter what numeric values are assigned to each category. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.7 Interval Data… Interval/Ratio data • Real numbers, i.e. heights, weights, prices, etc. • Also referred to as quantitative or numerical. Arithmetic operations can be performed on Interval Data, thus its meaningful to talk about 2*Height, or Price + $1, and so on. Interval Data: has no natural “0” such as temperature. 100 degrees is 50 degrees hotter than 50 degrees BUT not twice as hot. Ratio Data: has a natural “0” such as weight. 100 pounds is 50 pounds heaver than 50 pounds AND is twice as heavy. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.8 Graphical & Tabular Techniques for Nominal Data… The only allowable calculation on nominal data is to count the frequency of each value of the variable. We can summarize the data in a table that presents the categories and their counts called a frequency distribution. A relative frequency distribution lists the categories and the proportion with which each occurs. Refer to Example 2.1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.9 Nominal Data (Tabular Summary) Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.10 Nominal Data (Frequency) Bar Charts are often used to display frequencies… Is there a better way to order these? Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.11 Nominal Data (Relative Frequency) Pie Charts show relative frequencies… Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.12 Ordinal data: Cumulative Relative Frequencies… first class… next class: .355+.185=.540 : : last class: .930+.070=1.00 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.13 Graphical Techniques for Interval Data There are several graphical methods that are used when the data are interval (i.e. numeric, non-categorical). The most important of these graphical methods is the histogram. The histogram is not only a powerful graphical technique used to summarize interval data, but it is also used to help explain probabilities. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.14 Building a Histogram… 1) Collect the Data 2) Create a frequency distribution for the data. 3) Draw the Histogram. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.15 Interpret… (18+28+14=60)÷200 = 30% i.e. nearly a third of the phone bills are greater than $75 about half (71+37=108) of the bills are “small”, i.e. less than $30 There are only a few telephone bills in the middle range. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.16 Shapes of Histograms… Variable Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Frequency Frequency Frequency Symmetry A histogram is said to be symmetric if, when we draw a vertical line down the center of the histogram, the two sides are identical in shape and size: Variable Variable 1.17 Shapes of Histograms… Frequency Frequency Skewness A skewed histogram is one with a long tail extending to either the right or the left: Variable Positively Skewed Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Negatively Skewed 1.18 Shapes of Histograms… Many statistical techniques require that the population be bell shaped. Drawing the histogram helps verify the shape of the population in question. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Frequency Bell Shape A special type of symmetric unimodal histogram is one that is bell shaped: Variable Bell Shaped 1.19 Measures of Central Tendency Central Tendency Median Arithmetic Mean Mode n X X i 1 n i Middle value in the ordered array Copyright © 2005 Brooks/Cole, division of Inc. Inc. Business Statistics: A FirstaCourse, 5eThomson © 2009 Learning, Prentice-Hall, Most frequently observed value 1.20 Chap 3-20 Measures of Variation Variation Range Variance Standard Deviation Coefficient of Variation Measures of variation give information on the spread or variability or dispersion of the data values. Same center, different variation Copyright © 2005 Brooks/Cole, division of Inc. Inc. Business Statistics: A FirstaCourse, 5eThomson © 2009 Learning, Prentice-Hall, 1.21 Chap 3-21 Measures of Variation:Variance Average (approximately) of squared deviations of values from the mean Sample variance: n S 2 Where (X X) i1 2 i n -1 X = arithmetic mean n = sample size Xi = ith value of the variable X Copyright © 2005 Brooks/Cole, division of Inc. Inc. Business Statistics: A FirstaCourse, 5eThomson © 2009 Learning, Prentice-Hall, 1.22 Chap 3-22 Measures of Variation:Standard Deviation Most commonly used measure of variation Shows variation about the mean Is the square root of the variance Has the same units as the original data n Sample standard deviation: S Copyright © 2005 Brooks/Cole, division of Inc. Inc. Business Statistics: A FirstaCourse, 5eThomson © 2009 Learning, Prentice-Hall, (X i1 i X) 2 n -1 1.23 Chap 3-23 Locating Extreme Outliers:Z-Score XX Z S where X represents the data value X is the sample mean S is the sample standard deviation Copyright © 2005 Brooks/Cole, division of Inc. Inc. Business Statistics: A FirstaCourse, 5eThomson © 2009 Learning, Prentice-Hall, 1.24 Chap 3-24 Locating Extreme Outliers:Z-Score Suppose the mean math SAT score is 490, with a standard deviation of 100. Compute the Z-score for a test score of 620. X X 620 490 130 Z 1.3 S 100 100 A score of 620 is 1.3 standard deviations above the mean and would not be considered an outlier. Copyright © 2005 Brooks/Cole, division of Inc. Inc. Business Statistics: A FirstaCourse, 5eThomson © 2009 Learning, Prentice-Hall, 1.25 Chap 3-25 Key Statistical Concepts… Population Sample Subset Parameter Statistic Populations have Parameters, Samples have Statistics. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.26 Statistical Inference… Statistical inference is the process of making an estimate, prediction, or decision about a population based on a sample. Population Sample Inference Statistic Parameter What can we infer about a Population’s Parameters based on a Sample’s Statistics? Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.27 Confidence & Significance Levels… The confidence level 1– is the proportion of times that an estimating procedure will be correct. E.g. a confidence level of 95% means that, estimates based on this form of statistical inference will be correct 95% of the time. When the purpose of the statistical inference is to draw a conclusion about a population, the significance level measures how frequently the conclusion will be wrong in the long run. E.g. a 5% significance level means that, in the long run, this type of conclusion will be wrong 5% of the time. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.28 Sampling Distributions A sampling distribution is a distribution of all of the possible values of a sample statistic for a given size sample selected from a population. For example, suppose you sample 50 students from your college regarding their mean GPA. If you obtained many different samples of 50, you will compute a different mean for each sample. We are interested in the distribution of all potential mean GPA we might calculate for any given sample of 50 students. Basic Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 7-29 Developing a Sampling Distribution Assume there is a population … Population size N=4 Random variable, X, is age of individuals A Values of X: 18, 20, 22, 24 (years) Basic Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 7-30 B C D Developing a Sampling Distribution (continued) Summary Measures for the Population Distribution: X μ P(x) i N .3 18 20 22 24 21 4 σ (X μ) i N .2 .1 0 2 2.236 18 A 20 B 22 C 24 D Uniform Distribution Basic Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 7-31 x Developing a Sampling Distribution (continued) Now consider all possible samples of size n=2 1st Obs 16 Sample Means 2nd Observation 18 20 22 24 18 18,18 18,20 18,22 18,24 20 20,18 20,20 20,22 20,24 1st 2nd Observation Obs 18 20 22 24 22 22,18 22,20 22,22 22,24 18 18 19 20 21 24 24,18 24,20 24,22 24,24 20 19 20 21 22 22 20 21 22 23 16 possible samples (sampling with replacement) Basic Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 24 21 22 23 24 Chap 7-32 Developing a Sampling Distribution (continued) Sampling Distribution of All Sample Means Sample Means Distribution 16 Sample Means 1st 2nd Observation Obs 18 20 22 24 18 18 19 20 21 20 19 20 21 22 22 20 21 22 23 22 23 24 24Basic 21 Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. _ P(X) .3 .2 .1 0 Chap 7-33 18 19 20 21 22 23 24 (no longer uniform) _ X Developing a Sampling Distribution (continued) Summary Measures of this Sampling Distribution: μX X i 18 19 19 24 21 σX N 16 2 ( X μ ) i X N (18 - 21)2 (19 - 21)2 (24 - 21)2 1.58 16 Basic Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 7-34 Comparing the Population Distribution to the Sample Means Distribution Population N=4 μ 21 Sample Means Distribution n=2 μX 21 σ 2.236 _ P(X) .3 P(X) .3 .2 .2 .1 .1 0 σ X 1.58 18 A 20 B 22 C 24 Basic Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. D X 0 Chap 7-35 18 19 20 21 22 23 24 _ X Sample Mean Sampling Distribution: Standard Error of the Mean Different samples of the same size from the same population will yield different sample means A measure of the variability in the mean from sample to sample is given by the Standard Error of the Mean: (This assumes that sampling is with replacement or sampling is without replacement from an infinite population) σ σX n Note that the standard error of the mean decreases as the size increases Chap 7-36 Basic sample Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Sample Mean Sampling Distribution: If the Population is Normal If a population is normally distributed with mean μ and standard deviation σ, the sampling distribution of X is also normally distributed with μX μ Basic Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. and σ σX n Chap 7-37 Z-value for Sampling Distribution of the Mean Z-value for the sampling distribution of Z where: ( X μX ) σX ( X μ) σ n X = sample mean μ = population mean σ = population standard deviation n = sample size Basic Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 7-38 : X Sampling Distribution Properties μx μ (i.e. xis unbiased ) Basic Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Normal Population Distribution μ x μx x Normal Sampling Distribution (has the same mean) Chap 7-39 Sampling Distribution Properties (continued) As n increases, Larger sample size σ x decreases Smaller sample size Basic Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 7-40 μ x Determining An Interval Including A Fixed Proportion of the Sample Means Find a symmetrically distributed interval around µ that will include 95% of the sample means when µ = 368, σ = 15, and n = 25. Since the interval contains 95% of the sample means 5% of the sample means will be outside the interval Since the interval is symmetric 2.5% will be above the upper limit and 2.5% will be below the lower limit. From the standardized normal table, the Z score with 2.5% (0.0250) below it is -1.96 and the Z score with 2.5% (0.0250) above it is 1.96. Basic Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 7-41 Determining An Interval Including A Fixed Proportion of the Sample Means Calculating the lower limit of the interval (continued) σ 15 XL μ Z 368 (1.96) 362.12 n 25 Calculating the upper limit of the interval σ 15 XU μ Z 368 (1.96) 373.88 n 25 95% of all sample means of sample size 25 are between 362.12 and 373.88 This is the confidence interval for the mean Basic Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 7-42 Confidence intervals The general formula for all confidence intervals is: Point Estimate ± (Critical Value)(Standard Error) Where: • Point Estimate is the sample statistic estimating the population parameter of interest • Critical Value is a table value based on the sampling distribution of the point estimate and the desired confidence level • Standard Error is the standard deviation of the point estimate Basic Business Statistics, 11e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 8-43 Hypothesis testing In addition to estimation, hypothesis testing is a procedure for making inferences about a population. Population Sample Inference Statistic Parameter Hypothesis testing allows us to determine whether enough statistical evidence exists to conclude that a belief (i.e. hypothesis) about a parameter is supported by the data. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.44 Nonstatistical Hypothesis Testing… A criminal trial is an example of hypothesis testing without the statistics. In a trial a jury must decide between two hypotheses. The null hypothesis is H0: The defendant is innocent The alternative hypothesis or research hypothesis is H1: The defendant is guilty The jury does not know which hypothesis is true. They must make a decision on the basis of evidence presented. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.45 Nonstatistical Hypothesis Testing… In the language of statistics convicting the defendant is called rejecting the null hypothesis in favor of the alternative hypothesis. That is, the jury is saying that there is enough evidence to conclude that the defendant is guilty (i.e., there is enough evidence to support the alternative hypothesis). If the jury acquits it is stating that there is not enough evidence to support the alternative hypothesis. Notice that the jury is not saying that the defendant is innocent, only that there is not enough evidence to support the alternative hypothesis. That is why we never say that we accept the null hypothesis. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.46 Nonstatistical Hypothesis Testing… There are two possible errors. A Type I error occurs when we reject a true null hypothesis. That is, a Type I error occurs when the jury convicts an innocent person. A Type II error occurs when we don’t reject a false null hypothesis. That occurs when a guilty defendant is acquitted. The probability of a Type I error is denoted as α (Greek letter alpha). The probability of a type II error is β (Greek letter beta). The two probabilities are inversely related. Decreasing one increases the other. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.47 Nonstatistical Hypothesis Testing… In our judicial system Type I errors are regarded as more serious. We try to avoid convicting innocent people. We are more willing to acquit guilty people. We arrange to make α small by requiring the prosecution to prove its case and instructing the jury to find the defendant guilty only if there is “evidence beyond a reasonable doubt.” Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.48 Nonstatistical Hypothesis Testing… The critical concepts are theses: 1. There are two hypotheses, the null and the alternative hypotheses. 2. The procedure begins with the assumption that the null hypothesis is true. 3. The goal is to determine whether there is enough evidence to infer that the alternative hypothesis is true. 4. There are two possible decisions: Conclude that there is enough evidence to support the alternative hypothesis. Conclude that there is not enough evidence to support the alternative hypothesis. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.49 Nonstatistical Hypothesis Testing… 5. Two possible errors can be made. Type I error: Reject a true null hypothesis Type II error: Do not reject a false null hypothesis. P(Type I error) = α P(Type II error) = β Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.50 Testing for a single mean, example 11.1 A department store manager determines that a new billing system will be cost-effective only if the mean monthly account is more than $170. A random sample of 400 monthly accounts is drawn, for which the sample mean is $178. The accounts are approximately normally distributed with a standard deviation of $65. Can we conclude that the new system will be cost-effective? Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.51 Example 11.1… The system will be cost effective if the mean account balance for all customers is greater than $170. We express this belief as a our research hypothesis, that is: H1: > 170 (this is what we want to determine) Thus, our null hypothesis becomes: H0: = 170 (this specifies a single value for the parameter of interest) Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.52 Example 11.1… What we want to show: H1: > 170 H0: = 170 (we’ll assume this is true) We know: n = 400, = 178, and = 65 Hmm. What to do next?! Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.53 Example 11.1… To test our hypotheses, we can use two different approaches: The rejection region approach (typically used when computing statistics manually), and The p-value approach (which is generally used with a computer and statistical software). We will explore only the latter… Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.54 Standardized Test Statistic… An easier method is to use the standardized test statistic: and compare its result to : (rejection region: z > ) Since z = 2.46 > 1.645 (z.05), we reject H0 in favor of H1… Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.55 p-Value The p-value of a test is the probability of observing a test statistic at least as extreme as the one computed given that the null hypothesis is true. In the case of our department store example, what is the probability of observing a sample mean at least as extreme as the one already observed (i.e. = 178), given that the null hypothesis (H0: = 170) is true? p-value Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.56 Interpreting the p-value… Overwhelming Evidence (Highly Significant) Strong Evidence (Significant) Weak Evidence (Not Significant) No Evidence (Not Significant) 0 .01 .05 .10 p=.0069 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.57 Interpreting the p-value… Compare the p-value with the selected value of the significance level: If the p-value is less than , we judge the p-value to be small enough to reject the null hypothesis. If the p-value is greater than , we do not reject the null hypothesis. Since p-value = .0069 < Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. = .05, we reject H0 in favor of H1 1.58 Example 11.2… AT&T’s argues that its rates are such that customers won’t see a difference in their phone bills between them and their competitors. They calculate the mean and standard deviation for all their customers at $17.09 and $3.87 (respectively). They then sample 100 customers at random and recalculate a monthly phone bill based on competitor’s rates. What we want to show is whether or not: H1: ≠ 17.09. We do this by assuming that: H0: = 17.09 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.59 Example 11.2… The rejection region is set up so we can reject the null hypothesis when the test statistic is large or when it is small. stat is “small” stat is “large” That is, we set up a two-tail rejection region. The total area in the rejection region must sum to , so we divide this probability by 2. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.60 Example 11.2… At a 5% significance level (i.e. = .05), we have /2 = .025. Thus, z.025 = 1.96 and our rejection region is: z < –1.96 -z.025 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. -or- 0 z > 1.96 +z.025 z 1.61 Example 11.2… From the data, we calculate = 17.55 Using our standardized test statistic: We find that: Since z = 1.19 is not greater than 1.96, nor less than –1.96 we cannot reject the null hypothesis in favor of H1. That is “there is insufficient evidence to infer that there is a difference between the bills of AT&T and the competitor.” Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.62 Summary of One- and Two-Tail Tests… One-Tail Test (left tail) Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Two-Tail Test One-Tail Test (right tail) 1.63 Chi square goodness of fit Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.64 13.1 Goodness-of-Fit Tests Given the following… 1) Counts of items in each of several categories 2) A model that predicts the distribution of the relative frequencies …this question naturally arises: “Does the actual distribution differ from the model because of random error, or do the differences mean that the model does not fit the data?” In other words, “How good is the fit?” 65 © 2011 Pearson Education, Inc. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.65 13.1 Goodness-of-Fit Tests Example: Stock Market “Up” Days Population of Stock Sample of 1000 “up” days Market Days “Up” days appear to be more common than expected on certain days, especially on Fridays. Null Hypothesis: The distribution of “up” days is no different from the population distribution. Test the hypothesis with a chi-square goodness-of-fit test. 66 © 2011 Pearson Education, Inc. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.66 13.1 Goodness-of-Fit Tests The Chi-Square Distribution 2 Note that “accumulates” the relative squared deviation of each cell from its expected value. 2 So, gets “big” when i). the data set is large and/or ii). the model is a poor fit. 67 © 2011 Pearson Education, Inc. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.67 13.1 Goodness-of-Fit Tests The Chi-Square Calculation 68 © 2011 Pearson Education, Inc. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.68 13.1 Goodness-of-Fit Tests The Chi-Square Calculation: Stock Market “Up” Days (192 193.4) 2 (218 199.7) 2 x ... 2.62 193.4 199.7 2 Using a chi-square table at a significance level of 0.05 and with 4 degrees of freedom: 42 9.488 2.62 Do not reject the null hypothesis. (The fit is “good”.) 69 © 2011 Pearson Education, Inc. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.69 13.2 Interpreting Chi-Square Values The Chi-Square Distribution The distribution is right-skewed and becomes broader with increasing degrees of freedom: 2 2 The test is a one-sided test. 70 © 2011 Pearson Education, Inc. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.70 13.3 Examining the Residuals When we reject a null hypothesis, we can examine the residuals in each cell to discover which values are extraordinary. Because we might compare residuals for cells with very different counts, we should examine standardized residuals: Note that standardized residuals from goodness-of-fit tests are actually z-scores (which we already know how to interpret and analyze). 71 © 2011 Pearson Education, Inc. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.71 13.3 Examining the Residuals Standardized residuals for the trading days data: • None of these values is remarkable. • The largest, Friday, at 1.292, is not impressive when viewed as a z-score. • The deviations are in the direction of a “weekend effect”, but they aren’t quite large enough for us to conclude they are real. 72 © 2011 Pearson Education, Inc. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.72 Chi square test of independence Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.73 Contingency Table… In Example 2.8, a sample of newspaper readers was asked to report which newspaper they read: Globe and Mail (1), Post (2), Star (3), or Sun (4), and to indicate whether they were blue-collar worker (1), white-collar worker (2), or professional (3). This reader’s response is captured as part of the total number on the contingency table… Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.74 Contingency Table… Interpretation: The relative frequencies in the columns 2 & 3 are similar, but there are large differences between columns 1 and 2 and between columns 1 and 3. similar dissimilar This tells us that blue collar workers tend to read different newspapers from both white collar workers and professionals and that white collar and professionals are quite similar in their newspaper choice. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.75 Graphing the Relationship Between Two Nominal Variables… Use the data from the contingency table to create bar charts… Professionals tend to read the Globe & Mail more than twice as often as the Star or Sun… Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.76 2 Test of Independence contingency tables with r rows and c columns H0: The two categorical variables are independent (i.e., there is no relationship between them) H1: The two categorical variables are dependent (i.e., there is a relationship between them) Business Statistics: A First Course, 5e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 11-77 2 Test of Independence (continued) The Chi-square test statistic is: 2 χ STAT ( fo fe )2 all cells fe where: fo = observed frequency in a particular cell of the r x c table fe = expected frequency in a particular cell if H0 is true χ 2STAT for the r x c case has (r - 1)(c - 1) degrees of freedom (Assumed: each cell in the contingency table has expected frequency of at least 1) Business Statistics: A First Course, 5e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 11-78 Expected Cell Frequencies Expected cell frequencies: row total column total fe n Where: row total = sum of all frequencies in the row column total = sum of all frequencies in the column n = overall sample size Business Statistics: A First Course, 5e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 11-79 Decision Rule The decision rule is 2 2 χ χ If STAT α, reject H0, otherwise, do not reject H0 2 Where χ α is from the chi-squared distribution with (r – 1)(c – 1) degrees of freedom Copyright © 2005Statistics: Brooks/Cole, aAdivision Thomson 5e Learning, Inc. Business FirstofCourse, © 2009 Prentice-Hall, Chap 11-80 Example The meal plan selected by 200 students is shown below: Number of meals per week Class none Standing 20/week 10/week Total Fresh. 24 32 14 70 Soph. 22 26 12 60 Junior 10 14 6 30 Senior 14 16 10 40 Total 70 88 42 200 Business Statistics: A First Course, 5e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 11-81 Example (continued) The hypothesis to be tested is: H0: Meal plan and class standing are independent (i.e., there is no relationship between them) H1: Meal plan and class standing are dependent (i.e., there is a relationship between them) Business Statistics: A First Course, 5e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 11-82 Example: Expected Cell Frequencies (continued) Observed: Class Standing Number of meals per week Expected cell frequencies if H0 is true: 20/wk 10/wk none Total Fresh. 24 32 14 70 Soph. 22 26 12 60 Junior 10 14 6 30 Senior 14 16 10 40 Class Standing Total 70 88 42 200 Example for one cell: row total column total fe n 30 70 10.5 Business Statistics: A 200 First Course, 5e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Number of meals per week 20/wk 10/wk none Total Fresh. 24.5 30.8 14.7 70 Soph. 21.0 26.4 12.6 60 Junior 10.5 13.2 6.3 30 Senior 14.0 17.6 8.4 40 70 88 42 200 Total Chap 11-83 Example: The Test Statistic (continued) The test statistic value is: 2 χ STAT all cells ( f o f e )2 fe ( 24 24.5 ) 2 ( 32 30.8 ) 2 ( 10 8.4 ) 2 0.709 24.5 30.8 8.4 χ 0.2 05 = 12.592 from the chi-squared distribution with (4 – 1)(3 – 1) = 6 degrees of freedom Copyright © 2005Statistics: Brooks/Cole, aAdivision Thomson 5e Learning, Inc. Business FirstofCourse, © 2009 Prentice-Hall, Chap 11-84 Example: Decision and Interpretation (continued) 2 The te ststatisticis χ STAT 0.709 ; χ 02.05 with 6 d.f. 12.592 Decision Rule: 2 χ If STAT > 12.592, reject H0, otherwise, do not reject H0 0.05 0 Do notH reject 0 Reject H0 2 χHere, STAT 2 20.05=12.592 Business Statistics: A First Course, 5e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 11-85 χ 0.2 05 = 12.592, = 0.709 < so do not reject H0 Conclusion: there is not sufficient evidence that meal plan and class standing are related at = 0.05 Independent samples t-test Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.86 Difference Between Two Means Population means, independent samples * σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Goal: Test hypothesis or form a confidence interval for the difference between two population means, μ1 – μ2 The point estimate for the difference is X1 – X 2 Chap 10-87 Difference Between Two Means: Independent Samples Different data sources Population means, independent samples * σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Unrelated Independent Sample selected from one population has no effect on the sample selected from the other population Use Sp to estimate unknown σ. Use a Pooled-Variance t test. Use S1 and S2 to estimate unknown σ1 and σ2. Use a Chap 10-88 Separate-variance t test Hypothesis Tests for Two Population Means Two Population Means, Independent Samples Lower-tail test: Upper-tail test: Two-tail test: H0: μ1 μ2 H1: μ1 < μ2 H0: μ1 ≤ μ2 H1: μ1 > μ2 H0: μ1 = μ2 H1: μ1 ≠ μ2 i.e., i.e., i.e., H0: μ1 – μ2 0 H1: μ1 – μ2 < 0 H0: μ1 – μ2 ≤ 0 H1: μ1 – μ2 > 0 H0: μ1 – μ2 = 0 H1: μ1 – μ2 ≠ 0 Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-89 Hypothesis tests for μ1 – μ2 Two Population Means, Independent Samples Lower-tail test: Upper-tail test: Two-tail test: H0: μ1 – μ2 0 H1: μ1 – μ2 < 0 H0: μ1 – μ2 ≤ 0 H1: μ1 – μ2 > 0 H0: μ1 – μ2 = 0 H1: μ1 – μ2 ≠ 0 -t Reject H0 if tSTAT < -t t Reject H0 if tSTAT > t Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-90 /2 -t/2 /2 t/2 Reject H0 if tSTAT < -t/2 or tSTAT > t/2 Hypothesis tests for µ1 - µ2 with σ1 and σ2 unknown and assumed equal Assumptions: Population means, independent samples σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Samples are randomly and independently drawn * Populations are normally distributed or both sample sizes are at least 30 Population variances are unknown but assumed equal Chap 10-91 Hypothesis tests for µ1 - µ2 with σ1 and σ2 unknown and assumed equal (continued) • The pooled variance is: Population means, independent samples 2 2 n 1 S n 1 S 1 2 2 S2 1 p σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. * (n1 1) (n2 1) • The test statistic is: t STAT X1 X 2 μ 1 μ 2 2 1 Sp n1 1 n 2 • Where tSTAT has d.f. = (n1 + n2 – 2) Chap 10-92 Confidence interval for µ1 - µ2 with σ1 and σ2 unknown and assumed equal Population means, independent samples σ1 and σ2 unknown, assumed equal σ1 and σ2 unknown, not assumed equal Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. * The confidence interval for μ1 – μ2 is: X1 X 2 tα/2 2 Sp 1 1 n1 n 2 Where tα/2 has d.f. = n1 + n2 – 2 Chap 10-93 Pooled-Variance t Test Example You are a financial analyst for a brokerage firm. Is there a difference in dividend yield between stocks listed on the NYSE & NASDAQ? You collect the following data: NYSE NASDAQ Number 21 25 Sample mean 3.27 2.53 Sample std dev 1.30 1.16 Assuming both populations are approximately normal with equal variances, is there a difference in mean yield ( = 0.05)? Chap 10-94 Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Pooled-Variance t Test Example: Calculating the Test Statistic (continued) H0: μ1 - μ2 = 0 i.e. (μ1 = μ2) H1: μ1 - μ2 ≠ 0 i.e. (μ1 ≠ μ2) The test statistic is: X1 X 2 μ1 μ 2 t 2 Sp 1 1 n1 n 2 3.27 2.53 0 1 1 1.5021 21 25 2 2 n 1 S n 1 S 21 11.30 2 25 11.16 2 2 1 1 2 2 S p (n1 1) (n2 1) Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. (21 - 1) (25 1) Chap 10-95 2.040 1.5021 Pooled-Variance t Test Example: Hypothesis Test Solution Reject H0 H0: μ1 - μ2 = 0 i.e. (μ1 = μ2) H1: μ1 - μ2 ≠ 0 i.e. (μ1 ≠ μ2) = 0.05 df = 21 + 25 - 2 = 44 .025 -2.0154 Critical Values: t = ± 2.0154 .025 0 2.0154 t 2.040 Test Statistic: 3.27 2.53 t 2.040 1 1 1.5021 21 25 Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Reject H0 Chap 10-96 Decision: Reject H0 at = 0.05 Conclusion: There is evidence of a difference in means. One-way ANOVA Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.97 One-Way Analysis of Variance Evaluate the difference among the means of three or more groups Examples: Accident rates for 1st, 2nd, and 3rd shift Expected mileage for five brands of tires Assumptions Populations are normally distributed Populations have equal variances Samples are randomly and independently drawn Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-98 Hypotheses of One-Way ANOVA H0 : μ1 μ2 μ3 μc All population means are equal i.e., no factor effect (no variation in means among groups) HAt Not one all of the population means are the same population mean is different 1 : least i.e., there is a factor effect Does not mean that all population means are different (some pairs may be the same) Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-99 Partitioning the Variation Total variation can be split into two parts: SST = SSA + SSW SST = Total Sum of Squares (Total variation) SSA = Sum of Squares Among Groups (Among-group variation) SSW = Sum of Squares Within Groups (Within-group variation) Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-100 Partitioning the Variation (continued) SST = SSA + SSW Total Variation = the aggregate variation of the individual data values across the various factor levels (SST) Among-Group Variation = variation among the factor sample means (SSA) Within-Group Variation = variation that exists among the data values within a particular factor level (SSW) Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-101 Partition of Total Variation Total Variation (SST) = Variation Due to Factor (SSA) Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. + Chap 10-102 Variation Due to Random Error (SSW) Total Sum of Squares SST = SSA + SSW c nj SST ( Xij X) 2 j1 i1 Where: SST = Total sum of squares c = number of groups or levels nj = number of observations in group j Xij = ith observation from group j Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-103 X = grand mean (mean of all data values) Total Variation (continued) SST ( X 11 X ) ( X 12 X ) ( X cn j X ) 2 2 Response, X X Group 1 Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Group 2 Chap 10-104 Group 3 2 Among-Group Variation SST = SSA + SSW c SSA n j ( X j X) 2 j1 Where: SSA = Sum of squares among groups c = number of groups nj = sample size from group j Xj = sample mean from group j X = grand mean (mean of all data values) Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-105 Among-Group Variation (continued) c SSA n j ( X j X) 2 j1 Variation Due to Differences Among Groups SSA MSA c 1 Mean Square Among = SSA/degrees of freedom i j Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-106 Among-Group Variation (continued) SSA n1 (X1 X) n 2 (X 2 X) n c (X c X) 2 2 Response, X X3 X2 X1 Group 1 Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Group 2 Chap 10-107 Group 3 X 2 Within-Group Variation SST = SSA + SSW c SSW j1 nj ( Xij X j ) 2 i1 Where: SSW = Sum of squares within groups c = number of groups nj = sample size from group j Xj = sample mean from group j Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Xij =Chap ith10-108 observation in group j Within-Group Variation (continued) c SSW j1 nj i1 ( Xij X j ) 2 Summing the variation within each group and then adding over all groups SSW MSW nc Mean Square Within = SSW/degrees of freedom μj Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-109 Within-Group Variation (continued) SSW ( X 11 X 1 ) 2 ( X 12 X 2 ) 2 ( X cn j X c ) 2 Response, X X3 X1 Group 1 Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Group 2 Chap 10-110 X2 Group 3 Obtaining the Mean Squares The Mean Squares are obtained by dividing the various sum of squares by their associated degrees of freedom SSA MSA c 1 Mean Square Among (d.f. = c-1) SSW MSW nc Mean Square Within (d.f. = n-c) SST MST n 1 Business Statistics: A Chap 10-111 First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Mean Square Total (d.f. = n-1) One-Way ANOVA Table Source of Variation Degrees of Freedom Sum Of Squares Among Groups c-1 Within Groups n-c SSW Total n–1 SST Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. SSA Mean Square (Variance) F SSA MSA = c-1 SSW MSW = n-c FSTAT = MSA MSW c = number of groups n = sum of the sample sizes from all groups df = degrees of freedom Chap 10-112 One-Way ANOVA F Test Statistic H0: μ1= μ2 = … = μc H1: At least two population means are different Test statistic MSA FSTAT MSW MSA is mean squares among groups MSW is mean squares within groups Degrees of freedom df1 = c – 1 df2 = n – c (c = number of groups) (n = sum of sample sizes from all populations) Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-113 Interpreting One-Way ANOVA F Statistic The F statistic is the ratio of the among estimate of variance and the within estimate of variance The ratio must always be positive df1 = c -1 will typically be small df2 = n - c will typically be large Decision Rule: Reject H0 if FSTAT > Fα, otherwise do not reject H0 Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-114 0 Do not reject H0 Reject H0 Fα One-Way ANOVA F Test Example You want to see if three different golf clubs yield different distances. You randomly select five measurements from trials on an automated driving machine for each club. At the 0.05 significance level, is there a difference in mean distance? Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-115 Club 1 254 263 241 237 251 Club 2 234 218 235 227 216 Club 3 200 222 197 206 204 One-Way ANOVA Example: Scatter Plot Club 1 254 263 241 237 251 Club 2 234 218 235 227 216 Distance 270 Club 3 200 222 197 206 204 260 250 240 230 • •• • • X1 •• • •• X 2 220 210 x1 249.2 x 2 226.0 x 3 205.8 200 x 227.0 190 Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-116 • •• •• 1 2 Club 3 X X3 One-Way ANOVA Example Computations Club 1 254 263 241 237 251 Club 2 234 218 235 227 216 Club 3 200 222 197 206 204 X1 = 249.2 n1 = 5 X2 = 226.0 n2 = 5 X3 = 205.8 n3 = 5 X = 227.0 n = 15 c=3 SSA = 5 (249.2 – 227)2 + 5 (226 – 227)2 + 5 (205.8 – 227)2 = 4716.4 SSW = (254 – 249.2)2 + (263 – 249.2)2 +…+ (204 – 205.8)2 = 1119.6 MSA = 4716.4 / (3-1) = 2358.2 MSW = 1119.6 / (15-3) = 93.3 Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 2358.2 FSTAT 25.275 93.3 Chap 10-117 One-Way ANOVA Example Solution H0: μ1 = μ2 = μ3 H1: μj not all equal = 0.05 df1= 2 df2 = 12 Test Statistic: MSA 2358.2 FSTAT 25.275 MSW 93.3 Critical Value: Fα = 3.89 = .05 0 Do not Reject H0 reject H0 Fα = A3.89 Business Statistics: Chap 10-118 FSTAT = 25.275 First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Decision: Reject H0 at = 0.05 Conclusion: There is evidence that at least one μj differs from the rest ANOVA Assumptions Randomness and Independence Select random samples from the c groups (or randomly assign the levels) Normality The sample values for each group are from a normal population Homogeneity of Variance All populations sampled from have the same variance Can be tested with Levene’s Test Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-119 ANOVA Assumptions Levene’s Test Tests the assumption that the variances of each population are equal. First, define the null and alternative hypotheses: H0: σ21 = σ22 = …=σ2c H1: Not all σ2j are equal Second, compute the absolute value of the difference between each value and the median of each group. Third, perform a one-way ANOVA on these absolute differences. Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-120 Levene Homogeneity Of Variance Test Example H0: σ21 = σ22 = σ23 H1: Not all σ2j are equal Calculate Medians Club 1 Club 2 Calculate Absolute Differences Club 3 Club 1 Club 2 Club 3 237 216 197 14 11 7 241 218 200 10 9 4 251 227 204 Median 0 0 0 254 234 206 3 7 2 263 235 222 12 8 18 Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 10-121 Levene Homogeneity Of Variance Test Example (continued) Anova: Single Factor SUMMARY Groups Count Sum Average Variance Club 1 5 39 7.8 36.2 Club 2 5 35 7 17.5 Club 3 5 31 6.2 50.2 F Pvalue Source of Variation Between Groups Within Groups Total SS df 6.4 2 415.6 12 422 14 Business Statistics: A First Course, 5e © Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. MS 3.2 0.092 34.6 Chap 10-122 F crit 0.912 3.885 Since the pvalue is greater than 0.05 we fail to reject H0 & conclude the variances are equal. correlation Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.123 Scatter Diagram… It appears that there is a relationship, that is, the greater the house size the greater the selling price… Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.124 Types of Relationships Linear relationships Curvilinear relationships Y Y X X Y Y X Business Statistics: A First Course, 5e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 12-125 X Types of Relationships (continued) Strong relationships Weak relationships Y Y X X Y Y X Business Statistics: A First Course, 5e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 12-126 X Types of Relationships (continued) No relationship Y X Y Business Statistics: A First Course, 5e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 12-127 X The Covariance The covariance measures the strength of the linear relationship between two numerical variables (X & Y) The sample covariance: n cov ( X , Y ) ( X X)( Y Y) i1 i i n 1 Only concerned with the strength of the relationship No causal effect is implied Copyright © 2005 Brooks/Cole, division of Inc. Inc. Business Statistics: A FirstaCourse, 5eThomson © 2009 Learning, Prentice-Hall, 1.128 Chap 3-128 Interpreting Covariance Covariance between two variables: cov(X,Y) > 0 X and Y tend to move in the same direction cov(X,Y) < 0 X and Y tend to move in opposite directions cov(X,Y) = 0 X and Y are independent The covariance has a major flaw: It is not possible to determine the relative strength of the relationship from the size of the covariance Copyright © 2005 Brooks/Cole, division of Inc. Inc. Business Statistics: A FirstaCourse, 5eThomson © 2009 Learning, Prentice-Hall, 1.129 Chap 3-129 Coefficient of Correlation Measures the relative strength of the linear relationship between two numerical variables Sample coefficient of correlation: cov (X , Y) r SX SY where n cov (X , Y) (X X)(Y Y) i1 i n i n 1 Copyright © 2005 Brooks/Cole, division of Inc. Inc. Business Statistics: A FirstaCourse, 5eThomson © 2009 Learning, Prentice-Hall, SX (X X) i1 i n 1 n 2 SY (Y Y ) i1 2 i n 1 1.130 Chap 3-130 t Test for a Correlation Coefficient Hypotheses H0: ρ = 0 H1: ρ ≠ 0 (no correlation between X and Y) (correlation exists) Test statistic t STAT r -ρ 2 1 r n2 Business Statistics: A First Course, 5e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chap 12-131 (with n – 2 degrees of freedom) where r r 2 if b1 0 r r 2 if b1 0 t-test For A Correlation Coefficient (continued) Is there evidence of a linear relationship between square feet and house price at the .05 level of significance? H0: ρ = 0 (No correlation) H1: ρ ≠ 0 (correlation exists) =.05 , df = 10 - 2 = 8 t STAT r ρ 1 r2 n2 Business Statistics: A First Course, 5e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. .762 0 1 .762 2 10 2 Chap 12-132 3.329 t-test For A Correlation Coefficient (continued) t STAT r ρ 1 r2 n2 .762 0 1 .762 2 10 2 3.329 Conclusion: There is evidence of a linear association at the 5% level of significance d.f. = 10-2 = 8 /2=.025 Reject H0 -tα/2 -2.3060 /2=.025 Do not reject H0 0 Business Statistics: A First Course, 5e © 2009 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Reject H0 tα/2 2.3060 Chap 12-133 Decision: Reject H0 3.329 Features of the Coefficient of Correlation The population coefficient of correlation is referred as ρ. The sample coefficient of correlation is referred to as r. Either ρ or r have the following features: Unit free Ranges between –1 and 1 The closer to –1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker the linear relationship Copyright © 2005 Brooks/Cole, division of Inc. Inc. Business Statistics: A FirstaCourse, 5eThomson © 2009 Learning, Prentice-Hall, 1.134 Chap 3-134 Scatter Plots of Sample Data with Various Coefficients of Correlation Y Y X r = -1 Y X r = -.6 Y Y r = +1 X Copyright © 2005 Brooks/Cole, division of Inc. Inc. Business Statistics: A FirstaCourse, 5eThomson © 2009 Learning, Prentice-Hall, X r = +.3 X r=0 1.135 Chap 3-135