Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Significance of a Correlation Running Head: SIGNIFICANCE TESTING AND CORRELATIONS Testing the Statistical Significance of a Correlation R. Michael Furr Wake Forest University Address correspondence to: Mike Furr Department of Psychology Wake Forest University Winston-Salem, NC 2706 [email protected] 336-758-5024 1 Statistical Significance of a Correlation 2 Testing the Statistical Significance of a Correlation Researchers from Psychology, Education, and other social and behavioral sciences are very concerned with statistical significance. If a researcher conducts a study and finds that the results are “statistically significant,” then the he or she has greater confidence in the effects revealed by the study. When results are statistically significant, researchers are more likely to believe that the effects are “real” and not likely to have occurred by chance. The goal of science is to understand our physical or social world, and this occurs in part by being able to judge which research findings are real and which are flukes and red herrings. This paper describes the procedures through which researchers determine whether the results of a study are statistically significant. It presents the logic, technical steps, and interpretation of a test of statistical significance, specifically for researchers examining a correlation between two variables. Many textbooks provide in-depth introductions to statistical significance, but there appear to be no sources that provide such an introduction in the context of correlations. Most introductory statistics textbooks in Psychology provide concepts and procedures for significance testing in the context of means, but the extension of significance testing to correlations is usually very slight. Typically, the coverage of significance testing for correlations, if it is discussed at all, focuses on computational procedures, bypassing the conceptual foundations and interpretations. In fact, the organization of some introductory statistics textbooks implies that correlations and significance testing are completely separate issues. For example, a chapter on correlation might be in a section labeled “Descriptive Statistics” and the chapters related to significance testing might be included in a section labeled “Inferential Statistics.” Although general statistics textbooks omit deep coverage of the conceptual and practical foundations of significance testing for correlations, one might suspect that such coverage could be found in sources that focus on correlational procedures specifically (e.g., Archdeacon, 1994; Bobko, 2001; Chen & Popovich, 2002; Cohen & Cohen, 1983; Edwards; 1984; Ezekiel, 1941; Miles & Shevlin, 2001; Pedhazur, 1997). Unfortunately, these sources also omit in-depth discussions of basic concepts in significance testing. The more advanced sources naturally assume that readers already have a solid grasp Statistical Significance of a Correlation 3 of basic concepts in significance testing. Unfortunately, even the more introductory sources provide little background in basic concepts in statistically significance as related to correlations. A number of potential problems arise from the fact that no sources provide in-depth discussions of significance testing as related to correlations. First, some budding researchers might be left with the mistaken and potentially confusing belief that correlations and significance tests are unrelated issues. Although the computation and interpretation of a correlation can proceed without reference to a significance test, correlations are rarely reported without an accompanying significance test. Second, even if researchers are aware that correlations can be tested for statistical significance, they might have difficulty connecting fundamental concepts in “significance testing” (e.g., parameters, confidence intervals, distributions of inferential statistics) to correlations. The existing sources make little effort t to generalize concepts articulated in the context of means or frequencies to correlational analyses. Third, the existing sources create difficulty for course instructors who cover correlational analyses before other kinds of analyses. For example, some Psychology Departments divide their “Research Methods and Statistics” courses into a “correlational” semester and an “experimental” semester. If the correlational course is taken before the experimental course, then instructors who teach the correlational course face a dilemma. They can ignore significance testing of correlations, they can provide a cursory coverage of significance testing of correlations, or they can assign readings that present significance testing in the context of means or frequencies. A solid understanding of significance testing as related to correlations may be particularly important as the field evolves in two ways. First, researchers are increasingly aware of the importance of effect sizes, such as correlations (American Psychological Association, 2001; Capraro & Capraro, 2003; Furr, 2004; Heldref Foundation, 1997; Kendall, 1997; Murphy, 1997; Rosenthal, Rosnow, & Rubin, 2000; Thompson, 1994, 1999; Wilkinson & APA Task Force on Statistical Inference, 1999). Second, many in the field recognize that regression, based on a correlational foundation, is a general approach to data analysis that can incorporate much that is typically conceptualized as Analysis of Variance. As the awareness and use of effect sizes and correlational analytic procedures continue to grow, and as advanced Statistical Significance of a Correlation 4 correlational procedures continue to emerge, researchers should have a solid understanding of the connections between correlations and significance testing. The current paper is intended to partially fill this hole in the basic statistical literature. It describes what statistical significance is about, presents fundamental concepts in evaluating statistical significance, and details the procedures for testing the statistical significance of a correlation. Samples and Populations: Inferential Statistics Imagine that Dr. Cartman wants to know whether the Scholastic Aptitude Test (SAT) is a valid predictor of college freshman performance at the local university. To address this issue, he recruits a sample of 200 freshmen from the university. The students give their consent for Dr. Cartman to have access to their academic records, from which he records their SAT scores and their first-year college Grade Point Average. Based on these data, Dr. Cartman finds that the correlation between SAT scores and GPA is .40, which is a positive correlation of moderate size. This correlation tells him that, within the sample, the students with relatively high SAT scores tend to have relatively high GPAs (and that students with relatively low SAT scores tend to have relatively low GPAs). Based on this finding in his sample, Dr. Cartman is tempted to conclude that the SAT is indeed a useful predictor of freshman GPA at the University. But how much confidence should Dr. Cartman have in this conclusion? He might be hesitant to use the results found in a sample of 200 students to make an inference about whether SAT scores are correlated with GPAs in the entire freshman student body. The question of statistical significance arises from the fact that scientists would like to make conclusions about psychological phenomena, effects, differences, or relationships between variables as they exist in a large population (or populations) of people (or rats, monkeys, etc, depending on the scientist’s area of expertise). For example, Dr. Cartman would like to make conclusions about whether SAT scores are correlated with GPAs in the entire freshman student body. Similarly, a clinical psychologist might be interested in whether a new drug generally helps to alleviate depression, within the “population” of all people who might take the drug. Or a social psychologist hypothesizes that romantic Statistical Significance of a Correlation 5 couples in which the partners have similar profiles of personality traits tend to be happier than couples in which the partners have dissimilar personalities. This researcher would be interested in concluding whether similarity and romantic happiness are generally correlated with each other within the “population” of all couples in romantic relationships. Despite their desire to make conclusions about large populations, researchers generally study only samples of people recruited from the larger population of interest. In our example, Dr. Cartman would like to make conclusions about the entire freshman class at the university, but he is able to recruit a sample of only 200 students from the student body. Similarly, a clinical psychologist cannot study all people who might ever take a drug, and a Social Psychologist cannot study all romantic couples. Researchers such as Dr. Cartman gather data from relatively small samples of people, and they use the sample data to make inferences about the existence and size of psychological phenomena in the larger population from which the sample was drawn. Researchers must be concerned about the accuracy with which they can use data from samples to make inferences about the population from which the sample was recruited. Dr. Cartman recognizes that the 200 students who happened to be in his study might not be perfectly representative of the entire freshman class. It is possible that, in the student body as a whole, there is no association between SAT scores and GPA. That is, in the population from which the sample of 200 students was drawn, the correlation between SAT and GPA could be zero. Even if the correlation in the population is exactly zero, Dr. Cartman could potentially obtain a random sample of students in which the correlation between SAT and GPA is not zero. His particular sample of 200 students might be unusual in some subtle way. Just by chance, Dr. Cartman might have recruited a sample in which people who scored relatively high on the SAT also tend to have relatively high GPAs. Thus researchers must be concerned about sampling error – the fact that a particular sample might not be perfectly representative of the population from which they were randomly drawn. The potential presence of sampling error means that researchers can never be totally confident that results found in a sample are perfectly representative of what is really “going on” in the population. Statistical Significance of a Correlation 6 The procedures for evaluating statistical significance help researchers determine how well their sample’s results represent the population from which the sample was drawn. Roughly speaking, when we find that results are statistically significant, we have confidence in inferring that the effects observed in the sample represent the effects in the population as a whole. In our example, Dr. Cartman would want to know whether the correlation he found in the sample’s data is statistically significant. If it is, then he will feel fairly confident in concluding that SAT scores are correlated with GPA in the student body as a whole. If his correlation is not found to be statistically significant, then he would not feel confident concluding that SAT scores are correlated with GPA in the student body as a whole. Because we use this process to help us make inferences about populations, the statistical terms and procedures involved in this process are called inferential statistics. There are standard terminologies describing inferential statistics and the connections between samples and populations. We use the sample data to calculate a correlation between two variables, such as SAT and GPA. This correlation is labeled with an r, and it is called a descriptive statistic because it describes some aspect of the sample that is actually observed in our study. We then might use the correlation observed in the sample to estimate the correlation in the population from which the sample is drawn. Because we cannot study the entire population, we can only make an informed guess about the correlation as it exists in the population. The correlation in the population is labeled with the Greek letter rho (ρ), and it is called a parameter. Statistical Hypotheses: Null and Alternative Consider again Dr. Cartman, who wishes to know if the SAT is correlated with GPA in the entire freshman class. If Dr. Cartman is conducting a traditional significance test of a correlation, then he will consider two possibilities. One possibility, called the null hypothesis, is that the SAT is not correlated with GPA within the Freshman student body. More technically, the null hypothesis states that the population correlation parameter (ρ) is zero. The null hypothesis is often labeled as H0, written as: Statistical Significance of a Correlation 7 H0: ρ = 0 This expresses exactly what Dr. Cartman is doing – he will be testing the null hypothesis that the correlation in the population is equal to zero. The second possibility, called the alternative hypothesis, is that the SAT is correlated with GPA within the freshman student body. More technically, the alternative hypothesis states that the population correlation parameter (ρ) is not zero. The alternative hypothesis is often labeled as H1 or HA, written as: H1: ρ ≠ 0 The Decision About the Statistical Hypotheses For traditional significance testing of a correlation, Dr. Cartman’ faces a decision between two competing hypotheses. In making this decision, Dr. Cartman has two options. First, he might reject the null hypothesis, thereby concluding that, in the population, the two variables are correlated with each other. In other words, the results of the analysis of his sample’s data make him feel confident enough to conclude that the population correlation is some value other than zero. Second, he might fail to reject the null hypothesis, thereby concluding that, in the population, the two variables are not correlated with each other. In other words, the results of the analysis of his sample’s data do not make him feel confident enough to conclude that the population correlation is a value other than zero. Note that, strictly speaking, both options are phrased in terms of rejecting the null hypothesis – Dr. Cartman can either reject the null or he can fail to reject the null. Researchers generally do not phrase the decisions in terms of the “accepting the null” or in terms of the alternative hypothesis. For these reasons, the traditional procedures are call “null hypothesis significant testing.” The decision regarding the null hypothesis is tied to the notion of statistical significance. If a researcher rejects the null hypothesis, then the result is said to be “statistically significant.” If a researcher fails to reject the null hypothesis, then the result is said to be not statistically significant. Practically speaking, the default decision is to fail to reject the null hypothesis – to conclude that the variables are uncorrelated in the population. Researchers reject the null hypothesis only when their Statistical Significance of a Correlation 8 sample data make them confident enough to override the default decision and guess that the null hypothesis is incorrect. In his sample of 200 freshmen, Dr. Cartman found a correlation of .40 between SAT and GPA. The question that Dr. Cartman faces is, do his sample findings make him confident enough to reject the null hypothesis that the correlation in the entire freshman student body is zero? Two issue arise when determining whether the sample findings make Dr. Cartman confident enough to reject the null hypothesis. First, how confident is Dr. Cartman that the null hypothesis is false? In other words, how confident should he be that, in the entire freshman class, SAT truly is correlated with GPA? The second issue is how confident does he need to be in order to actually reject the null hypothesis? Psychology and related sciences have reached a consensus regarding the degree of confidence that a researcher should have before rejecting a null hypothesis. These two issues are considered in turn, as part of a process called a t-test. Testing the Null Hypothesis: What Affects Our Confidence? Two main factors make Dr. Cartman more or less willing to conclude that there is a non-zero correlation between SAT and GPA in the entire Freshman student body. One factor affecting his confidence is the size of the correlation in his sample. In his sample’s data, Dr. Cartman found a correlation of r = .40, which represents a positive correlation of moderate size. But what if he had found that the correlation in his sample was much weaker, say only r = .12? Dr. Cartman recognizes that a correlation of only r = .12 is not very different from a correlation of zero. Therefore, he probably would not be very confident in concluding that the population correlation was anything but zero. In other words, if the correlation in the population is indeed zero (i.e., if ρ = 0), then it would not be very surprising to randomly draw a sample in which the observed correlation is small – only slightly different from zero. But what if Dr. Cartman had found that the correlation in sample was very strong, say r = .80? A correlation of r = .80 is very far from zero – it expresses a very strong association between two variables. Therefore, he probably would be much more confident in concluding that the population correlation was not zero. In other words, if the correlation in the population is indeed zero (i.e., if ρ = 0), then it would be Statistical Significance of a Correlation 9 very surprising to randomly draw a sample in which the correlation is so far away from zero. In sum, the size of the correlation in the sample will affect Dr. Cartman’s confidence in concluding that the population correlation is anything but zero – larger sample correlations will increase his confidence that the population correlation is not zero. The second factor affecting his confidence in rejecting the null is the size of the sample itself. In his study, Dr. Cartman was able to recruit 200 participants. But what if he had been able to recruit a small sample of only 15 participants? Dr. Cartman probably would not be very confident in making inferences about the entire freshman student body based on a study of only 15 participants. On the other hand, if he had been able to recruit a sample of 500 students (a much larger proportion of the population), then Dr. Cartman would be more comfortable in making inferences about the entire student body. Therefore, larger samples increase his confidence in making inferences about the population, and smaller samples decrease his confidence. We can quantify the amount of confidence that a researcher should have in rejecting the null hypothesis that the correlation in the population is zero (i.e., H0: ρ = 0). We compute a t value, which is an inferential statistic that can be conceptualized roughly as an index of “degree of confidence in rejecting the null hypothesis.” The formula for computing the t value reflects the two factors discussed above – the size of the correlation and the size of the sample: tOBSERVED = r 1− r 2 x N −2 Equation 1 The tOBSERVED is the t value derived for the data that was observed in the actual sample of participants in the study, r is the correlation in the sample, and N is the number of participants in the sample. In Dr. Cartman’ data: .40 tOBSERVED = tOBSERVED = .436 tOBSERVED = 6.135 1 − .40 2 x 200 − 2 x 14.071 Statistical Significance of a Correlation 10 Large t values reflect more confidence in rejecting the null hypothesis. Consider the t value that would be obtained for a sample in which the correlation is only .12 and the sample size is only 15: .12 tOBSERVED = tOBSERVED = .121 tOBSERVED = .436 1 − .12 2 x 15 − 2 x 3.606 This t value is noticeably lower than the t value found in the larger sample with the larger correlation, and the lower t value reflects the lower confidence that we would have in rejecting the null hypothesis. In sum, effect size (i.e., the size of the correlation) and sample size are the two key factors affecting a researcher’s confidence in rejecting the null hypothesis. These two factors are part of what researchers call the “power” of a significance test (Cohen, 1988). Larger effect sizes (correlations farther away from zero) and larger sample sizes increase our confidence in rejecting a null hypothesis – reflecting a “powerful” significance test. Testing the Null Hypothesis: How Confident Do We Need to Be? Once we know the factors influencing Dr. Cartman’s confidence in rejecting the null hypothesis, we can consider the question of how confident he needs to be in order to reject the null. We have seen that larger correlations and larger samples produce greater confidence, as reflected in larger t values. But how large does a t value need to be in order for Dr. Cartman to decide to reject the null hypothesis that the population correlation between SAT and GPA is zero? Confidence can be framed in terms of the probability that we would be making an error if we rejected the null hypothesis. Recall that a researcher never really knows if the null hypothesis is true or if it is false (because researchers typically cannot include entire populations in their studies). Researchers collect data on a sample that is drawn from the population of interest, and then they use the sample’s data to make educated guesses about the population. But even the most well-educated guess could be incorrect. Significance testing is most directly concerned with what is called a “Type I Error.” A Type I Statistical Significance of a Correlation 11 error is made when a researcher rejects the null hypothesis when in fact the null hypothesis is true. That is, a researcher makes a Type I error when he or she concludes that two variables are correlated with each other in the population, when in reality the two variables are not correlated with each other in the population. If Dr. Cartman rejects the null hypothesis in his study, then he is saying that there is a very low probability that he will be making an incorrect rejection. The probability of an event occurring (i.e., the probability that a mistake will be made) ranges from 0 to 1.0, with probabilities near zero meaning that the event is very unlikely to occur. Thus, a probability of 0 means that there is absolutely no chance that a mistake will be made, and a probability of 1.0 means that a mistake will definitely be made. Values between these two extremes reflect differing likelihoods of the event. A probability of .50 means that there is a 50% chance that the a mistake will be made, and a probability of .05 means that there is only a 5% chance (a pretty remote chance) that a mistake will be made. By convention, psychologists have adopted the probability of .05 as the criterion for determining how confident a researcher needs to be before rejecting the null hypothesis. Put another way, if Dr. Cartman finds that his study gives him a level of confidence associated with less than a 5% chance of an incorrect rejection of the null hypothesis, then he is “allowed” to reject the null hypothesis. Traditionally, psychologists have assumed that, if researchers are so confident in their results that they have such a small chance of making a Type I Error, then they are allowed to reject the null hypothesis. Researchers often use the term “alpha level” when referring to the degree of confidence required to reject the null hypothesis. By convention, most significance tests in psychology are conducted with an alpha level of .05. Statisticians have made connections between the observed t values computed earlier and the p value (alpha level) associated with incorrectly rejecting the null hypothesis. How can Dr. Cartman determine if his observed t value allows him to be confident enough to assert that he has less than a 5% chance of making a Type I Error? To do this, Dr. Cartman must identify the appropriate “critical” t value, which will tell him how large his observed t value must be in order for him to reject the null hypothesis in Statistical Significance of a Correlation 12 his study. The critical t value that Dr. Cartman will use reflects a .05 probability of incorrectly rejecting the null hypothesis. It is the t value that is exactly associated with a 5% chance of making a Type I Error. To identify the appropriate critical t value, Dr. Cartman can refer to a Table of critical t values. Many basic statistics textbooks and research method textbooks include tables of critical t values, such as that presented in Table 1 (see the end of this paper). Dr. Cartman must consider only two issues when identifying the critical t value for his study. These two issues are reflected in the columns and rows of the Table. Table 1 presents several columns of t values. These columns represent different degrees of confidence required, in terms of the probability of making a Type 1 Error. Because psychology and similar sciences have traditionally adopted a probability level of .05 as the criterion for rejecting a null hypothesis, Dr. Cartman will typically only be concerned about the values in the column labeled “.05.” Table 1 also presents several rows, and each row represents a different sized study. The rows are labeled “df,” which stands for degrees of freedom. “Degrees of freedom” is linked to the number of participants in the sample. Specifically, df = N – 2. Dr. Cartman determines that the degrees of freedom for his study is df = 198 (200 – 2 = 198). Referring to a Table of critical t values, Dr. Cartman pinpoints the intersection of the .05 column and the appropriate row. The entry at this point in the Table is 1.972. he will use this critical t value to help decide whether to reject the null hypothesis. Testing the Null Hypothesis: Making the Decision The decision about the null hypothesis is made by comparing an observed t value to the appropriate critical t value. If Dr. Cartman finds that the absolute value of his observed t value is larger than the critical t value, then he will decide to reject the null hypothesis. If Dr. Cartman finds that the absolute value of his observed t value is not larger than the critical t value, then he fails reject the null hypothesis. In shorthand terms, If |tOBSERVED| > tCRITICAL then reject H0 Statistical Significance of a Correlation 13 If |tOBSERVED| < tCRITICAL then fail to reject H0 In his case, Dr. Cartman rejects the null hypothesis, because the absolute value of his observed t value is larger than the critical t value (|6.135| > 1.972). These “statistically significant” results tell Dr. Cartman that it is highly unlikely to find a correlation of .40 (a moderate effect size) in a sample of 200 participants (a fairly large sample), if the correlation in the population is zero. Therefore, he rejects the null hypothesis and concludes with confidence that the correlation in the population is probably not zero. That is, he concludes that in the entire freshman student body at his University, SAT scores are indeed correlated with GPA. For a more general perspective, it might be worth considering other patterns of results. As a second example, imagine that a second researcher, Dr. Marsh, had obtained a correlation of .12 from a sample of 15 participants. In this case, the observed t value would be tOBSERVED = .437. Looking at Table 1, in the .05 column and the row for df = 13 (df = 15 - 2), he finds that the critical t value is tCRITICAL = 2.160. Here, the absolute value of tOBSERVED is less than tCRITICAL, so Dr. Marsh would fail to reject the null hypothesis. The effect size is small (i.e., it is not very different from zero), and the sample is small (only 15 people). With such weak results and such a small sample, Dr. Marsh is not confident enough to reject the idea that the correlation in the population is zero. Thus, his correlation is not statistically significant. As a third example, imagine that Dr. Broflovski had obtained a negative correlation (say, r = -.40) from a sample of 200 participants. In this case, the observed t value would be tOBSERVED = -6.135 (note that this is a negative observed t value). Looking at Table 1, in the .05 column and the row for df = 198, he finds that the critical t value is tCRITICAL = 1.972. Here, the absolute value of tOBSERVED is greater than tCRITICAL (|-6.135| > 1.972), so Dr. Broflovski would reject the null hypothesis in this case. Dr. Broflovski has found a moderately-sized correlation (i.e., it is fairly different from zero) in a fairly large sample of participants. Dr. Broflovski feels that he is highly unlikely to find a moderate effect size in a fairly large sample, if the correlation in the population is zero. Note that the direction of the correlation (positive or negative) does not make a difference in this example. Dr. Broflovski has conducted what is known as a Statistical Significance of a Correlation 14 two-tailed test or a non-directional test. This means that he is testing the null hypothesis that the correlation in the population is zero. This hypothesis can be rejected if the correlation in the sample is positive or if it is negative – either way could convince him that the correlation in the population is not likely to be zero. Table 2 presents a summary of the steps in conducting a typical null hypothesis significance test of a correlation. Interpreting the Decision A significance test comes down to a decision between two choices. Usually this decision concerns whether or not the correlation is zero in the population from which the sample has been drawn. Inferential statistics help us determine the likelihood that a given sample’s results might have occurred either: a) because the sample is drawn from a population in which the correlation is not zero, or b) purely by chance, with the sample being drawn from a population in which the correlation is zero. We reject the null hypothesis when the probability level associated with our results suggests that our results are unlikely to have occurred if the null hypothesis were true. Again, the primary example of Dr. Cartman shows this situation – he obtained a moderate correlation in a large sample. His significance test tells him that this result is unlikely to have occurred in this sample, if indeed the correlation in the population is zero. He therefore concludes that the null hypothesis is false (i.e., he concludes that the population correlation is not zero), and decides to reject it. We fail to reject the null hypothesis when the probability level associated with our results suggests that our results are not unlikely to have occurred if the null hypothesis were true. In the second example, Dr. Marsh found a weak correlation in a small sample. The significance test indicates that the results might very well occur even if the correlation in the population is zero. Dr. Marsh therefore concludes that the null hypothesis might not be false (i.e., the population correlation might well be zero) and so he decided not to reject it. Statistical Significance of a Correlation 15 You are likely to hear a variety of different interpretations of a correlation that is statistically significant. For example, Dr. Cartman’s results (r = .40, p < .05) from his sample of N = 200 might lead him to make statements such as: • The correlation is “significantly different from zero.” • It’s unlikely that the sample came from a population in which the correlation is zero. • In the population from which the sample was drawn, the two variables are probably associated with each other. • The observed data are unlikely to have occurred by random chance. • There is a less than a .05 probability (i.e., a very small chance) that the results could have been obtained if the null hypothesis is true. • He is 95% confident that the population correlation is not zero • If this study was done 100 times (each with a random sample of N = 200, drawn from a population in which the correlation is zero), we would get a correlation of magnitude .40 or stronger (ie, r ≥ |.40|) fewer than 5 times. • Given that the results are unlikely to have occurred if the null were true, then the null is probably not true. You are also likely to hear a variety of different interpretations of a correlation that is not statistically significant. For example, the second example (r = .12, p > .05, N = 15) might lead to statements such as: • The sample’s correlation is “not significantly different from zero.” • It’s not unlikely that the sample came from a population in which the correlation is zero. • In the population from which the sample was drawn, the variables are likely to be uncorrelated with each other. • The observed data might very well have occurred by random chance. Statistical Significance of a Correlation • 16 There is a more than a .05 probability (i.e., not a small chance) that the results could have been obtained even if the null hypothesis is true. • She cannot be 95% confident that the population correlation is not zero. • If this study is done 100 times (each with a random sample of N = 15, drawn from a population in which the correlation is zero), we would get a correlation of magnitude .12 or stronger (ie, r ≥ |.12|) more than 5 times. • Given that the results are not unlikely to have occurred if the null were true, the null might very well be true. Experts in probability might take issue with some of the above interpretations, depending on their perspective on probability and logic. Nevertheless, many of the interpretations above or close variations are often used. While considering the appropriate interpretations of significance tests, we should also consider at least two potential confusions. One point of confusion might concern a rejection of the null hypothesis. Dr. Cartman’s sample correlation was r = .40, which was statistically significant. By rejecting the null hypothesis that ρ = 0, Dr. Cartman can conclude that the sample is probably not drawn from a population with a correlation of 0. But he should not conclude that the sample was drawn from a population with a correlation of ρ = .40. The sample might come from a population with a correlation of ρ = .40, but it also might come from a population with a correlation of ρ = .35 or ρ = .50 and so on. So, rejecting the null hypothesis means that the correlation in the population is probably not zero, but it does not indicate what the correlation in the population is likely to be. A second potential point of confusion concerns the failure to reject the null hypothesis. In the second example, Dr. Marsh’s sample correlation was r = .12, which was not statistically significant. Recall that the failure to reject the null hypothesis tells Dr. Marsh that the sample’s results might very well have occurred if ρ = 0. So, Dr. Marsh can assume that the population correlation might be zero. In this case, Dr. Marsh should not conclude that the correlation in the population is zero. The sample’s Statistical Significance of a Correlation 17 results (r = .12) might also have occurred if ρ = .02, ρ = -.07, or ρ = .20. So, just because the population correlation might be zero, that does not mean that it is zero or that all other possibilities are less likely. Confidence Intervals As outlined above, a null hypothesis test is a very specific test. The results of the typical test allow us to make one inference about the population, specifically that the population correlation is either unlikely to be zero or it might well be zero. That is, are two variables likely to be associated with each other in the population or not? Although it is useful to evaluate the likelihood that the population correlation is zero, we can ask many other questions about the correlation in the population from which a sample was drawn. For example, what is our best guess about the actual correlation in the population? If the correlation in Dr. Cartman’s sample is r = .40, then what is Dr. Cartman’s best guess about the size of the correlation among the entire freshman student body? All that Dr. Cartman knows is that the sample correlation is .40, therefore his most reasonable guess about the student body correlation is that it is ρ = .40. This guess about the specific value of the population correlation is called a point estimate because he is estimating a single, specific point at which the population correlation lies. Although the point estimate of the population correlation is an “educated guess,” Dr. Cartman is not sure that the population correlation is .40. He recognizes that his particular random sample of students might be different from the entire freshman student body in some ways, and these differences might mean that the correlation he finds in his sample is different from the correlation in the entire student body. Dr. Cartman might say that, although he is not sure that the population correlation is .40, he is fairly confident that the population correlation lies somewhere between .28 and .51. A confidence interval (CI) for a correlation is the range in which the population correlation is likely to lie, and it is estimated with a particular degree of confidence. For example, Dr. Cartman’s range (.28 ≤ ρ ≤ .51) is a 95% CI. That is, he is 95% confident that the population correlation (ρ) is between .28 and .51. Statistical Significance of a Correlation 18 Although a discussion of the calculation of a CI is beyond the scope of this paper, three important issues must be considered in interpreting a CI. First, the “width” of the CI reflects the precision of the estimate. Dr. Cartman’s 95% CI ranges from .28 to .51, which is a span of 23 “points” on the correlational metric. But consider the second example, in which Dr. Marsh found a correlation of r = .12 in a sample of 15 participants. Dr. Marsh’s 95% CI ranges from -.42 to .60 (ie, -.42 ≤ ρ ≤ .60), which is a span of 102 “points’ on the correlational metric. Note the difference between the two examples, illustrated in Figure 1. Dr. Cartman’s estimate of the population correlation is a narrower range than is Dr. Marsh’s estimate. A narrower range reflects a more precise and informative estimate. For a more familiar example, consider two weather predictions. One meteorologist predicts that the high temperature tomorrow will be somewhere between 60 and 70 degrees – a range of 10 degrees. Another meteorologist predicts that the high temperature tomorrow will be between 30 and 100 degrees – a much wider range of 70 degrees. Obviously, the first meteorologist’s narrower range is a much more precise and useful prediction. Narrow CI’s are more precise and informative than are wide CI’s. A second important point regarding CI’s is the effect of sample size on a CI. In computing a CI based on a sample’s data, the size of the sample is directly related to the precision (i.e., width) of the CI. Large samples allow researchers to make relatively precise CI estimates. Consider again the fact that Dr. Cartman’s CI is much more precise than Dr. Marsh’s CI. The difference in precision is primarily due to the difference in the sample sizes from the two studies. Dr. Cartman’s CI is based on 200 participants, but Dr. Marsh’s CI is based on only 15 participants. The link between sample size and the width of a CI is conceptually related to the link between sample size and our confidence in rejecting a null hypothesis, discussed earlier. A relatively large sample includes a relatively large proportion of the population. Therefore, an estimate about the population is more precise when based on large samples than when based on smaller samples. A third point regarding CI’s is the link between a CI and the typical null hypothesis test. Recall that the typical hypothesis test of a correlation is the test of the null hypothesis that the correlation in the population is zero (H0: ρ = 0). We reject the null hypothesis when we are confident that there is less than Statistical Significance of a Correlation 19 a 5% chance that the correlation in the population is indeed zero. A CI might be seen as the flip side of the significance test. Dr. Cartman’s CI tells us to be 95% confident that the population correlation is within the range of .28 to .51. Put another way, Dr. Cartman’s CI tells us that there is only a 5% chance that the population correlation is outside of the range of .28 to .51. Note that Dr. Cartman’s CI does not include zero. The interval includes only positive values (i.e., it is entirely above zero), as illustrated in Figure 1. Therefore, the CI tells us to be 95% confident that the correlation in the population is not zero. In other words, it tells us that there is less than a 5% chance that the population correlation is zero. This parallel’s the outcome of Dr. Cartman’s null hypothesis test, in which he rejected the hypothesis that the correlation in the population is zero. In contrast, consider Dr. Marsh’s CI, also illustrated in Figure 1. Dr. Marsh’s CI does include zero – the CI ranges from a negative value at one end to a positive value at the other end. The fact that zero is within Dr. Marsh’s 95% CI indicates that the correlation in the population from which his sample was drawn might very well be zero. Although Dr. Marsh is 95% confident that the population correlation is not -.50, -.85, .62 and so on (because these values are outside of his CI), he cannot be confident that the population correlation is not zero. In sum, the traditional null hypothesis significance test is directly related to CI’s, and this relationship hinges on whether a CI includes zero. Advanced Issues The concepts, procedures, and examples in this paper reflect the most typical kind of significance test of a correlation, in which a researcher tests the null hypothesis that the population correlation is zero, at an alpha level of .05. Although this is the most typical kind of significance test, options exist for conducting other kinds of statistical tests. Details of such options and advanced issues are beyond the scope of this paper, but an overview might be useful. Using Statistical Software Statistical packages such as SPSS usually provide exact probability values associated with each significance test. Figure 2 presents SPSS output for Example 2 (Dr. Marsh’s results), with labels Statistical Significance of a Correlation 20 provided to aid interpretation. Note that SPSS labels the p value as “Sig. (2-tailed)”. As shown in Figure 2, the exact p value is .67. The p values reported by the statistical software are used to make decisions about the null hypothesis. If the p value is larger than .05 (as in Figure 2), then we would fail to reject the null hypothesis. If the p value is smaller than .05, then we would reject the null hypothesis. Therefore, if you use statistical software for correlational analysis, then you will not need to refer to a table of t values. Instead, you simply examine the exact p value and gauge whether it is greater than or less than .05. Additional Significance Tests for a Correlation By far, the most typical significance test of a correlation is a test of the null hypothesis that the population correlation is zero (H0: ρ = 0). This is most commonly reported in the psychological literature, and it is the “default” test, as reflected in the p values reported by statistical software packages such as SPSS or SAS. Despite this, we could test other null hypotheses involving correlations. We could test a null hypothesis that the population correlation is a specific value other than zero. For example, previous research might indicate the correlation between Conscientiousness and Work Performance is .30 in the population, but we might hypothesize that some professions, might have an even stronger correlation between Conscientiousness and Work Performance. We could recruit a sample of accountants, measure their Conscientiousness and measure their Work Performance, and we might test the null hypothesis that the correlation in the population of accountants (ie, the population from which our sample is drawn) is .30 (i.e., H0: ρ = .30). In this case, we believe that the correlation among accountants is not .30, which is reflected in the alternative hypothesis (H1 ρ ≠ .30). The significance test for this example would be conducted somewhat differently than the much more typical test outlined earlier, and many statistics textbooks describe the procedures. Other p Values Besides .05 Statistical Significance of a Correlation 21 As described above, researchers have traditionally allowed themselves to reject null hypotheses when their analyses suggest that there is less than a 5% chance of making a Type I Error (i.e., incorrectly rejecting a null hypothesis). Although the p value (alpha level) of .05 is the conventional point at which researchers reject a null hypothesis, researchers could consider using different p values. Researchers sometimes use an even more strict criterion, such as an alpha level .01. Researchers who decide to use a p value of .01 would reject the null hypothesis only when their analyses suggest that there is less than a 1% chance of making a Type I Error. Using a different p value changes only Step 3 in the process of statistical significance tests, as illustrated in Table 2. In Step 3, the researcher would select a critical t value associated with a .01 alpha level. To identify the appropriate critical value, the researcher would refer to a table such as Table 1, and examine the column labeled “.01” instead of the column labeled “.05.” The researcher would then proceed to Step 4 and Step 5, comparing their observed t value to the critical t value associated with the .01 alpha level. As shown in Table 1, the critical t value for a study conducted with an alpha of .01 is larger than the critical t value for a study conducted with an alpha of .05. In terms of the significance test, this difference means that researchers must be even more confident that the null hypothesis is incorrect. That is, a larger observed t value is required in order to reject the null hypothesis when using an alpha of .01. Two-tailed vs One-tailed Tests The examples described in this paper are based on “two-tailed” significance tests. The tests are designed to evaluate the null hypothesis that the population correlation is zero (H0: ρ = 0), in comparison to the alternative hypothesis that the population correlation is not zero (H1: ρ ≠ 0). For such two-tailed tests, the null hypothesis could be rejected if the sample correlation is positive or negative – if the correlation is on either of the two sides of zero. These hypotheses are non-directional – they do not reflect any kind of expectation that the population correlation is positive, for example. But researchers might have strong reasons to suspect that the correlation is in a specific direction. For example Dr. Cartman might suspect that the correlation between SAT and GPA is positive. In such Statistical Significance of a Correlation 22 cases, researchers could consider using a “one-tailed” significance test. For one-tailed tests, the hypotheses are framed differently. If Dr. Cartman hypothesized that the population correlation is positive, then he might conduct a one-tailed test in which he tests that null hypothesis that the population correlation is less than or equal to zero (H0: ρ ≤ 0), in comparison to the alternative hypothesis that the population correlation is greater than zero (H1: ρ > 0). These are known as directional hypotheses. Conducting a one-tailed test changes Step 3 in the process of statistical significance tests, as illustrated in Table 2. In Step 3, the researcher would select a critical t value associated with a one-tailed (at the alpha level that he or she has chosen, usually .05). To identify the appropriate critical value, the researcher would refer to a table of critical t values. For the sake of simplifying the earlier discussion of critical values, Table 1 does not include information for one-tailed tests; however, many textbooks include tables with columns that guide researchers to the appropriate critical t values for one-tailed tests. The researcher would then proceed to Step 4 and Step 5, comparing their observed t value to the critical t value associated with the one-tailed test. Although researchers might use one-tailed tests, two-tailed tests are probably more common. One-tailed tests are often perceived as more liberal that two-tailed tests (e.g., Gravetter & Wallnau, 2004), allowing researchers to reject the null hypothesis more easily (although, in fact the two approaches have equal probabilities of producing a Type I error). This perception arises from the fact that the critical t values used in one-tailed tests are smaller than are the critical t values used in two-tailed tests. Consequently, researchers must meet a lower degree of confidence before rejecting the null hypothesis in a one-tailed test. Researchers tend to shy away from procedures that make is easier to reject a null hypothesis, preferring to take a more conservative approach. Put another way, researchers are reluctant to adopt procedures that might increase the probability of making a Type I error, and the use of one-tailed tests is often perceived as potentially increasing such errors. Therefore, despite the logic of one-tailed tests and directional hypotheses, two-tailed tests and non-directional hypotheses are used frequently. An Alternative Conceptualization of an Inferential Statistic Statistical Significance of a Correlation 23 The conceptual approach to significance testing that is adopted in the current paper emphasizes the importance of effect size and sample size in determining statistical significance (see Equation 1, above). Textbooks usually present an alternative approach to significance testing. The alternative approach is very similar to the one outlined in the current paper, in that it proceeds through the same Steps listed in Table 2 and produces the same result. However, the alternative approach uses a slightly different conceptual framework. Again, a full description of the alternative approach is beyond the scope of this paper, but a general familiarity could be useful. The difference between the two approaches lies in the conceptualization of Step 2 (computing the observed t value). The alternative approach includes two components. First, we have greater confidence that the null hypothesis is incorrect when our sample’s statistic (i.e., our observed correlation) is far away from what is predicted by the null hypothesis. Second, we have greater confidence in making inferences about the population from which the sample was drawn when our sample’s statistic is a precise estimate of the population parameter. As outlined in many textbooks, the alternative approach conceptualizes an inferential statistic in the following way: Observed t value = Observed value of - Expected value of the correlation the correlation (in the sample) under the null hypothesis Standard Error of the correlation Equation 2 or tOBESRVED = r - ρ sr In this approach, the observed t value again reflects our confidence that the null hypothesis is incorrect – larger observed t values make us more likely to reject the null hypothesis. We will assume that we are conducting a test of the typical null hypothesis (H0: ρ ≤ 0) – that the correlation is zero in the population from which the sample was drawn. As in the approach described earlier, we are more likely to reject the null hypothesis when the observed correlation is far away from zero than when the observed correlation is close to zero. This is reflected in the numerator of the equation above – the difference between the observed correlation and the correlation that is proposed by the null hypothesis. Statistical Significance of a Correlation 24 Equation 2 differs from Equation 1 in the concept of the “standard error” of the correlation. Although there are very technical and highly abstract ways of defining the standard error of the correlation, it can generally be interpreted as indicating how imprecise the sample correlation is as an estimate of the correlation in the population from which it was drawn (Gravetter & Wallnau, 2004). A large standard error indicates that the correlation found in the sample is a poor estimate of the correlation in the population. A small standard error indicates that the correlation found in the sample is a good estimate of the correlation in the population. As discussed earlier, sample size is a key factor affecting one’s confidence in using the sample results to make inferences about the population. Therefore, it is not surprising to find that the sample size is a component of the standard error: sr = 1− r2 N −2 With this equation representing the standard error of the correlation, and with the understanding that the null hypothesis specifies a correlation of zero (H0: ρ = 0), then the equation for the observed t value is: t= r−ρ r = sr 1− r2 N −2 Equation 3 Equation 3 is often found in textbooks that explain the computations for the significance test of a correlation coefficient. Notice that this equation includes exactly the same components as does Equation 1. The two approaches are mathematically identical – leading to the same observed t value. In addition, they are conceptually similar in that they define the inferential statistics (the observed t value) as a product of the size of the effect (how large is the correlation?, how far away from zero is it?) and the degree to which the sample’s data is a good approximation of the population’s properties (how large is the sample?). Statistical Significance of a Correlation 25 A Broader Perspective on Significance Testing The significance test of the correlation is an example of significance testing more generally. As Rosenthal and Rosnow (1991) point out, most significance tests can be conceptualized as: Inferential Statistic = Effect Size x Size of Study The t value is one kind of inferential statistic, and different inferential statistics are used for significance tests of different descriptive statistics. For example, the test of the difference among three or more group means uses an F value (i.e., ANOVA), and the test of differences in frequencies uses a Chi Square value. Roughly speaking though, inferential statistics indicate the confidence that a researcher should have in rejecting the null hypothesis. Larger values for inferential statistic reflect stronger confidence. Similarly, the correlation coefficient is but one kind of effect size. Other statistics that represent various effect sizes include the degree of difference between two groups (e.g., Cohen’s d) or the proportion of variance accounted for (e.g., R Squared or Eta Squared). Effect sizes represent the strength of the findings in the sample – how strong is the correlation between variables or how different are two groups? The stronger the findings in the sample, the more confident we are in rejecting the null hypothesis, where the null hypothesis states that there is no correlation between variables in the population or there is no difference between two populations of people. Finally, sample size is but one facet of the size of study. Although the number of people in the sample is generally the most important components of the “size of the study,” some inferential statistics also consider the number of variables involved in the analysis. A procedure called multiple regression is used to examine the degree to which two or more predictor variables (e.g., SAT, IQ, and Academic Motivation) are related to an outcome variable (e.g., Freshman GPA). The inferential statistics associated with multiple regression take into account the number of predictor variables being examined. In sum, the equation above expresses an important point. The degree to which any significance test will lead to rejection of a null hypothesis is a function of some kind of effect size and of the size of Statistical Significance of a Correlation 26 the study that was conducted. A large effect size and a large study give the study greater power. Power is a concept that essentially reflects the likelihood of accurately rejecting a null hypothesis. Conclusion Individuals who are being introduced to the concept of correlational analysis and to significance testing might have difficulty making clear connections between the two. In textbooks and other sources that might be used for teaching basic statistics, the conceptual foundation of significance testing of correlations has traditionally been neglected. Hopefully, the current paper begins to remedy this neglect and helps enhance a deeper understanding of this fundamental statistical procedure. Statistical Significance of a Correlation 27 References Aberson, C. (2002). Interpreting null results: Improving presentation and conclusions with confidence intervals. Journal of Articles in Support of the Null Hypothesis, 1, 36–42. American Psychological Association. (2001). Publication manual of the American Psychological Association (5th ed.). Washington, DC: Author. Archdeacon, T. J. (1994). Correlation and Regression Analysis: A Historian's Guide. Madison, WI: University of Wisconsin Press. Bobko, P. (2001), Correlation and regression, 2nd edition. Thousand Oaks, CA: Sage Publications. Capraro, M. M., & Capraro, R. M. (2003). Exploring the APA fifth edition publication manual’s impact on the analytic preferences of journal editorial board members. Educational and Psychological Measurement, 63, 554-565. Chen, P. Y., & Popovich, P.M. (2002). Correlation: Parametric and Nonparametric Measures. Thousand Oaks, CA: Sage Publications. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New Jersey: Lawrence Erlbaum. Cohen, J. & Cohen, P. (1983). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, Second Edition. Hillsdale, NJ: Erlbaum. Edwards, A. L. (1984). An Introduction to Linear Regression and Correlation (2nd ed.), New York: W.H. Freeman. Ezekiel, M. (1941). Methods of correlation analysis. New York: Wiley. Furr, R. M. (2004). Interpreting effect sizes in contrast analysis. Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences, 3, 1-25. Gravetter, F.J., & Wallnau, L.B. (2004). Statistics for the behavioral sciences (6th ed.). Belmont, CA: Wadsworth. Heldref Foundation. (1997). Guidelines for contributors. Journal of Experimental Education, 65, 95-96. Statistical Significance of a Correlation 28 Kendall, P.C. (1997). Editorial. Journal of Consulting and Clinical Psychology, 65, 3-5. Miles, J.N.V. & Shevlin, M.E. (2001). Applying regression and correlation: A guide for students and researchers. London: Sage Publications Murphy, K.R. (1997). Editorial. Journal of Applied Psychology, 82, 3-5. Pedhazur, E.J. (1997). Multiple regression in behavioral research, third edition. New York: Harcourt Brace College Publishers. Rosenthal, R. & Rosnow, R. L. (1991). Essentials of behavioral research: Methods and data analysis (2nd ed.). New York: McGraw Hill. Rosenthal, R., Rosnow, R. L., & Rubin, D. B. (2000). Contrasts and effect sizes in behavioral research: A correlational approach. New York: Cambridge University Press. Thompson, B. (1994). Guidelines for authors. Educational and Psychological Measurement, 54, 837-847. Thompson, B. (1999). Improving research clarity and usefulness with effect size indices as supplements to statistical significance tests. Exceptional Children, 65, 329-338. Wilkinson, L., & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604. Statistical Significance of a Correlation Table 1 DF 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 60 70 80 90 100 120 140 160 180 198 ∞ .10 .05 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.684 1.671 1.667 1.664 1.662 1.660 1.658 1.656 1.654 1.653 1.653 1.645 12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 2.021 2.000 1.994 1.990 1.987 1.984 1.980 1.977 1.975 1.973 1.972 1.960 Alpha Level .01 63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.704 2.660 2.648 2.639 2.632 2.626 2.617 2.611 2.607 2.603 2.601 2.576 .001 636.619 31.599 12.924 8.610 6.869 5.959 5.408 5.041 4.781 4.587 4.437 4.318 4.221 4.140 4.073 4.015 3.965 3.922 3.883 3.850 3.819 3.792 3.768 3.745 3.725 3.707 3.690 3.674 3.659 3.646 3.551 3.460 3.435 3.416 3.402 3.390 3.373 3.361 3.352 3.345 3.340 3.291 29 Statistical Significance of a Correlation Table 2 Steps in conducting a typical significance test of a correlation Step Description 1. Compute the observed statistic (i.e., compute the correlation), based on the sample’s data 2. Compute the observed t value, based on the sample correlation and the sample size 3. Obtain the critical t value by referring to a table of the t distribution, based on a two-tailed significance level of .05 and df = N-2 4. Compare t observed to t critical 5. Make a decision about the null hypothesis. 30 Statistical Significance of a Correlation Figure 1 Illustrating Confidence Intervals -1.0 -.50 0 .50 .28 -.42 .51 Dr. Cartman’s CI .60 Dr. Marsh’s CI 1.0 31 Statistical Significance of a Correlation Figure 2 SPSS output of correlational analysis Correlations SAT SAT GPA Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N 1 15 .120 .670 15 GPA .120 .670 15 1 15 Sample’s Correlation (r = .12) P value (p = .67) Sample Size (n = 15) 32