Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 1 Statistics Review (Reminding you of what you, hopefully, already know) 1 What are we doing here? Statistics! A system of managing, understanding and analyzing data. To understand the theories we develop, we need a way to objectively categorize and summarize information (data). Then we need methods for judging and interpreting data. Statistics give us all of that (kinda handy, huh?) 2 Some terms to start with… Observations – the “individual” we are studying. In psychology, this is usually people or animals, but not always. Some synonyms – “Experimental Units” (that’s from the book), subjects, participants, individuals, people, rats, guinea pigs… Variable – the characteristics that we are measuring. 3 Types of data Quantitative data – information about observations that are numerical, both in value and in nature. Qualitative data – information that can is not numerical in nature (although it can be in name). There is a field of analysis called Qualitative Data Analysis that goes far beyond simple categories. 4 Classification of Data We can make finer distinctions than the above 1. Nominal – qualitative: simply names the individual or a quality. 2. Ordinal – qualitative: values of the variable are ordered, but there are not necessarily equal spaces between the values. 3. Interval – quantitative. Scores are ordered and equally spaced, but there is no meaningful zero. 4. Ratio – quantitative. Interval, but with a meaningful zero. 5 Two More Classifications… Discreet – There is a set number of possible scores and there can be nothing between them. Nominal and ordinal are both discreet. Continuous – There is a theoretically infinite number of possible scores, even if there is a restricted range. Ratio data is always continuous, interval data may or may not be, but typically is. Assignment 3 – p4, items 1.1, 1.3 6 Populations and Samples Population – the set of all possible observations, existing and/or conceptual. Described by parameters. 1. People (everybody) 2. NCCU Students 3. All college students 4. People with depression 5. Countries 6. Businesses Sample – a subset of data from a population. Described by statistics. 1. People in a telephone survey, dorm residents at NCCU 2. Students in this class, Psi Chi members, football players 3. NCCU Students, Seniors at all NC universities 4. Students getting treated at student health, patients of Dr. Jones 5. North, South, Central American countries 6. Fortune 500 companies 7 Random sampling – observations are selected in such a way that each member of the population has an equal chance of selection. 1. Student ID numbers are written on Ping-Pong balls and selected Lotto-style 2. Computer draws of listing of all college students 3. Random draws from a hat containing the names of everybody diagnosed with depression. 4. Cut out the names of all the countries listed in the NY Times Almanac, random draws. 5. File a request under the Freedom of Information Act to get the names of all corporations from the IRS, random draw. Assignment 4 – p8, items 1.9, 1.13, 1.15 8 Intro to Graphical Stats Drawing graphs of our data enhances our understanding immeasurably. Relative Frequency Distribution Frequency 250 200 150 100 50 or e M 0 15 0 14 0 13 0 12 0 11 0 10 90 80 70 60 0 9 The y-axis shows the range of scores, the x-axis shows the relative frequency of the occurrence of those scores. This type of graph describes the frequencies of random variables. That is, some variable in a population that can take on the values described in the graph. 10 Area under the curve: The space enclosed by the graph represents all the possible scores and how often they occur. If we say that the total area is equal to 1. The area associated with a specific value gives the percentage (or proportion) of the population with that score. 11 Consider this distribution: Frequency 120 100 80 60 40 20 0 70 80 90 100 110 120 130 140 150 12 Two New Terms: Random variable: Whatever variable we are measuring. “Random” because it can take on some range of values. The likelihood that it takes on a specific value is determined by the probability distribution. The probability distribution is the same thing as it’s relative frequency. See all that stuff about the area under the curve above. 13 The problem with populations… We can’t measure most populations so their probability distributions are unknown. Taking a sample allows us to infer information about the population. We assume that population parameters are about equal to sample statistics. 14 First steps with samples Before we start going crazy with data analysis, we need to just have a look at our data. 76, 80, 84, 85, 86, 87, 87, 88, 89, 93, 94, 95, 95, 96, 97, 98, 98, 99, 100, 101, 104, 106, 106, 106, 109, 112, 115, 126, 133, 139 15 Stem and Leaf Plot 7 8 9 10 11 12 13 6 04567789 345567889 0146669 23 6 39 16 Histogram Frequency 12 10 8 6 4 2 0 80 90 100 110 120 130 140 More Assignment 5 – p14, items 1.17, 1.19, 1.21 17 Numerical Descriptions of Data (Note: “descriptions” is not an accident, this branch of stats is called descriptive statistics) Finding the center of a data set: The mean – the arithmetic average, the sum of all of the scores, divided by the total number of scores. Writing this with statistical notation… x n 18 Understanding the breadth or spread of a dataset. Range – the distance between the lowest and highest scores. Simply subtract the low score from the high score. Variance – the average of the squares of the deviations. SS N where 2 ( xi ) SS ( xi ) x N 2 2 2 i 19 The standard deviation –the square root of the variance. One of the very nice things is that standard deviations can be negative. Thus, it not only tells how far a score is from the mean, but whether it’s above or below the mean. Remember that the above formula is for populations. In the real world we have to estimate the parameters. The sample statistic is calculated the same way except we use n-1 instead of N. SS s n 1 where 2 ( xi ) SS ( xi ) x N 2 2 2 i 20 Interpreting the Standard Deviation 1. For any dataset, regardless of it’s shape, 75% of the observations will fall within 2 standard deviations of the mean. 2. For most datasets of moderate size (about 25 or more observations) with a mound shaped distribution, 95% of the observations will fall within 2 standard deviations of the mean. So, when you hear that someone’s score was –2.5 standard deviations from the mean, you know a lot. 21 More on parameters and statistics Although we perform our analyses on samples, we are interested in understanding populations. In other words, we are more interested in parameters than statistics. However, we can’t measure parameters so we estimate them with statistics. The sample mean is an estimate of the population mean. The sample variance is an estimate of the population variance. Assignment 6 p24, items 1.23, 1.25, 1.29 22 The Normal Probability Distribution The normal distribution is symmetrical around the mean with its spread determined by its standard deviation. The standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1. This is also known as a z distribution. You can use the z table to find out the proportion of data between two scores. Look at the portion of the chart on page 30. Now, why is this useful? Assignment 7 – p32, items 1.35, 1.37 23 Sampling Distributions (not sample distributions) Imagine this: You gather a sample of students and give them a statistics test. Now you can calculate a mean and standard deviation. We use this to estimate the population mean and standard deviation. Question: If you do this just once, how likely are you to hit the actual mean and standard deviation of the population? So, what is the solution? Take a bunch of samples. But, now we have a bunch of means and standard deviations. So what do we do with them? They are scores, just like anything else. In fact, they now make up a new kind of distribution. Sampling distributions are probability distributions of sample statistics. 24 More detail Take a sample of a specific n, say 20 observations, on a specific variable, say income. Now you can calculate the mean, say it’s $37, 455.08. OK, great. Put the sample back in the population (i.e., sample with replacement). Do the same thing again, also with n=20, again on income. Find the mean again. Do this an infinite number of times. This is the sampling distribution. Now we can’t actually take an infinite number of samples. We can’t because, well, infinite is kind of a lot. It’s even more than a lot. So, we mathematically estimate the parameters of the distribution. 25 Standard Error The standard deviation of the sampling distribution is called the standard error of the estimate, usually referred to as just the standard error. This is something we are going to use a lot. Why is it called the standard error? 26 The Central Limit Theorem Given sufficiently large sample sizes the shape of the sampling distribution is normal, regardless of the shape of the population distribution. 27 Making & assessing parameter estimates If the mean of sampling distribution is equal to the parameter, the statistic is considered an unbiased estimate. If not, it is considered biased. The sample mean and standard deviation are considered unbiased. However, this does not mean that they are accurate estimates. We need a way of assessing accuracy. 28 Confidence Intervals Recall two things: Sampling distributions are normal. Given a normal distribution, 95% of the observations fall within about 2 standard deviations of the mean. It’s actually 1.96 standard deviations rather than 2. So, we add and subtract 1.96 standard deviations from the sample mean. This range of numbers is called the “95% Confidence Interval” or 95% CI. 29 What Does the CI Mean? Assuming that we are using .95 as the confidence coefficient, we are 95% sure that we have captured the true mean inside that range. That is, 95% of the time this procedure will give us a range of data that includes the true mean. Expanding the width of the interval (e.g., to 99% or 100%) will increase our confidence, but decrease our precision. Shrinking the confidence interval (e.g., to 75% or 50%) will increase our precision but decrease our confidence. Assignment 8 – p44, 1.45, 1.47, 1.51 30 Review of Hypothesis Testing First, more definitions: 1. Theory – theories are big, very big. They are broad, powerful statements that explain a set of observations. They are based on evidence and are discarded when they no longer work. 2. Hypothesis – hypotheses are smaller. They are derived from theories and are very specific, narrow statements about what we are testing with a given experiment or set of analyses. 31 The test of a hypothesis makes use of the following concepts: 1. The null hypothesis, H0, is a statement, usually indicating no effect, which we assume is true. 2. An alternative or research hypothesis, Ha or H1, makes a statement counter to the null. Usually, this is what we “want” to find evidence for. 3. A test statistic, a number calculated from the data that we use to make a decision about H0 and H1. 4. The rejection region is a set of values. If the test statistic falls inside this region, then we reject H0. 32 Testing a hypothesis about a population mean The z-test – compare the sample data to a hypothesized population mean. z y 0 y y where: n 33 The problem with the z-test – we rarely know the population standard deviation. Enter the one-sample t-test: y 0 t s n Assignment 9 – p57 1.56, 1.57, 1.61, 1.77 34 Testing the difference between two means. Say we are comparing the effects of talk therapy on panic attacks. One group of subjects gets talk therapy while the other gets no therapy. We set =0.05. What does this mean? Next, we get our means, standard deviations and estimate the standard error. We use this to calculate t. We look this value up in the t table. If our value is bigger than the critical value, then we reject the null. 35 Calculation t y1 y 2 1 1 s ( ) n1 n2 2 p 36 There are two minor differences when doing this with a computer package. First, the package does all the calculating for you. Second, it reports the exact p, or the observed significance level. So, if we set to 0.05, then we are actually looking for a p less than 0.05, usually written as p<0.05. Thus, when we get the exact p, we are looking for it to be less than 0.05. 37 Assumptions When doing a statistical test, there are certain assumptions about the data that must be met first. 1. The sampled population (or populations) is normal. 2. The sample (or samples) is randomly and independently sampled 3. The variances of however many groups are being tested are equal. This is called homogeneity of variance. Assignment 10 – p63 items 1.71, 1.73, 1.77 38 Checking for heteroscedasticity (unequal variances) If the first two assumptions are true, then we can test the third with an F-test. The F distribution is a non-normal distribution whose shape is dependent on it’s degrees of freedom. Unlike the t-distribution, it has two degrees of freedom. It is used when we are sampling from two populations (or two parameters from the same population). In this case we are sampling variances from our two populations. The degrees of freedom are n1-1 and n2-1. Turn to page 73. 39 Calculation 2 1 2 2 where the numerator is the larger sample variance and the denominator is the smaller. s F s Assignment p79, 1.83 40