Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Basic Statistical Concepts Psych 231: Research Methods in Psychology Mistrust of statistics? It is all in how you use them They are a critical tool in research Statistics Why do we use them? Descriptive statistics • Used to describe, simplify, & organize data sets • Describing distributions of scores Inferential statistics • Used to test claims about the population, based on data gathered from samples • Takes sampling error into account, are the results above and beyond what you’d expect by random chance Statistics Recall that a variable is a characteristic that can take different values. The distribution of a variable is a summary of all the different values of a variable Both type (each value) and token (each instance) How much do you like psy231? 5 values (1, 2, 3, 4, 5) 1-2-3-4-5 Hate it Love it 1 5 5 Distribution 7 tokens (1,1,2,3,4,5,5) 4 1 3 2 Many important distributions Population • All the scores of interest Sample • All of the scores observed (your data) • Used to estimate population characteristics Distribution of sample distributions 1 5 52 3 3 5 3 1 2 5 1 3 Sample Use descriptive statistics, focus on 3 properties Distribution 3 2 1 How do we describe these distributions? 2 Population 3 1 1 1 2 5 • Used to estimate sampling error 1 Properties of a distribution Shape • Symmetric v. asymmetric (skew) • Unimodal v. multimodal Center • Where most of the data in the distribution are • Mean, Median, Mode Spread (variability) • How similar/dissimilar are the scores in the distribution? • Standard deviation (variance), Range Distribution Visual descriptions - A picture of the distribution is usually helpful • Gives a good sense of the properties of the distribution Many different ways to display distribution • Graphs • Continuous variable: • histogram, line graph (frequency polygons) • Categorical variable: • pie chart, bar chart • Table • Frequency distribution table Numerical descriptions of distributions Distribution A frequency histogram Example: Distribution of scores on an exam Frequency 20 18 16 14 12 10 8 6 4 2 0 18 17 12 11 10 8 7 5 3 1 5054 5559 60- 6564 69 70- 7574 79 80- 8584 89 9094 95100 Exam scores Graph for continuous variables A line graph Example: Distribution of scores on an exam Frequency 20 18 16 14 12 10 8 6 4 2 0 50 55 60 65 70 75 80 85 90 95 Exam scores Graph for continuous variables Bar chart Pie chart Cutting Doe Missing Smith Graphs for categorical variables Be careful using a line graph for categorical variables The line implies that there are responses between Smith and Doe, but there are not Caution VAR00 003 Va lid 1.00 Fre quen cy 2 Percent 7.7 Va lid Perce nt 7.7 Cumu lati ve Percent 7.7 2.00 3.00 4.00 3 3 5 11 .5 11 .5 19 .2 11 .5 11 .5 19 .2 19 .2 30 .8 50 .0 5.00 6.00 7.00 8.00 4 2 4 2 15 .4 7.7 15 .4 7.7 15 .4 7.7 15 .4 7.7 65 .4 73 .1 88 .5 96 .2 9.00 To tal 1 26 3.8 10 0.0 3.8 10 0.0 10 0.0 Values Counts Percentages (types) Frequency distribution table Symmetric • The two sides line up Asymmetric (skewed) • The two sides do not line up Properties of distributions: Shape Symmetric Asymmetric (skewed) Negative Skew Positive Skew tail Properties of distributions: Shape tail Unimodal (one mode) Multimodal Minor mode Major mode Bimodal examples Properties of distributions: Shape There are three main measures of center Mean (M): the arithmetic average • Add up all of the scores and divide by the total number • Most used measure of center Median (Mdn): the middle score in terms of location • The score that cuts off the top 50% of the from the bottom 50% • Good for skewed distributions (e.g. net worth) Mode: the most frequent score • Good for nominal scales (e.g. eye color) • A must for multi-modal distributions Properties of distributions: Center The most commonly used measure of center The arithmetic average Divide by the total number in the population Computing the mean – The formula for the population mean is (a parameter): – The formula for the sample mean is (a statistic): The Mean X N X X n Add up all of the X’s Divide by the total number in the sample How similar are the scores? Range: the maximum value - minimum value • Only takes two scores from the distribution into account • Influenced by extreme values (outliers) Standard deviation (SD): (essentially) the average amount that the scores in the distribution deviate from the mean • Takes all of the scores into account • Also influenced by extreme values (but not as much as the range) Variance: standard deviation squared Spread (Variability) Low variability High variability The scores are fairly similar The scores are fairly dissimilar mean Variability mean The standard deviation is the most popular and most important measure of variability. The standard deviation measures how far off all of the individuals in the distribution are from a standard, where that standard is the mean of the distribution. • Essentially, the average of the deviations. Standard deviation Our population 2, 4, 6, 8 X 2 4 6 8 20 5.0 N 4 4 1 2 3 4 5 6 7 8 9 10 An Example: Computing the Mean Our population 2, 4, 6, 8 Step 1: To get a measure of the deviation we need to subtract the population mean from every individual in our distribution. X 2 4 6 8 20 5.0 N 4 4 X - = deviation scores 2 - 5 = -3 -3 1 2 3 4 5 6 7 8 9 10 An Example: Computing Standard Deviation (population) Our population 2, 4, 6, 8 Step 1: To get a measure of the deviation we need to subtract the population mean from every individual in our distribution. X 2 4 6 8 20 5.0 N 4 4 X - = deviation scores 2 - 5 = -3 4 - 5 = -1 -1 1 2 3 4 5 6 7 8 9 10 An Example: Computing Standard Deviation (population) Our population 2, 4, 6, 8 Step 1: To get a measure of the deviation we need to subtract the population mean from every individual in our distribution. X 2 4 6 8 20 5.0 N 4 4 X - = deviation scores 2 - 5 = -3 4 - 5 = -1 6 - 5 = +1 1 1 2 3 4 5 6 7 8 9 10 An Example: Computing Standard Deviation (population) Our population 2, 4, 6, 8 Step 1: To get a measure of the deviation we need to subtract the population mean from every individual in our distribution. X 2 4 6 8 20 5.0 N 4 4 X - = deviation scores 2 - 5 = -3 4 - 5 = -1 6 - 5 = +1 8 - 5 = +3 3 1 2 3 4 5 6 7 8 9 10 Notice that if you add up all of the deviations they must equal 0. An Example: Computing Standard Deviation (population) Step 2: So what we have to do is get rid of the negative signs. We do this by squaring the deviations and then taking the square root of the sum of the squared deviations (SS). X - = deviation scores 2 - 5 = -3 4 - 5 = -1 6 - 5 = +1 8 - 5 = +3 SS = (X - )2 = (-3)2 + (-1)2 + (+1)2 + (+3)2 = 9 + 1 + 1 + 9 = 20 An Example: Computing Standard Deviation (population) Step 3: ComputeVariance (which is simply the average of the squared deviations (SS)) So to get the mean, we need to divide by the number of individuals in the population. variance = 2 = SS/N An Example: Computing Standard Deviation (population) Step 4: Compute Standard Deviation To get this we need to take the square root of the population variance. X 2 standard deviation = = 2 N An Example: Computing Standard Deviation (population) To review: Step 1: Compute deviation scores Step 2: Compute the SS Step 3: Determine the variance • Take the average of the squared deviations • Divide the SS by the N Step 4: Determine the standard deviation • Take the square root of the variance An Example: Computing Standard Deviation (population) To review: Step 1: Compute deviation scores Step 2: Compute the SS Step 3: Determine the variance • Take the average of the squared deviations • Divide the SS by the N-1 Step 4: Determine the standard deviation • Take the square root of the variance This is done because samples are biased to be less variable than the population. This “correction factor” will increase the sample’s SD (making it a better estimate of the population’s SD) An Example: Computing Standard Deviation (sample) Example: Suppose that you notice that the more you study for an exam, the better your score typically is. This suggests that there is a relationship between study time and test performance. We call this relationship a correlation. Relationships between variables Properties of a correlation Form (linear or non-linear) Direction (positive or negative) Strength (none, weak, strong, perfect) To examine this relationship you should: Make a scatterplot Compute the Correlation Coefficient Relationships between variables Plots one variable against the other Useful for “seeing” the relationship Form, Direction, and Strength Each point corresponds to a different individual Imagine a line through the data points Scatterplot Hours study Exam perf. X 6 1 Y 6 2 5 6 3 4 3 2 Y 6 Scatterplot 5 4 3 2 1 1 2 3 4 5 6 X A numerical description of the relationship between two variables For relationship between two continuous variables we use Pearson’s r It basically tells us how much our two variables vary together As X goes up, what does Y typically do • X, Y • X, Y • X, Y Correlation Coefficient Linear Form Non-linear Negative Positive Y Y X X • As X goes up, Y goes up • As X goes up, Y goes down • X & Y vary in the same direction • X & Y vary in opposite directions • Positive Pearson’s r • Negative Pearson’s r Direction Zero means “no relationship”. The farther the r is from zero, the stronger the relationship The strength of the relationship Spread around the line (note the axis scales) Strength r = -1.0 “perfect negative corr.” -1.0 r = 0.0 “no relationship” r = 1.0 “perfect positive corr.” 0.0 The farther from zero, the stronger the relationship Strength +1.0 Rel A r = -0.8 Rel B r = 0.5 -.8 -1.0 .5 0.0 Which relationship is stronger? Rel A, -0.8 is stronger than +0.5 Strength +1.0 Compute the equation for the line that best fits the data points Y 6 5 Y = (X)(slope) + (intercept) 4 3 2 1 0.5 Change in Y 1 2 3 Regression 4 5 6 X Change in X 2.0 = slope 4.5 Can make specific predictions about Y based on X Y 6 5 X=5 Y = (X)(.5) + (2.0) Y=? Y = (5)(.5) + (2.0) Y = 2.5 + 2 = 4.5 4 3 2 1 1 2 3 Regression 4 5 6 X Also need a measure of error Y = X(.5) + (2.0) + error Y = X(.5) + (2.0) + error • Same line, but different relationships (strength difference) Y 6 5 Y 6 5 4 3 2 1 4 3 2 1 1 2 3 4 5 Regression 6 X 1 2 3 4 5 6 X Don’t make causal claims Don’t extrapolate Extreme scores (outliers) can strongly influence the calculated relationship Cautions with correlation & regression