Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics for Everyone Workshop Summer 2011 Part 1 Statistics as a Tool in Scientific Research: Review of Summarizing and Graphically Representing Data; Introduction to SPSS Workshop presented by Linda Henkel and Laura McSweeney of Fairfield University Funded by the Core Integration Initiative and the Center for Academic Excellence at Fairfield University and NSF CCLI Grant Statistics as a Tool in Scientific Research Types of Research Questions • Descriptive (What does X look like?) • Correlational (Is there an association between X and Y? As X increases, what does Y do?) • Experimental (Do changes in X cause changes in Y?) Different statistical procedures allow us to answer the different kinds of research questions Statistics as a Tool in Scientific Research Start with the science and use statistics as a tool to answer the research question Get your students to formulate a research question first: • How often does this happen? • Did all plants/people/chemicals act the same? • What happens when I add more sunlight, give more praise, pour in more water? Statistics as a Tool in Scientific Research Can collect data in class Can use already collected data (yours or database) Helping students to formulate research question: Ask them to think about what would be interesting to know. What do they want to find out? What do they expect? Types of Data: Measurement Scales Categorical: male/female blood type (A, B, AB, O) Stage 1, Stage 2, Stage 3 melanoma Numerical: weight # of white blood cells mpg Categorical Types of Data: Measurement Scales Nominal: numbers are arbitrary; 1= male, 2 = female Ordinal: numbers have order (i.e., more or less) but you do not know how much more or less; 1st place runner was faster but you do not know how much faster than 2nd place runner Numerical Interval: numbers have order and equal intervals so you know how much more or less; A temperature of 102 is 2 points higher than one of 100 Ratio: same as interval but because there is an absolute zero you can talk meaningfully about twice as much and half as much; Weighing 200 pounds is twice as heavy as 100 pounds Entering Data into SPSS You will need to specify the following for each variable: • Name of the variable • Type of data: Numerical or String • Type of measure: Nominal, Ordinal, Scale (Interval or Ratio) • Labels or units Types of Statistical Procedures Descriptive: Organize and summarize data Inferential: Draw inferences about the relations between variables; use samples to generalize to population Descriptive Statistics The first step is ALWAYS getting to know your data Summarize and visualize your data It is a big mistake to just throw numbers into the computer and look at the output of a statistical test without any idea what those numbers are trying to tell you or without checking if the assumptions for the test are met. Descriptive Statistics for Numerical Data For each group/treatment: Numerical Summaries: • Measures of central tendency • Measures of variability • Representing numerical summaries in tables Graphical Summaries: • Histograms • Bar Graph of Means/Mean Plots • Boxplots Choosing the Appropriate Type of Graph One numerical variable (e.g., Height): Histogram, Boxplot One numerical variable and one categorical variable (e.g., Height vs. Gender): Side-by-side Histograms, Boxplots, Bar Graph of Means, Mean Plots Side by Side Histograms Box Plots Bar Graph of Means and Error Bars Mean Plot with Error Bars Shapes of Distributions • Normal (approximately symmetric) • Skewed • Unimodal/Bimodal/Uniform/Other • Outliers The Normal Curve “Bell-shaped” Most scores in center, tapering off symmetrically in both tails Amount of peakedness (kurtosis) can vary Variations to Normal Distribution Skew = Asymmetrical distribution • Positive/right skew = greater frequency of low scores than high scores (longer tail on high end/right) • Negative/left skew = greater frequency of high scores than low scores (longer tail on low end/left) Histogram Showing Positive (Right) Skew Variations to Normal Distribution Bimodal distribution: two peaks Rectangular/Uniform: all scores occur with equal frequency Potential Outlier: An observation that is well above or below the overall bulk of the data Variations to Normal Distribution Bimodal Distribution (two peaks) Number of People 25 20 15 10 Rectangular/Uniform Distribution (equal # highs and lows) 5 0 16 Weight 14 12 10 8 Number of People 100-119 120-139 140-159 160-179 180-199 200-219 220-239 6 4 2 0 100-119 120-139 140-159 160-179 180-199 200-219 220-239 Potential Outlier in Distribution Number of People Weight 16 14 12 10 8 6 4 2 0 100119 120139 140159 160179 180199 200219 Weight 220239 240259 260279 280299 Measuring Skewness Skewness measures the extent to which a distribution deviates from symmetry around the mean; SPSS computes value A value of 0 represents a symmetric or evenly balanced distribution (i.e., a normal distribution). A positive skewness indicates a greater number of smaller values (peak is to the left). A negative skewness indicates a greater number of larger values (peak is to the right). Kurtosis Kurtosis is a measure of the “peakedness” or “flatness” of a distribution; SPSS computes value A kurtosis value near 0 indicates a distribution shape close to normal. A negative kurtosis indicates a shape flatter than normal. A positive value indicates more peaked than normal. An extreme kurtosis (e.g., |k| > 5.0) indicates a distribution where more of the values are in the tails of the distribution than around the mean. Caution!! It is important to determine normality so you can 1. Choose appropriate measures of central tendency and variability 2. Use hypothesis tests appropriately Assessing Normality Method 1: Make a histogram of numerical data and compare with normal curve, Check if the histogram is unimodal and symmetric, bell-shaped Method 2: Kurtosis and skewness values are between +1 is considered excellent, but a value between +2 is acceptable in many analyses in the life sciences. Assessing Normality Method 3: Conduct a hypothesis test for normality • Shapiro-Wilk (n<2000) • Kolmogorov-Smirnov (n>2000) Ho: Data come from a population with a normal distribution Ha: Data do not come from a population with a normal distribution So if p-value < .05, conclude the distribution is not normal Descriptive Statistics Once you know what type of measurement scale the data were measured on, you can choose the most appropriate statistics to summarize them: Measures of central tendency: Most representative score Measures of dispersion: How far spread out scores are Measures of Central Tendency Central tendency = Typical or representative value of a group of scores • Mean: Average score • Median: middlemost score; score at 50th percentile; half the scores are at or above, half are at or below • Mode: Most frequently occurring score(s) Measures of Central Tendency Measure Definition Takes Every Value Into Account? When to Use Mean M=X/n Yes Numerical data BUT… Can be heavily influenced by outliers so can give inaccurate view if distribution is not (approximately) symmetric Median Middle value No Ordinal data or for numerical data that are skewed Mode Most frequent data value No Nominal data Three Different Distributions That Have the Same Mean Mean Sample A Sample B Sample C 0 8 6 2 7 6 6 6 6 10 5 6 12 4 6 6 6 6 Measures of Variability Knowing what the center of a set of scores is is useful but…. How far spread out are all the scores? Were all scores the same or did they have some variability? Range, Standard deviation, Interquartile range Variability = extent to which scores in a distribution differ from each other; are spread out The Range as a Measure of Variability Difference between lowest score in the set and highest score • Ages ranged from 27 to 56 years of age • There was a 29-year age range • The number of calories ranged from 256 to 781 Sample Standard Deviation Standard deviation = How far on “average” do the scores deviate around the mean? s = SD = 2 X M N 1 • In a normal distribution, 68% of the scores fall within 1 standard deviation of the mean (M SD) • The bigger the SD is, the more spread out the scores are around the mean Variations of the Normal Curve (larger SD = wider spread) Interquartile Range Quartile 1: 25th percentile Quartile 2: 50th percentile (median) Quartile 3: 75th percentile Quartile 4: 100th percentile Interquartile range = IQR = Score at 75th percentile – Score at 25th percentile So this is the midmost 50% of the scores Interquartile Range on Positively (Right) Skewed Distribution IQR is often used for ordinal data or for interval or ratio data that are skewed (does not consider ALL the scores) Relationship between Quartiles and the Boxplot Box is formed by Q1, Median and Q3 Whiskers extend to the smallest and largest observations that are not outliers Extreme outliers lie outside the interval Q1 – 3*IQR and Q3+ 3*IQR (denoted by *) Mild outliers lie outside of the interval Q1 – 1.5*IQR and Q3 + 1.5*IQR (denoted by o) Measures of Variability Measure Definition Takes Every Value Into Account? Range Highest lowest score No, only based on two most extreme values To give crude measure of spread Standard Deviation 68% of the data fall within 1 SD of the mean Yes, but describes majority For numerical data that are approximately symmetric or normal Interquartile Range Middle 50% of the data fall within the IQR No, but describes most For ordinal data or for numerical data that are skewed When to Use Presenting Measures of Central Tendency and Variability in Text The number of fruit flies observed each day ranged from 0 to 57 (M = 25.32, SD = 5.08). Plants exposed to moderate amounts of sunlight were taller (M = 6.75 cm, SD = 1.32) than plants exposed to minimal sunlight (M = 3.45 cm, SD = 0.95). The response time to a patient’s call ranged from 0 to 8 minutes (M = 2.1, SD = 0.8) Sentences should always be grammatical and sensible. Do not just list a bunch of numbers. Use the statistical information to supplement what you are saying Presenting Measures of Central Tendency and Variability in Tables (Symmetric Data) Range 0 to 57 M 25.32 SD 5.08 Weight (lbs) 118 to 208 160.31 10.97 Response time to Patient’s Call (mins) 0 to 8 2.1 0.8 Number of Fruit Flies Be sure to include the units of measurement! You can include an additional column to put the sample size (N) Presenting Measures of Central Tendency and Variability in Tables (Skewed Data) Range 0 to 57 Median 27 IQR 9 Weight (lbs) 118 to 208 155.6 12 Response time to Patient’s Call (mins) 0 to 8 1.5 1 Number of Fruit Flies Be sure to include the units of measurement! You can include an additional column to put the sample size (N) What’s the Difference Between SD and SE? Sometimes instead of standard deviation, people report the standard error of the mean (SE or SEM) in text, tables, and figures Standard deviation (SD) = “Average” deviation of individual scores around mean of scores • Used to describe the spread of your (one) sample Standard error (SE = SD/N) = How much on average sample means would vary if you sampled more than once from the same population (we do not expect the particular mean we got to be an exact reflection of the population mean) • Used to describe the spread of all possible sample means and used to make inferences about the population mean Use the one that makes sense for your research question. Are you describing one data set only (SD) or generalizing to the population (SE)? Teaching Tips • Dangers of low N: Be sure to emphasize to students that with a small sample size, data may not be representative of the population at large and they should take care in drawing conclusions • Dangers of Outliers: Be sure your students look for outliers (extreme values) in their data and discuss appropriate strategies for dealing with them (e.g., eliminating data because the researcher assumes it is a mistake instead of part of the natural variability in the population = subjective science) Time to Practice • • • • Working with SPSS Getting descriptive statistics Making appropriate graphs Looking at shapes of distributions Teaching tips: • Hands-on practice is important for your students • Sometimes working with a partner helps