Download Statistics as a Tool in Scientific Research

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Statistics for Everyone Workshop
Summer 2011
Part 1
Statistics as a Tool in Scientific Research:
Review of Summarizing and Graphically
Representing Data; Introduction to SPSS
Workshop presented by Linda Henkel and Laura McSweeney of Fairfield University
Funded by the Core Integration Initiative and the Center for Academic Excellence at
Fairfield University and NSF CCLI Grant
Statistics as a Tool in Scientific Research
Types of Research Questions
• Descriptive (What does X look like?)
• Correlational (Is there an association between
X and Y? As X increases, what does Y do?)
•
Experimental (Do changes in X cause changes
in Y?)
Different statistical procedures allow us to
answer the different kinds of research
questions
Statistics as a Tool in Scientific Research
 Start with the science and use statistics as
a tool to answer the research question
Get your students to formulate a research
question first:
• How often does this happen?
• Did all plants/people/chemicals act the
same?
• What happens when I add more sunlight,
give more praise, pour in more water?
Statistics as a Tool in Scientific Research
Can collect data in class
Can use already collected data (yours or
database)
Helping students to formulate research
question: Ask them to think about what
would be interesting to know. What do they
want to find out? What do they expect?
Types of Data: Measurement Scales
Categorical: male/female
blood type (A, B, AB, O)
Stage 1, Stage 2, Stage 3
melanoma
Numerical:
weight
# of white blood cells
mpg
Categorical
Types of Data: Measurement Scales
Nominal: numbers are arbitrary; 1= male, 2 = female
Ordinal: numbers have order (i.e., more or less) but
you do not know how much more or less; 1st place
runner was faster but you do not know how much
faster than 2nd place runner
Numerical
Interval: numbers have order and equal intervals so
you know how much more or less; A temperature
of 102 is 2 points higher than one of 100
Ratio: same as interval but because there is an
absolute zero you can talk meaningfully about
twice as much and half as much; Weighing 200
pounds is twice as heavy as 100 pounds
Entering Data into SPSS
You will need to specify the following for each
variable:
• Name of the variable
• Type of data: Numerical or String
• Type of measure: Nominal, Ordinal, Scale
(Interval or Ratio)
• Labels or units
Types of Statistical Procedures
Descriptive: Organize and summarize
data
Inferential: Draw inferences about the
relations between variables; use
samples to generalize to population
Descriptive Statistics
The first step is ALWAYS getting to know
your data
 Summarize and visualize your data
It is a big mistake to just throw numbers into
the computer and look at the output of a
statistical test without any idea what those
numbers are trying to tell you or without
checking if the assumptions for the test are
met.
Descriptive Statistics for Numerical Data
For each group/treatment:
Numerical Summaries:
• Measures of central tendency
• Measures of variability
• Representing numerical summaries in tables
Graphical Summaries:
• Histograms
• Bar Graph of Means/Mean Plots
• Boxplots
Choosing the Appropriate Type of Graph
One numerical variable (e.g., Height):
Histogram, Boxplot
One numerical variable and one categorical
variable (e.g., Height vs. Gender):
Side-by-side Histograms, Boxplots, Bar Graph
of Means, Mean Plots
Side by Side Histograms
Box Plots
Bar Graph of Means and Error Bars
Mean Plot with Error Bars
Shapes of Distributions
• Normal (approximately symmetric)
• Skewed
• Unimodal/Bimodal/Uniform/Other
• Outliers
The Normal Curve
“Bell-shaped”
Most scores in center, tapering off
symmetrically in both tails
Amount of peakedness (kurtosis) can vary
Variations to Normal Distribution
Skew = Asymmetrical distribution
• Positive/right skew = greater frequency of low
scores than high scores (longer tail on high
end/right)
• Negative/left skew = greater frequency of high
scores than low scores (longer tail on low end/left)
Histogram Showing Positive (Right) Skew
Variations to Normal Distribution
Bimodal distribution: two peaks
Rectangular/Uniform: all scores occur with
equal frequency
Potential Outlier: An observation that is well
above or below the overall bulk of the data
Variations to Normal Distribution
Bimodal Distribution (two peaks)
Number of People
25
20
15
10
Rectangular/Uniform Distribution
(equal # highs and lows)
5
0
16
Weight
14
12
10
8
Number of People
100-119 120-139 140-159 160-179 180-199 200-219 220-239
6
4
2
0
100-119 120-139 140-159 160-179 180-199 200-219 220-239
Potential Outlier in Distribution
Number of People
Weight
16
14
12
10
8
6
4
2
0
100119
120139
140159
160179
180199
200219
Weight
220239
240259
260279
280299
Measuring Skewness
Skewness measures the extent to which a
distribution deviates from symmetry around the
mean; SPSS computes value
A value of 0 represents a symmetric or evenly
balanced distribution (i.e., a normal distribution).
A positive skewness indicates a greater number of
smaller values (peak is to the left).
A negative skewness indicates a greater number of
larger values (peak is to the right).
Kurtosis
Kurtosis is a measure of the “peakedness” or
“flatness” of a distribution; SPSS computes value
A kurtosis value near 0 indicates a distribution
shape close to normal.
A negative kurtosis indicates a shape flatter than
normal.
A positive value indicates more peaked than normal.
An extreme kurtosis (e.g., |k| > 5.0) indicates a
distribution where more of the values are in the
tails of the distribution than around the mean.
Caution!!
It is important to determine normality so you
can
1. Choose appropriate measures of central
tendency and variability
2. Use hypothesis tests appropriately
Assessing Normality
Method 1: Make a histogram of numerical data and
compare with normal curve, Check if the
histogram is unimodal and symmetric, bell-shaped
Method 2: Kurtosis and skewness values are
between +1 is considered excellent, but a value
between +2 is acceptable in many analyses in the
life sciences.
Assessing Normality
Method 3: Conduct a hypothesis test for normality
• Shapiro-Wilk (n<2000)
• Kolmogorov-Smirnov (n>2000)
Ho: Data come from a population with a normal
distribution
Ha: Data do not come from a population with a
normal distribution
So if p-value < .05, conclude the distribution is not
normal
Descriptive Statistics
Once you know what type of measurement
scale the data were measured on, you can
choose the most appropriate statistics to
summarize them:
Measures of central tendency: Most
representative score
Measures of dispersion: How far spread out
scores are
Measures of Central Tendency
Central tendency = Typical or
representative value of a group of
scores
• Mean: Average score
• Median: middlemost score; score at 50th
percentile; half the scores are at or above,
half are at or below
• Mode: Most frequently occurring score(s)
Measures of Central Tendency
Measure
Definition
Takes
Every Value
Into
Account?
When to Use
Mean
M=X/n
Yes
Numerical data
BUT… Can be heavily
influenced by outliers so can
give inaccurate view if
distribution is not
(approximately) symmetric
Median
Middle value
No
Ordinal data or for numerical
data that are skewed
Mode
Most frequent
data value
No
Nominal data
Three Different Distributions That
Have the Same Mean
Mean
Sample A
Sample B
Sample C
0
8
6
2
7
6
6
6
6
10
5
6
12
4
6
6
6
6
Measures of Variability
Knowing what the center of a set of scores is
is useful but….
How far spread out are all the scores?
Were all scores the same or did they have
some variability?
Range, Standard deviation, Interquartile
range
Variability = extent to which scores in a
distribution differ from each other; are spread
out
The Range as a Measure of Variability
Difference between lowest score in the set
and highest score
•
Ages ranged from 27 to 56 years of age
•
There was a 29-year age range
•
The number of calories ranged from 256
to 781
Sample Standard Deviation
Standard deviation = How far on “average” do
the scores deviate around the mean?
s = SD =
2


X

M

N 1
• In a normal distribution, 68% of the scores
fall within 1 standard deviation of the mean
(M  SD)
• The bigger the SD is, the more spread out
the scores are around the mean
Variations of the Normal Curve
(larger SD = wider spread)
Interquartile Range
Quartile 1: 25th percentile
Quartile 2: 50th percentile (median)
Quartile 3: 75th percentile
Quartile 4: 100th percentile
Interquartile range = IQR =
Score at 75th percentile – Score at 25th percentile
So this is the midmost 50% of the scores
Interquartile Range on Positively (Right)
Skewed Distribution
IQR is often used for ordinal data or for interval or ratio
data that are skewed (does not consider ALL the
scores)
Relationship between Quartiles and the Boxplot
Box is formed by Q1, Median and Q3
Whiskers extend to the smallest and
largest observations that are not
outliers
Extreme outliers lie outside the
interval Q1 – 3*IQR and
Q3+ 3*IQR (denoted by *)
Mild outliers lie outside of the interval
Q1 – 1.5*IQR and Q3 + 1.5*IQR
(denoted by o)
Measures of Variability
Measure
Definition
Takes Every
Value Into
Account?
Range
Highest lowest
score
No, only based
on two most
extreme values
To give crude measure
of spread
Standard
Deviation
68% of the
data fall
within 1 SD
of the mean
Yes, but
describes
majority
For numerical data that
are approximately
symmetric or normal
Interquartile
Range
Middle 50%
of the data
fall within
the IQR
No, but
describes
most
For ordinal data or for
numerical data that are
skewed
When to Use
Presenting Measures of Central Tendency and
Variability in Text
The number of fruit flies observed each day ranged
from 0 to 57 (M = 25.32, SD = 5.08).
Plants exposed to moderate amounts of sunlight were
taller (M = 6.75 cm, SD = 1.32) than plants exposed
to minimal sunlight (M = 3.45 cm, SD = 0.95).
The response time to a patient’s call ranged from 0 to 8
minutes (M = 2.1, SD = 0.8)
 Sentences should always be grammatical and
sensible. Do not just list a bunch of numbers. Use
the statistical information to supplement what you are
saying
Presenting Measures of Central Tendency
and Variability in Tables (Symmetric Data)
Range
0 to 57
M
25.32
SD
5.08
Weight (lbs)
118 to
208
160.31
10.97
Response time to
Patient’s Call
(mins)
0 to 8
2.1
0.8
Number of Fruit
Flies
Be sure to include the units of measurement! You
can include an additional column to put the sample
size (N)
Presenting Measures of Central Tendency and
Variability in Tables (Skewed Data)
Range
0 to 57
Median
27
IQR
9
Weight (lbs)
118 to
208
155.6
12
Response time to
Patient’s Call
(mins)
0 to 8
1.5
1
Number of Fruit
Flies
Be sure to include the units of measurement! You
can include an additional column to put the sample
size (N)
What’s the Difference Between SD and SE?
Sometimes instead of standard deviation, people report the
standard error of the mean (SE or SEM) in text, tables, and
figures
Standard deviation (SD) = “Average” deviation of individual
scores around mean of scores
• Used to describe the spread of your (one) sample
Standard error (SE = SD/N) = How much on average sample
means would vary if you sampled more than once from the
same population (we do not expect the particular mean we got
to be an exact reflection of the population mean)
• Used to describe the spread of all possible sample means and
used to make inferences about the population mean
 Use the one that makes sense for your research question.
Are you describing one data set only (SD) or generalizing to
the population (SE)?
Teaching Tips
• Dangers of low N: Be sure to emphasize to
students that with a small sample size, data
may not be representative of the population at
large and they should take care in drawing
conclusions
• Dangers of Outliers: Be sure your students look
for outliers (extreme values) in their data and
discuss appropriate strategies for dealing with
them (e.g., eliminating data because the researcher
assumes it is a mistake instead of part of the natural
variability in the population = subjective science)
Time to Practice
•
•
•
•
Working with SPSS
Getting descriptive statistics
Making appropriate graphs
Looking at shapes of distributions
Teaching tips:
• Hands-on practice is important for your
students
• Sometimes working with a partner helps