Download Standard deviation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Hss2381a – stats and stuff
The Normal Curve, part 1
No class on Thursday!
Interdisciplinary Journal
of Health Sciences
• WANTED:
Seeking applicants for the 2011-2012 editorial team
• Students in both the English and French HSS streams
are encouraged to apply.
• Send an email expressing your interest in the position
to [email protected], with your resume attached.
• Successful candidates will be invited to a panel
interview.
• Deadline to apply: Wednesday, September 28th, 2011
Last time….
• We covered measures of central tendency:
– Mode
– Median
– Mean
• And two measures of variability:
– Range
– Interquartile Range
Two More Measures of Variability
• Standard deviation
• Variance
The Standard Deviation
• Standard deviation (SD or σ): An index that
conveys how much, on average, scores in a
distribution vary
• SDs are based on deviation scores (x),
calculated by subtracting the mean from each
person’s original score
x=X-M
Standard Deviation Interpretation
• In a normal distribution, a fixed percentage
of cases lie within certain distances from the
mean:
Example
• We weigh 10 students and collect their weight
in pounds:
– 110 120 130 140 150 150 160 170 180 190
• What is the mean? (M)
150
For the lightest person, their weight is the mean – 40
For the heaviest person, their weight is the mean +40
What’s a deviation?
• A “deviation” is how much each data point
deviates from the mean
– So for X1 the deviation is -40
– And for x10 the deviation is +40
• So what’s a “standard deviation”?
• It’s some sort of measure of how much the
“typical” data point deviates from the mean
Let’s go back to our data…
• Mean = 150
Data (weights Deviation from
in pounds)
Mean
-40
110
-30
120
-20
130
-10
140
0
150
0
150
10
160
20
170
30
180
40
190
0
TOTAL
Defining Standard Deviation
• The sum of all deviation scores in a
distribution always = 0
• to compute SDs, deviation scores must be
squared (x2) before being summed
• SD equation:
SD = Square root of: Σx2 ÷ (N -1)
Standard Deviation (cont’d)
Weights (pounds):
110 120 130 140 150 150 160 170 180 190
Deviation scores (x) for M = 150:
-40 -30 -20 -10 0 0 10 20 30 40
Squared deviation scores (x2):
1600 900 400 100 0 0 100 400 900 1600
Sum of squared deviation scores:
1600+900+400+100+0+0+100+400+900+1600 = 6000
SD = √(6000/(N -1) =
SD = √(6000/(9) = 25.82
A little bit about notation
σ
“sigma” = standard deviation in the reference population
s
Lower case “s” = standard deviation in the sample
The textbook uses “SD” for both
Standard Deviation Interpretation
• Provides a “standard”—the SD indicates the
average amount of deviation of scores from
the mean
• Tells you how wrong, on average, the mean
is as a summary of the overall distribution
• An SD provides valuable information when
the distribution is normal:
– There are approximately three SDs above and
below the mean in a normal distribution
Standard Deviation Interpretation (cont’d)
• In a normal distribution, a fixed percentage
of cases lie within certain distances from the
mean:
SDs and Individual Scores
• A person who scores one SD below the mean
has a higher score than 16% of the cases
(2.3% + 13.6%)
• A person who scores one SD above the mean
has a higher score than 84% of the cases
(50.0% + 34.1%)
Standard Deviation: Advantages
• Takes all data into account in describing
variability
• Is more stable as a measure of variability than
the range or IQR
• Lends itself to computation of other measures
often used in inferential statistics
• Is helpful in interpreting individual scores
when data are distributed approximately
normally
Standard Deviation: Disadvantages
• Can be influenced by extreme scores
• Not as “intuitive” or as easy to interpret as
the range
Variance
• An important variability concept in inferential
statistics, but not used descriptively
• The variance = SD2
• In earlier example, SD2 = 25.822 = 666.67
• Not easily interpreted because it is not in
units of original data—it is in units squared
(here, pounds squared)
More about notation
σ
“sigma” = standard deviation in the reference population
s
Lower case “s” = standard deviation in the sample
σ2
“sigma squared” = variance in the reference population
s2
Variance in the sample
Formulae for Variance
Population variance
Sample variance
Measurement Scales and Descriptive
Statistics
Scale
Central
Variability
Tendency Index Index
Nominal
Mode
--
Ordinal
Median
Range, IQR
Interval and
ratio
Mean
Standard
deviation,
Variance
Relative Standing
• Central tendency and variability indexes
describe a distribution
• There are also descriptive statistics to
describe individual scores—i.e., their relative
standing or position in a distribution:
– Percentile ranks
– Standard scores
Percentiles
• A percentile is one one-hundredth of a
distribution
• Quartiles divide a distribution into quarters
• Deciles divide a distribution into tenths
• Each percentile, quartile, etc. can be
determined in relation to a score in a
distribution
Percentile Rank
• A percentile rank is the location of a given
score in the distribution—it communicates
what percentage of cases fall at or below
that value
– Score  What percentile rank?
– Percentile  What score?
Percentiles and Outliers
• Outliers are often defined in relation to
percentiles
• There are:
– Mild outliers
– Extreme outliers
NOT
what
we’re
talking
about
An outlying observation, or
outlier, is one that appears to
deviate markedly from other
members of the sample in which
it occurs.
-Grubbs (Wikipedia)
In this course (as per the textbook), an
outlier is a value that is >1.5 times the IQR
Outliers:
Formal Definition
• A mild outlier is a score that is between 1.5
and 3.0 times the value of the IQR, below Q1
or above Q3
• An extreme outlier is a score that is greater
than 3.0 times the value of the IQR, below
Q1 or above Q3
Box Plots
• A box plot (or box-and-whiskers plot) is a
graphic depiction of a distribution that
shows the median, the IQR, and the outer
limits of values not considered outliers
– Outlying cases can be shown on the box plot,
with identifying information (e.g., an ID
number)
Traditionally…
But for the purposes of this course (due to
the textbook’s insistence)…
The extent of the boxplot is NOT the range, but
rather those data points that are NOT outliers
Box Plots (cont’d)
•
•
•
•
Bottom of “box” shows Q1
Top of “box” shows Q3
Horizontal line in box shows median
“Whiskers” show outer limits of what is NOT
an outlier
– In SPSS, a circle O indicates value and ID of a
mild outlier
– An asterisk * is for an extreme outlier
Box Plot Illustration – p52
Textbook Heart Rate Data:
Q1 = 62
Q2 = 66 = Median
Q3 = 68
“Whiskers” limits: 53, 77
Mild outliers:
50 (#106), 45 (#105)
Extreme outliers:
40 (#104), 90 (#103),
95 (#102), 100 (#101)
Box Plots Versus Histograms
• Outliers can be seen in histograms, but box
plots give more useful information about
degree of extremity and ID numbers
(Stolen from wikipedia)
Standard Scores
• Also called z-score or z-statistic or z-value
or normal score
• Is a measure of how far an observation is
from the mean of its distribution
• The z-score only has meaning if you know
the parameters of the reference population
• i.e.: μ and σ
Standard Scores
• Standard scores—another index of “relative
standing” helpful in interpreting raw scores
• A standard score (also called a z score) is a
score expressed in standard deviation units,
in relative distance from the mean
Standard Scores (cont’d)
• Standard score equation:
z = (X – M) ÷ SD
• That is, the mean is subtracted from an
individual score, then divided by the SD
• For example:
M = 100, SD = 25, X = 125, z = 1.0
M = 100, SD = 25, X = 50, z = -2.0
How is this useful?
• Very useful in standardized testing (like MCAT,
GRE, SAT, etc)
• Allows us to:
– Calculate the probability of a score occurring
within a normal distribution
– Compare two scores that are from different
normal distributions
Calculating a Probability Using a z-score
For a variable distributed normally
(such as MCAT scores in Canada, a
z-score of 1.96 will have 95% of
observations falling within its range.
Example
• We know that the LSAT score in Canada is
normally distributed. The mean mark is 60%
and the SD is 15. So….
– What is the lowest mark among those who were
in the top 10% of performers?
– (Why? Because law schools will only take the top
10% and need to know what mark to make their
cut-off)
Example
• We know that the LSAT score in Canada is
normally distributed. The mean mark is 60%
and the SD is 15. So….
We get the “1.282”
by looking it up in a
table, or using a zscore calculator
http://www.fourmilab.ch/rpkp/experi
ments/analysis/zCalc.html
Using z-scores to compare tests
• A student is in two classes, English and Math.
• She got 70% in English and 70% in Math and
wants to know which class she’s doing better
in
– Why isn’t the answer automatically “English”?
Using z-scores to compare tests
• A student is in two classes, English and Math.
Using z-scores to compare tests
Since these scores are from two different distributions, we
need to standardise them into z-scores so that they can be
directly compared. This gives us:
Using z-scores to compare tests
How do we interpret this?
Z=0.67 suggests that the
student performed 0.67 SDs
above the mean in both
classes. This makes her
above average in both
classes. But she’s doing
equally well in both.
(If we use a z-score
calculator, we’d find out
that z=0.67 means that
she’s in the top 25.1% of
the class.)
Standard Scores (cont’d)
• Standard scores have a mean of 0.0 and an SD of
1.0:
• But z scores can be transformed mathematically to
have any mean and SD
• Most typical:
– Mean = 500, SD = 100 (e.g., GRE, SAT)
– Mean = 100, SD = 15 (e.g., IQ tests)
– Mean = 50, SD = 50 (called T scores)
The Normal Distribution
• Central Limit Theorem:
– Under “mild” conditions, a large number of any
random variable will be distributed “normally”
• For fun, go to:
– http://www.math.csusb.edu/faculty/stanton/prob
stat/clt.html
– This is an “applet” that you keep clicking on. It
produces a graph of a random variable. You will
see that it always ends up being a Normal curve
Properties of the Normal
Distribution
• About 68% of values drawn from a normal
distribution are within one standard deviation
( σ )away from the mean
• about 95% of the values lie within two
standard deviations from the mean
• about 99.7% are within three standard
deviations
• This fact is known as the 68-95-99.7 rule or
the empirical rule or the 3-sigma rule
3-sigma rule
Homework
• P.57, A4, A5