Download CENTRAL TENDENCY, VARIABILITY, NORMAL CURVE

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
CENTRAL TENDENCY, VARIABILITY, NORMAL CURVE,
STANDARD SCORES!
LECTURE#2 !
PSYC218 ANALYSIS OF BEHAV. DATA !
DR. OLLIE HULME, 2011, UBC!
Housekeeping!
Earn 10 bucks in experiment at 7.30pm
Feedback
Assignment released onto vista tonight or early tomorrow due by 5pm
next thurs, in-class hard copy only
Install SPSS today, no excuses later in the week if you have problems
with your disc – patch for mac will be on website tonite.
Coglab ‘Memory span’ due next Thursday noon
Assignment 1 due next Thursday 5pm
Survey monkey due next Tuesday noon
www.surveymonkey.com/s/RSZPDPV.
You need your coglab ID # (for tracking).
Roadmap!
Syllabus test
Central tendency (Ch4)
Variability (Ch4)
Normal curve (Ch5)
Standard scores (Ch5)
SPSS Demo
Experiment for money
Syllabus quiz!
Content
a) Only material in the lectures can be tested on the exam
b) Material from lectures and textbook can be tested on the
exam
Exams
a) 1 x MT 1 x Final, all in class, all multiple choice
b) 2 MT in class, 1 final 30th july, mixture of multiple choice
and computational questions
c) 1 MT in class, 1 final 30th july, computational questions only
Grades
Combined total for HSP and clicker participation is
a) 5%
b) 6%
c) 8%
d) 2%
Central Tendency & Variability!
Central Tendency
Allows us to describe a group of scores in terms of an
average, representative or typical value
Mean, Median, Mode
How it clusters toward
the centre’
Central Tendency: Mean!
Same as the average (ex. average exam grade)
Symbolized as
(X bar) when the calculation is
made on sample data (most common)
Symbolized as
(mu) when the calculation is made
on population data
Calculating the Mean!
= The sum of the scores divided by the number of
scores
Both formulas
identical, just
use different
symbols so
that can
differentiate
sample from
population
Calculate the mean on the following sample data:
X = [110, 103, 121]
X 110 +103 +121
∑
X=
=
= 111.33
n
3
Properties of Mean!
Sensitive to exact value of all the scores in
the distribution
If a score is changed the mean will always
change since every score is used in its
calculation
Sample Data: 110, 103, 121
Change one score: 110 changed to 109
If my IQ goes up by 1
point, then the class
average IQ will also
increase
Properties of the Mean!
It is very sensitive to extreme
scores because every score is
used in its calculation
Sample Data: 109, 103, 121
Add an extreme score: 1001
Properties of the Mean!
The sum of the deviations about the mean equals zero
Sample Data: 109, 103, 121
= 111
= (109-111) + (103 -111) + (121 – 111)
= (-2) + (-8) + (10)
=0
Properties of the Mean!
Sum of the squared deviations of all the
scores about their mean is a minimum
∑(X − X )
2
is a minimum
While the sum of the squared deviations
about the mean does not equal 0 it is
smaller than if the squared deviations
€ the
were taken for any value other than
mean
∑ (X − X )
2
X
∑(X −110)
109 (109-­‐110)2 = 1 (109-­‐111)2 = 4 (109-­‐112)2 = 9 103 (103-­‐110)2 = 49 (103-­‐111)2 = 64 (103-­‐112)2 = 81 121 € (121-­‐110)2 = 121 (121-­‐111)2 = 100 (121-­‐112)2 = 81 168 171 2
€
171 The smallest value, is for the mean
Properties of the Mean!
Under most circumstances it is less subject to
sampling variation than the other measures of central
tendency (median or mode)
Remember a sample is a subset of the population. So
we could take many different samples from the same
population and we could calculate the mean, median
and mode for each of these samples.
This would cause the means, modes & medians for
these different samples to vary.
The means will vary (differ) least across the samples
This is main reason the mean is
used in inferential statistics
rather than the median or mode.
The Overall Mean!
Sometimes you need to calculate
the overall mean from a collection of
means of smaller samples
Let’s say we have 3 exams:
Number of students in exam1 = n1
Mean for exam1 = X1
€
mean for 1st exam = 72 and 97
students write that exam.
mean for 2nd exam is 68 and 90
students write that exam.
mean on the 3rd exam is 65 and 88
students write that exam
What is the
overall mean for
all exams?
= 68.45
The Median!
The scale value below which 50% of the scores fall
If the scores are in a grouped frequency distribution the
median equals the 50th Percentile Point (P50 )
Median is the score that is
slap bang in the middle of
the rank
How to find the Median!
If the scores are raw (ungrouped) then rank order the
scores
If there are an odd number of scores median is the
centermost score.
If there are an even number of scores median is the
average of the two centermost scores.
Odd Number of Scores
Scores: 9,4,2,9,7,5,2,6,3
Rank Ordered: 2,2,3,4,5,6,7,9,9
Median = 5
Even Number of Scores
Scores: 9,4,2,9,7,5,2,6
Ranked: 2,2,4,5,6,7,9,9
Median = 5.5
Properties of the Median!
Less sensitive to extreme scores than mean
Sample Data: 101, 109, 103, 105, 121
= 107.80
Extreme scores will be very
high or very low scores so they
will be at the ends of the rank
ordered scores and therefore
won’t be included in the
calculation of the median.
Mdn = 105
Change last score to an extreme score: 101, 109, 103, 105, 1001
= 283.80
Mdn = 105
Holy crap that was a
big one!
I can hardly contain
my indifference
Properties of the Median!
Median is more subjectible to sampling variability than
the mean but less so than the mode
If we took many different samples from the same
population and calculated the mean, median and mode
for each of these samples, the medians for these
samples would vary.
The medians for the different samples would vary more
than the means but less than the modes.
The Mode!
Mode is the most frequent score in the
distribution
For scores in a grouped frequency
distribution the mode is the midpoint of
the interval with the highest frequency
For raw (ungrouped) scores no are
calculations necessary just inspect the
data to find the most frequent score.
Central Tendency & Symmetry!
Negative skew
Bell-shaped
Mean < Median < Mode
Mean = Median = Mode
Positive skew
Mean > Median > Mode
Skew!
-ve
+ve
Negative skew is like the
bell-shape, but extra stuff at
the end toward the negative
end of the scale
positive skew is like the
bell-shape, but extra stuff at
the end toward the positive
end of the scale
Most values are higher
Most values are lower
Variability!
Variability
Allows us to describe how spread
out or dispersed the scores are
Range, Standard Deviation,
Variance
Measures of Variability: Range!
The difference between the highest and lowest scores
in the distribution
Range = Highest Score – Lowest Score
Scores: 2, 3, 7, 18, 6
Range = 18 – 2
= 16
Very crude measure of variability as
it only considers the two most
extreme scores
Standard deviation!
The standard deviation is a commonly used measure
of the variability in the data.
We will calculate it in number of steps, calculating the
deviation scores, then the sum of the squares, then
plugging it all together in a simple equation
Standard Deviation: Deviation scores!
Let’s first consider deviation scores which tell us how
far away a raw score is from the mean
Deviation score is simply the difference between X and
the mean
Calculate deviation scores for the following sample
data: 109, 103, 121
X
109 109-­‐111 = -­‐2 103 103-­‐111 = -­‐8 121 121-­‐111 = 10 Remember!
Standard Deviation: Sum of Squares!
The sum of squares is the sum of the squared
deviation scores
2
SS = ∑ (X – X)
(sum of squared sample scores)
(sum of squared population scores)
€
X
109 -­‐2 -­‐22 = 4 103 -­‐8 -­‐82 = 64 121 10 102 = 100 SS =
∑ (X – X)
SS = 168
€
2
The Standard Deviation!
We are interested in some measure of
the average deviation about the mean
so we need to divide SS by n
SS
=
n -1
Average squared deviation
€
We are still in squared
units so now we need to
take the square root of the
average squared deviation
= 84
The standard deviation
= 9.16
Why n – 1
instead of just n?
This is a trick
used to prevent
the sample
underestimating
the standard
deviation of the
population
The Standard Deviation!
The standard deviation is
symbolized as s for sample
data)
Sum of Squares
The standard deviation is
symbolized as for
population data
Sum of Squares
Note: the only difference in the formula for the
sample standard deviation (s) and the population
standard deviation is the denominator (n-1 vs. N).
The Deviation Method!
Calculate the standard deviation using the following
sample data: 1, 2, 3, 6, 8
Step 1: Calculate the mean
Step 2: Calculate the deviation scores and the squared
deviation scores
Continued…!
Step 3: Calculate the sum of squares
If we we had data from the whole
population
Step 4: Calculate s
Standard Deviation Properties!
1. It is a measure of dispersion relative to the mean
2. It is sensitive to each score in the distribution
3. If only one score is shifted closer to the mean the
standard deviation will decrease
4. If only one score is shifted further from the mean the
standard deviation will increase
5. It is stable with regard to sampling fluctuations
And that sir, is
why it is so
widely used
It is not the average
deviation as some
people on the internet
might say!
Variance!
Variance = Square of the standard deviation
Another way of expressing
the spread of the data
s2 =
€
σ2
€
2
X
–
X
(
)
∑
n –1
( X – µ )2
∑
=
N
(sample variance)
(population variance)
Illustration!
You recently completed a memory test
where you were only able to remember
8 of the words the experimenter read to
you?
How well did I do relative to everyone else
You don’t have access to the data or to a grouped
frequency distribution so you can’t determine your
percentile rank .
So how can you figure out how your memory
compares to others’ memory?
Well what if I gave you the standard
deviation and the mean, would that help
your fragile little mind?
Z-score!
= a transformed score that expresses
how many standard deviations a raw
score is above or below the mean
Positive z-score: raw score above
mean
z =
X – µ
σ
(population data)
€
Negative z-score: raw score is below
mean
Value of the z-score indicates how
many standard deviations the raw
score is from the mean
Essentially, how far is
the score away from
the mean, in units of
the standard
deviation
Transform Raw Score to Z-Score!
You recalled 8 words. The mean of the sample is 7 and
the standard deviation is 1.88. What is your z score?
X=8
=7
s = 1.88
z = 0.53
Your score is 0.53 standard
deviations above the mean
Comparing Apples and Oranges!
You participated in another experiment. This time the
researcher was assessing your verbal abilities. You got
a score of 28. The mean of the sample is 32.33 and the
standard deviation is 9.52. What is your z-score?
X = 28
= 32.33
s = 9.52
z = -0.45
Your verbal ability score is .45 standard
deviations below the mean.
relative to the rest of
the group is you
memory or your verbal
ability better?
Clicker Question!
Calculate the z score for a score of 25.
Assume
= 15 and s = 5.
a) b) c) d) e) 1
2
3
4
5
Z Scores to Raw Scores!
To convert a z-score to a raw score you need to
multiply the z-score by the standard deviation and then
add the mean
X = (z) (s) +
(sample data)
X = (z) ( ) +
(population data)
Pretty easy to go back
and forth between zscore and raw score if
you know the mean and
standard deviation
This is simply a re-arrangement of this equation (or the
equivalent one for population data – not shown )
If you don’t know how to do this, you need to brush up on basic
algebra, re-arranging equations.
Again Khanacademy.com highly recommended
Practice Transforming Z to Raw!
Your friend Edgar does the same 2 experiments and
determines that he has a z score of -1.60 on the memory
test and a z score of 1.02 on the verbal abilities test.
What were his raw scores?
X = (z)(s) +
Memory Test
X = (-1.60)(1.88) + 7
= 3.99
Verbal Test
X = (1.02)(9.52) + 32.33
= 42.04
Characteristics of Z-scores!
The mean of a distribution of z-scores always equals 0
Z score transformations involve
subtracting the mean from each raw
score. So the overall mean of the z
scores will be 0
Z-score is just a
deviation score in units
of standard deviation
Since we know that
the mean of the
deviations = 0
We know that the
mean of the z-scores
will also be zero
The sum of the
deviations about the
mean always equals 0.
So the average or mean
deviation will also equal
0 (0/n=0).
Characteristics of Z Scores!
The standard deviation of a distribution of z scores
always equals 1
Z scores are deviation scores in the metric of the
standard deviation.
The formula involves dividing the deviation score by
the standard deviation. This puts the deviation score in
the metric of standard deviation units.
So a z-score of 1 means your score is 1 standard
deviation above the mean. A z-score of -1 means your
score is 1 standard deviation below the mean.
So it follows that the standard deviation of a distribution
of z-scores will always be 1 (of course the distribution
will speak in its own language).
fact
Characteristics of Z Scores!
Z-scores have the same shape as the set of raw
scores
Z-score transformations only change the values of the
scores in a simple way. It takes each raw score,
subtracts the mean, and divides by the standard
deviation.
The shape of the distribution scores and the relative
positions of the scores remain intact
In technical
jargon this is a
linear
transformation
Characteristics of Z Scores!
Z-scores have the same shape as the set of raw
scores
If the distribution of raw scores is negatively skewed the distribution of z scores will
also be skewed (the scaling of the x-axis will change though)
raw
Z-scores
If the distribution of raw scores is a normal bell shaped curve then so will the
distribution of z-scores
raw
Z-scores
Clicker Question!
A z-score of 1.75 means…
a) b) c) d) the raw score is below the mean
the raw score is 1.75 units above the mean
the raw score is 1.75 standard deviations above the mean
the average raw score is 1.75 units from the mean
The Normal Curve!
Many variables in nature fall on a normal curve
Important for many inferential statistics (tests of
significance)
Considered the most prominent distribution in statistics
Many variables are
normally distributed,
height, weight, IQ
frequency
Variable
The Normal Curve!
Normal curve is often used as a first approximation to describe
random variables that tend to cluster around a single mean value.
Commonly used throughout psychology, natural sciences, social
sciences as a simple model for complex phenomena
It’s prevalence is explained by central limit theorem, which shows
that under many conditions the sum of a large number of random
variables is distributed approximately normally.
Non-normality!
Not everything is normally distributed
Quantities that grow exponentially, such as prices, incomes or
populations, are often skewed to the right, and hence may be
better described by other distributions, such as the log-normal
distribution or Pareto distribution.
.e.g. Reaction times are often not normally distributed
Areas Under the Normal Curve !
Normal curve has a precise equation which describes it.
For all normal distributions there are special relationships
between mean, standard deviation + areas under the
curve
(Pagano p96)
68.26%
3 stand
dev.
2 stand
dev.
1 stand.
dev. Below
mean
mean
1 stand.
dev. above
mean
2 stand
dev.
3 stand
dev.
Areas Under the z-distribution!
These percentages are just something we know to be true of all normal
distributions – known as the ‘68-95-99.7’ rule, or the empirical rule.
68.26%
z scores -3
-2
-1
0
1
2
3
The same relationship holds for normal data that is transformed into zscores, since z-scores are in units of standard deviations
IQ scores !
IQ is normally distributed, the average is 100 and
standard deviation is 16
IQ:
=100
= 16
For any normal distribution
this relationship always
holds
68.26%
z scores
-3
-2
-1
0
1
2
3
*note these are population parameters
IQ scores interpretation !
% of people with IQ
between 100 and
132?
34.13% of people have an
IQ 100 - 16
13.59% of people have an
IQ 116 - 132
34.13+13.59%=
47.72%
68.26%
z scores
-3
-2
-1
0
1
2
3
Clicker Question!
Based on the previous
figure what percentage
of scores have z-score
values of -3 or lower?
a) b) c) d) e) .13%
.26%
2.15%
2.28%
4.56%
z scores
-3
-2
-1
0
1
2
3
Clicker Question!
Based on the previous figure what percentage of scores
have z-score values greater than 2?
a) b) c) d) e) Less than .13%
.13%
.26%
2.28%
4.56%
z scores
2.15% + 0.13%
= 2.28%
-3
-2
-1
0
1
2
3
Example!
I have an IQ of 107
Caclcute Prank from z!
Assume we have
population data
What is Ronald’s percentile rank
(what percent of population has
a lower IQ than him)?
X – µ
σ
z =
Remember for IQ
mean = 100 and
standard deviation
= 16
Step 1: Calculate his z-score
€
Step 2: Draw a normal curve and
place the z-score on the curve (to
aid understanding)
frequency
Step 3: Look up percentage
below this z-score
z = .44
.44
Z
-3
-2
Percentile rank can
be calculated from
the area of the
curve
-1
0
1
2
3
Z-score table concept !
In the same way someone has
calculated the areas under the curve
for the z-scores (-3,-2,-1,1,2,3) in this
graph…
They have calculated them for the full
range of z-scores inbetween and put
them in a z-score table found in table
A of appendix D of Pagano
Using this table allows you to
calculate the percentage of scores
with a z-score above or below any zscore of interest.
Z-score table – Appendix D !
Column A lists all of the various
possible z scores
Note it only lists positive scores.
because the normal curve is
symmetrical so the
corresponding values for –ve
scores are identical.
Column B lists the
proportion of scores that
fall between the z score
(listed in column A) and
the mean
Column C lists the
proportion of scores
that fall between the zscore (listed in column
A) and the closest tail
of the distribution
For positive z scores it
gives the proportion of
scores that are higher
than the z score, For
negative z scores it is
proportion lower
Clicker Question!
You look up a z score of -.50 in Table A. Column
C shows that .3085 corresponds to that z score.
This means that:
a) 30.85% of the scores in the distribution are
lower than the z score
b) 30.85% of the scores in the distribution are
higher than the z score
c) 30.85% of the scores lie between the mean
and the z score
d) 80.85% of the scores in the distribution are
higher than the z score
e) 80.85% of the scores in the distribution are
lower than the z score
Back to Ronaldʼs percentile rank!
Step 3: Find the corresponding area under the curve
by referring to the z-score table
.1700 x 100 = 17%
.44
Column B method:
Z
-3
-2
-1
0
1
2
3
Find z = .44 in column A.
Column B shows proportion of scores that fall between the mean and the z-score
= 0.17 or 17%
Total area below this z-score then is
50% (always 50% below mean) + 17% = 67%
Percentile rank for IQ of 107 = 67%
67% of scores
fall below
Ronald’s IQ
score
Back to Ronaldʼs percentile rank!
Step 3: Find the corresponding area under the curve
by referring to the z-score table
33%
.3300 x 100=33%
Column C method:
.44
Z
-3
-2
-1
0
Find z = .44 in column A
Column C shows that the proportion of scores that fall above
our z score (since our score is +) is 0.3300 = 33%
If 33% fall above then 67% fall below
Percentile rank for score of 107 = 67%
1
2
3
Further Illustration!
Determine what percentage of people
received a score between the score
you received (z = .53) and the score
your friend received (z = -1.60)?
no negative z-scores in the
table, so you have to look
up 1.60 positive, which
gives same answer
Column B method
B
Z-score of -1.60 = 0.4452 = 44.52%
people between this score and the
mean
44.52%
Therefore 64.71% (44.52% + 20.19%)
of the scores fall between your score
and your friend’s score
Z
-3
-2 -1.60 -1
20.19%
B
Z-score of .53 is =0.2019 = 20.19%
people between this score and the
mean
0 .53
1
2
3
Percentile Points!
If the memory test was given to an
entire population and
= 7 and
= 1.88
Percentile
point for
75%
What is the score below which 75%
of the scores fall? What is P75?
25%
Z
Step 1:
Using Table A locate the area in
Column C closest to .2500 (25%)
and find its z-score
Area closest = 0.2514
z value = 0.67
-3
-2
-1
0
.67
1
2
Step 2: Transform z = .67 to a
raw score.
X = (z)( ) +
X = (.67)(1.88) + 7
X = 8.26
So 75% of the population received a
memory test score lower than 8.26
3
Further Illustration!
If the memory test was
given to an entire
population and
=7
and
= 1.88
What are the scores
that bound (that define
the boundary) the
middle 90% of the
distribution.
90%
5%
Z
-3
-2
5%
-1
0
1
2
3
Further Illustration!
Step 1:
Using Table A locate the area in in
column C closest to .0500 (5%)
Because you want to know the
score for which 5% of scores are
higher and the score for which 5%
of scores are lower as these
scores will bound the middle 90%.
(100-90)/2 = 5%
find the corresponding z-score
z=1.65
The other z-score will be -1.65
because both boundaries are the
same distance from the mean
90%
5%
Z
-3
Step 2:
-2 -1.65
5%
-1
x=
3.90
0
1
1.65 2
3
x = 10.10
Transform z = 1.65 and z = -1.65 to
raw scores via
X = (z)( ) + ]
z = 1.65
x = (1.65)(1.88) + 7
x = 10.10
z = -1.65
x = (-1.65) (1.88) + 7
x = 3.90
The scores 3.90 and 10.10 bound
the middle 90% of the distribution