Download class notes - rivier.instructure.com.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
CLASS NOTES: Measures of Central Tendency; Variability
CONCEPT
CALCULATION/EXAMPLES
APPLICATION
For a population:
Remember your symbols
representing a population verses a
sample. µ is the mean symbol & N
is the number of scores / subjects
for a population. M is the mean
symbol & n is the number of scores
/ subjects for a sample.
Central tendency: A statistical
measure to determine a single score
that defines the center of a
distribution. The most common
method for summarizing and
describing a distribution is to find a
single value that defines the
average score & can serve as a
representative of the entire
distribution.
Mean: The sum of the scores
divided by the number of scores




* mean characteristics:
- changing the value of any score
changes the mean
- if you add or remove a score, it
will change the mean unless the
score added or removed is the same
as the mean
- if you multiply or divide each
score by a constant, then the mean
can be multiplied or divided by the
same constant

µ = _∑X_
N
For a sample:
M = _∑X_
n
data set: 3, 7, 4, 6
N=4
_20_
4
=5
Simple Weighted mean:
Combining 2 or more sets of scores
& then finding the overall mean for
the combined group. There is a
more complex version of obtaining
the correct mean for 2 data sets that
have unequal ‘n’ values. We will
review this formula later on.
X-values that represent a
population are often represented as
a capital X. x-values that are a part
of the sample standard deviation
formula often are represented by a
line over a lower case x.
Out of a population number of 4,
the data set includes 3, 7, 4 & 6.
The sum of these is 20 divided by
N (4). The population mean is 5.
M = ∑X(overall sum for the combined group) Basically, you are calculating the
n (total number in the combined group) mean of one or more groups of data
to find the overall mean for the
combined groups for those data
= ∑X1_+_∑X2_
sets. A more complex formula is
n1 + n2
used for when the data sets are
unequal.
Example:
Group 1 n = 6
Group 2 n = 6
Scores represent minutes of
intervention time
In this example, Group #1 has 6
clients with a sum total of 26
minutes. Group #2 has 6 clients
with a sum total of 14 minutes.
Using the calculations, the
weighted mean is 3.33.
Group 1
6
3
5
3
4
5
∑ = 26
Group 2
4
1
2
3
1
3
∑ = 14
Remember that to calculate the
weighted mean, you are drawing
from 2 or more sets of scores.
These sets of scores may or may
not have the same “n”.
_26 + 14_ = _40_ = 3.33
6+6
12
Median: The score that divides a
When N is an odd number:
distribution exactly in half. Exactly
50% of the individuals in a
Data set: 3, 5, 8, 10, 11
distribution have scores at or below
the mean. The median is equivalent
to the 50th percentile.
When N is an even number:
The goal of the median is to
determine the precise midpoint of a Data set: 3, 3, 4, 5, 7, 8
distribution.
List scores in order from lowest to
highest & the middle score is the
median. In this case, the median is
number 8.
List scores in order from lowest to
highest. In this case, 2 numbers are
in the middle: 4 & 5. Locate the
mid-point between the 2 middle
scores. In this case, the mid-point is
4.5.
List scores in order from lowest to
When there are several scores
with the same value in the middle highest. In this case, the median is
4.
of the distribution:
Data set: 1, 2, 2, 3, 4, 4, 4, 4, 4, 5
Mode: The score or category that
has the greatest frequency. The
mode can be used to determine the
typical or average value for any
scale of measurement & also is the
only measure of central tendency
that can be used w/ data from a
nominal scale of measurement.
Selecting a Measure of Central
Tendency:
The best scenario is where you
have enough data so that you can
calculate all measures of central
tendency, but that may not always
be the case. The mean is usually
always the preferred measure of
Score
7
6
5
4
3
2
1
0
f
1
0
3
2
3
5
4
2
Using this data, the score with the
highest frequency is 2 (with a
frequency of 5). Keep in mind that
it is possible to have more than one
mode (scores that have the same
highest frequency). Two modes is
called bimodal.
central tendency, but there may be
times when the mean is not or
cannot be calculated & used as the
best representative of the data.
When to use the median:
 When a data set contains
extreme scores or skewed
distributions
Data set: 5, 6, 4, 4, 3, 28, 6, 1, 33
With extreme scores such as 28 &
33 in the set, the median (which is
5) may be a better representation.

When all the data is not
available; such as a missing
value(s) or you are presented
with an open ended distribution
Person
1
2
3
4
5
6
Time (mns)
8
11
12
13
17
Never finished
The mean cannot be accurately
calculated with missing
information, so the median would
be the better representation (the
median time in this case would be
12.5 with 2.5 scores below the
median & 2.5 (including the
undetermined score) above the
median).

When there is no upper or
lower limit listed for one of the
categories
Person
5 or more
4
3
2
1
0
f
3
2
2
3
6
4
Again, the full information is not
available to compute the mean.
Color
Blue
Green
Yellow
Orange
Purple
f
9
6
2
3
1
The mean or median cannot be
calculated with nominal scales.
Only the mode. In this case, the
mode is the color “blue” b/c it is
the most frequent color chosen.
When to use the Mode:
 With nominal scales

When discrete variables are
used
Examples: numbers of children,
room numbers, etc.
Variables that exist only in whole,
indivisible categories that cannot be
split or fractioned are best
represented by the mode.
 Describing shape
The mode describes the peak of a
distribution, so it can be helpful in
the description of shape.
Central Tendency & the Shape of
a Distribution:
A = mean, median & mode
A = mode
B = median
C = mean
A = mean
B = median
C = mode
The mean & median would be in
the center or the “valley” of the
distribution with the modes
(bimodal) are represented by the
tops of both “hills” of the
distribution
No mode. This distribution has a
mean & median (the center), but no
one score has a greater frequency.
Variability
Variability: Provides a
quantitative measure of the degree
to which scores in a distribution are
spread out or clustered together.
A good measure of variability
serves 2 purposes:
 Standard deviation is primarily a
descriptive measure. It describes
how variable or spread out the
scores are in a distribution.
 It allows us to interpret
individual scores; where one
particular score may lie in
* describes the distribution. It tells
whether the scores are
clustered close together or
spread out over a large distance.
Variability is usually defined in
terms of distance; how much
distance to expect between one
score & another, or b/t an
individual score & the mean.
 measures how well an
individual score represents the
entire distribution. This is very
important in inferential
statistics where small samples
are used to answer questions
about a population.
 Variability provides
information about how much
error to expect if you are using
a sample to represent a
population.
relation to the mean & in relation
to the average distance of all
scores from the mean
 The mean & the SD are the
most common values used to
describe a set of data.
 Adding a constant to each score
will not change the SD
 Multiplying each score by a
constant causes the SD to be
multiplied by the same constant.
Range: The difference b/t the
upper real limit of the largest
(maximum) X value & the lower
real limit of the smallest
(minimum) X value
The range lets you know how
spread out your scores are.
Although the range gives you some
general information about your
data, it does not give an accurate
description of the variability for the
entire distribution b/c it does not
consider all the scores. Therefore,
the range is considered to be a
crude & unreliable measure of
variability.
SD is particularly important to
inferential statistics. The goal of
inferential statistics is to detect
meaningful & significant patterns
in research results.
Variable - Weight
170
180
190
Real 169.5– 179.5- 189.5Limits 170.5 180.5 190.5
Range = URL Xmax – LRL Xmin
Data set: 3, 7, 12, 8, 5, 10
12.5 – 2.5 = 10
Interquartile Range: The range
covered by the middle 50% of the
distribution.
Q3 – Q1
Since the interquartile range is
covered by the middle 50% of the
distribution, that means that it
inhabits the space b/t the score that
falls at the 25% & the score that
falls at the 75%. You can draw a
frequency histogram to find these
scores or form a frequency
distribution table to locate the 25%
and 75% to see what scores fall
between these 2 boundaries. The
interquartile range may be used as a
measure of variability b/c it shows
where the majority of the scores lie
& ignores those scores that may be
outliers. Its limits of course, are
that not all scores are represented in
its calculation.
_Q3 – Q1_
2
Semi-Interquartile Range: Half
of the interquartile range
The semi-interquartile range is half
of the interquartile range. So you
would follow the instructions as
listed above & then divide by 2.
The limits here are again that not
all scores are represented.
Deviation: The distance of one
particular score from the mean.
The standard deviation is the most
commonly used & most important
measure of variability. The SD
measures the variability by
considering the distance b/t each
score & the mean & determines
whether the scores are generally
near or far from the mean.
Basically, it approximates the
average distance of all scores from
the mean.
Deviation here simply explains
how one particular value deviates
from the mean, or the center of
the distribution.
Deviation score = X - µ
2 steps to finding the deviation:
 determine the deviation or
distance from the mean for each
individual score
Example: X = 53; µ = 50
53 – 50 = 3
Σ(X - µ)

Calculate the mean of the
deviation scores
X
8
1
3
0
X-µ
+5
-2
0
-3
Σ(X - µ) = 0
Remember your notation. X = a
particular score while µ is the
notation for the population mean. If
the mean (µ) of a set of data is 50
& a particular score (X) is 53, then
the deviation of the score of 53
from the mean of 50 is 3.
Remember your notation of Σ
indicating the “sum.” Since the
mean is the average of a set of
scores, then each score has a
placement either above, or below
the mean. If the score lies above the
mean, this directionality is
identified with a “+” sign in front o
the score. If the score lies below the
mean, it is indicated with a “ - ” in
front of the score. To find the
average deviation score for a data
set, you add together all the
deviation scores. Your answer
should always be “0”.
Sum of squares: The sum of the
squared deviation scores. This
represents the numerator part of the
standard deviation formula. The
sum of squares is obtained thru the
steps shown in the middle column:
1)
Σ(X - µ)2
standard deviation formula for a
population:
1) Find each deviation score
2) Square each deviation score
3) Sum the squared deviations
σ = Σ (X – µ)2
N
The post-script “2” (X2) means that
you multiply the value by itself.
Ex: 52 = 5 x 5 = 25. This formula is
highlighted in blue as it may be
used in future formulas to establish
“deviation” This is also called the
sum of squares
The statistical notation indicating
standard deviation for a population
is the symbol σ.
standard deviation formula for a
sample:
s = (X – M)2
n–1
(these formulas above were
designed to obtain the standard
deviation, or the average distance
of all values from the mean, or
center of the distribution. However,
see Population Variance below to
help direct you to a more usable,
operational formula for obtaining
the standard deviation)
Population Variance: The mean
squared deviation. Variance is the
mean (or average) of the squared
deviation scores. The measure of
variability is based upon squared
distances. This helps with
inferential statistical methods, but
may not be the best descriptive
measure for variability.
The statistical notation indicating
standard deviation for a sample is s.
The statistical notation for a sample
mean is M.
The variability of a population is
usually greater than the variability
of a sample. That is why the sample
formula uses “n - 1”, called the
degrees of freedom.
var iance
To correct for the issue mentioned
above, the calculation is then
“square-rooted”
Standard Deviation: The square
root of variance or the average
squared deviation. This tells you
the average distance of all the
scores from a data set from the
mean by combining variation, sum
of squares & variance.
computational formula for
standard deviation for a
population:
σ =
ΣX2 – (ΣX)2
_
__N__
N
The more workable formula for
obtaining the standard deviation
involves modification of the
individual steps to design the
computational formula.
computational formula for
standard deviation for a sample:
s=
Σx2 – (Σx)2
n __
n-1
Example:
Step 1: Place each one of your “X”
values, or each number in the data
set under the “X” column.
Step 2: The second column is
where you “square” each individual
“X” value (X2). Here is where you
multiply each X value by itself (ex:
X2 = 32 = 3x3 = 9)
Data set: 1, 3, 6, 11
X
1
3
6
11
Σ = 21
(ΣX)2 = 441
Step 3: Sum up all of your X
values at the bottom of your X
column. Also, sum up all of your
squared values (X2) at the bottom
under the X2 column represented as
ΣX2 .
s=
X2
1
9
36
121
2
ΣX = 167
167 – 441
______4__
4–1
Remember that Σ is the symbol for
“sum” or adding values together.
It sometimes helps w/ placement to
place “cross arrows” beneath your
table to the formula so that you
know you are placing the right
values in the correct places on the
formula.
In “Step 3,” there is no parentheses
indicating that you are adding up
values that are already squared:
ΣX2
In “Step 4,” the ΣX is in
parentheses (ΣX) indicating that
you must first sum each of the X
values first before squaring that
value: (ΣX)2 following “order of
operation” rules where you
compute values w/in parentheses
first before computing outside
parentheses.
Step 4: At the bottom of your X
column beneath where you
summed your X values, multiply
your ΣX (sum of X) by itself,
which is represented as (ΣX)2 =
(212 = 21x21=441).
Step 5: Identify your “N” value
(the number of scores in your data
set) (“N” for population data, “n”
for sample data)
n=4
(ΣX)2 = 441
ΣX2 = 167
At this point, you have all of the
values you need. Match up the
values that correspond to the
symbols in the formula.
s=
167 – 441
______4_
4–1
Step 6: Working w/ the numerator
first: Order of operation indicates
that we divide before subtracting.
So we divide (ΣX)2/n first (441/4 =
110.25).
s = 167 – 110.25
4–1
Step 7: Subtract: ΣX2 – 110.25
(167-110.25 = 56.75)
s = 56.75
4-1
Step 8: Since we are using a
“sample” for this example, our
denominator is “n-1” to control for
variations b/t population & sample
groups. Had we been working w/
population data, we would only
have “N” as our denominator. For
this example, we subtract n-1 (41=3)
s = 56.75
3
Step 9: Now we have 2 values left:
one in the numerator & one in the
denominator. So we divide:
(56.75/3 = 18.916666)
s = √18.916666
Step 10: DO NOT FORGET TO
SQUARE ROOT YOUR FINAL
VALUE. You do this by hitting the
square root button on your
calculator.
s = 4.34932
When working a formula that
has a numerator & a
denominator, complete all
calculations w/in the numerator
& all calculations w/in the
denominator separately until you
come up w/ one value on top &
one value on the bottom. Then
complete the calculations.
One of the most frequent errors
made when computing the standard
deviation formula is forgetting to
square root your final value. When
writing or typing out your formula,
you may draw the square root over
the formula, insert the “square
root” symbol from Word, or write
“sq. rt.” in front of the formula so
that you do not forget to complete
this final step.
It is always recommended that
students write out both population
& sample standard deviation
formulas on sticky notes & lay it
next to every problem you are
working so that you can double
check to make sure you have each
value in the right spot on the
formula & that you do not miss a
step!
Remember that the farther to the
right you round, the more accurate
your outcomes will be.
Degrees of freedom: The df for
the sample variance are defined as
df = n – 1. The df determine the
number of scores in the sample that
are independent and free to vary.
This is why “n – 1” is used in the
sample formula for standard
deviation as it corrects for bias in
sampling variability (since sample
variability typically is smaller than
in population variability). So, df is
extremely important for inferential
statistics.
df = n - 1
See the example above of working
w/ a sample formula & shows in
the denominator of the formula
underneath the square root the df,
or n-1.
Degrees of freedom is not a
calculated error, but instead
controls for differences b/t sample
outcomes & population outcomes
as there are always some variance
b/t the two.