Download Lecture Notes-RM Capters 9&5-Measures of Central Tendency

Document related concepts

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Research Methods
Winter 2008
Chapter 9 – Measures of Central Tendency and
Dispersion
Instructor: Dr. Harry Webster
1
Chapter 9
Describing Distributions with Numbers
A) Finding the median involve ordering data and
positions.
B) means, mode, and standard deviation do not
involve position.
Measures of central tendency are: median, mode, and
means.
Measures of spread are: Variance and standard
deviation.
2
A) Median, quartiles, five number summary and
boxplots.
Median: the midpoint of an arranged (ordered from
smallest to largest) distribution of data. The 50th
percentile.
(Percentile: ranking out of 100)
Calculating the median:
1. Arrange scores from smallest to largest.
2. Use formula: (n + 1)/2 to find the location of the
median.
3a. If you have an odd number of scores, the formula
will lead you to the median score.
3
Ex1.,
2 3 4 5 6 7 8 9 10
Formula: (n + 1)/2 (9 + 1 )/2 = 5 (location of median)
Count 5 scores and we get 6. 6 is the median score.
Ex2.,
23455567888
Formula (n + 1)/2
(11 + 1)/2 = 6 (location of
median).
Count 6 scores and we get 5. As there are many 5’s
we must indicate (underline) which 5 is the median
score.
4
B) Mean and Standard Deviation
Mean: An average of scores.
Pronounced ‘x-bar’; symbol =
X
Sum of scores divided by number of cases.
Ex. 1 2 3 4 = 10/4 = 2.5
Sensitive to outliers.
Ex., 1 2 3 40 = 46/4 = 11.5
5
Standard Deviation
Most frequently used expression of
spread/variability
Is a measure of the average spread of
scores from the mean.
Small standard deviations involve a set of
scores that are close to the mean.
Large standard deviations involve a set of
scores that are further away from the
mean.
Is influenced by outliers (the mean is
used to calculate the standard deviation).
6
The Standard Deviation Formula
St.Dev. 
2
(x
x
)

n 1
  sum of everything
in parenthese s
x  each score
x  mean
n  number of scores
7
To calculate the standard deviation (S.D. or St. Dev.)
Step 1. Find the mean.
Step 2. Find the distance of each score from the
mean.
Step 3. Square each result to get rid of negatives.
Step 4. Add up the squared deviations (from the
mean).
Step 5. Divide by n-1. This gives the variance.
Step 6. Find the square root. This gives the St. Dev.
8
Example: Data set: 1 2 4 6 7
1. Find the mean
20/5 = 4
Deviation
3. Squared Deviation
2. 1 – 4 = -3
3x3 =
9
2 – 4 = -2
2x2 =
4
4–4= 0
0
6–4= 2
2x2 =
4
7–4= 3
3x3 =
9
4. Total
26
9
5. Divide the sum of the squared deviations by n-1
26/5-1 = 6.5 This is the variance.
6. Square root the variance
Square root of 6.5 = 2.55
This is the standard deviation.
10
Use medians when there are outliers. Ex. income.
Use means and standard deviations when the
distribution appears symmetrical. Ex. Test grades,
performance on athletic variables that are measured in
time.
Use the Mode with Nominal, Ordinal, Interval, and
Ratio levels of measurements.
The mode is the only measure of central tendency that
can be used with Nominal data such as gender of
respondents, preferred type of music, marital status,
etc.
11
6. We have used 2 sets of data (7 2 2 1 3 4 5 6 and 7
2 2 1 3 4 50 6) to determine five number summaries
and standard deviations.
Using the numbers, show the effects of outliers.
12
CHAPTER 9
NORMAL DISTRIBUTIONS
When a graph depicts proportion of scores instead of
frequency of scores it is called a density graph.
The proportions add up to 1 (100%).
When the density graph is smoothed into a line, it is
called a density curve.
13
• The mean is further towards the tail of the distribution
as it takes into account the size of those scores (ex.,
outliers).
• The median depicts position in a distribution of data
only; it is not affected by the more extreme scores.
•
Normal Curve
Skewed to the right
14
• Normal Curves/Normal Distributions:
• The most important curve in Social Science and
Commerce statistics.
• Many biological variables fall on a normal curve. Ex.,
height.
• Many psychological variables are ‘forced’ into a
normal curve. Ex., I.Q., some psychological
inventories.
• Many sociological/economic variables don’t fall into a
normal curve.
15
• Ex. income, education.
Features of Normal Curves (Normal Distributions):
1. Given the mean and standard deviation, we can
draw the normal curve.
2. Mean is center of the distribution; cuts it in half. This
is also the median or 50th percentile.
3. The curve is symmetrical; one side of the mean
mirrors the other.
4. The standard deviation determines the shape of the
curve. The smaller the standard deviation, the closer
the scores are to one another, the ‘taller’ the curve.
16
17
The standard deviation breaks the normal curve into
segments that reflect the percent of scores in the set
of scores. The 50th percentile is at st. dev. zero.
Standard deviations for the mean
18
• The 68-95-99.7 Rule
• 68% of all scores fall between -1 and +1 standard
deviation.
• 95% of all scores fall between -2 and +2 standard
deviations.
• 99.7% of all scores fall between -3 and +3 standard
deviations.
• As the tails of the normal curve do not touch the
horizontal axis, we cannot determine the number of
standard deviations for 100% of the scores.
• This is to leave room for extreme outliers.
19
Ex., Women’s height. Mean = 65 “ St. Dev. = 2.5 “
20
Chapter 5
What is a Confidence Interval.
21
INTRODUCTION TO STATISTICAL INFERENCE
Statistical inference is a technique to make decisions
regarding the probability that the population would
behave in the same way as the sample.
As it is based on probability, then the rules of
probability must be followed. Therefore, the
assumptions which must be met are:
1) Randomness: the predictable pattern of outcomes
after very many trials.
1a) If samples are chosen randomly, then the pattern
of outcomes is a normal distribution. This is called a
sampling distribution.
22
2) We assume the mean of the normal distribution
reflects the mean of the population parameter.
Statistical inference helps us determine how confident
we are about where a result falls on the sampling
distribution in two ways:
1. Confidence Intervals: How confident we are that
our sample’s result captured the population parameter
within a certain range (margin of error).
2. Tests of significance: We make a claim about the
population and use the sample’s results to test that
claim. Want to determine the probability of our claim
23
being right.
CHAPTER 5
WHAT IS A CONFIDENCE INTERVAL?
A confidence interval estimates a population
parameter from a sample statistic at a certain level of
confidence. Here confidence means the probability of
being right.
We also refer to it as a Confidence Statement.
We take the sample’s statistic (data) and estimate
what the population’s answer would be. Involves how
sure we are (confidence level) and margin of error (the
margin where we believe the population’s answer falls.
24
We take the sample’s statistic (data) and
estimate what the population’s answer would
be. Involves how sure we are (confidence
level) and margin of error (the margin within
which we believe the population’s answer falls.
We can develop Confidence Statements for:
A) Data given in percents/proportions.
B) Data given in means.
(The only difference is a change of formula)
25
p  population
parameter
p̂  statistics (results)
from samples
Take any statistic and
estimate the probabilit y (conf. level)
of it capturing the population
parameter within a certain margin
of scores (margin of error).
26
A) When the statistic is given in percents or
proportions.
The formula to find a confidence interval for any level
of confidence is:
pˆ  z* pˆ (1  pˆ ) / n
z  z score is a standard score
p̂  statistics (results)
from samples
n  sample size
27
p̂ = sample statistic (proportion or percent)
z* = z scores (standard scores)
n = number of subjects in the sample
28
Example:
Mayor Tremblay is two weeks from election
day. He wants to know his chances of winning
the election. A polling company asks 1000
people who they would vote for if the election
were held today and 57% say Mayor Tremblay.
Tremblay wants to be 90% confident that he will
win.
.57  1.64 .57(1  .57) / 1000
.57 + - .0256
or 57% plus and minus 2.5%29
The margin of error is 2.5%. By subtracting and
adding it to the percent of people who said they would
vote for Mayor Tremblay (57%) we find the range of
scores (margin of error) within which we are 90%
confident lies the population parameter.
Confidence Statement
Mayor Tremblay can be 90% confident that between
54.4% and 59.5% of all voters will vote for him if the
election were held today. (The all reflects the
population parameter)
The confidence statement is the whole sentence; the
margin of error is between 54.4% and 59.5%; the
30
confidence level is 90%.
CHAPTER 9
DESCRIBING RELATIONSHIPS:
SCATTERPLOTS AND CORRELATIONS
Scatterplots:
Involves the relationship between two or more
quantitative (ordinal, interval or ratio: NB – not
nominal) variables measured on the same
individuals/objects.
(For our course, we will deal with two variables.)
31
The graph that depicts this relationship is called a
scatterplot.
Sometimes, scatterplots have an explanatory variable
(on the horizontal axis) and a response variable (on
the vertical axis).
The explanatory variable is the independent variable.
The response variable is the dependent variable.
32
Each dot in a scatterplot reflects two pieces of
information (variables) about an individual.
In this example, the individuals are countries. The
graph depicts the relationship between gross domestic
product per person and longevity. (p. 271)
33
Some scatterplots have no explanatory and response
variables; only the relationship between two variables.
Ex., The Archaeopteryx: the femur (leg bone) and
humerus (arm bone); the size of one does not ‘explain’
or ‘contribute’ to the size of the other. (p. 274)
34
This scatterplot has a definite shape: as one variable
increases, the other tends to increase.
This is called a positive association.
Association betw een Ice Cream Sales and Temperature
Ice Cream Sales
10
8
6
4
2
0
10
12
14
16
18
Temperature
20
35
When one variable decreases and the other
increases, it is called a negative association.
Ice Cream Price and Sales
Sales
10
5
0
0
0.5
1
1.5
2
2.5
Price
36
When there is no relationship between the change in
one variable and the change in another variable, there
is no association.
Ice Cream Sales
Scatterplot of Ice Cream Sales
and TV Violence
9
8
7
6
5
4
3
2
1
0
0
2
4
6
TV Violence Ratings
8
37
To examine a scatterplot:
1. Look at the overall pattern and any important
deviations.
2. Describe the scatterplot using the form, direction
and strength of the relationship.
3. Look for outliers
4. The closer the data are to forming a linear line, the
stronger the association (either negative or positive).
Ex., The Archaeopteryx: There is a strong positive
association between the size of the femur and the
humerus with no outliers.
38
When the association between two variables is
expressed mathematically, it is called a correlation.
Features of Correlations
1. It is expressed as r.
2. The range is from -1.00 to +1.00.
3. -1.00 is a perfect negative correlation; +1.00 is a
perfect positive correlation. These are never seen
with real data. Zero is no correlation - there is no
relationship between the variables.
39
4. Correlations use standard scores so we can
compute them for any two variables (doesn’t have to
be the same unit of measurement).
5. Correlations measures the strength of straight-line
(linear) associations between variables.
6. Correlations are affected by outliers. The more
data there is, the less an outlier will influence the
correlation.
40
41
Correlations between:
.8 - 1.00
.6 - .79
.4 - .59
.2 - .39
0.0 - .19
Are considered:
Very Strong
Strong
Moderate
Weak
Very Weak
42
2. What is wrong with the following statement:
a) The correlation between the first snow storm of any
given year and the number of car accidents that day is
r = - 1.3
b) The correlation between gender and income is
about r = .66
43
3. Give an example for each of the following:
a) A strong positive correlation
b) A strong negative correlation
c) No correlation
44
EXERCISES
1. Professor Lively, runs every day for at least 30
minutes and checks her pulse rate.
Time
34.12
35.52
34.52
34.05
34.13
35.52
36.17
Pulse
152
124
140
152
146
128
136
45
1a) Draw a scatterplot for these data.
1b) The correlation, is r = -.815 Briefly describe
what this means.
46
•
•
•
•
•
•
•
•
•
•
2. The following are the closing quotes for the Nasdaq
and Microsoft for ten trading days.
Nasdaq:
Microsoft:
1742
54
1785
57
1770
55
1789
56
1784
56
1804
57
1862
60
1845
60
1826
59
1824
59
47
2a) The correlation is r = .974
Describe what this means:
a. Does NASDAQ performance cause Microsoft sales
to rise?
b. Does microsoft sale cause NASDAQ performance
to rise?
c. Neither a. nor b.
Justify your answer.
48
Causation:
The reason something occurs; what makes it happen.
Requires experimental research designs where there
is a great deal of control of all variables.
Philosophically, causation requires a ‘leap of faith’
from excluding all other possible explanations to
granting the independent variable the power to have
caused the behavior.
49
a) Simple Causation:
Very rare in real life.
A causes B to happen.
Ex., paying students $250 to get 80%+ in a course.
This would increase the number of students who get
80%+.
If everything else is kept constant, we could say that
the $250 had an effect on students’ behavior; it
caused an increase in grades.
A
B
50
b) Common Response
A causes B and C
When changes in two variables are caused by a third,
common, variable.
Ex., July is season for highest ice cream sales;
July is also the month where the most people drown.
Ice cream does not cause drowning; the warm
weather increases both sales and drownings.
B
A
C
51
c) Confounding Response:
We know two variables cause a change in a third but
we don’t know the ‘weight’ of each variable.
Ex., person smokes and drinks too much.
Heart is affected; we know that both contribute but do
not know how much each contribute.
Need to do experimental research to ‘sort out’ the
influences.
Helps to isolate each variable’s effect on heart.
52
When experimentation is not possible, we can
approach causation if the following conditions are met:
1. The association between two variables is strong.
2. The association between two variables is
consistent.
3. The alleged cause precedes the effect.
4. The alleged cause is plausible.
53
3. People who drink diet soft drinks tend to gain more
weight over a one year period than people who do not.
Does drinking diet drinks make people gain weight?
Give a more plausible explanation.
54