Download Test 2 RM - Chaps 9&5 Organizing Data

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Research Methods
Winter 2008
Oraganizing & Analyzing Data
Chapter 9
Graphs & Charts
Instructor: Dr. Harry Webster
1
Where Do Data Come From?
Anything that we can put into a number is data.
Good data requires valid measurements; appropriate
way to investigate the topic.
Types of investigations in Social Science and
Commerce that usually have data are:
2
1. Observational Study: Uses predetermined
categories (target behaviors/events) and observes
frequency. Seeks to describe behavior or event.
Ex., observe children for helpful behavior. Have target
behaviors identified before beginning observation.
Define precisely helpful behavior.
Ex., p. 7 textbook. Compared proximity of residence
to power lines for children with leukemia to those
without leukemia.
3
2. Survey Studies:
Offer a series of questions usually to a large number
of people.
Ex., ability and attitude/opinion questionnaires.
Looks at how many people gave a certain answer
(frequency & percents).
2a) Sample surveys: A survey where much attention is
given to securing the sample in a random manner
where, as much as possible, everyone has an equal
chance of being selected.
Ex., public opinion polls, marketing research.
4
3. Census
Tries to gather data from everyone in a country.
Underestimates the homeless and some minority
groups.
Governments do this to establish voting districts,
economic and social trends.
For Canadian Census
4. Experiments
Researcher changes one thing (independent variable,
IV; called treatment in textbook); keeps everything
else the same, and determines if behavior (dependent
variable, DV) is affected.
5
Basic design is two groups where one group gets one
level of the IV and the other gets another level of the
IV (this includes its absence).
Since everything else kept constant, then any change
in behavior (DV) is due to the change in the IV.
Can make cause and effect (what causes what)
statements with experiments.
Cause and effect statements can only be made about
groups; not individuals. They are probable causes not
certain ones (remember: statistics is about probability
not certainty).
6
7
Ex., Does a tutor impact academic performance?
Group A
No tutor in QM
Group B
A tutor in QM
Both groups are otherwise the same.
Compare grades in QM.
8
Vocabulary
(From Ch 4 Test 1)
Individual: anything that is measured; includes
people, objects, animals. The ‘thing’ that provides the
measurement.
Ex., person
Variable: any characteristic of an individual; anything
that we actually measure.
Ex., height, weight, schooling, singing ability, opinion.
9
A measurement can be in a count (frequency) or in a
rate (proportion or percent).
Usually, a rate is better. Offers the advantage of
comparing two different measurements.
Ex., Metro is 2 minutes late 10 times = count.
Bus is 2 minutes late 25 times = count.
Metro is 2 minutes late 10% of the time = rate.
Bus is 2 minutes late 15% of the time = rate.
10
Predictive Validity: A measurement has predictive
validity if it can be used to predict success or
performance on tasks that are related to what we
measured.
Ex., Do college grades predict university grades?
To some extent yes, but motivation and drive are
missing from the prediction. (ref: p. 141)
SAT (Scholastic Aptitude Tests) and high school
grades predict about 34% of the variation in college
grades. This means that about 1/3rd of the grade
differences in college are predicted by the grades in
high school and on the SAT tests.
11
Scales or Levels of Measurement:
Nominal Scale: categorizes objects/people. No
indication of size.
Ex., male (1) and female (2).
Ordinal Scale: reflects larger/smaller but not by how
much.
Ex., professional tennis rankings.
12
Interval Scale: reflects difference in equal units. Get
an idea of size of difference.
Ex., Temperature: 10 & 20 degrees Celsius. They
are measured in equal units but one is not twice the
other. There is no absolute zero. Zero degrees does
not mean an absence of temperature.
Ex.,IQ : (intelligence quotient) where a person ‘s IQ is
determined in relation to the average in the
population. There is no absolute zero IQ.
Ratio Scale: reflects equal units and an absolute zero:
Weight in Kgs or lbs; Distance in Metres, … Few in
Social Science and Commerce disciplines:
Ex., An item costs $.50, then $.25, then zero - is given
13
away.
Exercises Chapter 8
1. Give a valid way to measure determination.
2. Give an example of a biased measurement.
3. Give an example of a reliable measurement.
14
4. Identify the scales of measurement for the
following:
a) Ice skating scores
b) Points in a hockey game.
c) Times for speed skating.
d) Gold, silver and bronze medals.
e) Penalties in hockey match.
15
5. If you had a Canadian Savings Bond that
gave you $100 interest on $2000 per year and
an Hydro Quebec Bond that gave you $60
interest on $1000 per year:
a) What is the rate of return on the CSB?
b) What is the rate of return on the Hydro Qc
Bond?
c) What is your total rate of return?
16
Part Two
Organizing Data
1. Graphics, Good and Bad Chapters 10
Nominal data: Bar Graph & Pie Chart
Ordinal, Interval, Ratio data: Line Graph
2. Displaying Distributions with Graphs.
Chapter 11
3. Describing Distributions with Numbers. Chapter 12
Measures of central tendency: median, means.
Measures of spread of distribution: quartiles,
minimum scores, maximum scores, range,
standard deviation
4. Normal distributions. Chapter 13
17
Chapter 10
Graphs: Good and Bad
Good graphs represent the data without distorting it.
Categorical variable: Nominal data that reflects
categories. Ex., gender.
Quantitative (continuous) variable: Ordinal, interval or
ratio. Data that can be averaged.
18
For Nominal (categorical) data use either Pie Charts
or Bar Graphs.
Ref: p. 176
19
• Pie Chart of education level of people aged 25-34 in
U.S.A. in 1998. (ref: p. 177)
20
For Nominal (categorical) data use either Pie Charts
or Bar Graphs.
Ref: p. 176
21
• Pie Chart of education level of people aged 25-34 in
U.S.A. in 1998. (ref: p. 177)
22
Bar Graph of education level of people aged 25-34 in
U.S.A. in 1998 (ref: p 177)
Education Level 25-34 Year Olds in 1998
35
30
Percent
25
20
15
10
5
0
Less H.S.
H.S.
Some College
Bachelor's
Advanced
23
For ordinal, interval or ratio (quantitative) data use a
line graph or histogram.
Line Graph: Generally plotting data over time.
Ex., Average cost of regular unleaded gas: 19902000. (Ref: p. 181)
24
To evaluate data in line graphs:
1. Look for overall pattern (trend).
2. Look for deviations from the pattern.
3. Look for seasonal variation that repeats itself each
year.
Verify to see if the seasonal variation has been
removed by using seasonal adjustment.
Ex., unemployment rates in January are adjusted to
remove the expected increase due to fewer jobs after
holiday sales jobs end.
25
Making good graphs:
1. Keep it simple.
2. Put titles, labels and legends.
Titles go above the graph, labels indicate the axes and
legends explain the data if there is more than one
variable.
3. Only put the graphic of the data in the plotting area.
Don’t clutter with gridlines or fancy graphics.
26
3. Make a line graph for the following data on the
Annual Canadian Consumer Price Index (1986 = 100)
Year
1993
1994
1995
1996
1997
Clothing/
Footwear
130.8
131.8
131.8
131.3
133
Transportation
125.7
131.3
138.1
143.5
147.9
27
Chapter 9
Displaying Distributions with Graphs
Use histograms for ordinal, interval or ratio scale data.
(quantitative/continuous data)
Making a histogram:
Step 1: Divide the range of the data into classes of
equal width.
Rule of thumb: Use the square root of the number of
scores to determine the number of classes. Usually,
no less than 4 and no more than 9.
Don’t use this rule of thumb when there is an
established way to divide the data into classes: ex.,
grades.
28
Step 2: Count the number of individuals in each class.
Step 3: Draw the histogram. Horizontal axis is for the
variable under study; vertical axis is for frequency.
Each bar represents a class. Use bars of equal width
29
Ex., Some of the grades in one of my Intro to Psych
sections many semesters ago: (Total = 30)
72
81
63
64
76
71
85
60
62
81
60
43
84
65
71
84
77
60
70
63
83
81
41
46
75
72
63
90
23
63
Step 1: Divide the range of the data into equal
classes.
Range = 23 - 90; can divide the data into 8 classes.
20-29; 30-39; 40-49; 50-59; 60-69; 70-79; 80-89; 9099.
30
Step 2. Count the number of individuals in each class.
Classes
20-29
30-39
40-49
50-59
60-69
70-79
80-89
90-99
Individuals
1
0
3
0
10
8
7
1
Total
30
31
Step 3:Draw the histogram:
Introduction to Psychology Grades
Frequency
10
5
0
20
30
40
50
60
70
80
90
100
Grades
32
Interpreting Histograms
1. Must use judgment to determine number of
classes.
2. Outliers: are extreme deviations; falls outside of the
pattern.
3. Look for overall pattern and deviations.
4. To find overall pattern ignore outliers; look for
center of distribution and spread.
5. Describe the shape as simply as possible.
a) Symmetrical: right and left sides are nearly mirror
images.
b) Skewed: Can be skewed to the right (trails off to the right)
or skewed to the left (trails off to the left).
33
34
35
Histogram skewed to the left
7
6
Frequency
5
4
3
2
1
0
0
3
6
9
12
15
Points in Football Games
18
21
36
Comparing Bar Graph and Histogram
Education Level 25-34 Year Olds in 1998
35
30
Percent
25
20
15
10
5
0
Less H.S.
H.S.
Some College
Bachelor's
Advanced
Highway Gas Mileage for Midsize Cars 2000
Frequency
10
5
0
21
23
25
27
29
31
33
Mileage
37
Chapter 9
Describing Distributions with Numbers
We will see:
A) median, quartiles, five number summary, boxplots,
These involve ordering data and positions.
Boxplot is a graphic (picture) of the five number
summary.
B) means, standard deviation. Do not involve position.
Measures of central tendency are: median, means.
Measures of spread are: quartiles, standard deviation.
38
A) Median, quartiles, five number summary and
boxplots.
Median: the midpoint of an arranged (ordered from
smallest to largest) distribution of data. The 50th
percentile.
(Percentile: ranking out of 100)
Calculating the median:
1. Arrange scores from smallest to largest.
2. Use formula: (n + 1)/2 to find the location of the
median.
3a. If you have an odd number of scores, the formula
will lead you to the median score.
39
Ex1.,
2 3 4 5 6 7 8 9 10
Formula: (n + 1)/2 (9 + 1 )/2 = 5 (location of median)
Count 5 scores and we get 6. 6 is the median score.
Ex2.,
23455567888
Formula (n + 1)/2
(11 + 1)/2 = 6 (location of
median).
Count 6 scores and we get 5. As there are many 5’s
we must indicate (underline) which 5 is the median
score.
40
3b. If we have an even number of scores then the
formula brings us in between two numbers.
Ex., 12 13 14 15 16 17
Formula (n + 1)/2
(6 + 1)/2 = 3.5 location of
median.
Count 3.5 scores and we are between 14 and 15.
Then find the average of these two scores.
14 + 15 = 29/2 = 14.5 median score. This is indicated
by putting a mark between the 14 and the 15.
12 13 14 ~15 16 17
41
B) Mean and Standard Deviation
Mean: An average of scores.
Pronounced ‘x-bar’; symbol =
X
Sum of scores divided by number of cases.
Ex. 1 2 3 4 = 10/4 = 2.5
Sensitive to outliers.
Ex., 1 2 3 40 = 46/4 = 11.5
42
Standard Deviation
Most frequently used expression of
spread/variability
Is a measure of the average spread of
scores from the mean.
Small standard deviations involve a set of
scores that are close to the mean.
Large standard deviations involve a set of
scores that are further away from the
mean.
Is influenced by outliers (the mean is
used to calculate the standard deviation).
43
The Standard Deviation Formula
St.Dev. 
2
(x
x
)

n 1
  sum of everything
in parenthese s
x  each score
x  mean
n  number of scores
44
To calculate the standard deviation (S.D. or St. Dev.)
Step 1. Find the mean.
Step 2. Find the distance of each score from the
mean.
Step 3. Square each result to get rid of negatives.
Step 4. Add up the squared deviations (from the
mean).
Step 5. Divide by n-1. This gives the variance.
Step 6. Find the square root. This gives the St. Dev.
45
Example: Data set: 1 2 4 6 7
1. Find the mean
20/5 = 4
Deviation
3. Squared Deviation
2. 1 – 4 = -3
3x3 =
9
2 – 4 = -2
2x2 =
4
4–4= 0
0
6–4= 2
2x2 =
4
7–4= 3
3x3 =
9
4. Total
26
46
5. Divide the sum of the squared deviations by n-1
26/5-1 = 6.5 This is the variance.
6. Square root the variance
Square root of 6.5 = 2.55
This is the standard deviation.
47
Use medians when there are outliers. Ex. income.
Use means and standard deviations when the
distribution appears symmetrical. Ex. Test grades,
performance on athletic variables that are measured in
time.
48
6. We have used 2 sets of data (7 2 2 1 3 4 5 6 and 7
2 2 1 3 4 50 6) to determine five number summaries
and standard deviations.
Using the numbers, show the effects of outliers.
49
CHAPTER 13
NORMAL DISTRIBUTIONS
When a graph depicts proportion of scores instead of
frequency of scores it is called a density graph.
The proportions add up to 1 (100%).
When the density graph is smoothed into a line, it is
called a density curve.
50
• The mean is further towards the tail of the distribution
as it takes into account the size of those scores (ex.,
outliers).
• The median depicts position in a distribution of data
only; it is not affected by the more extreme scores.
•
Normal Curve
Skewed to the right
51
• Normal Curves/Normal Distributions:
• The most important curve in Social Science and
Commerce statistics.
• Many biological variables fall on a normal curve. Ex.,
height.
• Many psychological variables are ‘forced’ into a
normal curve. Ex., I.Q., some psychological
inventories.
• Many sociological/economic variables don’t fall into a
normal curve.
52
• Ex. income, education.
Features of Normal Curves (Normal Distributions):
1. Given the mean and standard deviation, we can
draw the normal curve.
2. Mean is center of the distribution; cuts it in half. This
is also the median or 50th percentile.
3. The curve is symmetrical; one side of the mean
mirrors the other.
4. The standard deviation determines the shape of the
curve. The smaller the standard deviation, the closer
the scores are to one another, the ‘taller’ the curve.
53
54
The standard deviation breaks the normal curve into
segments that reflect the percent of scores in the set
of scores. The 50th percentile is at st. dev. zero.
Standard deviations for the mean
55
• The 68-95-99.7 Rule
• 68% of all scores fall between -1 and +1 standard
deviation.
• 95% of all scores fall between -2 and +2 standard
deviations.
• 99.7% of all scores fall between -3 and +3 standard
deviations.
• As the tails of the normal curve do not touch the
horizontal axis, we cannot determine the number of
standard deviations for 100% of the scores.
• This is to leave room for extreme outliers.
56
Ex., Women’s height. Mean = 65 “ St. Dev. = 2.5 “
57
The standard deviation of an individual score is called
the standard score.
Standard score is also known as a z- score.
The standard score allows us to determine the
percentile rank of that score.
58
What is the percentile rank of a woman who is 68”
tall?
For this we need the Standard Score formula:
Standard Score (St. Sc.) = Score - mean
St. Dev.
St. Sc. = 68 - 65 = 1.2
2.5
A standard score of 1.2 means this woman is +1.2
standard deviations above the mean.
It translates into a percentile ranking of 88.49 using
Table B (p. 552).
59
INTRODUCTION TO STATISTICAL INFERENCE
Statistical inference is a technique to make decisions
regarding the probability that the population would
behave in the same way as the sample.
As it is based on probability, then the rules of
probability must be followed. Therefore, the
assumptions which must be met are:
1) Randomness: the predictable pattern of outcomes
after very many trials.
1a) If samples are chosen randomly, then the pattern
of outcomes is a normal distribution. This is called a
sampling distribution.
60
2) We assume the mean of the normal distribution
reflects the mean of the population parameter.
Statistical inference helps us determine how confident
we are about where a result falls on the sampling
distribution in two ways:
1. Confidence Intervals: How confident we are that
our sample’s result captured the population parameter
within a certain range (margin of error).
2. Tests of significance: We make a claim about the
population and use the sample’s results to test that
claim. Want to determine the probability of our claim
61
being right.
CHAPTER 21
WHAT IS A CONFIDENCE INTERVAL?
A confidence interval estimates a population
parameter from a sample statistic at a certain level of
confidence. Here confidence means the probability of
being right.
We also referred to it as a Confidence Statement.
We take the sample’s statistic (data) and estimate
what the population’s answer would be. Involves how
sure we are (confidence level) and margin of error (the
margin where we believe the population’s answer falls.
62
We take the sample’s statistic (data) and
estimate what the population’s answer would
be. Involves how sure we are (confidence
level) and margin of error (the margin where we
believe the population’s answer falls.
We will cover how to develop Confidence
Statements for:
A) Data given in percents/proportions.
B) Data given in means.
(The only difference is a change of formula)
63
p  population
parameter
p̂  statistics (results)
from samples
Take any statistic and
estimate the probabilit y (conf. level)
of it capturing the population
parameter within a certain margin
of scores (margin of error).
64
A) When the statistic is given in percents or
proportions.
The formula to find a confidence interval for any level
of confidence is:
pˆ  z* pˆ (1  pˆ ) / n
z  z score is a standard score
p̂  statistics (results)
from samples
n  sample size
65
p̂ = sample statistic (proportion or percent)
z* = z scores (standard scores)
n = number of subjects in the sample
66
Example:
Mayor Villeneuve is two weeks from election
day. He wants to know his chances of winning
the election. A polling company asks 1000
people who they would vote for if the election
were held today and 57% say Villeneuve.
Villeneuve wants to be 90% confident that he
will win.
.57  1.64 .57(1  .57) / 1000
.57 + - .0256
or 57% plus and minus 2.5%67
The margin of error is 2.5%. By subtracting and
adding it to the percent of people who said they would
vote for Mayor Villeneuve (57%) we find the range of
scores (margin of error) within which we are 90%
confident lies the population parameter.
Confidence Statement
Mayor Villeneuve can be 90% confident that between
54.4% and 59.5% of all voters will vote for him if the
election were held today. (The all reflects the
population parameter)
The confidence statement is the whole sentence; the
margin of error is between 54.4% and 59.5%; the
68
confidence level is 90%.
CHAPTER 9 - RELATIONSHIPS:
SCATTERPLOTS AND CORRELATIONS
Scatterplots:
Involves the relationship between two or more
quantitative (ordinal, interval, ratio) variables
measured on the same individuals/objects.
(For our course, we will deal with two variables.)
69
The graph that depicts this relationship is called a
scatterplot.
Sometimes, scatterplots have an explanatory variable
(on the horizontal axis) and a response variable (on
the vertical axis).
The explanatory variable is the independent variable.
The response variable is the dependent variable.
70
Each dot in a scatterplot reflects two pieces of
information (variables) about an individual.
In this example, the individuals are countries. The
graph depicts the relationship between gross domestic
product per person and longevity. (p. 271)
71
Some scatterplots have no explanatory and response
variables; only the relationship between two variables.
Ex., The Archaeopteryx: the femur (leg bone) and
humerus (arm bone); the size of one does not ‘explain’
or ‘contribute’ to the size of the other. (p. 274)
72
This scatterplot has a definite shape: as one variable
increases, the other tends to increase.
This is called a positive association.
Association betw een Ice Cream Sales and Temperature
Ice Cream Sales
10
8
6
4
2
0
10
12
14
16
18
Temperature
20
73
When one variable decreases and the other
increases, it is called a negative association.
Ice Cream Price and Sales
Sales
10
5
0
0
0.5
1
1.5
2
2.5
Price
74
When there is no relationship between the change in
one variable and the change in another variable, there
is no association.
Ice Cream Sales
Scatterplot of Ice Cream Sales
and TV Violence
9
8
7
6
5
4
3
2
1
0
0
2
4
6
TV Violence Ratings
8
75
To examine a scatterplot:
1. Look at the overall pattern and any important
deviations.
2. Describe the scatterplot using the form, direction
and strength of the relationship.
3. Look for outliers
4. The closer the data are to forming a linear line, the
stronger the association.
Ex., The Archaeopteryx: There is a strong positive
association between the size of the femur and the
humerus with no outliers.
76
When the association between two variables is
expressed mathematically, it is called a correlation.
Features of Correlations
1. It is expressed as r.
2. The range is from -1.00 to +1.00.
3. -1.00 is a perfect negative correlation; +1.00 is a
perfect positive correlation. These are never seen
with real data. Zero is no correlation - there is no
relationship between the variables.
77
4. Correlations use standard scores so we can
compute them for any two variables (doesn’t have to
be the same unit of measurement).
5. Correlations measures the strength of straight-line
(linear) associations between variables.
6. Correlations are affected by outliers. The more
data there is, the less an outlier will influence the
correlation.
78
79
Correlations between:
.8 - 1.00
.6 - .79
.4 - .59
.2 - .39
0.0 - .19
Are considered:
Very Strong
Strong
Moderate
Weak
Very Weak
80
Causation:
The reason something occurs; what makes it happen.
Requires experimental research designs where there
is a great deal of control of all variables.
Philosophically, causation requires a ‘leap of faith’
from excluding all other possible explanations to
granting the independent variable the power to have
caused the behavior.
81
a) Simple Causation:
Very rare in real life.
A causes B to happen.
Ex., paying students $250 to get 80%+ in a course.
This would increase the number of students who get
80%+.
If everything else is kept constant, we could say that
the $250 had an effect on students’ behavior; it
caused an increase in grades.
A
B
82
b) Common Response
A causes B and C
When changes in two variables are caused by a third,
common, variable.
Ex., July is season for highest ice cream sales;
July is also the month where the most people drown.
Ice cream does not cause drowning; the warm
weather increases both sales and drownings.
B
A
C
83
c) Confounding Response:
We know two variables cause a change in a third but
we don’t know the ‘weight’ of each variable.
Ex., person smokes and drinks too much.
Heart is affected; we know that both contribute but do
not know how much each contribute.
Need to do experimental research to ‘sort out’ the
influences.
Helps to isolate each variable’s effect on heart.
84
When experimentation is not possible, we can
approach causation if the following conditions are met:
1. The association between two variables is strong.
2. The association between two variables is
consistent.
3. The alleged cause precedes the effect.
4. The alleged cause is plausible.
85