Download Types of data and how to present them - 47-269-203-spr2010

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Psychometrics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Types of data and how to
present them
47:269: Research Methods I
Dr. Leonard
March 31, 2010
Scientific Theory
1. Formulate theories 
2. Develop testable hypotheses (operational definitions) 
3. Conduct research, gather data 
4. Evaluate hypotheses based on data 
5. Cautiously draw conclusions
Scales of Measurement
Nominal
Categories
Ordinal
Categories that can be ranked
Interval
Scores with equidistant
intervals between them
Ratio
Scores with equidistant
intervals and absolute zero
Nominal
Responses Responses
are distinct
can be
ranked
YES
NO
Equal
intervals
Absolute
zero
NO
NO
Ordinal
YES
YES
NO
NO
Interval
YES
YES
YES
NO
Ratio
YES
YES
YES
YES
Two major approaches to using data
 Descriptive statistics
Describe or summarize data to characterize sample
Organizes responses to show trends in data
 Inferential statistics
Draw inferences about population from sample (is
population distinct from sample?)
Significance
Capture
impact of random error on responses
Margin
 Note:
tests
of error
Statistics describe responses from a sample;
parameters describe responses from a population (e.g.,
a census)
Descriptive Statistics
 N,
total number of cases (responses) in a sample
Our class would be N = 33
 f, or frequency, is the number of participants who gave
a particular response, x
 Can
 Can
also be given as percentages or proportions
be univariate or bivariate
 How
participants vary on one variable (uni-)
 How participants vary on two variables (bi-)
 Descriptive
statistics are a good first step for
analyzing any data!
 They
are the only statistics appropriate for nominal data
Frequency distribution (nominal data)
x (response)
f (frequency)
%
Democrat
479
47.9
Republican
411
41.1
Independent
101
10.1
Green party
9
0.9
Total
n = 1,000
100%
Frequency distribution (interval or ratio data)

When you need to present a wide range of scores, show responses
grouped in intervals to make it easier to grasp “big picture” of data
2.7 1.9
3.1
1.0
3.3 1.3
2.2 3.0 3.4 3.1
1.8
2.6 3.7
2.2 1.9
3.1
3.4 3.0 3.5 3.0 2.4 3.0 3.4 2.4
2.4 3.2 3.3 2.7 3.5 3.2 3.1
2.1
1.5
1.4
2.6 2.9 2.1
2.3 3.1
3.3
2.7 2.4 3.4 3.3 3.0 3.8
1.6
2.8 3.8 1.4
2.6 1.5
2.8 2.3
2.8 2.3 2.8 3.2 2.8
1.9
3.3 2.9 2.0 3.2
Interval
.90 - 1.1
1.2 - 1.4
1.5 - 1.7
1.8 - 2.0
2.1 - 2.3
2.4 - 2.6
2.7 - 2.9
3.0 - 3.2
3.3 - 3.5
3.6 - 3.8
f
1
3
3
5
6
7
10
14
12
3
 Frequency
distributions can be depicted graphically in…
Bar graphs
 Bars not touching because of
discrete data
 Nominal and ordinal data
Histograms
 Bars touching because of
continuous data
 Interval and ratio data
Frequency polygons (single line)
 Interval and ratio data
Shapes of Distributions
_
normal
_
positive skew
_
negative skew
X
X
X
Shapes of Distributions
_
normal
_
platykurtic
_
leptokurtic
X
X
X
What else can we do besides frequencies?

Measures of central tendency show the central or “typical” scores
in a distribution
 Mean- the average score
 Median- the middle score
 Mode- the most frequent

score
The mean, median, and mode are related to the horizontal shape
(skew) of the distribution.
 In
 In
 In
a normal distribution: Mean = Median = Mode
a positively skewed distribution: Mode < Median < Mean
a negatively skewed distribution: Mean < Median < Mode
Which measure of central tendency???
Different measures of central tendency are appropriate
depending upon the level of measurement used:
Nominal

Mode
Ordinal

Mode
Median
Interval/Ratio

Mode
Median
Mean
The Mean

2
The most informative and elegant measure of
central tendency.
 The average
 The fulcrum point of the distribution
4
6
8
10
2
4
6
8
15
The Median
 The
middle most score in a distribution.
 The scale value below which and above which 50%
of the distribution falls
Not the fulcrum: The halfway point
2
4
6
8
10
2
4
6
8
15
 If
2
The Median
N is odd, then median is the center score
4
6
8
2
10
4
6
8
15
 If
N is even, then median is the average of the two
centermost score
2
4
6
8
10
12
2
4
6
8
10
15
The Median
If
the median occurs at a value where
there are tied scores, use the tied
score as the median
10
2
4
6
8
10
8
10
15
 The
The Mode
most frequent score in the distribution
10
2
2
4
4
6
6
8
10
8
10
8
10
8
10
15
15
One more thing…

These measures of central tendency vary in their sampling
stability = match between the sample mean (e.g., x) and the
population mean (μ).
Mode
Least sampling
stability
•
Median
Mean
Most sampling
stability
Note: Roman (r, s, x) characters are used for sample statistics
while Greek (, , ) characters are used for population statistics.
Review of central tendency





Which one is the only appropriate measure for nominal data?
 The mode
How do you find the median when there is an odd number of scores?
 Simply locate the score in the middle
…when there is an even number of scores?
 Average the two middle scores
Which measure is most sensitive to extreme scores and why?
 The mean because it takes all scores into account and can be swayed
by positive or negative skew
Which measure has the most sampling stability and why?
 The mean because it is the most accurate representation of the
overall sample
Application of central tendency
 In
2006, the median home price in Boston was
$386,300. (San Francisco was $518,400; Washington
D.C was $258,700).
 How
 Why
do you interpret these numbers?
are housing prices framed in terms of the
median rather than the mean or the mode?
Measures of variability
Measures
of central tendency
…indicate the typical scores in a distribution
…are related to skew (horizontal)
Measures
of variability
…show the dispersion of scores in a distribution
…are related to kurtosis (vertical)
Measures of variability
Range
- the difference between the highest
and lowest score
Variance
- the total variation (distance) from
the mean of all the scores
Standard
deviation - the average variation
(distance) from the mean of all the scores
Measures of variability
Range = Highest Score – Lowest Score
2
4
6
8
2
4
6
8
10
15
Most sensitive to extreme scores!
Measures of variability
Again,
variance is the overall distance from the
mean of all scores (requires squaring the distance
of each score from the mean)
Not
as useful as the standard deviation -- the
average distance scores fall from the mean
Measures of variability
 Standard
deviation, like the mean, is the most
informative and elegant measure of variability.
 The average distance of scores from the mean score
-- deviation is distance!
2
 Also
4
6
8
10
like the mean, standard deviation has the most
sampling stability
How would these standard deviations differ?
2
Mean = 6
Mean = 7.9
2
4
4
6
8
6
8
10
6
8
10
10
Range = 8
Range = 10
12
Standard deviation and shape of distribution
5
0
1
1
4
1
1
4
0
4
1
1
5
5
2
1
0
5
2
1
5
6
3
1
Mean = 15
0
Mean = 15
Std. Dev. = 10
6
Mean = 15
Std. Dev. = 0.9
Properties of Normal Distributions
• All normal distributions are single peaked, symmetric, and
bell-shaped
• Normal distributions can have different values for mean and
standard deviation but…
• All normal distributions follow the 68-95-99 rule
68.3% of data within 1 standard deviation of the mean
95.4% of data within 2 standard deviations of the mean
99.7% of data within 3 standard deviations of the mean
99.7% - 95.4%
- 68.3% - 95.4% - 99.7%
Mean