Download Chapter 2 - Full

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Categorical variable wikipedia , lookup

Transcript
Chapter 2
Turning Data
Into
I f
Information
i
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
2.1 Raw Data
• Raw data are for numbers and category labels
that have been collected but have not yet been
processed in any way.
• When measurements are taken from a subset of a
population, they represent sample data.
• When all individuals in a population are measured,
the measurements represent
p
population
p
p
data.
• Descriptive statistics: summary numbers
for either population or a sample.
sample
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
2
2.2 Types of Data
• Categorical variables consist of group or
g y names that don’t necessarilyy have a
category
logical ordering. Examples: eye color, country of
residence.
• Categorical variables for which the categories
have a logical ordering are called ordinal
variables Examples: highest educational degree
variables.
earned, tee shirt size (S, M, L, XL).
• Quantitative variables consist of numerical
values taken on each individual. Examples:
g , number of siblings.
g
height,
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
3
Asking the Right Questions
One Categorical Variable
Question: How many and what percentage of
individuals fall into each category?
Example:
p What ppercentage
g of college
g students favor the
legalization of marijuana, and what percentage of
college students oppose legalization of marijuana?
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
4
Asking the Right Questions
Two Categorical Variables
Question: Is there a relationship between the two variables,
so that the category into which individuals fall for one
variable seems to depend on which category they are in
for the other variable?
Example:
l In Case
C
S d 11.6,
Study
6 we asked
k d if the
h risk
i k off having
h i a
heart attack was different for the physicians who took
aspirin than for those who took a placebo.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
5
Asking the Right Questions
One Quantitative Variable
Question: What are the interestingg summaryy measures,, like
Q
the average or the range of values, that help us
understand the collection of individuals who were
measured?
Example: What is the average handspan measurement, and
how much variability is there in handspan measurements?
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
6
Asking the Right Questions
One Categorical and One Quantitative Variable
Question: Are the measurements similar across
categories?
Example: Do men and women drive at the same
“f
“fastest
speeds”
d ” on average??
Question: When the categories have a natural ordering
(an ordinal variable), does the measurement variable
increase or decrease, on average, in that same order?
Example: Do high school dropouts, high school
graduates,
d
college
ll
dropouts,
d
andd college
ll
graduates
d
have increasingly higher average incomes?
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
7
Asking the Right Questions
Two Quantitative Variables
Question: If the measurement on one variable is high
Q
g
(or low), does the other one also tend to be high (or low)?
Example: Do taller people also tend to have larger
handspans?
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
8
Explanatory and Response Variables
Many questions are about the relationship
between two variables.
It is useful to identify one variable as the
explanatory variable and the other variable
as the response variable.
In general, the value of the explanatory variable
for an individual is thought to partially explain the
value of the response variable for that individual.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
9
2.3 Summarizing One or Two
Categorical Variables
Numerical Summaries
• Count how many fall into each category.
• Calculate the percent in each category.
• If two variables, have the categories of
the explanatory
e planator variable
ariable define the rows
ro s
and compute row percentages.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
10
Example 2.1 Importance of Order
Survey of n = 190 college students.
About half (92) given the question:
“Randomly
d l pick
i k a letter
l
--- S or Q.”
Note: 66% picked the first choice of S.
Oth half
Other
h lf (98) given
i
th question:
the
ti
“Randomly pick a letter --- Q or S.”
Note: 54% picked the first choice of Q.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
11
Example 2.2 Lighting the Way
to Nearsightedness
N
i h d
Survey of n = 479 children.
Th
Those
who
h slept
l t with
ith nightlight
i htli ht or in
i fully
f ll lit
room before age 2 had higher incidence of
nearsightedness (myopia) later in childhood.
Note: Study does not prove sleeping with light
actually caused myopia in more children.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
12
Visual
V
sua Su
Summaries
a es
• Pie Charts: useful for summarizing
a single categorical variable if not
too many categories.
• Bar Graphs: useful for summarizing
one or two categorical variables and
particularly
ti l l useful
f l for
f making
ki comparisons
i
when there are two categorical variables.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
13
Example 2.3 Humans Are Not
G dR
Good
Randomizers
d i
Survey of n = 190 college students.
“Randomly
d l pick
i k a number
b between
b
1 andd 10.”
R lt Most
Results:
M t chose
h
7,
7 very few
f chose
h
1 or 10.
10
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
14
Example 2.4 Revisiting Nightlights
and
d Nearsightedness
N
i h d
Survey of
n = 479 children.
hild
Response:
Degree
of Myopia
Explanatory:
Amount of
Sl ti
Sleeptime
Lighting
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
15
2.4 Finding Information
in Quantitative Data
Long list of numbers – needs to be organized
to obtain answers to questions of interest.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
16
Five-Number Summaries
• Find extremes (high, low),
the median, and the quartiles
(medians of lower and upper
halves of the values).
• Quick
Q i k overview
i off the
h data
d values.
l
• Information about the center,
spread, and shape of data.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
17
Example 2.5 Right Handspans
About 25% of handspans of females are
between 12.5 and 19.0 centimeters,
about 25% are between 19 and 20 cm,,
about 25% are between 20 and 21 cm, and
about 25% are between 21 and 23.25 cm.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
18
Interesting Features of
Quantitative Variables
• Location: center or average.
e g median
e.g.
• Spread: variability
e.g. difference between two
extremes or two quartiles.
q
• Shape: (later in Section 2.5)
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
19
Outliers and How to Handle Them
Outlier: a data point that is not
consistent with the bulk of the data.
• L
Lookk for
f them
h via
i graphs.
h
• Can have bigg influence on conclusions.
• Can cause complications in some
statistical anal
analyses.
ses
• Cannot discard without justification.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
20
Example 2.6 Ages of Death
off U.S.
U S Fi
First L
Ladies
di
Partial Data Listingg and five-number summary:
y
Extremes are more interesting here:
Who died at 34? Martha Jefferson
Who lived to be 97? Bess Truman
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
21
Possible Reasons for Outliers
and Reasonable Actions
• Mistake made while taking measurement or entering it
into computer. If verified, should be discarded/corrected.
• Individual in question belongs to a different group than
bulk of individuals measured. Values may be discarded if
summary is desired and reported for the majority group
only.
• Outlier is legitimate data value and represents natural
variability for the group and variable(s) measured.
Values may not be discarded — they provide important
information about location and spread.
spread
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
22
2.5 Pictures for
Quantitative Data
• Histograms:
Hi t
similar
i il to
t bar
b graphs,
h usedd
for any number of data values.
• Stem-and-leaf plots and dotplots:
ppresent all individual values,, useful for
small to moderate sized data sets.
• Boxplot or box-and-whisker plot:
useful summary for comparing two
or more groups.
groups
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
23
Interpreting Histograms, Stemplots,
and
dD
Dotplots
t l t
• Values are centered around 20 cm.
• Two possible low outliers.
• Apart from outliers, spans range from about 16 to 23 cm.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
24
Describing
g Shape
p
• S
Symmetric,
t i bell-shaped
b ll h
d
• Symmetric,
y
, not bell-shaped
p
• Skewed Right: values trail off
to the right
• Skewed Left: values trail off
to the left
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
25
Example 2.8 Big Music Collection
About how many CDs do you own?
Stem is ‘100s’ and leaf unit is ‘10s’.
Final digit is truncated.
N b ranged
Numbers
d ffrom 0 tto about
b t 450,
450
with 450 being a clear outlier and
most values ranging from 0 to 99.
99
The shape is skewed right.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
26
2.6 Numerical Summaries
of Quantitative Data
Notation
N
t ti ffor R
Raw D
Data:
t
n = number of individuals in a data set
x1, x2 , x3,…, xn representt individual
i di id l raw data
d t values
l
Example: A data set consists of handspan
values in centimeters for six females;
the values are 21, 19, 20, 20, 22, and 19.
Then, n = 6
x1= 21, x2 = 19, x3 = 20, x4 = 20, x5 = 22, and x6 = 19
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
27
Describing the Location
of a Data Set
• M
Mean: the
th numerical
i l average
• Median: the middle value ((if n odd))
or the average of the middle two
values (n even)
Symmetric: mean = median
Skewed Left: mean < median
g mean > median
Skewed Right:
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
28
Determining the Mean and Median
x
∑
x=
i
The Mean
where
∑x
i
n
means “add together all the values”
The Median
If n is
i odd:
dd M = middle
iddl off ordered
d d values.
l
Count (n + 1)/2 down from top of ordered list.
If n is even: M = average of middle two ordered values.
values
Average values that are (n/2) and (n/2) + 1
down from top of ordered list.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
29
The Influence of Outliers
on the Mean and Median
Larger influence on mean than median.
High
g outliers will increase the mean.
Low outliers will decrease the mean.
If ages att ddeath
th are: 70
70, 72
72, 74,
74 76,
76 andd 78
then mean = median = 74 years.
If ages at death are: 35, 72, 74, 76, and 78
then median = 74 but mean = 67 years.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
30
Describing Spread: Range
and Interquartile Range
• Range = high value – low value
• Interquartile
I t
til R
Range (IQR) =
upper quartile – lower quartile
• Standard Deviation
((covered later in Section 2.7))
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
31
Example 2.10 Fastest Speeds Ever Driven
Five-Number
Summary
for 87 males
•
•
•
Median = 110 mph measures the center of the data
Two extremes describe spread over 100% of data
Range = 150 – 55 = 95 mph
Two quartiles describe spread over middle 50% of data
Interquartile Range = 120 – 95 = 25 mph
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
32
Notation and Finding the Quartiles
Split the ordered values into the half
that is below the median and the half
that is above the median.
Q1 = lower
l
quartile
il
= median of data values
that are below the median
Q3 = upper
pp q
quartile
= median of data values
that are above the median
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
33
Example 2.10 Fastest Speeds (cont)
Ordered Data
((in rows of
10 values)
for the 87 males:
55 60 80 80 80 80 85 85 85 85
90 90 90 90 90 92 94 95 95 95
95 95 95 100 100 100 100 100 100 100
100 100 101 102 105 105 105 105 105 105
105 105 109 110 110 110 110 110 110 110
110 110 110 110 110 112 115 115 115 115
115 115 120 120 120 120 120 120 120 120
120 120 124 125 125 125 125 125 125 130
130 140 140 140 140 145 150
• Median = (87+1)/2 = 44th value in the list = 110 mph
• Q1 = median of the 43 values below the median =
(43+1)/2 = 22nd value from the start of the list = 95 mph
• Q3 = median of the 43 values above the median =
(43+1)/2 = 22nd value from the end of the list = 120 mph
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
34
Percentiles
The kth percentile is a number that has
k% of the data values at or below it and
((100 – k)%
) of the data values at or above it.
• Lower qquartile = 25th ppercentile
• Median = 50th percentile
• Upper quartile = 75th percentile
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
35
Picturing Location
and Spread with Boxplots
Boxplots for right handspans
of males and females.
• Box covers the middle
50% of the data
• Line within box marks
the median value
• Possible outliers are
marked with asterisk
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
36
How to Draw a Boxplot
off a Q
Quantitative
i i Variable
V i bl
Step 1: Label either a vertical axis or a horizontal axis
with numbers from min to max of the data.
Step 2: Draw box with lower end at Q1 and upper end at Q3.
St 33: Draw
Step
D
a li
line through
th
h the
th box
b att the
th median
di M.
M
Step 4: Draw a line from Q1 end of box to smallest data
value that is not further than 1.5
1 5 × IQR from Q1.
Q1
Draw a line from Q3 end of box to largest data value
that is not further than 1.5 × IQR from Q3.
Step 5: Mark data points further than 1.5 × IQR from either
edge of the box with an asterisk. Points represented
with asterisks are considered to be outliers.
outliers
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
37
2.7 Bell-Shaped Distributions
of Numbers
Many measurements follow a predictable pattern:
• Most individuals are clumped
p around the center
• The greater the distance a value is from the
center, the fewer individuals have that value.
Variables that follow such a pattern are said
to be “bell-shaped”. A special case is called
a normal distribution or normal curve.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
38
Example 2.11 Bell-Shaped
B i i h Women’s
British
W
’ Heights
H i h
Data: representative
p
sample
p of 199 married British couples.
p
Below shows a histogram of the wives’ heights with a normal
curve superimposed. The mean height = 1602 millimeters.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
39
Describing Spread
with Standard Deviation
Standard deviation measures variability
by summarizing how far individual
data values are from the mean.
Think of the standard deviation as
roughly the average distance
values fall from the mean.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
40
Describing Spread
with Standard Deviation
Both sets have same mean of 100.
Set 1: all values are equal to the mean so there is
no variability at all.
Set 2: one value equals the mean and other four values
are 10 points away from the mean, so the average
distance away from the mean is about 10.
10
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
41
Calculating the Standard Deviation
Formula for the (sample) standard deviation:
∑ (x − x )
2
s=
i
n −1
The value of s2 is called the (sample) variance.
An equivalent formula, easier to compute, is:
s=
∑x
2
i
− nx
2
n −1
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
42
Calculating the Standard Deviation
Consider four pulse rates: 62, 68, 74, 76
Step 1:
62 + 68 + 74 + 76 280
x=
=
= 70
4
4
Steps 2 and 3:
120
Step
p 4: s =
= 40
4 −1
2
Step 5: s = 40 = 6.3
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
43
Population Standard Deviation
Data sets usually represent a sample from a larger
population If the data set includes measurements for
population.
an entire population, the notations for the mean and
standard deviation are different,, and the formula for
the standard deviation is also slightly different.
A population mean is represented by the symbol μ
(“mu”), and the population standard deviation is
∑ (x − μ )
2
σ=
i
n
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
44
Interpreting the Standard Deviation
for Bell-Shaped Curves:
The Empirical Rule
For any bell-shaped curve, approximately
• 68% off the
h values
l
fall
f ll within
i hi 1 standard
deviation of the mean in either direction
• 95% of the values fall within 2 standard
deviations of the mean in either direction
• 99.7% of the values fall within 3 standard
deviations of the mean in either direction
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
45
The Empirical Rule, the Standard
Deviation, and the Range
• Empirical Rule => the range from the
minimum to the maximum data values equals
about 4 to 6 standard deviations for data with
an approximate bell shape.
• You can get a rough idea of the value of the
standard deviation by dividing the range by 6.
Range
R
s≈
6
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
46
Example 2.11 Women’s Heights (cont)
Mean height for the 199 British women is 1602 mm
andd standard
d d ddeviation
i i is
i 62.4
62 4 mm.
• 68% of the 199 heights would fall in the range
1602 ± 62.4, or 1539.6 to 1664.4 mm
• 95% of the heights would fall in the interval
1602 ± 2(62.4), or 1477.2 to 1726.8 mm
• 99.7%
99 7% of the heights would fall in the interval
1602 ± 3(62.4), or 1414.8 to 1789.2 mm
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
47
Example 2.11 Women’s Heights (cont)
Summary of the actual results:
Note: The minimum height = 1410 mm and the maximum
height = 1760 mm, for a range of 1760 – 1410 = 350 mm.
So an estimate of the standard deviation is:
Range 350
s≈
=
= 58.3 mm
6
6
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
48
Standardized z-Scores
Standardized score or z-score:
Observed value − Mean
z=
Standard deviation
E
Example:
l Mean
M
resting
ti pulse
l rate
t for
f adult
d lt men is
i 70
beats per minute (bpm), standard deviation is 8 bpm.
The standardized score for a resting pulse rate of 80:
80 − 70
z=
= 1.25
8
A pulse rate of 80 is 1.25 standard deviations
above the mean pulse rate for adult men
men.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
49
The Empirical Rule Restated
For bell-shaped data,
• About 68% of the values have
zz-scores
scores between –11 and +1.
+1
• About 95% of the values have
z-scores between
b t
–22 and
d +2.
+2
• About 99.7% of the values have
z-scores between –3 and +3.
Copyright ©2004 Brooks/Cole, a division of Thomson Learning, Inc.
50