Download Chapter 5: Understanding and comparing distributions This chapter

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Diet-induced obesity model wikipedia , lookup

Dieting wikipedia , lookup

DASH diet wikipedia , lookup

Transcript
Chapter 5: Understanding and comparing distributions
This chapter is a continuation of previous chapter with a concentration on the comparison
of distributions.
The boxplot is a particularly useful visual tool for comparison; it is graphical display of
a 5-number summary with one modification: outliers are identified. Boxplots are useful for
visually displaying the center, spread, range, and any outliers of the distribution. They are
also useful for comparing several distributions simultaneously. The components are shown:
d
outlier
whisker
whisker
Q1
M
Q3
• The central box shows Q1 , the median, and Q3 .
• The whiskers extend to the most extreme values that are within the fences.
• The lower fence is Q1 − 1.5 × IQR and the upper fence is Q3 + 1.5 × IQR.
• The fences are not shown; instead whiskers are drawn at largest datum less than the
upper fence and the smallest datum greater than the lower fence.
• Any points outside the fences are outliers and are plotted individually.
Example Was Tyrannosaurus Rex cold-blooded?
The question may be answered by
Vert2
examining the isotopic concentrations of
Vert1
Tibia
oxygen in bone phosphate of fossilized
Rib
skeletons. Isotopic absorbtion depends
Phal.
upon body temperature, implying
PCau.
that there will be variation in the mean
Meta.
isotopic concentration among bones
MCau.
from different locations if the animal
Gast 2
is cold-blooded. On the other hand,
Gast 1
there are minor temperature variations
Femur
DCau.
in warm-blooded animals. Substantial
variation in the mean isotopic concentration
among different bones constitutes evidence
supporting the position that T. Rex was cold-blooded.
The data shown in the figure above and right are isotopic
30
11.0
11.5
Isotopic concentration
12.0
concentration from different bones from a single T. Rex fossil.1 Between 3 and 6 observations
were obtained from each bone. The boxplots indicate that there are substantial differences
among bones with respect to the mean isotopic concentrations of oxygen. However, the data
are few in number (per bone), and so a we need quantitative measure of the strength of
evidence supporting the position that there are differences among bones in isotope concentration.2
Outlier Detection using the IQR: A common rule for detecting outlying values is called
the 1.5 IQR rule: values at least 1.5 × IQR greater than Q3 or at least 1.5 × IQR less than
Q1 are outliers. Hence, outliers lie outside the interval:
[Q1 − 1.5 × IQR, Q3 + 1.5 × IQR]
Histograms and boxplots yield similar, but not identical impressions of the same data. Below
are histograms and boxplots representing the same data.
Skewed right
Skewed left
0
0
100
Unimodal, symmetric
Biomodal
0
0
1
2
100
100
See Ramsey, F.L., Schafer, D.W. The Statistical Sleuth, 2nd Ed., p. 146.
Discussed in STAT 452.
31
100
Remark : The boxplots show that Q3 is further from M than Q1 if the distribution is skewed
right. Q3 is closer to M than Q1 if the distribution is skewed left. Q1 and Q3 are equidistant
to M if the distribution is symmetric.
50
Example: Female mice (n = 349) were randomly assigned to six treatment groups to investigate whether restricting dietary intake increases
life expectancy.3 Diet treatments were:
30
20
2. N/N85: mice fed normally before and
after weaning. After weaning, ration
was controlled at 85 kcal/wk.
Lifetime (months)
40
1. NP: mice ate unlimited amount of
nonpurified, standard diet
10
3. N/R50: normal diet before weaning
and reduced calorie diet (50 kcal/wk)
after weaning.
4. R/R50: reduced calorie diet of 50
kcal/wk both before and after weaning.
NP
N/N85
lopro
N/R50
R/R50
N/R40
5. lopro: low protein diet before weaning, restricted diet (50 kcal/wk) after weaning and
dietary protein content decreased with advancing age.
6. N/R40: normal diet before weaning and reduced diet (40 Kcal/wk) after weaning.
Which group had the greatest spread?4 Greatest median lifetime?5 Smallest median?6 What
is the predominant direction of skew?7 Which greatest number of outliers? Least?8
Time Plots: Time series plots are useful graphical descriptions for quantitative variables
collected over time. Below and to the left is a set of time plots showing the annual number
of TB cases per 100,000 persons from 1982 to 2005.
In three countries, there is a slow but consistent decline in the number of cases. In Armenia, there has been an increase since the fall of the Soviet Union (1991). The trend
3
Weindruch, R., Walford, R.L., Fligiel, S. and Guthrie D. (1986). The Retardation of Aging in Mice
by Dietary Restriction: Longevity, Cancer, Immunity and Lifetime Energy Intake, Journal of Nutrition
116(4):64154.
4
lopro - using the IQR as the measure of spread.
5
N/R40
6
NP
7
left
8
N/N85 and lopro, respectively.
32
in Angola is very unusual; the data may be incorrect because of problems in monitoring
and reporting TB cases during the civil war (1995 to 1997). HIV/AIDS is also potentially
responsible for the increase.
500
300
Rate
150
400
Zimbabwe
Botswana
SouthAfrica
200
100
100
50
0
0
Rate per 100,000
200
Angola
Argentina
Armenia
Germany
UnitedStates
India
1980
1985
1990
1995
2000
2005
1985
Year
1990
1995
2000
Year
The plot to the right (and above) shows three countries from southern Africa. The data
are plotted differently: individual points are shown and a smooth or smoother is graphed.
The smooth shows the general trend of the data more clearly than the individual points.
It is apparent that the TB rate has increased dramatically, most likely attributable to the
development of HIV/AIDS epidemics.
Transformations (re-expressions) of data: Sometimes, when the distribution of a quantitative variable is skewed, the data values can be transformed to a different scale to make
the distribution more symmetric. Symmetrically distributed data are, loosely speaking, a
prerequisite for accurate hypothesis tests and confidence intervals.
Example Ozone pollution is believed to be a source of mortality, increasing the yearly risk of
death from respiratory diseases by 40% to 50% in heavily polluted cities like Los Angeles and
Riverside and by about 25% throughout the rest of the country.9 Consequently, forecasting
daily ozone levels in large cities is an important use of statistics. The data shown in the
histogram are n = 366 measurements on daily maximum one-hour-average ozone reading
from 1976 in Los Angeles.
9
Maugh, T.H. L.A. Times, Low-level ozone exposure found to be lethal over time, March 12, 2009.
33
2005
50
20
Frequency
30
40
60
50
40
10
0
Frequency
30
20
10
0
0
0
10
20
1
2
3
30
Log−daily maximum one−hour−average ozone reading
Daily maximum one−hour−average ozone reading
The panels show the original data and after transformation to a log scale. If the original observations are denoted by y1 , . . . , y366 then the log-transformed data are log(y1 ), . . . , log(y366 ).
The log-transformed data are more symmetric in distribution. There are two logarithm transformations. The common logarithm uses 10 as the base whereas the natural logarithm uses
Euler’s number e = 2.718282 · · · as the base.
The common logarithm of a positive number y is the exponent x to which 10 must be
raised to obtain y. Thus, y = 10x . Some examples are
log10 (10) = 1 since 101 = 10
log10 (100) = 2 since 102 = 100.
Viewed another way, the log base 10 function extracts the exponent on the 10. Similarly,
log10 (1000) = 3
Another example is log10 (500) = 2.699 since 102.699 = 500.
Remarks:
1. The effect of transforming to the logarithmic scale is to shrink the right tail of a
distribution towards the median.
2. log functions cannot be applied to negative numbers, for instance, log10 (−2) is undefined. If a log transformation is to be applied to a data set with negative numbers, an
expedient solution is to add a constant c to all values (this is called shifting the data.
The constant c is chosen so that all values are positive after shifting.
34
3. Logarithmic transformations preserve the original order of the observations. That is,
if yi < yj , then log10 (yi ) < log10 (yj ), and if log10 (yi ) < log10 (yj ) then yi < yj . Order
preserving transformations are nearly always better than a transformation that is not
order preserving. The transformation y → y 2 is not order preserving since −2 < −1
but (−2)2 > (−1)2 .
35