Download Measures of Location and Variability Spring, 2009

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Measures of
Location
and
Variability
Spring, 2009
Skill set:
You should know the definitions of the major measures of location (mean, median,
mode, geometric mean) and variability (standard deviation, variance, standard error of
the mean, skewness and kurtosis).
You should know:
Set of Observations
xi
xi + c
cxi
Mean
x
x+c
cx
Variance
s2
s
s2
s
c2 s2
Descriptive Statistic
Standard deviation
c
cs
Means the absolute value of c.
You should be able to use Stata to graph histograms and box plots. You should know
how to use the help menu.
Outline
Scales of measurement
Page 1
Measures of Location
Mean
Median
Mode
Geometric Mean
Properties of Means
Page 2
Page 7
Page 9
Page 10
Page 15
Stata commands used:
Dropdown menus
log using
describe (des)
summarize (sum)
generate (gen)
codebook
label
display (di)
list
ameans
Page 25
Measures of spread or variability
Range
Percentiles
Interquartile range
Variance
Standard deviation
Standard error of the mean
Kurtosis
Skewness
Page 30
Page 30
Page 32
Page 33
Page 34
Page 34
Page 35
Page 35
Definition of whiskers
Page 36
Drop down menus
Box Plots
Page 38
Dataset used:
weight.dta
Scales used with data:
Four scales are used with variables: nominal, ordinal, interval and ratio.
nominal - the variable has no order, just category names
Gender (male, female) and hypertensive (yes, no) are examples
ordinal - the variable can be rank ordered but there is no consistent distance between
the categories
Income scaled as low, medium and high is an example. We know that
someone in the category low has a smaller income than someone in the
category high but we don’t know how much smaller.
Is the distance between low and medium the same as the distance between
medium and high? We just know the order not the difference or distance
between categories.
interval and ratio - both of these are scales of equally spaced units (i.e. consistent
distances) like height in inches.
A difference between the two scales is that variables on the ratio scale have a
zero point that can be interpreted as there is none of the quantity being
measured but variables on the interval scale do not have such a zero point.
Height is on the ratio scale and 0 inches tall means there is no height.
The Celsius scale is on the interval scale but not the ratio scale. Zero
degrees Celsius does not mean there is no heat.
In order to be on the ratio scale, the ratio of two numbers has to make sense.
A person 140 cm tall is twice as tall as one 70 cm tall. An oven at 300
degrees Celsius is not twice as hot as one at 150 degrees Celsius.
Measures of location:
We will consider several measures of location. The mean, which we consider first, is
the most commonly used measure of location.
Page -1-
Mean:
If the sample consists of
n points x1 , x2 , x3 ,..., xn , then the mean ( x )
is defined as
n
∑ xi
x1 + x2 + x3 +...+ xn
n
n
This is just the arithmetic mean of the n values. In order to calculate a mean, the
x=
i =1
=
variable has to be at least on the interval scale.
We will create and use the small data set “smalldbp.dta” with the diastolic blood
pressures of 10 people to illustrate means. We will follow the steps in the picture below.
1)
2)
3)
4)
5)
6)
We click on the log button which opens the “Begin logging Stata output” menu.
We select the folder in which we wish to save our log file (i.e. “Chapter2").
We tell Stata we want a “log” type of log file rather than the “smcl” type of log file.
We give our log file a name (smalldbp.log)
We save our log file to “Chapter2"
The results of 1 - 5.
Page -2-
6)
. log using "W:\WP51\Biometry\AAAABiostatFall2007\Data\Chapter2\smalldbp.log"
-----------------------------------------------------------------------------log: W:\WP51\Biometry\AAAABiostatFall2007\Data\Chapter2\smalldbp.log
log type: text
opened on: 29 Aug 2007, 18:49:36
“log on (text)”
tells you that you have a log file
running and that it is text as
opposed to smcl
We are going to enter our data using the data editor. Entering data here is just like
entering data in Excel.
(1) I click on the data editor button (the highlighted button below) and that brings up the
Data Editor menu. I then just type in an ID variable and 10 diastolic blood pressures
(DBP). (2) I preserve the data so I won’t lose it and (3) close the data editor because
Stata won’t let me type on the command line if the data editor is open
In the Introduction to Stata handout I show you how to use the dropdown menus to give
the variables names other than var1 and var2 and to give the variables descriptive
Page -3-
labels. Here I am just going to type in the appropriate commands on the command line.
- preserve
. rename var1 id
. label variable id "Unique Identifier"
. rename var2 dbp
. label variable dbp "Diastolic Blood Pressure in mm Hg"
. des
Contains data
obs:
10
vars:
2
size:
60 (99.9% of memory free)
------------------------------------------------------------------------------storage display
value
variable name
type
format
label
variable label
------------------------------------------------------------------------------id
byte
%8.0g
Unique Identifier
dbp
byte
%8.0g
Diastolic Blood Pressure in mm
Hg
------------------------------------------------------------------------------Sorted by:
Note: dataset has changed since last saved
“des” is short for describe.
The mean diastolic pressure of these 10 people is:
10
x=
∑ xi
i =1
10
90 + 85 + 100 + 87 + 92 + 78 + 80 + 96 + 93 + 99
=
10
900
=
= 90.0
10
It is customary to write the value for the mean to one more decimal place than the
original data. The original DBP’s are integers so I report the mean of the DBP’s as
90.0. We usually report the standard deviation to two decimal places beyond the
original data (7.51).
Page -4-
The easy way to get the mean is to just type in “sum dbp” or for more information type
“sum dbp, det” where sum is short for summarize and det is short for detail. The
results are below.
. sum dbp
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------dbp |
10
90
7.512952
78
100
. sum dbp,det
Diastolic Blood Pressure in mm Hg
------------------------------------------------------------Percentiles
Smallest
1%
78
78
5%
78
80
10%
79
85
Obs
10
25%
85
87
Sum of Wgt.
10
50%
75%
90%
95%
99%
91
96
99.5
100
100
Largest
93
96
99
100
Mean
Std. Dev.
90
7.512952
Variance
Skewness
Kurtosis
56.44444
-.248569
1.914099
To use dropdown menus to do the same thing see the back of this handout.
Graph #1 based on original set of 10 DBP values.
Page -5-
The mean can be thought of as the center of gravity (if you have weights of equal size
hanging off each sample point, the mean would be the balance point.).
Advantages of using the mean:
it uses all the observations in the sample
each sample has a unique mean
A disadvantage of using the mean is that it is sensitive to extreme values (and the
smaller the sample, the more impact the extreme values have).
Below I create a new variable which is equal to the old variable dbp except the value 99
is changed to 130 (we’ll call this set of 10 values the newdbp). Note that this changes
the mean of the sample from 90.0 to 93.1 (see graph below to understand how the
center of gravity has changed just by changing one value).
. gen newdbp = dbp
. replace newdbp = 130 if dbp == 99
(1 real change made)
“gen” is short for generate
. sum newdbp
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------newdbp |
10
93.1
14.64734
78
130
Graph #2 is based on the set of 10 DBP values with 99 replaced by 130.
Page -6-
Notice that the mean is pulled from 90.0 to 93.1 (i.e. the mean is pulled toward the
outlying value).
. save smalldbp.dta
file smalldbp.dta saved
. log close
log:
log type:
closed on:
W:\WP51\Biometry\AAAABiostatFall2007\Data\Chapter2\smalldbp.log
text
29 Aug 2007, 20:29:53
The largest value for baseline cholesterol in the dataset weight.dta is 412. Try changing
that to 1500 and comparing the mean of the original sample with the mean of the
changed sample. Notice that there are 10,273 participants with baseline cholesterol
values but there are 10,355 participants in the dataset.
The way to create the new DBP variable with dropdown menus is given at the back of
the handout.
When we study the Central Limit Theorem, we will find that the mean has some nice
properties that allow us to get confidence intervals and do hypothesis testing.
The type of data needed to calculate a mean is interval (i.e. you have to have the ability
to divide and still have a legitimate observation). So we calculate means for variables
such as age and diastolic blood pressure (i.e. continuous variables).
Median:
If the sample contains an odd number of observations, the median is the middle
observation provided the sample is ordered from smallest to largest.
If the sample contains an even number of observations, the median is the average of
the two middle observations given that the sample is ordered from smallest to largest.
You can see that this definition makes the median such that an equal number of points
are greater than or equal to and less than or equal to the median.
An advantage for the median over the mean is that the median is not sensitive to
extreme values. Notice that both the variable dbp and the variable newdbp have the
same median, but not the same mean. The median is the 50th percentile.
Median
Mean
dbp
91
90.0
newdbp
91
93.1
Page -7-
. sum(dbp),det
(original set of 10 values for DBP)
Diastolic Blood Pressure (dbp)
------------------------------------------------------------Percentiles
Smallest
1%
78
78
5%
78
80
10%
79
85
Obs
10
25%
85
87
Sum of Wgt.
10
50%
75%
90%
95%
99%
91
96
99.5
100
100
Largest
93
96
99
100
Mean
Std. Dev.
90
7.512952
Variance
Skewness
Kurtosis
56.44444
-.248569
1.914099
Note that in the Stata output below the 50th percentile is the median and that although
the largest value changes from 100 to 130 the median remains the same.
. sum(newdbp),det
New version of DBP with 99 changed to 130
------------------------------------------------------------Percentiles
Smallest
1%
78
78
5%
78
80
10%
79
85
Obs
10
25%
85
87
Sum of Wgt.
10
50%
75%
90%
95%
99%
91
96
115
130
130
Largest
93
96
100
130
Mean
Std. Dev.
93.1
14.64734
Variance
Skewness
Kurtosis
214.5444
1.644196
5.212837
Another advantage for the median is that each sample has a unique median.
A disadvantage for the median is that it does not utilize all the data in the sample.
In order to obtain a median, the data has to be on at least the ordinal scale (i.e. you can
order the observations).
When should we use the mean and when should we use the median? The cartoon
below sort of gives the correct answer.
Page -8-
Mode:
The mode is the most frequently occurring value in a set of observations.
A disadvantage for the mode is that not all samples have a mode and some samples
have multiple modes.
Sample 1 = {1,2,3,4,5,6,7,8,9,10} has no mode.
Sample 2 = {1,1,1,2,3,4,4,4,5} has modes 1 and 4.
Sample 3 = {M, F, F, F, M, M, M, F, F, F} has mode F where M = male and F = female.
The mode can be calculated with data on the nominal scale (i.e. all you have to be able
to do is categorize each observation). The mode will not come up again in this course
unless it is in a discussion of a bimodal distribution because it is not amenable to
mathematical manipulation.
Things about logs you have probably long since forgotten. log here can be to any base
(i.e.
log e , log10 )
1)
2)
3)
4)
log(a) is defined only if a > 0.
log(ab) = log(a) + log(b)
log(a/b) = log(a) - log(b)
k
log( a ) = k log( a )
Page -9-
Geometric mean:
If the sample is
x1 , x2 , x3 ,..., xn
x g = n x1 ⋅ x2 ⋅ x3 ⋅⋅⋅ xn
then the geometric mean (
(This is the nth root of the product of sample elements)
xg = ( x1 ⋅ x2 ⋅ x3 ⋅⋅⋅ xn )
This can also be written as
x g ) is defined as
1
n
or as
n
log( x g ) =
∑ log( xi )
i =1
n
The geometric mean turns up when doing such things as dilution assays.
So using our newly remembered facts about logs we have the following:
1
⎛
n⎞
log( x g ) = log⎜ ( x1 ⋅ x2 ⋅ x3 ⋅⋅⋅ xn ) ⎟
⎝
⎠
1
= log( x1 ⋅ x2 ⋅ x3 ⋅⋅⋅ xn )
n
=
log( x1 ) + log( x2 ) + log( x3 ) + ⋅⋅⋅ log( xn )
n
n
=
∑ log( x )
i
i=1
n
So we have that the mean of the logs is the log of the mean.
Rosner gives a good example of the use of the geometric mean on pages 14 and 15,
Table 2.4.
Page -10-
The geometric mean is more appropriate than the arithmetic mean in the following
circumstances:
1) When losses/gains can best be expressed as a percentage rather than a fixed value.
2) When rapid growth is involved, as in the development of a bacterial or viral
population.
3) When the data span several orders of magnitude as with a concentration of
pollutants.
Taken from Common Errors in Statistics 2nd edition by Good and Hardin.
The most commonly used of the above measures of location is the mean with the
median second because it is used in non-parametric analyses.
Question:
Why would the CMS (Center for Medicare and Medicaid Services) present the
geometric mean to summarize the length of hospital stay.
Note that this doesn’t fit any of the reasons given above. It has to do with transformed
data.
Below is a small study of the length of hospital stay for 25 patients. The dataset used is
hospital.dta which is a file that is also used in the Introduction to Stata. hospital.dta is
on the class website.
Page -11-
The distribution of a variable is said to be symmetric if the pieces on either side of the
center point are mirror images. Otherwise the distribution is described as skewed. If
the distribution is symmetric the skewness value given in the detailed version of the
command summarize is zero.
The variable length of hospital stay is skewed to the right (also described as positively
skewed). Notice that the skewness value is 2.2 . A positive skewness value (i.e. value
> 0) indicates that the skewness is to the right (see the histogram of hospital stay
above). A negative skewness value indicates the distribution is skewed to the left.
Individuals who have much longer hospital stays than most of the other patients is very
common for length of stay data.
. sum stay,det
Length of hospital stay in days
------------------------------------------------------------Percentiles
Smallest
1%
3
3
5%
3
3
10%
3
3
Obs
25
25%
5
4
Sum of Wgt.
25
50%
75%
90%
95%
99%
8
11
14
17
30
Largest
11
14
17
30
Mean
Std. Dev.
8.6
5.715476
Variance
Skewness
Kurtosis
32.66667
2.203535
8.959067
This is a case where the value 30 days is probably correct so we can’t just set it to
missing. One thing that we can do is transform the data to bring the 30 days closer to
the rest of the data. One of the transformations which will bring in the larger values is
the natural (i.e. base e) logarithmic transformation (log to base 10 will also bring in the
more distant data). To get the log transformation we simply generate a new variable
that is equal to log base e of the variable stay.
. gen logofstay = log(stay)
. label variable logofstay "The natural logarithm of the variable length of hospital
stay"
You can also use ln(stay) to get the log base e of stay. To get the log base 10 you use
log10(stay). The things about logs that we’ve probably long since forgotten are true
regardless of the base.
Notice in the histogram below that the log transformation has pulled the largest value in
nearer the other values.
Page -12-
Histogram 2 above is the graph of the natural logarithm of the variable stay, so the log
of the geometric mean of stay will equal the arithmetic mean of the variable logofstay.
. ameans stay
Variable |
Type
Obs
Mean
[95% Conf. Interval]
-------------+---------------------------------------------------------stay | Arithmetic
25
8.6
6.240767
10.95923
| Geometric
25
7.303239
5.774765
9.236272
|
Harmonic
25
6.308454
5.148257
8.143695
-----------------------------------------------------------------------. ameans logstay
Variable |
Type
Obs
Mean
[95% Conf. Interval]
-------------+---------------------------------------------------------logstay | Arithmetic
25
1.988318
1.753498
2.223138
| Geometric
25
1.907722
1.685849
2.158796
|
Harmonic
25
1.8248
1.613525
2.09974
-----------------------------------------------------------------------. di log(7.303239)
1.9883179
Or the antilog of the arithmetic mean of the variable logstay is the geometric mean of
the variable stay.
. di exp(1.988318)
7.3032394
The antilog in this case is the inverse of the log function which is the exponential
x
function (i.e.
where
).
e
e = 2.7182818
Page -13-
So what does the log transformation do?
If the ratios of two pairs of points are equal then on the log scale the distance between
the two members of a pair is the same for both pairs.
10 = 1
100 10
so
10 ⎞⎟
1
= log⎛⎜ ⎞⎟
log⎛⎜
⎝ 100⎠
⎝ 10⎠
but
10 ⎞⎟
1
= log⎛⎜ ⎞⎟ = log(1) − log(10)
log(10) − log(100) = log⎛⎜
⎝ 100⎠
⎝ 10⎠
So we have
. di log(10/100)
-2.3025851
. di log(1/10)
-2.3025851
. di log(1) - log(10)
-2.3025851
. di log(10) - log(100)
-2.3025851
So instead of having 1 and 10, 9 units apart while 10 and 100 are 90 units apart both
are 2.3 units apart on the natural log scale.
So the short answer to why CMS presents the geometric mean is to lessen the
influence of outlying values.
Page -14-
Properties of means:
Property 1:
Sometimes we wish to rescale the elements of our sample. For example, we may have
collected the weight of our participants in pounds and now we are going to publish our
paper in a journal that requires the weight to be reported in grams.
The data file we are using is “weight.dta”. I double (left) clicked on the data set weight
which was stored on the W drive and the file opened in Stata. In the “use” statement
below from the “W” to “weight.dta” gives the path to find the data set. When we open a
data set in this fashion, Stata will store any log file we create in the same folder where
the dataset was stored.
Page -15-
There are several properties that I would like you to notice about the file above:
1)
2)
3)
The file is sorted by the variable weight. This means if I list the variable weight,
the smallest weight will be listed first and the largest weight will be listed last.
Each variable has a variable label describing the data the variable contains.
The categorical variables have value labels.
Notice in the description above that the number of observations is given as 10,355 but
the summary of weight below says there are 10,341 values for weight.
. sum weight
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------weight |
10341
183.1275
39.37125
54
392
If I use the command codebook, we can see that there are 14 missing values for weight.
. codebook weight
-----------------------------------------------------------------------------weight
Weight (lbs) at Baseline
-----------------------------------------------------------------------------type:
numeric (float)
range:
unique values:
[54,392]
262
mean:
std. dev:
183.127
39.3713
percentiles:
10%
136
units:
missing .:
25%
156
50%
180
1
14/10355
75%
206
90%
234
We know that 1 pound = 453.26 grams. So let us create a new variable called
“wtingms” that is the baseline weight in grams.
. gen wtingms = weight*453.26
(14 missing values generated)
. label variable wtingms “Weight in grams”
Note that wtingms is missing 14 values because weight is missing 14 values (i.e.
missing × 453.26 = missing). Stata uses the period to represent missing data.
Page -16-
Below I used the command “list” to list the values of weight and wtingms for the last 19
participants (when the data is ordered by weight) which includes the 14 people with
missing values for wtingms. “noobs” asks that Stata not to number the rows.
. list id weight wtingms
if weight >= 364,noobs
+---------------------------+
|
id
weight
wtingms |
|---------------------------|
| 10337
364.00
164986.6 |
| 10338
370.00
167706.2 |
| 10339
382.00
173145.3 |
| 10340
392.00
177677.9 |
| 10341
392.00
177677.9 |
|---------------------------|
| 10342
.
. |
| 10343
.
. |
| 10344
.
. |
| 10345
.
. |
| 10346
.
. |
|---------------------------|
| 10347
.
. |
| 10348
.
. |
| 10349
.
. |
| 10350
.
. |
| 10351
.
. |
|---------------------------|
| 10352
.
. |
| 10353
.
. |
| 10354
.
. |
| 10355
.
. |
+---------------------------+
I have listed the last 19 observations for
weight.
The periods represent missing data.
Since the missing data is listed last, we
know that Stata considers missing values
to be larger than any other values.
The other thing to notice is that
164986.6 = 453.26 × 364
167706.2 = 453.26 × 370
etc.
Below we see that the mean of the wtingms variable is 453.26 times the mean of the
weight variable.
. sum weight
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------weight |
10341
183.12745
39.37125
54.00000 392.00000
. sum wtingms
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------wtingms |
10341
83004.35
17845.41
24476.04
177677.9
. di 453.26*183.12745
83004.348
The “di” above stands for display. The “*” says multiply 183.12745 times 453.26. That
is, I’m using Stata like it is a calculator.
Page -17-
c is a constant (here 453.26), the sample cx1 , cx2 , cx3 ,..., cxn
(wtingms) has mean cx where x is the mean of the sample x1 , x2 , x3 ,... xn
This shows that if
(weight). That is, you can obtain the mean of a sample and then multiply by the
constant or you can multiply each element by the constant and then get the mean.
Property 2:
x1 + c, x2 + c, x3 + c,..., xn + c has mean x + c
x1 , x2 , x3 ,..., xn has mean x and c is a constant.
Sample
if the sample
This says you can add (or subtract) a fixed value to each of the original values and then
get the mean or you can get the mean of the original values and then add (or subtract)
the fixed value.
You will find later when doing regression that people sometimes “center” their data by
subtracting the mean of the variable from each of the original observations. So instead
of putting the original variable in the regression equation, the variable they use is the
original variable minus its mean.
So let’s take a look at what happens when you add a fixed value to each element of a
sample. Let us take the variable chol (this is the baseline cholesterol from the dataset
weight.dta) and add 50 to the baseline value for each of the10273 people who have a
baseline value (i.e. 82 people have missing listed for the baseline value of cholesterol
and missing + 50 = missing).
. sum chol,det
Lipid BL Cholesterol
------------------------------------------------------------Percentiles
Smallest
1%
167
130
5%
181
134.5
10%
189.5
142.5
Obs
10273
25%
205
144
Sum of Wgt.
10273
50%
75%
90%
95%
99%
223
241.5
259
269
288.5
Largest
320.5
322
345
412
Mean
Std. Dev.
223.7146
26.80037
Variance
Skewness
Kurtosis
718.2601
.2067261
3.099006
. gen cholplus = chol + 50
(82 missing values generated)
. label variable cholplus50 "Baseline cholesterol + 50 mg/dL"
Soapbox moment: I recommend always labeling your variables. You think you’ll
remember how the variable is defined, but when you come back to the data six months
later you may find that you’ve forgotten.
Page -18-
. sum cholplus50,det
Baseline cholesterol + 50 mg/dL
------------------------------------------------------------Percentiles
Smallest
1%
217
180
5%
231
184.5
10%
239.5
192.5
Obs
10273
25%
255
194
Sum of Wgt.
10273
50%
75%
90%
95%
99%
273
Largest
370.5
372
395
462
291.5
309
319
338.5
Mean
Std. Dev.
273.7146
26.80037
Variance
Skewness
Kurtosis
718.2601
.2067261
3.099006
So we can see that adding 50 to each baseline value shifts all of the percentiles, the
mean, the minimum and the maximum up by 50 points. Notice that the standard
deviation and the variance (which we will define on later) remain unchanged (this is
because they refer to shape, while the mean and percentiles etc. refer to position). The
skewness and kurtosis (to be defined later) also remain the same because the only
thing we’ve done is to shift the curve up 50 points. See the graphs on the next 2 pages.
Below is the codebook for both chol and cholplus50.
. codebook chol cholplus50
-----------------------------------------------------------------------------chol
Lipid BL Cholesterol
-----------------------------------------------------------------------------type:
range:
unique values:
mean:
std. dev:
percentiles:
numeric (float)
[130,412]
326
units:
missing .:
.1
82/10355
223.715
26.8004
10%
189.5
25%
205
50%
223
75%
241.5
90%
259
-----------------------------------------------------------------------------cholplus50
Baseline cholesterol + 50 mg/dL
-----------------------------------------------------------------------------type:
range:
unique values:
mean:
std. dev:
percentiles:
numeric (float)
[180,462]
326
units:
missing .:
.1
82/10355
273.715
26.8004
10%
239.5
25%
255
Page -19-
50%
273
75%
291.5
90%
309
Below I have created a histogram for each of chol and cholplus50. You can see that the
two histograms below are the same shape. The lower one is just shifted 50 mg/dL to
the right.
600
400
0
200
Frequency
800
1000
Original Baseline Cholesterol
100
150
200
224
250
300
350
400
450
Baseline Cholesterol mg/dL
600
400
200
0
Frequency
800
1000
Baseline Cholesterol + 50
100
150
200
250
273.7
300
350
Baseline Cholesterol mg/dL + 50 mg/dL
Page -20-
400
450
The height of the box (i.e. from
25th to 75th percentile) is called the
interquartile range and it is a
measure of variability.
Lipid BL Cholesterol
Upper whisker
75th percentile
50th percentile
25th percentile
Lower whisker
100
The bottom of the box is the 25th
percentile and the top of the box
is the 75th percentile.
200
The line in the middle of the box
is the median or 50th percentile.
300
400
Box and whisker plots:
Box and Whisker Plot
500
400
300
200
100
Cholesterol for baseline and baseline + 50
Adding a constant changes location but not variability
Lipid BL Cholesterol
Baseline cholesterol + 50 mg/dL
The box plot above shows even more clearly that the distribution is just shifted up
without changing the relationship of the various pieces. So what I’ve worked hard to
show is that adding a fixed number to each unit of a sample changes the location of the
distribution but leaves the shape unchanged. We will discover that multiplying each unit
of a sample by a fixed number changes the shape of the distribution.
Page -21-
Now go back to multiplying the original values by some constant
We’ll generate a new variable which we obtain by multiplying each of the original
baseline cholesterol values by 2.
. gen cholX2 = 2*chol
(82 missing values generated)
. label variable cholX2
"Baseline cholesterol times 2 mg/dL"
Notice below that almost all of the values produced by the summarize command are
multiplied by 2. There are three exceptions. The variance is multiplied by 4 = 22 (we will
later learn the variance = SD2, where SD = standard deviation) and the skewness and
kurtosis are the same as they were for baseline cholesterol (as opposed to being
multiplied by 2). We’ll discuss skewness and kurtosis later.
. sum cholX2,det
Baseline cholesterol times 2 mg/dL
------------------------------------------------------------Percentiles
Smallest
1%
334
260
5%
362
269
10%
379
285
Obs
10273
25%
410
288
Sum of Wgt.
10273
50%
75%
90%
95%
99%
446
483
518
538
577
Largest
641
644
690
824
Mean
Std. Dev.
447.4292
53.60075
Variance
Skewness
Kurtosis
2873.04
.2067261
3.099006
. sum chol,det
Lipid BL Cholesterol
------------------------------------------------------------Percentiles
Smallest
1%
167
130
5%
181
134.5
10%
189.5
142.5
Obs
10273
25%
205
144
Sum of Wgt.
10273
50%
75%
90%
95%
99%
223
241.5
259
269
288.5
Largest
320.5
322
345
412
Mean
Std. Dev.
223.7146
26.80037
Variance
Skewness
Kurtosis
718.2601
.2067261
3.099006
I have created a histogram for each of baseline cholesterol and baseline cholesterol
times 2. In order to compare the 2 graphs they need to be on the same scale. Notice
that the smallest value for cholesterol is 130 mg/dL and the largest for cholesterol times
Page -22-
600
400
0
200
Frequency
800
1000
2 is 824 mg/dl. So I will select the x-axis scale as 125(100)825 for both versions of
cholesterol. 125(100)825 says label the x-axis starting with the smallest value (i.e. 125)
and then going up by units of 100 until you reach 825.
125
225
325
425
525
625
725
825
600
400
200
0
Frequency
800
1000
Baseline cholesterol mg/dL
125
225
325
425
525
625
Baseline cholesterol mg/dL times 2
Page -23-
725
825
200
400
mg/dL
600
800
Baseline cholesterol and baseline cholesterol times 2
Lipid BL Cholesterol
Baseline cholesterol times 2 mg/dL
Looking at the graphs on the previous page and above we see that multiplying by 2 has
changed not only the location (mean) but also the shape. The cholesterol times 2 is
much more spread out (we’ll come back to these graphs when we discuss measures of
variability).
So we’ve learned that adding to the elements of a sample changes only the location but
multiplying changes both the location and the shape. We know that we can measure
location using the mean and median, but we don’t yet know how to indicate (other than
graphically) that the shape has changed.
Page -24-
Menus to get means:
Click on “Submit” to run the command but
leave the menu up so you can make
changes as needed.
Click “OK” just to run the command.
Click on “?” to bring up the help menu for
summarize.
Click on “R” to clear the entries in the
menu.
Page -25-
How to change the values of a variable.
. replace chol = 1500 if chol == 412
(1 real change made)
Page -26-
How to get geometric, arithmetic and harmonic means.
Page -27-
How to get a histogram.
Page -28-
2000
1500
Frequency
1000
500
0
200
250
300
350
cholplus50
Page -29-
400
450
Measures of spread or variability:
Range:
range = largest value - smallest value
Note that codebook gives the range as an interval. Statisticians tend to
use the definition as given so that the range is a single number
Advantage:
This is the simplest measure of spread.
Disadvantage
Very sensitive to extreme values
The range for the baseline cholesterol is 412 - 130 = 282.
If we change the largest value (412) to 550, then the range becomes
550 - 130 = 420
One of the problems with the range is there is a tendency for larger samples, to
have larger ranges.
How does adding 50 to the variable cholesterol or multiplying by 2 change the range.
The range for the baseline cholesterol is 412 - 130 = 282.
The range for the cholesterol + 50 = 462 - 180 = 282.
So these two variables with the same shape also have the same range.
The range for cholesterol times 2 = 824 - 260 = 564 = 2 times the range
of baseline cholesterol.
The range for cholesterol times 2 is twice that of the original cholesterol. We can see
that in the histograms and the box-and-whisker plots in the Chapter 2 Part 1 handout.
Percentiles:
Rosner says that intuitively, the
p th
percentile is the value V p such that p percent of
the sample points are less than or equal to V p . The median is the 50th percentile.
You will also see percentiles called quantiles.
Quartiles are the 25th , 50 th , 75th percentiles
Quintiles are the 20 th , 40 th , 60 th , 80 th percentiles
Deciles are the 10 th , 20 th , 30 th , 40 th , K , 90 th percentiles
Page -30-
Below we can see the change in the 25th, 50th and 75th percentiles as you add a constant
(here 50) to the original cholesterol or multiply the original cholesterol by a constant
(here 2).
Percent
Cholesterol
Cholesterol + 50
Cholesterol x 2
25%
205
205 + 50 = 255
205 x 2 = 410
50%
223
223 + 50 = 273
223 x 2 = 446
75%
241.5
241.5 + 50 = 291.5
241.5 x 2 = 483
Page -31-
Interquartile range:
Interquartile range = value of the
75th
percentile - value of the
25th
percentile
As we saw in the last handout, the interquartile range is the height of the box in
the box plot graph.
Notice below that the values of baseline cholesterol cluster together whereas the
values of baseline cholesterol times 2 are much more spread out. We would like to be
able to describe this variability in a way that uses all of the data as opposed to the
range and interquartile range which use only 2 of the values in the dataset. We’ll call
this new statistic the variance.
Page -32-
Variance:
A first guess at a definition for variance might be
n
guess(1) = ∑ ( xi − x )
i= 1
This definition uses all of the observations in the sample. It also seems reasonable to
use the distance of each observation from the mean as a measure of how spread out the
values are. The problem is that this sum is always equal to zero.
A second guess might be
n
guess(2) = ∑ | xi − x |
i= 1
This second guess solves the problem of the sum adding to zero and it is scaled the
same as the original data. However, this second guess has two problems: (1) is that the
absolute value is mathematically intractable and (2) this sum gets larger as the sample
size gets larger. The second problem could be dealt with by dividing the sum by the size
of the sample, namely .
n
Guess number 3 is to square the difference because the square is easier to deal with
mathematically than the absolute value and it prevents the sum from being zero as the
absolute value did. If we also divide by , then we have provided a correction for the
sample size (i.e. we adjusted the sum of squares so that the sum doesn’t increase just
because the sample size increases).
n
n
guess(3) =
∑ ( xi − x ) 2
i=1
n
The problem with this estimate, which we won’t understand until we learn about biased
and unbiased estimators, is that on the average it is too small (this means if we took a
large number of repeated samples of size
from a given population and averaged all of
the variances from these samples, the average would be smaller than the true variance
n
of the population). To solve this problem we divide by n − 1 rather than .
What we haven’t stated before is that the sample estimate for the variance is intended to
n
Page -33-
estimate the variance of the population from which the sample was drawn.
2
So the variance ( s ) is defined as follows:
n
s2 =
∑ ( xi − x ) 2
i= 1
n −1
The variance of each of the baseline cholesterol and the baseline cholesterol + 50 is
718.26. The variance of the cholesterol times 2 = 2873.04 (i.e. 22 × baseline cholesterol
variance). Notice that the variance is not in the same units as the original data (i.e.
mg2/dL2 versus mg/dL). See the Stata output on page 2.
Standard deviation:
The only problem left with the above definition is that the variance is not in the same
units as the original data. This can be solved by taking the square root of the variance.
The square root of the variance is called the standard deviation and is denoted by s.
We take the non-negative square root so s $ 0.
n
s=
∑ ( xi − x ) 2
i=1
n −1
Standard Error of the Mean:
The standard error of the mean, denoted either SEM or SE is the standard deviation
divided by the square root of
or
n
SE =
s
n
The SE is going to come in handy when we get to confidence intervals and the Central
Limit Theorem. Small preview: The standard deviation ( s ) tells us about the spread for
Page -34-
a single sample. The standard error (SE) is actually the standard deviation of the
distribution of all sample means from samples of size . Notice that the size of the SE
is dependent upon the size of the sample.
n
Kurtosis:
The kurtosis of a distribution describes its peakedness relative to the length and size of
its tails. The kurtosis of the normal distribution is 3. Distributions with values of kurtosis
higher than 3 tend to have sharp peaks and long tapering tails (see the histogram of
triglycerides ). Values lower than 3 indicate distributions that are relatively flat with short
tails.
Users of SAS need to be aware that the value that SAS gives for kurtosis is Stata’s value
minus 3 (i.e. the normal distribution will have a kurtosis of 3 according to Stata and 0
according to SAS).
There are at least two different definitions of kurtosis and SAS and Stata have just
selected different definitions.
Kurtosis = 17.6
Skewness = 1.8
Skewness:
A symmetric distribution is one that you can fold over at the mean and the two halves will
coincide. A symmetric distribution (e.g. the normal distribution) will have a skewness of
zero. Those distributions that are skewed to the right, like triglycerides, have a positive
number for skewness. Those skewed to the left will have a negative number for
skewness.
Page -35-
The direction of the skewness goes with the side the longer tail is on. So the
triglycerides graph above is said to be skewed to the right.
500
1,000
The 50th percentile line is not in the
center of the box. This is hard to
see but the median line is a little
below the middle if the box.
The whiskers are not the same
length.
And, of course, that long string of
points outside the upper whisker with
no similar string outside the lower
whisker.
0
Lipid BL Triglycerides
1,500
How to tell the graph is skewed
when using a box plot:
Definition of the whiskers.
First order the units of the sample in ascending order (smallest to largest).
Let
x[ p] denote the pth percentile.
The box extends from
Define
So
x[25] is the 25
x[25] to x[75] .
th
percentile.
The line in the “middle” is
x[50] .
U = x[75] + 15
. ( x[75] − x[25])
and
L = x[25] − 15
. ( x[75] − x[25])
Page -36-
Notice that if the whiskers were defined by U and L, then the length of the upper and
lower whiskers would always be the same. After we’ve looked at a bunch of examples
you’ll know the upper and lower whiskers are not always the same length. The length
depends on the upper and lower adjacent values defined below.
x( i ) indicates that the x ' s are ordered from smallest to largest.
n x ' s , then x(1) is the smallest and x( n ) is the largest.
The notation
are
If there
The upper adjacent value (i.e. the upper whisker) is defined as the x( i ) such that
x( i ) ≤ U
and
x(i + 1) > U (i.e. x(i )
is just inside or on U).
The lower adjacent value (i.e. the lower whisker) is defined as the x( i ) such that
x(i ) ≥ L and
x(i − 1) < L
(i.e.
x( i )
is just inside or on L).
Notice that Rosner refers to points outside the whiskers as outlying values.
The upper and lower adjacent values (defined above) are a creation of John Tukey
(Exploratory Data Analysis, 1977).
Page -37-
John Tukey - Statistician
He died at 85 in 2000
Coined the Word 'Software' and the word ‘bit’ for binary digit.
Tukey used the term software three decades before the
founding of microsoft.
John Wilder Tukey was one of the most influential statisticians
of the last 50 years and a wide-ranging thinker.
Mr. Tukey developed important theories about how to analyze
data and compute series of numbers quickly. He spent
decades as both a professor at Princeton University and a
researcher at AT&T's Bell Laboratories, and his ideas
continue to be a part of both doctoral statistics courses and
high school math classes. In 1973, President Richard M.
Nixon awarded him the National Medal of Science.
Taken in part from the New York Times Obituary.
How to graph a box plot
In the menu above click on box
plot and you will get the menu on
the right. There are a lot of fancy
things you can do but just putting
“trig” in the variables window gets
you the graph a couple of pages
up.
Page -38-