Download Chapter 2-1. Describing Variables, Levels of

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Chapter 2-1. Describing Variables, Levels of Measurement, and Choice of
Descriptive Statistics
Statisticians
In the front of M.G. Kendall and A. Stuarts, The Advanced Theory of Statistics, Vol 2, is a
quotation attributed to the fictitious K.A.C. Manderville, The Undoing of Lamia Gurdleneck.
"You haven't told me yet," said Lady Nuttal, "what it is your fiance does for a living."
"He's a statistician," replied Lamia, with an annoying sense of being on the defensive.
Lady Nuttal was obviously taken aback. It had not occurred to her that statisticians
entered into normal social relationships. The species, she would have surmised, was
perpetuated in some collateral manner, like mules.
"But Aunt Sara, it's a very interesting profession," said Lamia warmly.
"I don't doubt it," said her aunt, who obviously doubted it very much. "To express
anything important in mere figures is so plainly impossible that there must be endless
scope for well-paid advice on the how to do it. But don't you think that life with a
statistician would be rather, shall we say, humdrum?"
Lamia was silent. She felt reluctant to discuss the surprising depth of emotional
possibility which she had discovered below Edward's numerical veneer.
"It's not the figures themselves," she said finally. "It's what you do with them that
matters."
Statistics is the Mathematics of Distributions
At the individual level, we can describe features with single numbers (e.g., Fred is 26 years old).
At the group level, however, we lose this ability. For example, there is not a single number to
describe the age of the people in a college classroom. There are many ages represented, which
we call a distribution of ages. Statistics can be said to be the mathematics of distributions. It
allows us to describe and test theories in our universe, since our universe can be conceived of as
an infinite set of distributions.
_____________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385
Chapter 2-1 (revision 8 Jan 2011)
p. 1
Displaying a Variable Distribution
In Windows Explorer, find the datasets & do files subdirectory of the course manual. Double
click on
births_with_missing.dta
to start Stata and read in the data.
If that does not work, because the file is not associated with the program name, for example, use
the following,
File
Open
Find the directory where you copied the course CD
Find the subdirectory datasets & do-files
Single click on births_with_missing.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\
births_with_missing.dta", clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do files"
use births_with_missing, clear
A way to discover the sample size of a dataset, is to use the Stata menus:
Data
Variable utilities
Count observations satisfying condition
OK
or simply execute the following command in the Command Window:
count
Notice that anytime you use the menus, a Stata command is constructed and executed and shown
in the Results Window.
The crudest way to see the distribution of a variable is to simply list its values. Let’s do this for
maternal age:
Chapter 2-1 (revision 8 Jan 2011)
p. 2
Data
Describe data
List data
Main tab: Variables: matage
OK
list matage
<hit abort button – the red dot with a white X>
<- abort after one screenful, since we know there are 500 lines of data
Clearly, a list of 500 numbers is difficult to comprehend. What we do in statistics is a process
called data reduction, which is to reduce the information into a form that is easier to
comprehend.
Chapter 2-1 (revision 8 Jan 2011)
p. 3
The first level of data reduction is the frequency table, which we get using
Statistics
Summaries, tables & tests
Tables
One-way tables
Main tab: Categorical variable: matage
OK
tabulate matage
<or abbreviate command to:>
tab matage
<or abbreviate both command and variable name to:>
tab mat
<or use minimum abbreviation:>
ta m
This generates the following frequency table.
maternal |
age |
Freq.
Percent
Cum.
------------+----------------------------------23 |
1
0.21
0.21
24 |
1
0.21
0.41
25 |
5
1.03
1.44
26 |
9
1.86
3.30
27 |
14
2.89
6.19
28 |
12
2.47
8.66
29 |
26
5.36
14.02
30 |
27
5.57
19.59
31 |
29
5.98
25.57
32 |
39
8.04
33.61
33 |
45
9.28
42.89
34 |
51
10.52
53.40
35 |
44
9.07
62.47
36 |
31
6.39
68.87
37 |
44
9.07
77.94
38 |
44
9.07
87.01
39 |
27
5.57
92.58
40 |
24
4.95
97.53
41 |
7
1.44
98.97
42 |
2
0.41
99.38
43 |
3
0.62
100.00
------------+----------------------------------Total |
485
100.00
The left column is the actual values (scores) for the maternal age variable. The “Freq” column is
the number of times that particular value occurred in the data (the frequency). The “Percent”
column is the frequency count divided by the sample size (N=485, rather than N=500, which we
will discuss shortly). The “Cum” column is the cumulative sum of the Percent column,
informing you what percent of the sample had “equal to or less than” that value.
Chapter 2-1 (revision 8 Jan 2011)
p. 4
By default, Stata commands only operate on non-missing values, which is what you normally
want to report in your article. To see the number of missing values with the tabulate command,
we simply add the “missing” option:
Statistics
Summaries, tables & tests
Tables
One-way tables
Main tab: Categorical variable: matage
Treat missing values like all other values
OK
tabulate matage , missing
maternal |
age |
Freq.
Percent
Cum.
------------+----------------------------------23 |
1
0.20
0.20
24 |
1
0.20
0.40
25 |
5
1.00
1.40
26 |
9
1.80
3.20
27 |
14
2.80
6.00
28 |
12
2.40
8.40
29 |
26
5.20
13.60
30 |
27
5.40
19.00
31 |
29
5.80
24.80
32 |
39
7.80
32.60
33 |
45
9.00
41.60
34 |
51
10.20
51.80
35 |
44
8.80
60.60
36 |
31
6.20
66.80
37 |
44
8.80
75.60
38 |
44
8.80
84.40
39 |
27
5.40
89.80
40 |
24
4.80
94.60
41 |
7
1.40
96.00
42 |
2
0.40
96.40
43 |
3
0.60
97.00
. |
15
3.00
100.00
------------+----------------------------------Total |
500
100.00
This time we see that n=15 observations have a value of “.”, which is Stata’s reserved symbol for
missing value for a numeric variable.
Chapter 2-1 (revision 8 Jan 2011)
p. 5
Let’s try this for a string variable, which is a variable of letters and/or special symbols (not
numbers).
Statistics
Summaries, tables & tests
Tables
One-way tables
Main tab: Categorical variable: sexalph
Treat missing values like all other values
OK
tabulate sexalph , missing
sex coded |
as string |
Freq.
Percent
Cum.
------------+----------------------------------|
41
8.20
8.20
female |
220
44.00
52.20
male |
239
47.80
100.00
------------+----------------------------------Total |
500
100.00
The missing value looks like a space, but it is actually a null value. If you were entering a
missing string value using the Stata data editor, you would hit the tab key or the enter key to
input the missing value, rather than the space character.
When we look at the Freq or Percent columns of the frequency table, we get a sense for what
values are most likely.
Chapter 2-1 (revision 8 Jan 2011)
p. 6
The next level of data reduction is to see this graphically with a histogram.
Graphics
Histogram
Main tab: Variable: matage
OK
0
.05
Density
.1
.15
histogram matage
25
30
35
maternal age
40
45
Notice that the tick marks don’t line up precisely with the middle of the bars, and that there
appears to be a value of maternal age, perhaps age 33, that has a zero frequency. Looking back at
the frequency table, however, reveals no age with a zero frequency.
Stata is attempting to provide the nicest looking graph, by forming what it considers the number
of bins (cut-offs) that should provide the nicest looking graph. In the Results Window, we see
what it did:
(bin=22, start=23, width=.90909091)
This algorithm generally works for a continuous variable that has a large number of distinct
values. For our variable, which has a relatively small number of distinct values, we can bypass
this feature using the “discrete” option.
Chapter 2-1 (revision 8 Jan 2011)
p. 7
Graphics
Histogram
Main tab: Variable: matage
Data is discrete
OK
.06
0
.02
.04
Density
.08
.1
histogram matage , discrete
20
25
30
35
maternal age
40
45
The default scale, “density”, is the proportion of the total sample of non-missing values for each
specific value of the matage. It is a graph of the Percent column, converted to a proportion, of
the frequency table.
Chapter 2-1 (revision 8 Jan 2011)
p. 8
To change the y-axis to percents, use the percent option
Graphics
Histogram
Main tab: Variable: matage
Data is discrete
Y-axis: Percent
OK
6
0
2
4
Percent
8
10
histogram matage , discrete percent
20
25
Chapter 2-1 (revision 8 Jan 2011)
30
35
maternal age
40
45
p. 9
To change the y-axis to frequency count, use the frequency option
Graphics
Histogram
Main tab: Variable: matage
Data is discrete
Y-axis: Frequency
OK
30
20
0
10
Frequency
40
50
histogram matage , discrete frequency
20
Chapter 2-1 (revision 8 Jan 2011)
25
30
35
maternal age
40
45
p. 10
Many people think its fun to overlay a normal distribution curve on their histogram, to get a feel
for how normally distributed the variable is. (Note: the importance of having a normally
distributed variable is overrated, as we will see in a later chapter.)
To get this, we request the normal option:
Graphics
Histogram
Main tab: Variable: matage
Data is discrete
Y-axis: Frequency
Density Plots tab: Add normal density plot
OK
30
20
0
10
Frequency
40
50
histogram matage , discrete frequency normal
20
Chapter 2-1 (revision 8 Jan 2011)
25
30
35
maternal age
40
45
p. 11
To see the line that passes through the top center of each bar, but with smoothing, we can overlay
on the histogram what is called a “kernal density plot”, using,
Graphics
Histogram
Main tab: Variable: matage
Data is discrete
Y-axis: Frequency
Density Plots tab: Add kernal density plot
OK
30
20
0
10
Frequency
40
50
histogram matage , discrete frequency kdensity
20
25
Chapter 2-1 (revision 8 Jan 2011)
30
35
maternal age
40
45
p. 12
To visually compare the histograms for males and females, we use the by( ) option.
Graphics
Histogram
Main tab: Variable: matage
Data is discrete
Y-axis: Frequency
By tab: Draw subgraphs for unique values of variables
Variables: sexalph
OK
histogram matage, discrete frequency by(sexalph)
male
0
10
Frequency
20
30
female
20
30
40
50
20
30
40
50
maternal age
Graphs by sex coded as string
Notice that it is difficult to tell how much the two histograms overlap, or diverge.
Chapter 2-1 (revision 8 Jan 2011)
p. 13
One approach we can take is to overlap the kernel density plots for the two sexes, using
.06
.04
0
.02
kdensity matage
.08
.1
twoway (kdensity matage if sexalph == "female", lcolor(pink)) ///
(kdensity matage if sexalph == "male" , lcolor(blue))
25
30
35
40
45
x
kdensity matage
kdensity matage
Here, the “///” in the command means to continue the command onto the next line, but this only
works in the do-file editor. To use that, click on the 5th icon from the right on the menu bar,
which looks like a notebook. Type in your command, and hit the “Do” button, which is the last
icon on the do-file editor menu bar. In the command window, you would just type the entire
command in on one line, without the “///”. The “( ) ( )”, with a command for a different graph
in each “( )”, tells Stata to overlay the graphs.
Notice it uses “density”, rather than “frequency”. With frequencies, the two sexes could have a
different looking line simply due to a different sample size, while using proportions, or densities,
make them directly comparable.
Chapter 2-1 (revision 8 Jan 2011)
p. 14
Another graph to show this, although more complicated to create, is
0
10
5
percent
5
10
clear
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "BiostatsCourse\datasets & do files"
use births_with_missing.dta
drop if matage==. | sex==.
count if sex==1
scalar Nmale=r(N)
count if sex==2
scalar Nfemale=r(N)
gen one = 1
collapse (count) one , by(sex matage)
rename one frequency
gen percent=frequency/Nmale*100 if sex==1
replace percent=frequency/Nfemale*100 if sex==2
replace percent = -percent if sex==1 // males on bottom
#delimit ;
twoway (bar percent matage if sex==1)
(bar percent matage if sex==2)
, legend(lab(1 "male") lab(2 "female"))
ylabel(-10 "10" -5 "5" 0 "0" 5 "5" 10 "10")
;
#delimit cr
25
30
35
maternal age
male
40
45
female
The commands used for this graph are taught in the K30 Computer Practicum (Stata) course.
Although fancy, it is still too much information to get your head around.
Chapter 2-1 (revision 8 Jan 2011)
p. 15
Finally, a frequently reported graph for comparing two or more distributions is the box and
whisker plot, or boxplot for short.
Graphics
Box plot
Main tab: Variables: matage
By tab: Draw subgraphs for unique values of variables
Variables: sexalph
OK
graph box matage, by(sexalph)
male
35
25
30
maternal age
40
45
female
Graphs by sex coded as string
The advantage of this graph is that it incorporates a side by side comparison, with just the right
amount of data reduction, which makes it the most popular approach for visually comparing
distributions.
Chapter 2-1 (revision 8 Jan 2011)
p. 16
Before we discuss what this graph represents, let’s add an outlier to the data to see how it is
represented.
Go into the data editor [spreadsheet with pencil icon, the “Date Editor (Edit)” in Stata-11], 5th
from right on menu bar) and change matage from 34 to 70 for the first observation. Click on the
X to exit, and click on accept changes in the exit data editor box. When you clicked on accept
changes, it executed the following command:
replace matage = 70 in 1
Re-creating the boxplot,
Graphics
Box plot
Main tab: Variables: matage
By tab: Draw subgraphs for unique values of variables
Variables: sexalph
OK
graph box matage, by(sexalph)
male
50
40
20
30
maternal age
60
70
female
Graphs by sex coded as string
Notice that the outlier is represented by a dot.
The interpretation of a boxplot is shown in the following box.
Chapter 2-1 (revision 8 Jan 2011)
p. 17
Boxplots
This graph is a box-and-whisker plot, or simply, a boxplox. The box shows the interquartile
range (IQR) (top of box is the 75th percentile, the bottom of the box is the 25th percentile). The
line inside the box is the median (50th percentile). The lines extending beyond the box, which
look like error bars, called the whiskers, represent the minimum and maximum. However, if a
data value extends beyond 1.5  IQR in either direction, the whiskers exclude that value and the
value is shown separately as a point on the graph. These points are called extreme values.
Extreme values might represent outliers, an outlier being a data value that appears to have not
come from the same population that the rest of sample came from.
This graphical approach for identifying outliers was proposed by Tukey (1977). Tukey referred
to outliers as “extreme values” to avoid the whole “outlier” issue, it be controversial how outliers
should be handled.
We will discuss outliers in a later chapter, what to do about them, and how to report or exclude
them in your article. For now, if you choose to describe this method for identifying outliers in
your study protocol, you could state the following:
We will identify outliers using the graphical method proposed by Tukey (1977), which
are values that extend beyond 1.5 times the interquartile range in either direction...(and
then tell what you intend to do about them).
The primary reason for using boxplots is not to identify outliers, but rather to graphically show
and perhaps compare distributions.
Exercise Look at the boxplot in Figure 3 of Bejon et al (2008). Notice in their footnote they
give the same explanation of the boxplot elements as given here.
Chapter 2-1 (revision 8 Jan 2011)
p. 18
Now that outliers have been described, return the data back to what it was, using
replace matage = 34 in 1
Re-creating the boxplot,
Graphics
Box plot
Main tab: Variables: matage
By tab: Draw subgraphs for unique values of variables
Variables: sexalph
OK
graph box matage, by(sexalph)
male
35
30
25
maternal age
40
45
female
Graphs by sex coded as string
Notice that the “by” option created a side-by-side graph for each value of the “by” variable.
Some other variations of this format will be tried next.
Chapter 2-1 (revision 8 Jan 2011)
p. 19
To get the box plots to be contained within the same graph, we replace “ by” with “over”.
Graphics
Box plot
Main tab: Variables: matage
Categories tab: check box for Group 1
Grouping variable: sexalph
OK
35
30
25
maternal age
40
45
graph box matage, over(sexalph)
female
male
Notice both groups are now in the same plot region, rather than two plots that are side by side.
The labels “female” and “male” are actually values of the variable sexalph. If we used the
variable sex, which has no value labels assigned, we will see the numeric values of the variable.
Chapter 2-1 (revision 8 Jan 2011)
p. 20
Graphics
Box plot
Main tab: Variables: matage
Categories tab: check box for Group 1
Grouping variable: sex
OK
35
25
30
maternal age
40
45
graph box matage, over(sex)
1
2
To assign labels to replace the 1 and 2, we use,
Chapter 2-1 (revision 8 Jan 2011)
p. 21
Graphics
Box plot
Main tab: Variables: matage
Categories tab: check box for Group 1
Grouping variable: sex
Click on “properties”
Check “override labels for this group”
Label specification: 1 Male 2 Female
Accept
OK
35
30
25
maternal age
40
45
graph box matage, over(sex, relabel(1 Male 2 Female))
Male
Female
Note: In the Label Specification, we put the label “Male” by 1 and “Female” by 2. Here, the 1
and 2 simply mean the sort order of the actual values of the variable. If the variable had been
scored 0 for male and 1 for female, we would still use “1 Male 2 Female” for the label
specification.
Chapter 2-1 (revision 8 Jan 2011)
p. 22
Multiple “over” groups can be specified to get a clustering effect. We simply specify more
grouping variables.
Graphics
Box plot
Main tab: Variables: matage
Categories tab: Check box for Group 1
Grouping variable: sex
Check box for Group 2
Grouping variable: hyp
Accept
OK
35
30
25
maternal age
40
45
graph box matage, over(sex) over(hyp)
1
2
0
1
2
1
At the bottom of the graph, the 1 and 2 come from the first specified “over” variable, and the 0
and 1 from the second variable, which are the actual values of these two variables.
Chapter 2-1 (revision 8 Jan 2011)
p. 23
To assign labels to these variables in the graph,
Graphics
Box plot
Main tab: Variables: matage
Categories tab: check box for Group 1
Grouping variable: sex
Click on “properties”
Check “override labels for this group”
Label specification: 1 Male 2 Female
check box for Group 1
Grouping variable: hyp
Click on “properties”
Check “override labels for this group”
Label specification: 1 "Maternal Hypertension Present"
2 "Maternal Hypertension Absent"
Accept
OK
35
30
25
maternal age
40
45
graph box matage, over(sex, relabel(1 Male 2 Female)) ///
over(hyp, relabel(1 "Maternal Hypertension Absent" ///
2 "Maternal Hypertension Present"))
Male
Female
Maternal Hypertension Absent
Male
Female
Maternal Hypertension Present
Notice for the label specification for variable hyp, we used 1 and 2 to denote the 1st and 2nd sort
order values of the variable, rather than using 0 and 1, which are the actual values of the variable.
Chapter 2-1 (revision 8 Jan 2011)
p. 24
The boxplot is a nice way to show whether or not two or more distributions overlap. However, it
is still too much information, and not specific enough to identify the size of the difference.
The next level of data reduction is descriptive statistics, which include the mean and standard
deviation. This is the most useful level of data reduction and is expected for publication.
The most commonly used descriptive statistics are shown on the following pages.
Chapter 2-1 (revision 8 Jan 2011)
p. 25
Descriptive Statistics
Descriptive statistics (or “descriptive measures”) are used to summarize data for a variable. Here
are some definitions:
n = sample size (can also use “N”)
Minimum = smallest value in the data
Maximum = largest value in the data
Range = Maximum minus Minimum
Mode = score with highest frequency
Mean = arithmetic average (also called “X bar”)
n
X
X
i 1
i
n
Median = 50th percentile (50% of values below and 50% above)
Interquartile Range (IQR) = 75% percentile minus 25% percentile, which is usually
indicated by just reporting the two percentiles
Variance = a measure of dispersion expressed in squared terms. This is only used in
formulas, and never reported as a discriptive measure.
n
s2 
Chapter 2-1 (revision 8 Jan 2011)
(X
i 1
i
 X )2
n 1
p. 26
Standard Deviation (SD) = the square root of the variance. This is a measure of dispersion
expressed in original units. For a generally symmetrical
distribution, approximately two-thirds of the values are in the
interval mean  1 S.D. and approximately 95% are in interval mean
 2 S.D.
n
s
(X
i 1
i
 X )2
n 1
Looking at the formula for the SD, or s, we observe it looks like the average of the
distance from the mean for each value of the variable (average distance from the
mean). The only thing strange is the denominator, n-1, instead of n. The n-1 is
called the “degrees of freedom.” Only n-1 values are “free to vary”, since two
equations are solved simultaneously, the formula for the mean and the formula for
the SD. That is, if you know n-1 values of the variable when computing the
variance or SD, the final value can only be a single specific number, a plugged
value, or otherwise a different value for the mean would have been necessary
(Munro, 2001, pp.84-85). So, the SD is the average distance from the mean of the
values which can contribute to the variability (those free to vary). Intuitively, you
can simply consider the SD as the average deviation, or distance, from the mean,
since n-1 is approximately n.
Standard Error of the Mean (SEM) = standard deviation of the sample mean, a measure of
variation in our estimate of the mean. It is the standard deviation
divided by the square root of the sample size.
SEM  s X 
s
n
If the context is clear, that you are talking specifically about the mean, this is also
just called the standard error (SE). This statistic will be discussed further below.
Chapter 2-1 (revision 8 Jan 2011)
p. 27
Frequency Proportion
Describing a Variable (Average and Variability)
When describing the distribution of a variable with descriptive statistics, the goal is to summarize
the distribution with as few numbers as possible. It is generally sufficient to report the average,
or center of the distribution, and the variability, or dispersion (how spread out the data are).
If the distribution is at least approximately symetrical, it will look somewhat like a theoretical
distribution called the normal distribution (bell shaped curve). A normal distribution with a
mean of 0 and a standard deviation (SD) of 1 will look like:
.5
Mean = 0
.4
.3
SD = 1
.2
.1
0
-4
-3
-2
-1
0
1
Variable Scores
2
3
4
For a normal curve, the SD on the x-axis is where the histogram, or curve, switches from
concave down to concave up.
Chapter 2-1 (revision 8 Jan 2011)
p. 28
If you are curious how the above graph was drawn, the following was run in the do-file editor,
#delimit ;
twoway (function y=normalden(x,0,1), clwidth(*1.5) range(-4 4))
(pci .4 0 0 0 , lwidth(*1.25) lcolor(red))
(pci .24 -1 .24 1, lwidth(*1.25) msize(*1.25) lcolor(red))
(pcarrowi .43 .3 .405 0 , lwidth(*1.5) msize(*1.25)
lcolor(black) mcolor(black))
(pcarrowi .27 1.4 .24 1.1 , lwidth(*1.5) msize(*1.25)
lcolor(black) mcolor(black))
, ytitle(Frequency Proportion)
xtitle(Variable Scores,height(5))
xlabels(-4(1)4) ylabels(0(.1).5,angle(horizontal))
legend(off)
text(.435 .32 "Mean = 0", placement(ne))
text(.27 1.5 "SD = 1", placement(ne))
;
#delimit cr
The standard deviation (SD) is a measure of the scatter, or dispersion, around the mean. To
illustrate this, in the next figure three normal distributions are drawn, each with the same mean,
for various SDs.
Frequency Proportion
.2
.15
.1
Mean = 50 , SD = 2
Mean = 50 , SD = 7
.05
Mean = 50 , SD = 20
0
0
10
20
Chapter 2-1 (revision 8 Jan 2011)
30
40
50
60
Variable Scores
70
80
90
100
p. 29
If you are curious how the above graph was drawn, the following was run in the do-file editor,
#delimit ;
twoway (function y=normalden(x,50,2),
clwidth(*1.5) lcolor(blue) range(0 100))
(function y=normalden(x,50,7),
clwidth(*1.5) lcolor(red) range(0 100))
(function y=normalden(x,50,20),
clwidth(*1.5) lcolor(green) range(0 100))
(pcarrowi .07 60 .065 55 , lwidth(*1.5) msize(*1.25)
lcolor(blue) mcolor(blue))
(pcarrowi .05 63 .04 58 , lwidth(*1.5) msize(*1.25)
lcolor(red) mcolor(red))
(pcarrowi .03 72 .02 66 , lwidth(*1.5) msize(*1.25)
lcolor(green) mcolor(green))
, ytitle(Frequency Proportion)
xtitle(Variable Scores,height(5))
xlabels(0(10)100) ylabels(0(.05).22,angle(horizontal))
legend(off)
text(.07 61 "Mean = 50 , SD = 2", placement(e) color(blue)
size(*1.1))
text(.05 64 "Mean = 50 , SD = 7", placement(e) color(red)
size(*1.1))
text(.03 73 "Mean = 50 , SD = 20", placement(e) color(green)
size(*1.1))
;
#delimit cr
Chapter 2-1 (revision 8 Jan 2011)
p. 30
Frequency Proportion
For a normal distribution, the following percentages of the distribution are between:
mean  1 S.D : 68.3%
mean  2 S.D : 95.4%
mean  3 S.D : 99.7%
as shown in the following graph,
Mean = 0 , SD = 1
Mean 1SD
(middle 68.3%)
.5
Mean 2SD
(middle 95.4%)
.4
.3
Mean 3SD
(middle 99.7%)
.2
.1
0
-4
-3
-2
-1
0
1
Variable Scores
2
3
4
If you are curious how this graph was drawn, the following was run in the do-file editor,
#delimit ;
twoway (function y=normalden(x,0,1), clwidth(*1.5) range(-4 4))
(pci .55 -1 0 -1 , lwidth(*2) lcolor(red))
(pci .55 1 0 1 , lwidth(*2) lcolor(red))
(pci .46 -2 0 -2 , lwidth(*2) lcolor(blue))
(pci .46 2 0 2 , lwidth(*2) lcolor(blue))
(pci .27 -3 0 -3 , lwidth(*2) lcolor(green))
(pci .27 3 0 3 , lwidth(*2) lcolor(green))
, ytitle(Frequency Proportion)
xtitle(Variable Scores,height(5))
xlabels(-4(1)4) title("Mean = 0 , SD = 1")
ylabels(0(.1).55,angle(horizontal))
legend(off)
text(.55 0 "Mean{&plusminus}1SD", placement(c) color(red))
text(.51 0 "(middle 68.3%)", placement(c) color(red))
text(.46 0 "Mean{&plusminus}2SD", placement(c) color(blue))
text(.42 0 "(middle 95.4%)", placement(c) color(blue))
text(.27 0 "Mean{&plusminus}3SD", placement(c) color(green))
text(.24 0 "(middle 99.7%)", placement(c) color(green))
;
#delimit cr
Chapter 2-1 (revision 8 Jan 2011)
p. 31
If a variable is approximately symmetrical, it will usually mimic a normal distribution. When
you report meanSD in your article, the reader knows that about 2/3 of the values fall in that
range.
The two numbers taken together, then, do a very nice job of describing the distribution. To take
a guess of what the age will be of next mother presenting at the London hospital for delivery, we
would say that our best guess is 34 years old (the mean), but it would not be surprising if her age
differs from that guess by up to 4 years younger or 4 years older ( one standard deviation). We
can confidently use that range because we will be right two-thirds of the time.
Consistent schema suggested by the J Thorac Cardiovasc Surg
The editors of The Journal of Thoracic and Cardivascular Surgery, in their statistical
instructions to authors, suggest a consistent scheme for bounding the same percent of
observations for various levels of measurement, approximately 70% (J Thorac Cardiovasc Surg,
2008),
“A consistent schema for expressive variability would include ±1 standard deviation for
normally distributed continuous variables, 15 and 85 percentiles for skewed distributions,
and 70% confidence limits for proportions.”
A “skewed” distribution is one that is not symmetrical. Do not worry about this suggestion.
There is really no need for such consistency, because there is nothing special about the middle
70% of observations. So, this suggestion will probably not gain widespread use, even in that
journal.
Chapter 2-1 (revision 8 Jan 2011)
p. 32
Frequency Proportion
When the distribution is perfectly symmetrical, the mean, median, and mode are identically the
same number.
.5
Mean = Median = Mode
.4
.3
.2
.1
0
-4
-3
-2
-1
0
1
Variable Scores
2
3
4
which was drawn using,
#delimit ;
twoway (function y=normalden(x,0,1), clwidth(*1.5) range(-4 4))
(pci .4 0 0 0 , lwidth(*1.25) lcolor(red))
(pcarrowi .44 0 .41 0 , lwidth(*1.5) msize(*1.25)
lcolor(black) mcolor(black))
, ytitle(Frequency Proportion)
xtitle(Variable Scores,height(5))
xlabels(-4(1)4)
ylabels(0(.1).5,angle(horizontal))
legend(off)
text(.46 0 "Mean = Median = Mode", placement(c))
;
#delimit cr
Chapter 2-1 (revision 8 Jan 2011)
p. 33
Frequency Proportion
For skewed distributions, the mode remains at the most frequent value, which is the variable
score corresponding to the maximun height of the frequency distribution. The mean always gets
pulled in the direction of the skewness, and the median always lies in between the mean and
mode. So, you can visually compare the mean to the median to determine if the distribution is
skewed, and in which direction.
Left Skewed Distribution
Mode
.25
Median
Mean
.2
.15
.1
.05
0
0
10
5
15
Variable Scores
Chapter 2-1 (revision 8 Jan 2011)
p. 34
Frequency Proportion
Right Skewed Distribution
.25
Mode
Median
Mean
.2
.15
.1
.05
0
0
5
10
15
Variable Scores
Chapter 2-1 (revision 8 Jan 2011)
p. 35
which was drawn using,
* --- left skewed distribution (guessing at values of statistics)
preserve
clear
set obs 140
gen x=_n/10
gen y=gammaden(4,1,0,x)
gsort -x
drop x
gen x=_n/10
#delimit ;
twoway (line y x , clwidth(*1.5))
(pci .24 11.1 0 11.1 , lwidth(*1.25) lcolor(red))
(pci .22 10 0 10 , lwidth(*1.25) lcolor(blue))
(pci .20 8 0 8 , lwidth(*1.25) lcolor(green))
, ytitle(Frequency Proportion) legend(off)
xtitle(Variable Scores,height(5))
title("Left Skewed Distribution")
ylabels(0(.05).25,angle(horizontal))
text(.25 11.1 "Mode", placement(c) color(red))
text(.23 10 "Median", placement(c) color(blue))
text(.21 8 "Mean", placement(c) color(green))
;
#delimit cr
restore
* --- right skewed distribution (guessing at values of statistics)
preserve
clear
set obs 140
gen x=_n/10
gen y=gammaden(4,1,0,x)
#delimit ;
twoway (line y x , clwidth(*1.5))
(pci .24 3 0 3 , lwidth(*1.25) lcolor(red))
(pci .22 4.2 0 4.2 , lwidth(*1.25) lcolor(blue))
(pci .20 6 0 6 , lwidth(*1.25) lcolor(green))
, ytitle(Frequency Proportion) legend(off)
xtitle(Variable Scores,height(5))
title("Right Skewed Distribution")
ylabels(0(.05).25,angle(horizontal))
text(.25 3 "Mode", placement(c) color(red))
text(.23 4.2 "Median", placement(c) color(blue))
text(.21 6 "Mean", placement(c) color(green))
;
#delimit cr
restore
Chapter 2-1 (revision 8 Jan 2011)
p. 36
Standard Error of the Mean
The standard error (SE) of the mean, or SEM, reflects the uncertainty in our sample estimate of
the mean, or the amount of variability in the mean that is expected by sampling variation.
If we took a large number of samples of size n, from a population with the same mean and
standard deviation as observed in the original sample, and computed the mean for each sample,
the distribution of these means would have a standard deviation (SD) equal to the standard error
(SE) from the original sample. In other words, the SE is the SD of the estimates of the mean.
To verify that the standard error (SE) is the standard deviation (SD) of the distribution of means
from repeated samples, we will simulate this.
Computing some descriptive statistics for matage,
use births_with_missing, clear
summarize matage
* <or>
sum matage
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------matage |
485
34.05979
3.905724
23
43
Next, we will draw 10,000 samples of size n=485 from a normal distribution with mean of
34.05979 and SD of 3.905724. Each time we will compute the mean and save it to a file.
clear
set obs 1
gen mean=.
save holdmeans, replace
*
forval i=1/10000 {
quietly clear
quietly set obs 485
quietly gen matage = invnorm(uniform())*3.905724+34.05979
quietly sum matage
quietly gen mean=r(mean)
quietly keep mean
quietly keep in 1
quietly append using holdmeans
quietly save holdmeans, replace
}
sum
display "original sample SE = SD/sqrt(N) = " 3.905724/sqrt(485)
histogram mean , normal
The result follows, which matches closely, only being off by 1 point on the 3rd decimal place.
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------mean |
10000
34.06112
.1767598
33.44242
34.73396
original sample SE = SD/sqrt(N) = .17734979
Chapter 2-1 (revision 8 Jan 2011)
p. 37
1.5
0
.5
1
Density
2
2.5
Here is the graph of the 10,000 means computed from the 10,000 samples. Notice it has a very
narrow distribution. The SE is 1/sqrt(N) as wide as the original maternal ages SD, or 1/sqrt(485)
= 1/22 as wide.
33.5
34
34.5
35
mean
The SE is always much narrower than the SD. That is because the mean is always very close to
the center of the individual scores, so sampling variability has a much smaller effect on it’s
estimate.
We are getting a bit ahead of the material covered, but this graph will immediate raise a question
in the mind of the reader who has reported research results using error bars, so we might as well
address it right now.
When you present your data using a bar chart with error bars, where the height of the bar is the
mean and the error bar is the amount of variability expected due to sampling variation, should
you use the SD or SE for the error bar? (Another choice is using the 95% confidence interval,
which will be introduced in Chapter 2-4.)
These graphs are very popular in basic science. To give an example that is more in line with the
small sample used in such studies, let’s take a random sample of n=10 from the maternal age
variable.
use births_with_missing, clear
*
set seed 999
// so we get the same random sample each time
sample 10 , count // random sample of n=10
*
sum matage
Variable |
Chapter 2-1 (revision 8 Jan 2011)
Obs
Mean
Std. Dev.
Min
Max
p. 38
-------------+-------------------------------------------------------matage |
10
35.3
3.653005
29
40
Computing the standard error,
display "SE = SD/sqrt(N) = " 3.653005/sqrt(10)
SE = SD/sqrt(N) = 1.1551816
To see the dramatic difference created by our choice for the error bars, the following figure
shows the mean with a SD error bar and a SE error bar.
50
45
40
Maternal Age
35
30
25
20
15
10
5
0
mean+/-SD
mean+/-SE
The second bar clearly looks “more precise” than the second bar. In the first bar, the mean is
35.3, but is likely to vary as much as 3.7, the height of the error bar. In the second bar, the mean
is 35.3, but is likely to vary only by as much as 1.6.
This makes it “seem like cheating” to some researchers, when the article displays the SE for the
error bar, rather than the SD. (You should always say in the figure legend which one you used.)
Chapter 2-1 (revision 8 Jan 2011)
p. 39
Which to use? The answer is simple if you just stop to think about it. It depends on if you are
presenting a description of your sample, which is the individual observations, or if you are
presenting an outcome. The individual observations are described with their mean and standard
deviation,where the SD reflects how much an individual observation is likely to vary (the middle
2/3rds) due to sampling variation. The outcome is always reported by some mean (mean,
regression slope, risk ratio, etc). The standard error of that estimate reflects how much the effect,
the mean in this case, is likely to vary (the middle 2/3rds) due to sampling variation.
A formal argument for using SEM, instead of SD, when describing results is given in Ch 2-2.
The argument requires the concept of statistical regularity, which is presented in that chapter.
If you are curious, the above graph was generated by the following Stata code executed in the dofile editor,
* mean +/- SD in first row: 35.30 +/- 3.65
* mean +/- SE in second row: 35.30 +/- 1.16
clear
input x mean lower upper
1 35.30 31.65 38.95
2 35.30 34.14 36.46
end
*
#delimit ;
twoway (bar mean x , bcolor(black) barwidth(.5))
(rcap lower upper x, lcolor(black) blwidth(medthick) msize(large))
,
legend(off) scheme(s1mono) plotregion(style(none))
xscale(range(.5 2.5)) ylabels(0(5)50, angle(horizontal))
xlabels(1 "mean+/-SD" 2 "mean+/-SE" , notick)
ytitle("Maternal Age") xtitle(" ") aspectratio(1)
;
#delimit cr
Chapter 2-1 (revision 8 Jan 2011)
p. 40
The choice of the many available descriptive statistics that we can choose from to report in a
manuscript is based on the level of measurement of the variable.
Measurement Scale (also called level of measurement)
Dichotomous scale (dichotomy, 2 categories) (e.g., sex: male or female)
Nominal scale name
(unordered named categories)
(e.g., therapy: radiation, surgical, chemo)
Ordinal scale name + order
(ordered categories)
(e.g., size: small, medium, large)
Interval scale name + order + equal intervals + arbitrary zero point
(e.g., temperature, 0 degrees does not imply
absence of temperature, the 0 point is just a
convention. Ratios do not make sense--you
would not say 80° is twice as hot as 40°.)
Ratio scale
name + order + equal intervals + absolute zero point
(e.g., height, 0mm means no height. Ratios make
sense–an object 10mm tall is twice the height as
an object 5mm tall.)
Note 1: it makes sense to do arithmetic on interval scaled variables, since this scale is
sufficiently close to our notion of integers and real numbers (both number
systems have equal intervals). It does not make sense to do arithmetic on
nominal and ordinal scales, since these scales do not have equal intervals.
Note 2: for purposes of choosing test statistics, interval and ratio scales are considered
equivalent.
A second measurement scale scheme is:
Binary data
Unordered categorical data
Ordered categorical data
Continuous data
Chapter 2-1 (revision 8 Jan 2011)
(dichotomous scale)
(nominal scale)
(ordinal scale)
(interval & ratio scales)
p. 41
Exercise Think about what arithmetic makes sense for a given level of measurement. An “X”
is shown in every cell of the following table for which the descriptive measure is a legitimate
statististic for that level of measurement. Do you understand why?
dichotomous nominal
ordinal
interval
binary
categorical
ordered
categorical
continuous
n
X
X
X
X
%
X
X
X
X
min
X
X
max
X
X
range
X
X
X
X
mode
X
X
mean
X
median
X
X
IQR
X
X
std. dev.
X
std. err.
X
note: In Chapter 2-6, it is shown that a dichotomous scale is also an
interval scale, though it is widely treated as a nominal scale.
Chapter 2-1 (revision 8 Jan 2011)
p. 42
Exercise This time an “X” is shown in every cell of the following table for which the
descriptive measure might be useful in a “Table 1. Patient Characteristics” table of a research
article. Do you understand why?
dichotomous nominal
ordinal
interval
binary
categorical
ordered
categorical
continuous
X
n
X
X
X
%
X
X
X
min
max
range
mode
mean
X
median
X
X
IQR
X
X, if shown
as 25th ,
75th
percentiles
std. dev.
X
std. err.
Chapter 2-1 (revision 8 Jan 2011)
p. 43
Exercise For each level of measurement, the “official” (most informative) measure of average
(central tendency) and the “official” (most informative) measure of variability (dispersion) is
shown. Do you understand why?
dichotomous nominal
ordinal
interval
binary
categorical
ordered
categorical
continuous
Avg, Var
Avg, Var
n
%
min
max
range
mode
mean
Avg
median
Avg
Avg if
skewed
IQR
Var
Var if
skewed
std. dev.
Var
std. err.
Chapter 2-1 (revision 8 Jan 2011)
p. 44
Bringing our original dataset back into Stata memory, provided our working directory is where
the data file is (otherwise you can read it back in with the “File” command on the Stata menu),
use births_with_missing, clear
Since maternal age is a continuous variable, or interval scaled variable, we can use the mean and
standard deviation to summarize the data.
Statistics
Summaries, tables & tests
Summary and descriptive statistics
Summary statistics
Main tab: Variables: matage
OK
summarize matage
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------matage |
485
34.05979
3.905724
23
43
To obtain greater detail,
Statistics
Summaries, tables & tests
Summary and descriptive statistics
Summary statistics
Main tab: Variables: matage
Options: display additional statistics
OK
summarize matage , detail
maternal age
------------------------------------------------------------Percentiles
Smallest
1%
25
23
5%
27
24
10%
29
25
Obs
485
25%
31
25
Sum of Wgt.
485
50%
75%
90%
95%
99%
34
37
39
40
42
Chapter 2-1 (revision 8 Jan 2011)
Largest
42
43
43
43
Mean
Std. Dev.
34.05979
3.905724
Variance
Skewness
Kurtosis
15.25468
-.241381
2.483957
p. 45
To see why median and interquartile range summarize the data better for skewed distributions,
let’s introduce some skewness. The original data are:
maternal age
------------------------------------------------------------Percentiles
Smallest
1%
25
23
5%
27
24
10%
29
25
Obs
485
25%
31
25
Sum of Wgt.
485
50%
75%
90%
95%
99%
34
37
39
40
42
Largest
42
43
43
43
Mean
Std. Dev.
34.05979
3.905724
Variance
Skewness
Kurtosis
15.25468
-.241381
2.483957
In this symmetrical distribution, the interval meanSD does a good job of representing 2/3 of the
data. We can recognize it is symmetrical because the mean is very close to the median. If it was
skewed, the mean would differ noticeably from the median.
Now, introducing some skewness, by replacing the 15 missing values with a maternal age of 100,
just for illustration
Data
Create or change variables
Change contents of a variable
Main tab: Variable: matage
New contents: 100
if/in tab: if expression: matage==.
OK
replace matage = 100 if matage==.
Chapter 2-1 (revision 8 Jan 2011)
p. 46
Obtaining the descriptive statistics again,
Statistics
Summaries, tables & tests
Summary and descriptive statistics
Summary statistics
Main tab: Variables: matage
Options: display additional statistics
OK
summarize matage , detail
maternal age
------------------------------------------------------------Percentiles
Smallest
1%
25
23
5%
27
24
10%
29
25
Obs
500
25%
32
25
Sum of Wgt.
500
50%
75%
90%
95%
99%
34
37
40
41
100
Largest
100
100
100
100
Mean
Std. Dev.
36.038
11.89873
Variance
Skewness
Kurtosis
141.5797
4.609086
25.20207
In this distribution, meanSD (3612 is 24 to 48) is nearly the entire distribution, as can be seen
by the frequency table
Statistics
Summaries, tables & tests
Tables
One-way tables
Main tab: Categorical variable: matage
OK
tabulate matage
Chapter 2-1 (revision 8 Jan 2011)
p. 47
meanSD: 24 to 48
maternal |
age |
Freq.
Percent
Cum.
------------+----------------------------------23 |
1
0.20
0.20
24 |
1
0.20
0.40
25 |
5
1.00
1.40
26 |
9
1.80
3.20
27 |
14
2.80
6.00
28 |
12
2.40
8.40
29 |
26
5.20
13.60
30 |
27
5.40
19.00
31 |
29
5.80
24.80
32 |
39
7.80
32.60
33 |
45
9.00
41.60
34 |
51
10.20
51.80
35 |
44
8.80
60.60
36 |
31
6.20
66.80
37 |
44
8.80
75.60
38 |
44
8.80
84.40
39 |
27
5.40
89.80
40 |
24
4.80
94.60
41 |
7
1.40
96.00
42 |
2
0.40
96.40
43 |
3
0.60
97.00
100 |
15
3.00
100.00
------------+----------------------------------Total |
500
100.00
|
|
|
|
|
|
interquartile range
32 to 37
The interquartile range, 32 to 37, however, still correctly encloses 1/2 of the distribution and the
median of 34 is still in the center of the distribution.
Exercise Look at Table 1 of the Brady et al. (2000) article. Notice which descriptive statistics
are used for each level of measurement of her variables.
Exercise Look at Table 1 of the Sulkowski et al. (2000) article. Notice which descriptive
statistics are used for each level of measurement of his variables. Why do you think he is using
the median and interquartile range, rather than mean and standard deviation, for obviously
interval level variables?
Exercise Look at Table 3 of the Sachs et al. (2007) article. Notice the use of mean±SD and
median (IQR) in the same table.
Chapter 2-1 (revision 8 Jan 2011)
p. 48
Another Command (table) For Displaying Descriptive Statistics
Perhaps we are interested in discovering if mothers with hypertension give birth earlier, or a
shorter gestational age, than mothers without hypertension. Treating gestational age as a
continuous, or interval scaled variable, we might want to produce a table showing the means and
standard deviations of gestational age, separately for mothers with and without hypertension.
We could do this with the “by” option for the summarize command,
Statistics
Summaries, tables & tests
Summary and descriptive statistics
Summary statistics
Main tab: Variables: gestwks
by/if/in tab: Repeat command by groups:
Variables that define groups: hyp
OK
by hyp, sort : summarize gestwks
* <or>
bysort hyp: summarize gestwks
-> hyp = 0
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------gestwks |
380
38.97521
1.962778
26.95
43.16
-> hyp = 1
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------gestwks |
63
37.83111
2.744347
27.33
42.2
-> hyp = .
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------gestwks |
4
36.9875
3.407701
32.41
39.65
We see the descriptive statistics for every value of hyp, including the subgroup with a missing
value for hyp. This is an easy command to memorize, but it not very close to a table of means
and standard deviations.
Chapter 2-1 (revision 8 Jan 2011)
p. 49
Using the table command,
Statistics
Summaries, tables & tests
Tables
Table of summary statistics (table)
Main tab: Row variable: hyp
Statistics 1: Mean
Variable: gestwks
Statistics 2: Standard deviation Variable: gestwks
Statistics 3: Count nonmissing Variable: gestwks
OK
table hyp, contents(mean gestwks sd gestwks count gestwks)
------------------------------------------------------hypertens | mean(gestwks)
sd(gestwks)
N(gestwks)
----------+-------------------------------------------0 |
38.97521
1.962778
380
1 |
37.83111
2.744347
63
-------------------------------------------------------
The table command has the nice feature that you can have a row variable, superrow variable,
column variable, and supercolumn variable, so you can show up to five different descriptive
statistics for all combinations of up to four categorical variables.
Adding sexalph as a column variable,
Statistics
Summaries, tables & tests
Tables
Table of summary statistics (table)
Main tab: Row variable: hyp
Statistics 1: Mean
Variable: gestwks
Statistics 2: Standard deviation Variable: gestwks
Statistics 3: Count nonmissing Variable: gestwks
Column variable: sexalph
OK
table hyp sexalph, contents(mean gestwks sd gestwks count gestwks)
-----------------------------|sex coded as string
hypertens |
female
male
----------+------------------0 | 38.94313 39.09039
| 2.21838 1.612922
|
179
179
|
1 | 37.54346 38.20912
| 3.12607 2.377861
|
26
34
------------------------------
Chapter 2-1 (revision 8 Jan 2011)
p. 50
To show the mean and standard deviation to one decimal place,
Statistics
Summaries, tables & tests
Tables
Table of summary statistics (table)
Main tab: Row variable: hyp
Statistics 1: Mean
Variable: gestwks
Statistics 2: Standard deviation Variable: gestwks
Statistics 3: Count nonmissing Variable: gestwks
Column variable: sexalph
Options tab: Override display format for numbers in cells:
Create…
Type of data: Numeric
Numeric type: Fixed numeric
Format properties: Justification: Right-justified
Total digits: 4
Digits to the right of decimal: 1
OK
OK
table hyp sexalph, contents(mean gestwks sd gestwks ///
count gestwks) format(%4.1f)
-------------------------| sex coded as
|
string
hypertens | female
male
----------+--------------0 |
38.9
39.1
|
2.2
1.6
|
179
179
|
1 |
37.5
38.2
|
3.1
2.4
|
26
34
--------------------------
Note: In the command box, the symbol “///” was used, which means continue the command on
the following line. That symbol can only be used in the do-file editor—it gives an error message
if run from the command line.
Chapter 2-1 (revision 8 Jan 2011)
p. 51
Another Command (tabstat) For Displaying Descriptive Statistics
There are two disadvantages to the table command: 1) it only permits five different descriptive
statistics, and 2) it does not label the statistics.
The tabstat command permits more than five different descriptive statistics, and it lets label the
statistics, as long as we use column headings for the statistics.
Statistics
Summaries, tables & tests
Tables
Table of summary statistics (tabstat)
Main tab: Variables: gestwks
Group statistics by variable: hyp
Statistics to display: Mean
Standard deviation
Count
Options tab: Use as columns: Statistics
OK
tabstat gestwks, statistics( mean sd count ) by(hyp) ///
columns(statistics)
Summary for variables: gestwks
by categories of: hyp (hypertens)
hyp |
mean
sd
N
---------+-----------------------------0 | 38.97521 1.962778
380
1 | 37.83111 2.744347
63
---------+-----------------------------Total | 38.81251 2.125999
443
----------------------------------------
Chapter 2-1 (revision 8 Jan 2011)
p. 52
Changing the columns to variables, which makes the statistics the rows,
Statistics
Summaries, tables & tests
Tables
Table of summary statistics (tabstat)
Main tab: Variables: gestwks
Group statistics by variable: hyp
Statistics to display: Mean
Standard deviation
Count
Options tab: Use as columns: variables
OK
tabstat gestwks, statistics( mean sd count ) by(hyp) ///
columns(variables)
-> hyp = 0
stats |
gestwks
---------+---------mean | 38.97521
sd | 1.962778
N |
380
--------------------> hyp = 1
stats |
gestwks
---------+---------mean | 37.83111
sd | 2.744347
N |
63
--------------------> hyp = .
stats |
gestwks
---------+---------mean |
36.9875
sd | 3.407701
N |
4
--------------------
Chapter 2-1 (revision 8 Jan 2011)
p. 53
References
Bejon P, Lusingu J, Olotu A, et al. (2008). Efficacy of RTS,S/AS01E vaccine against malaria in
children 5 to 17 months of age. N Engl J Med 359(24):2521-32.
Brady K, Pearlstein T, Asnis GM, et al. (2000). Efficacy and safety of sertraline treatment of
posttraumatic stress disorder: a randomized controlled trial. JAMA 283(14):1837-1844.
J Thorac Cardiovasc Surg (2008). Statistical Methods. J Thorac Cardiovasc Surg 135(1):40A41A.
Munro BH. (2001). Statistical Methods for Health Care Research. 4th ed. Philadelphia,
Lippincott.
Sachs GS, Nierenberg AA, Calabrese JR, et al. (2007). Effectiveness of adjunctive antidepressant
treatment for bipolar depression. N Engl J Med 356(17):1711-1722.
Sulkowski MS, Thomas DL, Chaisson RE, Moore RD. (2000). Hepatotoxicity associated with
antiretroviral therapy in adults infected with human immunodeficiency virus and the role
of hepatitis C or B virus infection. JAMA 283(1):74-80.
Tukey J. (1977). Exploratory data analysis. Reading, MA, Addison-Wesley.
Chapter 2-1 (revision 8 Jan 2011)
p. 54