Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2-1. Describing Variables, Levels of Measurement, and Choice of Descriptive Statistics Statisticians In the front of M.G. Kendall and A. Stuarts, The Advanced Theory of Statistics, Vol 2, is a quotation attributed to the fictitious K.A.C. Manderville, The Undoing of Lamia Gurdleneck. "You haven't told me yet," said Lady Nuttal, "what it is your fiance does for a living." "He's a statistician," replied Lamia, with an annoying sense of being on the defensive. Lady Nuttal was obviously taken aback. It had not occurred to her that statisticians entered into normal social relationships. The species, she would have surmised, was perpetuated in some collateral manner, like mules. "But Aunt Sara, it's a very interesting profession," said Lamia warmly. "I don't doubt it," said her aunt, who obviously doubted it very much. "To express anything important in mere figures is so plainly impossible that there must be endless scope for well-paid advice on the how to do it. But don't you think that life with a statistician would be rather, shall we say, humdrum?" Lamia was silent. She felt reluctant to discuss the surprising depth of emotional possibility which she had discovered below Edward's numerical veneer. "It's not the figures themselves," she said finally. "It's what you do with them that matters." Statistics is the Mathematics of Distributions At the individual level, we can describe features with single numbers (e.g., Fred is 26 years old). At the group level, however, we lose this ability. For example, there is not a single number to describe the age of the people in a college classroom. There are many ages represented, which we call a distribution of ages. Statistics can be said to be the mathematics of distributions. It allows us to describe and test theories in our universe, since our universe can be conceived of as an infinite set of distributions. _____________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385 Chapter 2-1 (revision 8 Jan 2011) p. 1 Displaying a Variable Distribution In Windows Explorer, find the datasets & do files subdirectory of the course manual. Double click on births_with_missing.dta to start Stata and read in the data. If that does not work, because the file is not associated with the program name, for example, use the following, File Open Find the directory where you copied the course CD Find the subdirectory datasets & do-files Single click on births_with_missing.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ births_with_missing.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do files" use births_with_missing, clear A way to discover the sample size of a dataset, is to use the Stata menus: Data Variable utilities Count observations satisfying condition OK or simply execute the following command in the Command Window: count Notice that anytime you use the menus, a Stata command is constructed and executed and shown in the Results Window. The crudest way to see the distribution of a variable is to simply list its values. Let’s do this for maternal age: Chapter 2-1 (revision 8 Jan 2011) p. 2 Data Describe data List data Main tab: Variables: matage OK list matage <hit abort button – the red dot with a white X> <- abort after one screenful, since we know there are 500 lines of data Clearly, a list of 500 numbers is difficult to comprehend. What we do in statistics is a process called data reduction, which is to reduce the information into a form that is easier to comprehend. Chapter 2-1 (revision 8 Jan 2011) p. 3 The first level of data reduction is the frequency table, which we get using Statistics Summaries, tables & tests Tables One-way tables Main tab: Categorical variable: matage OK tabulate matage <or abbreviate command to:> tab matage <or abbreviate both command and variable name to:> tab mat <or use minimum abbreviation:> ta m This generates the following frequency table. maternal | age | Freq. Percent Cum. ------------+----------------------------------23 | 1 0.21 0.21 24 | 1 0.21 0.41 25 | 5 1.03 1.44 26 | 9 1.86 3.30 27 | 14 2.89 6.19 28 | 12 2.47 8.66 29 | 26 5.36 14.02 30 | 27 5.57 19.59 31 | 29 5.98 25.57 32 | 39 8.04 33.61 33 | 45 9.28 42.89 34 | 51 10.52 53.40 35 | 44 9.07 62.47 36 | 31 6.39 68.87 37 | 44 9.07 77.94 38 | 44 9.07 87.01 39 | 27 5.57 92.58 40 | 24 4.95 97.53 41 | 7 1.44 98.97 42 | 2 0.41 99.38 43 | 3 0.62 100.00 ------------+----------------------------------Total | 485 100.00 The left column is the actual values (scores) for the maternal age variable. The “Freq” column is the number of times that particular value occurred in the data (the frequency). The “Percent” column is the frequency count divided by the sample size (N=485, rather than N=500, which we will discuss shortly). The “Cum” column is the cumulative sum of the Percent column, informing you what percent of the sample had “equal to or less than” that value. Chapter 2-1 (revision 8 Jan 2011) p. 4 By default, Stata commands only operate on non-missing values, which is what you normally want to report in your article. To see the number of missing values with the tabulate command, we simply add the “missing” option: Statistics Summaries, tables & tests Tables One-way tables Main tab: Categorical variable: matage Treat missing values like all other values OK tabulate matage , missing maternal | age | Freq. Percent Cum. ------------+----------------------------------23 | 1 0.20 0.20 24 | 1 0.20 0.40 25 | 5 1.00 1.40 26 | 9 1.80 3.20 27 | 14 2.80 6.00 28 | 12 2.40 8.40 29 | 26 5.20 13.60 30 | 27 5.40 19.00 31 | 29 5.80 24.80 32 | 39 7.80 32.60 33 | 45 9.00 41.60 34 | 51 10.20 51.80 35 | 44 8.80 60.60 36 | 31 6.20 66.80 37 | 44 8.80 75.60 38 | 44 8.80 84.40 39 | 27 5.40 89.80 40 | 24 4.80 94.60 41 | 7 1.40 96.00 42 | 2 0.40 96.40 43 | 3 0.60 97.00 . | 15 3.00 100.00 ------------+----------------------------------Total | 500 100.00 This time we see that n=15 observations have a value of “.”, which is Stata’s reserved symbol for missing value for a numeric variable. Chapter 2-1 (revision 8 Jan 2011) p. 5 Let’s try this for a string variable, which is a variable of letters and/or special symbols (not numbers). Statistics Summaries, tables & tests Tables One-way tables Main tab: Categorical variable: sexalph Treat missing values like all other values OK tabulate sexalph , missing sex coded | as string | Freq. Percent Cum. ------------+----------------------------------| 41 8.20 8.20 female | 220 44.00 52.20 male | 239 47.80 100.00 ------------+----------------------------------Total | 500 100.00 The missing value looks like a space, but it is actually a null value. If you were entering a missing string value using the Stata data editor, you would hit the tab key or the enter key to input the missing value, rather than the space character. When we look at the Freq or Percent columns of the frequency table, we get a sense for what values are most likely. Chapter 2-1 (revision 8 Jan 2011) p. 6 The next level of data reduction is to see this graphically with a histogram. Graphics Histogram Main tab: Variable: matage OK 0 .05 Density .1 .15 histogram matage 25 30 35 maternal age 40 45 Notice that the tick marks don’t line up precisely with the middle of the bars, and that there appears to be a value of maternal age, perhaps age 33, that has a zero frequency. Looking back at the frequency table, however, reveals no age with a zero frequency. Stata is attempting to provide the nicest looking graph, by forming what it considers the number of bins (cut-offs) that should provide the nicest looking graph. In the Results Window, we see what it did: (bin=22, start=23, width=.90909091) This algorithm generally works for a continuous variable that has a large number of distinct values. For our variable, which has a relatively small number of distinct values, we can bypass this feature using the “discrete” option. Chapter 2-1 (revision 8 Jan 2011) p. 7 Graphics Histogram Main tab: Variable: matage Data is discrete OK .06 0 .02 .04 Density .08 .1 histogram matage , discrete 20 25 30 35 maternal age 40 45 The default scale, “density”, is the proportion of the total sample of non-missing values for each specific value of the matage. It is a graph of the Percent column, converted to a proportion, of the frequency table. Chapter 2-1 (revision 8 Jan 2011) p. 8 To change the y-axis to percents, use the percent option Graphics Histogram Main tab: Variable: matage Data is discrete Y-axis: Percent OK 6 0 2 4 Percent 8 10 histogram matage , discrete percent 20 25 Chapter 2-1 (revision 8 Jan 2011) 30 35 maternal age 40 45 p. 9 To change the y-axis to frequency count, use the frequency option Graphics Histogram Main tab: Variable: matage Data is discrete Y-axis: Frequency OK 30 20 0 10 Frequency 40 50 histogram matage , discrete frequency 20 Chapter 2-1 (revision 8 Jan 2011) 25 30 35 maternal age 40 45 p. 10 Many people think its fun to overlay a normal distribution curve on their histogram, to get a feel for how normally distributed the variable is. (Note: the importance of having a normally distributed variable is overrated, as we will see in a later chapter.) To get this, we request the normal option: Graphics Histogram Main tab: Variable: matage Data is discrete Y-axis: Frequency Density Plots tab: Add normal density plot OK 30 20 0 10 Frequency 40 50 histogram matage , discrete frequency normal 20 Chapter 2-1 (revision 8 Jan 2011) 25 30 35 maternal age 40 45 p. 11 To see the line that passes through the top center of each bar, but with smoothing, we can overlay on the histogram what is called a “kernal density plot”, using, Graphics Histogram Main tab: Variable: matage Data is discrete Y-axis: Frequency Density Plots tab: Add kernal density plot OK 30 20 0 10 Frequency 40 50 histogram matage , discrete frequency kdensity 20 25 Chapter 2-1 (revision 8 Jan 2011) 30 35 maternal age 40 45 p. 12 To visually compare the histograms for males and females, we use the by( ) option. Graphics Histogram Main tab: Variable: matage Data is discrete Y-axis: Frequency By tab: Draw subgraphs for unique values of variables Variables: sexalph OK histogram matage, discrete frequency by(sexalph) male 0 10 Frequency 20 30 female 20 30 40 50 20 30 40 50 maternal age Graphs by sex coded as string Notice that it is difficult to tell how much the two histograms overlap, or diverge. Chapter 2-1 (revision 8 Jan 2011) p. 13 One approach we can take is to overlap the kernel density plots for the two sexes, using .06 .04 0 .02 kdensity matage .08 .1 twoway (kdensity matage if sexalph == "female", lcolor(pink)) /// (kdensity matage if sexalph == "male" , lcolor(blue)) 25 30 35 40 45 x kdensity matage kdensity matage Here, the “///” in the command means to continue the command onto the next line, but this only works in the do-file editor. To use that, click on the 5th icon from the right on the menu bar, which looks like a notebook. Type in your command, and hit the “Do” button, which is the last icon on the do-file editor menu bar. In the command window, you would just type the entire command in on one line, without the “///”. The “( ) ( )”, with a command for a different graph in each “( )”, tells Stata to overlay the graphs. Notice it uses “density”, rather than “frequency”. With frequencies, the two sexes could have a different looking line simply due to a different sample size, while using proportions, or densities, make them directly comparable. Chapter 2-1 (revision 8 Jan 2011) p. 14 Another graph to show this, although more complicated to create, is 0 10 5 percent 5 10 clear cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "BiostatsCourse\datasets & do files" use births_with_missing.dta drop if matage==. | sex==. count if sex==1 scalar Nmale=r(N) count if sex==2 scalar Nfemale=r(N) gen one = 1 collapse (count) one , by(sex matage) rename one frequency gen percent=frequency/Nmale*100 if sex==1 replace percent=frequency/Nfemale*100 if sex==2 replace percent = -percent if sex==1 // males on bottom #delimit ; twoway (bar percent matage if sex==1) (bar percent matage if sex==2) , legend(lab(1 "male") lab(2 "female")) ylabel(-10 "10" -5 "5" 0 "0" 5 "5" 10 "10") ; #delimit cr 25 30 35 maternal age male 40 45 female The commands used for this graph are taught in the K30 Computer Practicum (Stata) course. Although fancy, it is still too much information to get your head around. Chapter 2-1 (revision 8 Jan 2011) p. 15 Finally, a frequently reported graph for comparing two or more distributions is the box and whisker plot, or boxplot for short. Graphics Box plot Main tab: Variables: matage By tab: Draw subgraphs for unique values of variables Variables: sexalph OK graph box matage, by(sexalph) male 35 25 30 maternal age 40 45 female Graphs by sex coded as string The advantage of this graph is that it incorporates a side by side comparison, with just the right amount of data reduction, which makes it the most popular approach for visually comparing distributions. Chapter 2-1 (revision 8 Jan 2011) p. 16 Before we discuss what this graph represents, let’s add an outlier to the data to see how it is represented. Go into the data editor [spreadsheet with pencil icon, the “Date Editor (Edit)” in Stata-11], 5th from right on menu bar) and change matage from 34 to 70 for the first observation. Click on the X to exit, and click on accept changes in the exit data editor box. When you clicked on accept changes, it executed the following command: replace matage = 70 in 1 Re-creating the boxplot, Graphics Box plot Main tab: Variables: matage By tab: Draw subgraphs for unique values of variables Variables: sexalph OK graph box matage, by(sexalph) male 50 40 20 30 maternal age 60 70 female Graphs by sex coded as string Notice that the outlier is represented by a dot. The interpretation of a boxplot is shown in the following box. Chapter 2-1 (revision 8 Jan 2011) p. 17 Boxplots This graph is a box-and-whisker plot, or simply, a boxplox. The box shows the interquartile range (IQR) (top of box is the 75th percentile, the bottom of the box is the 25th percentile). The line inside the box is the median (50th percentile). The lines extending beyond the box, which look like error bars, called the whiskers, represent the minimum and maximum. However, if a data value extends beyond 1.5 IQR in either direction, the whiskers exclude that value and the value is shown separately as a point on the graph. These points are called extreme values. Extreme values might represent outliers, an outlier being a data value that appears to have not come from the same population that the rest of sample came from. This graphical approach for identifying outliers was proposed by Tukey (1977). Tukey referred to outliers as “extreme values” to avoid the whole “outlier” issue, it be controversial how outliers should be handled. We will discuss outliers in a later chapter, what to do about them, and how to report or exclude them in your article. For now, if you choose to describe this method for identifying outliers in your study protocol, you could state the following: We will identify outliers using the graphical method proposed by Tukey (1977), which are values that extend beyond 1.5 times the interquartile range in either direction...(and then tell what you intend to do about them). The primary reason for using boxplots is not to identify outliers, but rather to graphically show and perhaps compare distributions. Exercise Look at the boxplot in Figure 3 of Bejon et al (2008). Notice in their footnote they give the same explanation of the boxplot elements as given here. Chapter 2-1 (revision 8 Jan 2011) p. 18 Now that outliers have been described, return the data back to what it was, using replace matage = 34 in 1 Re-creating the boxplot, Graphics Box plot Main tab: Variables: matage By tab: Draw subgraphs for unique values of variables Variables: sexalph OK graph box matage, by(sexalph) male 35 30 25 maternal age 40 45 female Graphs by sex coded as string Notice that the “by” option created a side-by-side graph for each value of the “by” variable. Some other variations of this format will be tried next. Chapter 2-1 (revision 8 Jan 2011) p. 19 To get the box plots to be contained within the same graph, we replace “ by” with “over”. Graphics Box plot Main tab: Variables: matage Categories tab: check box for Group 1 Grouping variable: sexalph OK 35 30 25 maternal age 40 45 graph box matage, over(sexalph) female male Notice both groups are now in the same plot region, rather than two plots that are side by side. The labels “female” and “male” are actually values of the variable sexalph. If we used the variable sex, which has no value labels assigned, we will see the numeric values of the variable. Chapter 2-1 (revision 8 Jan 2011) p. 20 Graphics Box plot Main tab: Variables: matage Categories tab: check box for Group 1 Grouping variable: sex OK 35 25 30 maternal age 40 45 graph box matage, over(sex) 1 2 To assign labels to replace the 1 and 2, we use, Chapter 2-1 (revision 8 Jan 2011) p. 21 Graphics Box plot Main tab: Variables: matage Categories tab: check box for Group 1 Grouping variable: sex Click on “properties” Check “override labels for this group” Label specification: 1 Male 2 Female Accept OK 35 30 25 maternal age 40 45 graph box matage, over(sex, relabel(1 Male 2 Female)) Male Female Note: In the Label Specification, we put the label “Male” by 1 and “Female” by 2. Here, the 1 and 2 simply mean the sort order of the actual values of the variable. If the variable had been scored 0 for male and 1 for female, we would still use “1 Male 2 Female” for the label specification. Chapter 2-1 (revision 8 Jan 2011) p. 22 Multiple “over” groups can be specified to get a clustering effect. We simply specify more grouping variables. Graphics Box plot Main tab: Variables: matage Categories tab: Check box for Group 1 Grouping variable: sex Check box for Group 2 Grouping variable: hyp Accept OK 35 30 25 maternal age 40 45 graph box matage, over(sex) over(hyp) 1 2 0 1 2 1 At the bottom of the graph, the 1 and 2 come from the first specified “over” variable, and the 0 and 1 from the second variable, which are the actual values of these two variables. Chapter 2-1 (revision 8 Jan 2011) p. 23 To assign labels to these variables in the graph, Graphics Box plot Main tab: Variables: matage Categories tab: check box for Group 1 Grouping variable: sex Click on “properties” Check “override labels for this group” Label specification: 1 Male 2 Female check box for Group 1 Grouping variable: hyp Click on “properties” Check “override labels for this group” Label specification: 1 "Maternal Hypertension Present" 2 "Maternal Hypertension Absent" Accept OK 35 30 25 maternal age 40 45 graph box matage, over(sex, relabel(1 Male 2 Female)) /// over(hyp, relabel(1 "Maternal Hypertension Absent" /// 2 "Maternal Hypertension Present")) Male Female Maternal Hypertension Absent Male Female Maternal Hypertension Present Notice for the label specification for variable hyp, we used 1 and 2 to denote the 1st and 2nd sort order values of the variable, rather than using 0 and 1, which are the actual values of the variable. Chapter 2-1 (revision 8 Jan 2011) p. 24 The boxplot is a nice way to show whether or not two or more distributions overlap. However, it is still too much information, and not specific enough to identify the size of the difference. The next level of data reduction is descriptive statistics, which include the mean and standard deviation. This is the most useful level of data reduction and is expected for publication. The most commonly used descriptive statistics are shown on the following pages. Chapter 2-1 (revision 8 Jan 2011) p. 25 Descriptive Statistics Descriptive statistics (or “descriptive measures”) are used to summarize data for a variable. Here are some definitions: n = sample size (can also use “N”) Minimum = smallest value in the data Maximum = largest value in the data Range = Maximum minus Minimum Mode = score with highest frequency Mean = arithmetic average (also called “X bar”) n X X i 1 i n Median = 50th percentile (50% of values below and 50% above) Interquartile Range (IQR) = 75% percentile minus 25% percentile, which is usually indicated by just reporting the two percentiles Variance = a measure of dispersion expressed in squared terms. This is only used in formulas, and never reported as a discriptive measure. n s2 Chapter 2-1 (revision 8 Jan 2011) (X i 1 i X )2 n 1 p. 26 Standard Deviation (SD) = the square root of the variance. This is a measure of dispersion expressed in original units. For a generally symmetrical distribution, approximately two-thirds of the values are in the interval mean 1 S.D. and approximately 95% are in interval mean 2 S.D. n s (X i 1 i X )2 n 1 Looking at the formula for the SD, or s, we observe it looks like the average of the distance from the mean for each value of the variable (average distance from the mean). The only thing strange is the denominator, n-1, instead of n. The n-1 is called the “degrees of freedom.” Only n-1 values are “free to vary”, since two equations are solved simultaneously, the formula for the mean and the formula for the SD. That is, if you know n-1 values of the variable when computing the variance or SD, the final value can only be a single specific number, a plugged value, or otherwise a different value for the mean would have been necessary (Munro, 2001, pp.84-85). So, the SD is the average distance from the mean of the values which can contribute to the variability (those free to vary). Intuitively, you can simply consider the SD as the average deviation, or distance, from the mean, since n-1 is approximately n. Standard Error of the Mean (SEM) = standard deviation of the sample mean, a measure of variation in our estimate of the mean. It is the standard deviation divided by the square root of the sample size. SEM s X s n If the context is clear, that you are talking specifically about the mean, this is also just called the standard error (SE). This statistic will be discussed further below. Chapter 2-1 (revision 8 Jan 2011) p. 27 Frequency Proportion Describing a Variable (Average and Variability) When describing the distribution of a variable with descriptive statistics, the goal is to summarize the distribution with as few numbers as possible. It is generally sufficient to report the average, or center of the distribution, and the variability, or dispersion (how spread out the data are). If the distribution is at least approximately symetrical, it will look somewhat like a theoretical distribution called the normal distribution (bell shaped curve). A normal distribution with a mean of 0 and a standard deviation (SD) of 1 will look like: .5 Mean = 0 .4 .3 SD = 1 .2 .1 0 -4 -3 -2 -1 0 1 Variable Scores 2 3 4 For a normal curve, the SD on the x-axis is where the histogram, or curve, switches from concave down to concave up. Chapter 2-1 (revision 8 Jan 2011) p. 28 If you are curious how the above graph was drawn, the following was run in the do-file editor, #delimit ; twoway (function y=normalden(x,0,1), clwidth(*1.5) range(-4 4)) (pci .4 0 0 0 , lwidth(*1.25) lcolor(red)) (pci .24 -1 .24 1, lwidth(*1.25) msize(*1.25) lcolor(red)) (pcarrowi .43 .3 .405 0 , lwidth(*1.5) msize(*1.25) lcolor(black) mcolor(black)) (pcarrowi .27 1.4 .24 1.1 , lwidth(*1.5) msize(*1.25) lcolor(black) mcolor(black)) , ytitle(Frequency Proportion) xtitle(Variable Scores,height(5)) xlabels(-4(1)4) ylabels(0(.1).5,angle(horizontal)) legend(off) text(.435 .32 "Mean = 0", placement(ne)) text(.27 1.5 "SD = 1", placement(ne)) ; #delimit cr The standard deviation (SD) is a measure of the scatter, or dispersion, around the mean. To illustrate this, in the next figure three normal distributions are drawn, each with the same mean, for various SDs. Frequency Proportion .2 .15 .1 Mean = 50 , SD = 2 Mean = 50 , SD = 7 .05 Mean = 50 , SD = 20 0 0 10 20 Chapter 2-1 (revision 8 Jan 2011) 30 40 50 60 Variable Scores 70 80 90 100 p. 29 If you are curious how the above graph was drawn, the following was run in the do-file editor, #delimit ; twoway (function y=normalden(x,50,2), clwidth(*1.5) lcolor(blue) range(0 100)) (function y=normalden(x,50,7), clwidth(*1.5) lcolor(red) range(0 100)) (function y=normalden(x,50,20), clwidth(*1.5) lcolor(green) range(0 100)) (pcarrowi .07 60 .065 55 , lwidth(*1.5) msize(*1.25) lcolor(blue) mcolor(blue)) (pcarrowi .05 63 .04 58 , lwidth(*1.5) msize(*1.25) lcolor(red) mcolor(red)) (pcarrowi .03 72 .02 66 , lwidth(*1.5) msize(*1.25) lcolor(green) mcolor(green)) , ytitle(Frequency Proportion) xtitle(Variable Scores,height(5)) xlabels(0(10)100) ylabels(0(.05).22,angle(horizontal)) legend(off) text(.07 61 "Mean = 50 , SD = 2", placement(e) color(blue) size(*1.1)) text(.05 64 "Mean = 50 , SD = 7", placement(e) color(red) size(*1.1)) text(.03 73 "Mean = 50 , SD = 20", placement(e) color(green) size(*1.1)) ; #delimit cr Chapter 2-1 (revision 8 Jan 2011) p. 30 Frequency Proportion For a normal distribution, the following percentages of the distribution are between: mean 1 S.D : 68.3% mean 2 S.D : 95.4% mean 3 S.D : 99.7% as shown in the following graph, Mean = 0 , SD = 1 Mean 1SD (middle 68.3%) .5 Mean 2SD (middle 95.4%) .4 .3 Mean 3SD (middle 99.7%) .2 .1 0 -4 -3 -2 -1 0 1 Variable Scores 2 3 4 If you are curious how this graph was drawn, the following was run in the do-file editor, #delimit ; twoway (function y=normalden(x,0,1), clwidth(*1.5) range(-4 4)) (pci .55 -1 0 -1 , lwidth(*2) lcolor(red)) (pci .55 1 0 1 , lwidth(*2) lcolor(red)) (pci .46 -2 0 -2 , lwidth(*2) lcolor(blue)) (pci .46 2 0 2 , lwidth(*2) lcolor(blue)) (pci .27 -3 0 -3 , lwidth(*2) lcolor(green)) (pci .27 3 0 3 , lwidth(*2) lcolor(green)) , ytitle(Frequency Proportion) xtitle(Variable Scores,height(5)) xlabels(-4(1)4) title("Mean = 0 , SD = 1") ylabels(0(.1).55,angle(horizontal)) legend(off) text(.55 0 "Mean{&plusminus}1SD", placement(c) color(red)) text(.51 0 "(middle 68.3%)", placement(c) color(red)) text(.46 0 "Mean{&plusminus}2SD", placement(c) color(blue)) text(.42 0 "(middle 95.4%)", placement(c) color(blue)) text(.27 0 "Mean{&plusminus}3SD", placement(c) color(green)) text(.24 0 "(middle 99.7%)", placement(c) color(green)) ; #delimit cr Chapter 2-1 (revision 8 Jan 2011) p. 31 If a variable is approximately symmetrical, it will usually mimic a normal distribution. When you report meanSD in your article, the reader knows that about 2/3 of the values fall in that range. The two numbers taken together, then, do a very nice job of describing the distribution. To take a guess of what the age will be of next mother presenting at the London hospital for delivery, we would say that our best guess is 34 years old (the mean), but it would not be surprising if her age differs from that guess by up to 4 years younger or 4 years older ( one standard deviation). We can confidently use that range because we will be right two-thirds of the time. Consistent schema suggested by the J Thorac Cardiovasc Surg The editors of The Journal of Thoracic and Cardivascular Surgery, in their statistical instructions to authors, suggest a consistent scheme for bounding the same percent of observations for various levels of measurement, approximately 70% (J Thorac Cardiovasc Surg, 2008), “A consistent schema for expressive variability would include ±1 standard deviation for normally distributed continuous variables, 15 and 85 percentiles for skewed distributions, and 70% confidence limits for proportions.” A “skewed” distribution is one that is not symmetrical. Do not worry about this suggestion. There is really no need for such consistency, because there is nothing special about the middle 70% of observations. So, this suggestion will probably not gain widespread use, even in that journal. Chapter 2-1 (revision 8 Jan 2011) p. 32 Frequency Proportion When the distribution is perfectly symmetrical, the mean, median, and mode are identically the same number. .5 Mean = Median = Mode .4 .3 .2 .1 0 -4 -3 -2 -1 0 1 Variable Scores 2 3 4 which was drawn using, #delimit ; twoway (function y=normalden(x,0,1), clwidth(*1.5) range(-4 4)) (pci .4 0 0 0 , lwidth(*1.25) lcolor(red)) (pcarrowi .44 0 .41 0 , lwidth(*1.5) msize(*1.25) lcolor(black) mcolor(black)) , ytitle(Frequency Proportion) xtitle(Variable Scores,height(5)) xlabels(-4(1)4) ylabels(0(.1).5,angle(horizontal)) legend(off) text(.46 0 "Mean = Median = Mode", placement(c)) ; #delimit cr Chapter 2-1 (revision 8 Jan 2011) p. 33 Frequency Proportion For skewed distributions, the mode remains at the most frequent value, which is the variable score corresponding to the maximun height of the frequency distribution. The mean always gets pulled in the direction of the skewness, and the median always lies in between the mean and mode. So, you can visually compare the mean to the median to determine if the distribution is skewed, and in which direction. Left Skewed Distribution Mode .25 Median Mean .2 .15 .1 .05 0 0 10 5 15 Variable Scores Chapter 2-1 (revision 8 Jan 2011) p. 34 Frequency Proportion Right Skewed Distribution .25 Mode Median Mean .2 .15 .1 .05 0 0 5 10 15 Variable Scores Chapter 2-1 (revision 8 Jan 2011) p. 35 which was drawn using, * --- left skewed distribution (guessing at values of statistics) preserve clear set obs 140 gen x=_n/10 gen y=gammaden(4,1,0,x) gsort -x drop x gen x=_n/10 #delimit ; twoway (line y x , clwidth(*1.5)) (pci .24 11.1 0 11.1 , lwidth(*1.25) lcolor(red)) (pci .22 10 0 10 , lwidth(*1.25) lcolor(blue)) (pci .20 8 0 8 , lwidth(*1.25) lcolor(green)) , ytitle(Frequency Proportion) legend(off) xtitle(Variable Scores,height(5)) title("Left Skewed Distribution") ylabels(0(.05).25,angle(horizontal)) text(.25 11.1 "Mode", placement(c) color(red)) text(.23 10 "Median", placement(c) color(blue)) text(.21 8 "Mean", placement(c) color(green)) ; #delimit cr restore * --- right skewed distribution (guessing at values of statistics) preserve clear set obs 140 gen x=_n/10 gen y=gammaden(4,1,0,x) #delimit ; twoway (line y x , clwidth(*1.5)) (pci .24 3 0 3 , lwidth(*1.25) lcolor(red)) (pci .22 4.2 0 4.2 , lwidth(*1.25) lcolor(blue)) (pci .20 6 0 6 , lwidth(*1.25) lcolor(green)) , ytitle(Frequency Proportion) legend(off) xtitle(Variable Scores,height(5)) title("Right Skewed Distribution") ylabels(0(.05).25,angle(horizontal)) text(.25 3 "Mode", placement(c) color(red)) text(.23 4.2 "Median", placement(c) color(blue)) text(.21 6 "Mean", placement(c) color(green)) ; #delimit cr restore Chapter 2-1 (revision 8 Jan 2011) p. 36 Standard Error of the Mean The standard error (SE) of the mean, or SEM, reflects the uncertainty in our sample estimate of the mean, or the amount of variability in the mean that is expected by sampling variation. If we took a large number of samples of size n, from a population with the same mean and standard deviation as observed in the original sample, and computed the mean for each sample, the distribution of these means would have a standard deviation (SD) equal to the standard error (SE) from the original sample. In other words, the SE is the SD of the estimates of the mean. To verify that the standard error (SE) is the standard deviation (SD) of the distribution of means from repeated samples, we will simulate this. Computing some descriptive statistics for matage, use births_with_missing, clear summarize matage * <or> sum matage Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------matage | 485 34.05979 3.905724 23 43 Next, we will draw 10,000 samples of size n=485 from a normal distribution with mean of 34.05979 and SD of 3.905724. Each time we will compute the mean and save it to a file. clear set obs 1 gen mean=. save holdmeans, replace * forval i=1/10000 { quietly clear quietly set obs 485 quietly gen matage = invnorm(uniform())*3.905724+34.05979 quietly sum matage quietly gen mean=r(mean) quietly keep mean quietly keep in 1 quietly append using holdmeans quietly save holdmeans, replace } sum display "original sample SE = SD/sqrt(N) = " 3.905724/sqrt(485) histogram mean , normal The result follows, which matches closely, only being off by 1 point on the 3rd decimal place. Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------mean | 10000 34.06112 .1767598 33.44242 34.73396 original sample SE = SD/sqrt(N) = .17734979 Chapter 2-1 (revision 8 Jan 2011) p. 37 1.5 0 .5 1 Density 2 2.5 Here is the graph of the 10,000 means computed from the 10,000 samples. Notice it has a very narrow distribution. The SE is 1/sqrt(N) as wide as the original maternal ages SD, or 1/sqrt(485) = 1/22 as wide. 33.5 34 34.5 35 mean The SE is always much narrower than the SD. That is because the mean is always very close to the center of the individual scores, so sampling variability has a much smaller effect on it’s estimate. We are getting a bit ahead of the material covered, but this graph will immediate raise a question in the mind of the reader who has reported research results using error bars, so we might as well address it right now. When you present your data using a bar chart with error bars, where the height of the bar is the mean and the error bar is the amount of variability expected due to sampling variation, should you use the SD or SE for the error bar? (Another choice is using the 95% confidence interval, which will be introduced in Chapter 2-4.) These graphs are very popular in basic science. To give an example that is more in line with the small sample used in such studies, let’s take a random sample of n=10 from the maternal age variable. use births_with_missing, clear * set seed 999 // so we get the same random sample each time sample 10 , count // random sample of n=10 * sum matage Variable | Chapter 2-1 (revision 8 Jan 2011) Obs Mean Std. Dev. Min Max p. 38 -------------+-------------------------------------------------------matage | 10 35.3 3.653005 29 40 Computing the standard error, display "SE = SD/sqrt(N) = " 3.653005/sqrt(10) SE = SD/sqrt(N) = 1.1551816 To see the dramatic difference created by our choice for the error bars, the following figure shows the mean with a SD error bar and a SE error bar. 50 45 40 Maternal Age 35 30 25 20 15 10 5 0 mean+/-SD mean+/-SE The second bar clearly looks “more precise” than the second bar. In the first bar, the mean is 35.3, but is likely to vary as much as 3.7, the height of the error bar. In the second bar, the mean is 35.3, but is likely to vary only by as much as 1.6. This makes it “seem like cheating” to some researchers, when the article displays the SE for the error bar, rather than the SD. (You should always say in the figure legend which one you used.) Chapter 2-1 (revision 8 Jan 2011) p. 39 Which to use? The answer is simple if you just stop to think about it. It depends on if you are presenting a description of your sample, which is the individual observations, or if you are presenting an outcome. The individual observations are described with their mean and standard deviation,where the SD reflects how much an individual observation is likely to vary (the middle 2/3rds) due to sampling variation. The outcome is always reported by some mean (mean, regression slope, risk ratio, etc). The standard error of that estimate reflects how much the effect, the mean in this case, is likely to vary (the middle 2/3rds) due to sampling variation. A formal argument for using SEM, instead of SD, when describing results is given in Ch 2-2. The argument requires the concept of statistical regularity, which is presented in that chapter. If you are curious, the above graph was generated by the following Stata code executed in the dofile editor, * mean +/- SD in first row: 35.30 +/- 3.65 * mean +/- SE in second row: 35.30 +/- 1.16 clear input x mean lower upper 1 35.30 31.65 38.95 2 35.30 34.14 36.46 end * #delimit ; twoway (bar mean x , bcolor(black) barwidth(.5)) (rcap lower upper x, lcolor(black) blwidth(medthick) msize(large)) , legend(off) scheme(s1mono) plotregion(style(none)) xscale(range(.5 2.5)) ylabels(0(5)50, angle(horizontal)) xlabels(1 "mean+/-SD" 2 "mean+/-SE" , notick) ytitle("Maternal Age") xtitle(" ") aspectratio(1) ; #delimit cr Chapter 2-1 (revision 8 Jan 2011) p. 40 The choice of the many available descriptive statistics that we can choose from to report in a manuscript is based on the level of measurement of the variable. Measurement Scale (also called level of measurement) Dichotomous scale (dichotomy, 2 categories) (e.g., sex: male or female) Nominal scale name (unordered named categories) (e.g., therapy: radiation, surgical, chemo) Ordinal scale name + order (ordered categories) (e.g., size: small, medium, large) Interval scale name + order + equal intervals + arbitrary zero point (e.g., temperature, 0 degrees does not imply absence of temperature, the 0 point is just a convention. Ratios do not make sense--you would not say 80° is twice as hot as 40°.) Ratio scale name + order + equal intervals + absolute zero point (e.g., height, 0mm means no height. Ratios make sense–an object 10mm tall is twice the height as an object 5mm tall.) Note 1: it makes sense to do arithmetic on interval scaled variables, since this scale is sufficiently close to our notion of integers and real numbers (both number systems have equal intervals). It does not make sense to do arithmetic on nominal and ordinal scales, since these scales do not have equal intervals. Note 2: for purposes of choosing test statistics, interval and ratio scales are considered equivalent. A second measurement scale scheme is: Binary data Unordered categorical data Ordered categorical data Continuous data Chapter 2-1 (revision 8 Jan 2011) (dichotomous scale) (nominal scale) (ordinal scale) (interval & ratio scales) p. 41 Exercise Think about what arithmetic makes sense for a given level of measurement. An “X” is shown in every cell of the following table for which the descriptive measure is a legitimate statististic for that level of measurement. Do you understand why? dichotomous nominal ordinal interval binary categorical ordered categorical continuous n X X X X % X X X X min X X max X X range X X X X mode X X mean X median X X IQR X X std. dev. X std. err. X note: In Chapter 2-6, it is shown that a dichotomous scale is also an interval scale, though it is widely treated as a nominal scale. Chapter 2-1 (revision 8 Jan 2011) p. 42 Exercise This time an “X” is shown in every cell of the following table for which the descriptive measure might be useful in a “Table 1. Patient Characteristics” table of a research article. Do you understand why? dichotomous nominal ordinal interval binary categorical ordered categorical continuous X n X X X % X X X min max range mode mean X median X X IQR X X, if shown as 25th , 75th percentiles std. dev. X std. err. Chapter 2-1 (revision 8 Jan 2011) p. 43 Exercise For each level of measurement, the “official” (most informative) measure of average (central tendency) and the “official” (most informative) measure of variability (dispersion) is shown. Do you understand why? dichotomous nominal ordinal interval binary categorical ordered categorical continuous Avg, Var Avg, Var n % min max range mode mean Avg median Avg Avg if skewed IQR Var Var if skewed std. dev. Var std. err. Chapter 2-1 (revision 8 Jan 2011) p. 44 Bringing our original dataset back into Stata memory, provided our working directory is where the data file is (otherwise you can read it back in with the “File” command on the Stata menu), use births_with_missing, clear Since maternal age is a continuous variable, or interval scaled variable, we can use the mean and standard deviation to summarize the data. Statistics Summaries, tables & tests Summary and descriptive statistics Summary statistics Main tab: Variables: matage OK summarize matage Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------matage | 485 34.05979 3.905724 23 43 To obtain greater detail, Statistics Summaries, tables & tests Summary and descriptive statistics Summary statistics Main tab: Variables: matage Options: display additional statistics OK summarize matage , detail maternal age ------------------------------------------------------------Percentiles Smallest 1% 25 23 5% 27 24 10% 29 25 Obs 485 25% 31 25 Sum of Wgt. 485 50% 75% 90% 95% 99% 34 37 39 40 42 Chapter 2-1 (revision 8 Jan 2011) Largest 42 43 43 43 Mean Std. Dev. 34.05979 3.905724 Variance Skewness Kurtosis 15.25468 -.241381 2.483957 p. 45 To see why median and interquartile range summarize the data better for skewed distributions, let’s introduce some skewness. The original data are: maternal age ------------------------------------------------------------Percentiles Smallest 1% 25 23 5% 27 24 10% 29 25 Obs 485 25% 31 25 Sum of Wgt. 485 50% 75% 90% 95% 99% 34 37 39 40 42 Largest 42 43 43 43 Mean Std. Dev. 34.05979 3.905724 Variance Skewness Kurtosis 15.25468 -.241381 2.483957 In this symmetrical distribution, the interval meanSD does a good job of representing 2/3 of the data. We can recognize it is symmetrical because the mean is very close to the median. If it was skewed, the mean would differ noticeably from the median. Now, introducing some skewness, by replacing the 15 missing values with a maternal age of 100, just for illustration Data Create or change variables Change contents of a variable Main tab: Variable: matage New contents: 100 if/in tab: if expression: matage==. OK replace matage = 100 if matage==. Chapter 2-1 (revision 8 Jan 2011) p. 46 Obtaining the descriptive statistics again, Statistics Summaries, tables & tests Summary and descriptive statistics Summary statistics Main tab: Variables: matage Options: display additional statistics OK summarize matage , detail maternal age ------------------------------------------------------------Percentiles Smallest 1% 25 23 5% 27 24 10% 29 25 Obs 500 25% 32 25 Sum of Wgt. 500 50% 75% 90% 95% 99% 34 37 40 41 100 Largest 100 100 100 100 Mean Std. Dev. 36.038 11.89873 Variance Skewness Kurtosis 141.5797 4.609086 25.20207 In this distribution, meanSD (3612 is 24 to 48) is nearly the entire distribution, as can be seen by the frequency table Statistics Summaries, tables & tests Tables One-way tables Main tab: Categorical variable: matage OK tabulate matage Chapter 2-1 (revision 8 Jan 2011) p. 47 meanSD: 24 to 48 maternal | age | Freq. Percent Cum. ------------+----------------------------------23 | 1 0.20 0.20 24 | 1 0.20 0.40 25 | 5 1.00 1.40 26 | 9 1.80 3.20 27 | 14 2.80 6.00 28 | 12 2.40 8.40 29 | 26 5.20 13.60 30 | 27 5.40 19.00 31 | 29 5.80 24.80 32 | 39 7.80 32.60 33 | 45 9.00 41.60 34 | 51 10.20 51.80 35 | 44 8.80 60.60 36 | 31 6.20 66.80 37 | 44 8.80 75.60 38 | 44 8.80 84.40 39 | 27 5.40 89.80 40 | 24 4.80 94.60 41 | 7 1.40 96.00 42 | 2 0.40 96.40 43 | 3 0.60 97.00 100 | 15 3.00 100.00 ------------+----------------------------------Total | 500 100.00 | | | | | | interquartile range 32 to 37 The interquartile range, 32 to 37, however, still correctly encloses 1/2 of the distribution and the median of 34 is still in the center of the distribution. Exercise Look at Table 1 of the Brady et al. (2000) article. Notice which descriptive statistics are used for each level of measurement of her variables. Exercise Look at Table 1 of the Sulkowski et al. (2000) article. Notice which descriptive statistics are used for each level of measurement of his variables. Why do you think he is using the median and interquartile range, rather than mean and standard deviation, for obviously interval level variables? Exercise Look at Table 3 of the Sachs et al. (2007) article. Notice the use of mean±SD and median (IQR) in the same table. Chapter 2-1 (revision 8 Jan 2011) p. 48 Another Command (table) For Displaying Descriptive Statistics Perhaps we are interested in discovering if mothers with hypertension give birth earlier, or a shorter gestational age, than mothers without hypertension. Treating gestational age as a continuous, or interval scaled variable, we might want to produce a table showing the means and standard deviations of gestational age, separately for mothers with and without hypertension. We could do this with the “by” option for the summarize command, Statistics Summaries, tables & tests Summary and descriptive statistics Summary statistics Main tab: Variables: gestwks by/if/in tab: Repeat command by groups: Variables that define groups: hyp OK by hyp, sort : summarize gestwks * <or> bysort hyp: summarize gestwks -> hyp = 0 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------gestwks | 380 38.97521 1.962778 26.95 43.16 -> hyp = 1 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------gestwks | 63 37.83111 2.744347 27.33 42.2 -> hyp = . Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------gestwks | 4 36.9875 3.407701 32.41 39.65 We see the descriptive statistics for every value of hyp, including the subgroup with a missing value for hyp. This is an easy command to memorize, but it not very close to a table of means and standard deviations. Chapter 2-1 (revision 8 Jan 2011) p. 49 Using the table command, Statistics Summaries, tables & tests Tables Table of summary statistics (table) Main tab: Row variable: hyp Statistics 1: Mean Variable: gestwks Statistics 2: Standard deviation Variable: gestwks Statistics 3: Count nonmissing Variable: gestwks OK table hyp, contents(mean gestwks sd gestwks count gestwks) ------------------------------------------------------hypertens | mean(gestwks) sd(gestwks) N(gestwks) ----------+-------------------------------------------0 | 38.97521 1.962778 380 1 | 37.83111 2.744347 63 ------------------------------------------------------- The table command has the nice feature that you can have a row variable, superrow variable, column variable, and supercolumn variable, so you can show up to five different descriptive statistics for all combinations of up to four categorical variables. Adding sexalph as a column variable, Statistics Summaries, tables & tests Tables Table of summary statistics (table) Main tab: Row variable: hyp Statistics 1: Mean Variable: gestwks Statistics 2: Standard deviation Variable: gestwks Statistics 3: Count nonmissing Variable: gestwks Column variable: sexalph OK table hyp sexalph, contents(mean gestwks sd gestwks count gestwks) -----------------------------|sex coded as string hypertens | female male ----------+------------------0 | 38.94313 39.09039 | 2.21838 1.612922 | 179 179 | 1 | 37.54346 38.20912 | 3.12607 2.377861 | 26 34 ------------------------------ Chapter 2-1 (revision 8 Jan 2011) p. 50 To show the mean and standard deviation to one decimal place, Statistics Summaries, tables & tests Tables Table of summary statistics (table) Main tab: Row variable: hyp Statistics 1: Mean Variable: gestwks Statistics 2: Standard deviation Variable: gestwks Statistics 3: Count nonmissing Variable: gestwks Column variable: sexalph Options tab: Override display format for numbers in cells: Create… Type of data: Numeric Numeric type: Fixed numeric Format properties: Justification: Right-justified Total digits: 4 Digits to the right of decimal: 1 OK OK table hyp sexalph, contents(mean gestwks sd gestwks /// count gestwks) format(%4.1f) -------------------------| sex coded as | string hypertens | female male ----------+--------------0 | 38.9 39.1 | 2.2 1.6 | 179 179 | 1 | 37.5 38.2 | 3.1 2.4 | 26 34 -------------------------- Note: In the command box, the symbol “///” was used, which means continue the command on the following line. That symbol can only be used in the do-file editor—it gives an error message if run from the command line. Chapter 2-1 (revision 8 Jan 2011) p. 51 Another Command (tabstat) For Displaying Descriptive Statistics There are two disadvantages to the table command: 1) it only permits five different descriptive statistics, and 2) it does not label the statistics. The tabstat command permits more than five different descriptive statistics, and it lets label the statistics, as long as we use column headings for the statistics. Statistics Summaries, tables & tests Tables Table of summary statistics (tabstat) Main tab: Variables: gestwks Group statistics by variable: hyp Statistics to display: Mean Standard deviation Count Options tab: Use as columns: Statistics OK tabstat gestwks, statistics( mean sd count ) by(hyp) /// columns(statistics) Summary for variables: gestwks by categories of: hyp (hypertens) hyp | mean sd N ---------+-----------------------------0 | 38.97521 1.962778 380 1 | 37.83111 2.744347 63 ---------+-----------------------------Total | 38.81251 2.125999 443 ---------------------------------------- Chapter 2-1 (revision 8 Jan 2011) p. 52 Changing the columns to variables, which makes the statistics the rows, Statistics Summaries, tables & tests Tables Table of summary statistics (tabstat) Main tab: Variables: gestwks Group statistics by variable: hyp Statistics to display: Mean Standard deviation Count Options tab: Use as columns: variables OK tabstat gestwks, statistics( mean sd count ) by(hyp) /// columns(variables) -> hyp = 0 stats | gestwks ---------+---------mean | 38.97521 sd | 1.962778 N | 380 --------------------> hyp = 1 stats | gestwks ---------+---------mean | 37.83111 sd | 2.744347 N | 63 --------------------> hyp = . stats | gestwks ---------+---------mean | 36.9875 sd | 3.407701 N | 4 -------------------- Chapter 2-1 (revision 8 Jan 2011) p. 53 References Bejon P, Lusingu J, Olotu A, et al. (2008). Efficacy of RTS,S/AS01E vaccine against malaria in children 5 to 17 months of age. N Engl J Med 359(24):2521-32. Brady K, Pearlstein T, Asnis GM, et al. (2000). Efficacy and safety of sertraline treatment of posttraumatic stress disorder: a randomized controlled trial. JAMA 283(14):1837-1844. J Thorac Cardiovasc Surg (2008). Statistical Methods. J Thorac Cardiovasc Surg 135(1):40A41A. Munro BH. (2001). Statistical Methods for Health Care Research. 4th ed. Philadelphia, Lippincott. Sachs GS, Nierenberg AA, Calabrese JR, et al. (2007). Effectiveness of adjunctive antidepressant treatment for bipolar depression. N Engl J Med 356(17):1711-1722. Sulkowski MS, Thomas DL, Chaisson RE, Moore RD. (2000). Hepatotoxicity associated with antiretroviral therapy in adults infected with human immunodeficiency virus and the role of hepatitis C or B virus infection. JAMA 283(1):74-80. Tukey J. (1977). Exploratory data analysis. Reading, MA, Addison-Wesley. Chapter 2-1 (revision 8 Jan 2011) p. 54