Download Chapter 6: Graphics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Chapter 6
Graphics
Modern computers with high resolution displays and graphic printers have revolutionized the visual display of information in fields ranging from computeraided design, through flow dynamics, to the spatiotemporal attributes of infectious diseases. The impact on statistics is just being felt. Whole books have
been written on statistical graphics and their contents are quite heterogeneous–
simple how-to-do it texts (e.g., ?; ?), reference works (e.g., Murrell, 2006) and
generic treatment of the principles of graphic presentation (e.g., ?). There
is even a web site devoted to learning statistics through visualization http:
//www.seeingstatistics.com/. Hence, it is not possible to be comprehensive
in this chapter. Instead, I focus on the types of graphics used most often in neuroscience (e.g., plots of means) and avoid those seldom used in the field (e.g.,
pie charts, geographical maps).
The here are four major purposes for statistical graphics. First, they are
used to examine and screen data to check for abnormalities and to assess the
distributions of the variables. Second, graphics are very useful aid to exploratory
data analysis (??). Exploratory data analysis, however, is used for mining large
data sets mostly for the purpose of hypothesis generation and modification, so
that use of graphics will not be discussed here. Third, graphics can be used to
assess both the assumptions and the validity of a statistical model applied to
data. Finally, graphics are used to present data to others. The third of these
purposes will be discussed in the appropriate sections on the statistics. This
chapter deals with the first and last purposes–examining data and presenting
results.
6.1
Examining data with graphics
The main purpose here is to view the data to detect outliers and to make
decisions about transforming variables. If the design has groups (even ordered
groups), then your plots should be constructed separately for each group. It is
possible for an outlier in one group to be hidden by the data points for other
1
CHAPTER 6. GRAPHICS
2
6
5
4
0
0
1
5
2
3
Frequency
15
10
Frequency
20
7
Figure 6.1: Examples of histograms: IQ scores from patients in a pediatric
neurology clinic.
40
50
60
70
80
IQ
90 100
50
60
70
80
90
100
IQ
groups.
6.1.1
Histograms
In the past, teachers of statistics tortured undergraduate students by giving
them a data set and requiring them to draw a histogram on graph paper. With
access to graphing software, those days are over (mostly). The histogram groups
a numeric variable into “bins” and then plots the midpoint of the bin on the
horizontal axis the frequency of scores within the that bin on the vertical axis.
The frequency can be expressed as either the raw number of scores or the percent
of scores in the bin. In general, histograms are best used for relatively large
sample sizes.
The most significant decision for the user is the number of bins. When the
number is too small, information is effectively “hidden.” When the number is
too large, the histogram may have only one or two observations per bar and
provides little information about the distribution. Also, bin size is a function
of sample size. Smaller bins can be used with larger samples. Good statistical
and graphing software usually gives a reasonable default size for the bins based
on sample size. All software will give the use the option of specifying either the
bin size (i.e., width of the bar) or the number of bins.
Figure 6.1 gives histograms of the IQ scores of 63 children seen at a pediatric
neurology clinic. The left hand plot in Figure 6.1 was generated using a default
for the number of bins . The right hand plot specified 25 bins.
Note that histograms result in a “wide” plot for a variable. Also, the range
of the horizontal axis is often determined by the software. Hence, comparing
groups using histograms can require extra work and the resulting graphic may
CHAPTER 6. GRAPHICS
3
Table 6.1: R code for producing a dot p[lot (strip chart).
s t r i p c h a r t ( pkcgamma$open_arm ~ pkcgamma$genotype ,
method=" j i t t e r " , v e r t i c a l=TRUE,
y l a b="P e r c e n t Time i n Open Arm" ,
x l a b="Genotype " ,
c o l ="b l u e " , pch =1, cex =1.5)
not be easy to interpret. There are fancy routines that that will plot histograms
for two (sometimes three) groups using “transparent” color schemes, but the
simplest way to examine groups is through dot plots and/or box plots.
6.1.2
Dot plots (strip charts)
A “dot plot” means different things to different statistical packages. Here, the
term is used in its traditional sense (?) and encompasses what R calls a “strip
chart.” Dot plots can be one of the most useful ways of displaying and perusing
data for neuroscience because sample sizes are usually small to moderate.
The type of dot plot most useful in neuroscience has the groups on the horizontal axis and the values for the variable on the vertical axis. Each observation
then becomes a point in the graph, plotted by its value. Figure 6.2 gives two
examples for the PKC-γ data. The left hand panel gives the traditional dot plot.
In the right hand panel, the points are “jittered” or moved slightly to the left
or right in order to avoid overlapping points. Table 6.1 presents the R code for
these plots. This code produced the right-hand or “jittered” figure. To produce
the left-hand figure omit the argument method=”jitter” .
The advantages of a dot plot are: (1) every value in the data can be visually
inspected; (2) the range of the data for each group is apparent in the event
that there may be significant differences in variance; and (3) outliers can be
readily identified. The major disadvantage comes when data sets are so large
that overlapping points make it difficult to appreciate the distribution of scores
(although some graphical software can overcome this limitation with moderately
sized data sets). A second disadvantage is that the dot plot does not give
information about the statistics of a distribution.
CHAPTER 6. GRAPHICS
4
Figure 6.2: Example of a dot plot.
●
●
●
●
●
●
●
●
●
●
●
●
●
++
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
+−
−−
Genotype
6.1.3
30
20
●
0
●
10
●
●
●
●
●
Percent Time in Open Arm
30
20
10
0
Percent Time in Open Arm
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
● ●
++
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
● ●
●
●
● ●
●
+−
−−
Genotype
Box (and whisker) plot
Figure 6.3 provides a box plot (aka box-and-whisker plot) of the same data.
In this plot, individual data points are not identified. Instead the shape of
the distribution is visually portrayed through the shape of a box and arms (or
whiskers). The bottom arm begins with the lowest connected data point in the
series (we postpone the definition of a connected data point for the moment).
The lower part of the box starts with the score at the first quartile and upper
part ends with the score at the third quartile. The horizontal line close to the
middle of the box is the median. Finally, the upper “whisker” starts above the
third quartile and ends with the uppermost connected data point. Hence, the
size of the box gives the interquartile range of the data, i.e., the 25th through
the 75th percentiles.
Box plots are more useful than dot plots when there is a moderate to large
number of observations per group. Their symmetry permits one to assess skewness and they are very helpful in detecting outliers. In symmetrical distributions,
the median splits the box into equal halves and two whiskers are of equal length.
A distribution with a positive skew (see the right hand box plot in Figure 6.4)
has a short whisker at the bottom, a median that is located below the half
way point in the box, a long whisker at the top, and a number of unconnected
data points at the high end. A variable with a negative skew has the opposite
shape—short whisker at the top, a median above the half way point in the box,
a long whisker at the bottom, and a number of unconnected data points at the
low end (see Figure 6.4).
CHAPTER 6. GRAPHICS
5
30
10
20
●
0
Percent Time in Open Arm
Figure 6.3: Box plots for the PKC-γ data by genotype.
++
+−
−−
Genotype
6.1.3.1
Unconnected data points
Most graphing software offers two options for dealing with very high and very
low values. The first option is to plot them as unconnected data points above
and below the whiskers. This was the option used to generate Figures 6.3 and
6.4. The second option is extend the whiskers to the lowest and to the highest
values.
There are no formal criteria defining an unconnected data point. Hence, it is
always necessary to consult the manual for the software. Many programs define
an unconnected data point as a value lower than 1.5 times the interquartile
range below the first quartile. For example, if the first quartile is at 31.4 and
the third quartile is at 42.7, then the interquartile range is 42.7 – 31.4 = 11.3.
Hence the cut off for a lower unconnected data point would be 31.4 – 1.5*11.3
= 14.45. A similar criterion is used for a unconnected data point at the higher
end of the distribution.
One cautionary note is in order—never consider an “unconnected” data point
an outlier without further evidence. An unconnected data point may achieve
its status simply because sample size is not large and the interquartile range for
a group is small. This is precisely the case for the heterozygote’s box plot in
Figure 6.3. Examination of the highest value for genotype +- in the dot plot of
Figure 6.2 reveals that it belongs to the rest of the distribution. Comparison of
the size of the three boxed in Figure 6.3 demonstrates that genotype +- has a
CHAPTER 6. GRAPHICS
6
Figure 6.4: Box plots of skewed variables.
●
4
Score
6
8
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
2
●
●
●
●
●
●
●
Negative
Positive
Skewness
smaller interquartile range than the other two.
In a box plot, never consider an unconnected data point as an outlier
without further evidence.
A further issue with unconnected data points is that you will (not may)
observe them in very large samples. Recall the property of the range (see Section
X.X) that as the sample size grows larger, the chance of observing extreme values
increases, and hence the range increases. Very large sample sizes will stabilize
the size of the box and length of the whiskers, but increase the likelihood that
extreme scores will be sampled.
6.1.3.2
Box plots and small samples
Upon learning to screen data using graphics, students often have trouble grasping the extent to which box plots can vary when sample size is small. Figure 6.5
provides an example. Here, ten different samples with an N of 8 in each sample
were generated using random numbers from a normal distribution. Thus, there
are no outliers in any of the samples and there are no significant differences in
variability in any of the samples. One can certainly use graphical methods for
CHAPTER 6. GRAPHICS
7
Figure 6.5: Box plots for 10 random samples of size 8.
screening, but it is imperative to use objective statistics to assess whether or
not the distributions of scores differ across groups.
6.1.4
Box and dot plots
The flexibility of modern graphics software permits hybrids of plots. The downside is that you must do the work to create the hybrids. One useful hybrid is a
combination of a box plot and a dot plot. This can be performed by creating
a box plot and then superimposing the dot plot over it. An example using the
simulated data from Figure 6.5 is given in Figure 6.6.
6.1.5
Violin Plots
A violin plot combines information from a box plot with smoothed information
about the frequency of scores at a value. To understand a violin plot, it is easier
to look first and explain later, so examine Figure 6.7 which gives a violin plot
for the same data used to construct Figures 6.5 and 6.6.
In place of a rectangle, the figure in a violin plot is scaled so that the width
represents the density of scores at a certain value. If the distribution of scores
were normal, then a violin plot would resemble that of Sample 5 but the shape
CHAPTER 6. GRAPHICS
Figure 6.6: A dot plot superimposed over a box plot.
8
CHAPTER 6. GRAPHICS
9
60
80
Figure 6.7: Examples of violin plots.
●
●
●
2
3
●
●
●
●
●
●
20
40
●
1
4
5
6
Sample
7
8
9
10
CHAPTER 6. GRAPHICS
10
would be completely symmetric. Sample 3 is almost normal but has a light
negative skew. Samples 1 and 8 illustrate positive skewness. Finally, Samples 4
and 10 depict a uniform distribution (i.e., one with a histogram that resembles
a rectangle).
Violin plots may also include a symbols for the median (the white dot in
Figure 6.7) and lines denoting the interquartile range (denser vertical line in
the Figure) and connected data points (finer vertical line). As in a box plot,
the are not universal standards for these symbols or for the definition of a
“connected” data point, so always consult the software’s documentation.
6.1.6
Assessing distributions
Dot, box, and violin plots are very useful in examining data and can give hints
about the underlying distributions. For many statistical analyses, these graphics
should be sufficient to detect potential problems. In some cases, however, it is
necessary to use stricter criteria to see if the scores fit a specific distribution.
Here, three types of graphics are often used: (1) a histogram with superimposed
plots of a theoretical distribution and/or kernel density; (2) a plot of observed
and theoretical cumulative distribution; and (3) a quantile-quantile or QQ plot.
These are illustrated in Figure 6.8 for 200 scores randomly sampled from a
normal distribution with a mean of 10 and standard deviation of 2.
6.1.6.1
Histogram with theoretical and kernel densities
If a distribution is normal with mean µ and standard deviation σ, then drawing
a normal curve based on those statistics over an observed histogram should
reveal a close fit. The only trick here is to make certain that the scale of the
vertical axis for the histogram is the proportion of data and not a raw count. To
further assess the fit, one should also plot a kernel density. A kernel density may
be viewed as an agnostic (technically, nonparametric) method of estimating the
shape of the distribution. It uses one of several functions (the kernel function)
and applies that function over a small section of the distribution to arrive at
an estimate of the density for a value within that small section. Think of it
as a smoothed histogram. Hence, if the kernel density agrees well with the
theoretical density, then there is good evidence that the observed data follow
the theoretical distribution. Table 6.2 gives the R code for producing the plot
in the upper panel of Figure 6.8. It is clear that the kernel estimate agrees well
with the theoretical normal.
6.1.6.2
Observed and theoretical cumulative distribution
The observed cumulative distribution function for X plots the value of X on
the horizontal axis and the proportion of all observed scores less than or equal
to X on the vertical axis. If the distribution is normal with a mean equal to the
observed mean and a standard deviation equal to the observed one then one can
construct a plot of the area under this normal curve from negative infinity to X
CHAPTER 6. GRAPHICS
11
Figure 6.8: Three graphical means for assessing the fit between an observed and
theoretical normal distribution: A histogram with kernel density and normal
density plot (upper panel); observed and theoretical cumulative density plot
(middle panel); and quantile-quantile or QQ plot (lower panel).
0.20
Histogram
Normal
0.10
0.00
0.05
Density
0.15
Kernel
4
6
8
10
12
14
x
1.0
Cumulative Density Function
0.6
0.4
0.0
0.2
Frequency
0.8
Empirical cdf
Theoretical cdf
●
●
●
● ●
●
●
4
●
●●
●
● ●
●●
● ●
●●
●
●●
●●
●
●●
●●
●●
●
●●
●●
● ●
●●
●●
6
8
●●
●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●●
●●
●
●●
● ●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●●
●●
●●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●●
●
●●
●
10
●
●●
●●
●●
●●
●
●
●●
●
●
●●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
● ●
●●
12
●
●●
●
●●
●
14
16
X
QQ Plot
●
12
10
8
●●
●● ●
●●●●●
●●
●●●
●●●●●●
●●●●●●
●●●●
●
●
●
●
●●●
●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●●
●●
●●●●●
●●●●
●●●●
●●●●●●
●●●
●●●
●
●
● ●●●
●
● ● ●
●
●
●
●
6
Sample Quantiles
14
●
●●
●
●
● ●
●
●
−3
−2
−1
0
Theoretical Quantiles
1
2
3
CHAPTER 6. GRAPHICS
12
Table 6.2: R code for plotting a histogram, theoretical normal, and kernel density.
h i s t ( x , c o l ="b i s q u e " , f r e q=FALSE)
meanx <− mean ( x )
sdx <− sd ( x )
c u r v e ( dnorm ( x , mean=meanx , sd=sdx ) ,
from=min ( x ) , t o=max( x ) , add=TRUE,
l t y =1, c o l ="Black " , lwd=3)
k e r n e l <− d e n s i t y ( x )
p o i n t s ( k e r n e l $ x , k e r n e l $ y , type=" l " ,
l t y =2, c o l ="b l u e " , lwd=3)
l e g e n d ( " t o p l e f t " , c ( " Normal " , " K e r n e l " ) , l t y=c ( 1 , 2 ) ,
bty="n " , cex =.9 , c o l=c ( " b l a c k " , " b l u e " ) )
Table 6.3: R code for plotting an observed and theoretical cumulative distribution.
p l o t ( e c d f ( x ) , main="Cumulative D e n s i t y Function " ,
y l a b="Frequency " , x l a b=e x p r e s s i o n ( i t a l i c (X) ) ,
pch=1)
meanx <− mean ( x )
sdx <− sd ( x )
c u r v e ( pnorm ( x , mean=meanx , sd=sdx ) ,
from=min ( x ) , t o=max( x ) , add=TRUE,
l t y =1, lwd =3, c o l ="r e d " )
t e x t ( min ( x ) , . 9 5 , " E m p i r i c a l c d f " , pos =4, cex =.9)
t e x t ( min ( x ) , . 8 8 , " T h e o r e t i c a l c d f " , pos =4,
c o l ="r e d " , cex =.9)
as a function of observed X. If the two plots agree, then there is good evidence
that the observed data are normally distributed. Again, one can substitute any
theoretical distribution for the normal to see which distribution bests fit the
data.
R code for performing this comparison is given in Table 6.3, the resulting
figure being the middle panel of Figure 6.8.
6.1.6.3
Quantile-quantile or QQ plots
A QQ plot plots the sample scores (aka quantiles) against the theoretical scores
(again, quantiles) expected if the observed data were distributed as a normal
with mean and standard deviation equal to the observed mean and standard
deviation. If the distribution is normally distributed, the points in this plot
should fall on a straight line. Hence, many QQ plots will also automatically plot
CHAPTER 6. GRAPHICS
13
2
3
Figure 6.9: Example of a scatterplot with a bivariate outlier (red square).
●
●
●
●
●
●
●
●
0
●
●
−1
Y
1
●
●●●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
−2
●●●
● ●●
●
●
●
●
● ●
●
●
●
●
●
−3
●
−3
−2
−1
0
1
2
3
X
the expected straight line. Again, substitute any theoretical distribution for the
normal one to examine how well that distribution agrees with the observed data.
The R commands for generating the graph in the lower panel of Figure 6.8
are
qqnorm ( x , main="QQ P l o t " )
q q l i n e ( x , lwd =3, c o l ="r e d " )
Once again, there is very good agreement between the theoretical distribution and the observed distribution.
6.1.7
Scatterplots
The scatterplot (aka scatter graph or scattergram) depicts the relationship between two numeric variables. Each observation in a data set is an (X, Y) point
in a Cartesian coordinate system where X is the horizontal axis and Y is the
vertical axis. The X value is the score of the observation on the first variable
and the Y is the score on the second variable. Figure 6.9 gives an example.
Scatterplots can be used for various purposes, but in the context of screening
data, their main uses are to detect bivariate outliers and to examine linearity.
The red square in the lower left of Figure 6.9 would be considered a bivariate
outlier that would not be detected as an outlier using univariate dot or box
plots.
CHAPTER 6. GRAPHICS
14
Table 6.4: R code to generate a scatterplot with a regression line from an R
data set called Scatter.
p l o t ( Scatter$X , Scatter$Y , pch =16 , c o l ="b l u e " ,
x l a b="X" , y l a b="Y" ,
xlim=c ( −3 , 3 ) , ylim=c ( −3 ,3))
a b l i n e ( lm (Y ~ X, data=S c a t t e r ) , lwd =3, c o l ="b l u e " )
Recall that is possible to have a bivariate outlier even though the observation
is not an outlier on either of the two variables (see Section X.X). It is not
unusual to encounter a student who has always received honors. Neither is it
odd to know about an undergraduate who has a 1.0 grade point average for the
past semester. But a consistent honors student who just get a 1.0 this semester
is an outlier who demands attention. Bivariate outliers can play havoc with
correlation, regression, and other techniques.
To examine linearity, first remove any outliers. Then calculate a linear regression (see Chapter X.X) and then put the straight line from that procedure
through the data. If the points cluster around the line, then there is visual evidence of linearity. If the shape of the points has a “bend,” then the relationship
is not linear. See Section X.X for more information about linearity and Section
X.X for methods of dealing with nonlinearity in data. R code for producing the
scatterplot and straight line for the data in Figure 6.9 after removing the outlier
is given in Table 6.4 and the resulting scatterplot in Figure X.X.
What happens when you have more than two or three numeric variables?
It is possible to add a third variable to produce a three-dimensional scatter
plot. Good graphical software will permit you to rotate the plot using a computer mouse so that you can view the data from various angles. There are also
complicated statistical algorithms with accompanying graphical tools to detect
what are called multivariate outliers, but they require considerable statistical
expertise. Mere mortals usually create and inspect a scatterplot for each pair
of variables.
For data screening, many statistical packages will create a matrix of scatterplots. Figure 6.11 gives an example using the scatterplotMatrix function
from the car library in R. This shows each possible scatterplot for four variables
(named “A” through “D”) with a box plot of the variable along the diagonal.
6.2
Presenting data with graphics
The number of different graphic types used in neuroscience to present data
is is legion. They include autoradiographs and their first cousins depicting
fluorescent-labeled molecules, brain images under positron emission tomography
or functional magnetic resonance imaging, charts of electrophysiological data
before and after a stimulus, density of Western blots, and simple plots of means.
Here, we consider only the graphical presentation of basic statistics such as
CHAPTER 6. GRAPHICS
15
2
3
Figure 6.10: Example of a scatterplot with a regression line.
●
●
●
●
●
●●●●
●
●●
●
●
●
●
0
●
●
−1
Y
1
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
−2
●●●
● ●●
●
●
●
●
● ●
●
●
●
●
−3
●
−3
−2
−1
0
1
2
3
X
percentages, means, variances, and correlations.
6.2.1
Categorical data and percents
Surprisingly, plotting categorical variables can be challenging. A plot for a single
categorical variable must always convey either the raw count or the percent
for the individual categories. In popular culture, the typical graph used for
this is the ubiquitous pie chart. Pie charts, however, have been criticized by
several graphic experts (e.g., Cleveland, 1994 X.X; Wilkinson, 2005 X.X), mostly
because we humans do not compare geometric areas very well. Critics prefer
a horizontal bar chart or a simple table in which the categories are ordered by
size. Here, I take no strong position on the controversy. It is clear that in
certain cases, the pie and bar chart convey equivalent information (Spence &
Lewandowsky, 1991 X.X), so it appears best to choose on a case by case basis.
To illustrate both types of graphs, consider a future time when we can catalog the major causes of a complex neurological disorder (e.g., Alzheimer’s disease
or Parkinson’s disease) into the following types: Mendelian disorders, biological pathogens, toxins, trauma, polygenic, and unknown. Someone writing an
review of this article would like to present the proportion of cases that are predominantly caused by each of these.1 A pie chart of hypothetical data is given in
1 In
reality, several different causes can contribute to a complex disorder. For example,
CHAPTER 6. GRAPHICS
16
Figure 6.11: Example of a scatterplot matrix.
● ●
●
●
●● ●
●●
●
●
2
0
−2
●
●
●
● ●●
●
●
●● ●
●
●
●●●
● ●
●
0 1
−2
●
●
●●
●
● ●
●
●
●
●
●
●
●
● ●
● ●●
●●
●
C
●
●
●
●● ● ●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
● ●● ●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●● ●●
●●
● ●
●● ●●
● ●
●
●● ●●
●
−3 −1
●
●
●
●
●
●
●
●
● ● ●●
●●●
● ●
●
● ●
●●
●●● ●
●
●
●
●
●●
●
●●
● ● ●
●
●
●
●● ●
● ●● ●
●
●
● ●
●
● ●
● ●● ●
● ● ●●
●●
●
● ● ●● ●
●
● ●
●
● ● ● ●
●● ●
●
●
●●
●
●
●
●●●
●● ●
●
●
●
1
●
●
●
●●
● ● ●
●
●
●●
●●
●
B
●
●● ●
●
●
● ● ● ●●
●
●
● ●● ● ●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●● ●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
0 1
●
●
1
−2
●
−3 −1
2
●
●
●
●
●● ● ● ●
●
● ● ●
●● ●
●● ●
●
● ●
●
●
●
●●
● ●
●
●●
●
●
●●
0 1
A
0
−2
−2
D
●●
●
●
−2
0 1
Figure 6.12 and a horizontal bar chart in Figure 6.13. The respective R codes
are presented in Tables X.X and X.X.
Many argue that the roughly 1:2:3 ratio of biological pathogens to toxins
to trauma is better captured with the bar chart. For others, it is a matter of
familiarity and style.
6.2.1.1
Categorical data and group comparisons
A much trickier situation is the comparison of percentages across different
groups. Consider a study in genetic epidemiology on genetic sensitivity to stress
someone’s disorder may arise because of polygenic sensitivity to a pathogen or a toxin.
Table 6.5: R code for a pie chart.
p c t <− c ( 2 , 6 , 1 1 , 1 8 , 2 7 , 3 6 )
c a u s e s <− c ( " Mendelian " , " Pathogens " , " Toxins " ,
"Trauma " , " P o l y g e n i c " , "Unknown " )
l a b e l s <− p a s t e ( c a u s e s , " " , pct , "%", s e p ="")
par ( mar=c ( 2 , 2 , 2 , 4 ) )
p i e ( pct , l a b e l s=l a b e l s , c o l=rainbow ( l e n g t h ( p c t ) ) )
CHAPTER 6. GRAPHICS
17
Figure 6.12: Hypothetical causes of a complex neurological disorder: Pie chart.
Trauma 18%
Toxins 11%
Pathogens 6%
Mendelian 2%
Polygenic 27%
Unknown 36%
Figure 6.13: Hypothetical causes of a complex neurological disorder: Horizontal
bar chart.
0%
10%
20%
30%
Unknown
40%
36%
Polygenic
27%
Trauma
18%
Toxins
11%
Pathogens
6%
Mendelian
2%
0%
10%
20%
30%
40%
CHAPTER 6. GRAPHICS
Table 6.6: R code for a horizontal bar chart.
p c t <− c ( 2 , 6 , 1 1 , 1 8 , 2 7 , 3 6 )
c a u s e s <− c ( " Mendelian " , " Pathogens " , " Toxins " ,
"Trauma " , " P o l y g e n i c " , "Unknown " )
par ( mar=c ( 2 . 1 , 1 , 2 . 2 , 2 ) )
p l o t (NULL, NULL, xlim=c ( −13 , 4 0 ) , ylim=c ( . 5 , 6 . 5 ) ,
x l a b ="" , xaxt="n " , y l a b ="" , yaxt="n " )
x p o i n t s <− c ( 0 , s e q ( from =5, t o =40 , by =5))
a x i s ( 1 , a t=x p o i n t s ,
l a b e l s=p a s t e ( x p o i n t s , "%", s e p ="") ,
cex . a x i s =.8)
a x i s ( 3 , a t=x p o i n t s ,
l a b e l s=p a s t e ( x p o i n t s , "%", s e p ="") ,
cex . a x i s =.8) f o r ( i i n 1 : l e n g t h ( c a u s e s ) ) {
t e x t ( −1 , i , c a u s e s [ i ] , pos =2, cex =.9)
for ( i in 1: length ( xpoints ))
l i n e s ( c ( xpoints [ i ] , xpoints [ i ] ) , c (0 , 6.5) ,
c o l =" l i g h t b l u e " )
f o r ( i in 1 : length ( pct ) ) {
t e x t ( −1 , i , c a u s e s [ i ] , pos =2, cex =.9)
r e c t ( 0 , i −.3 , p c t [ i ] , i +.3 , c o l ="b l u e " )
t e x t ( p c t [ i ] − . 2 , i , p a s t e ( p c t [ i ] , "%", s e p ="") ,
pos =4, cex =.9)
}
18
CHAPTER 6. GRAPHICS
19
and psychopathology. For simplicity, assume that the gene hypothesized to
moderate stress has two alleles, A and a. Stress is measured using a standardized interview that catalogues stressful within a recent time period and then
catalogues them as being “controllable” (e.g., having a driver’s license revoked
because of driving while intoxicated) or “uncontrollable” (unexpected death of
a loved one).
A simplified hypothesis of genetic sensitivity would state that people with
all three genotypes experience the same amounts of stress, but given stress,
some genotypes are more likely to develop psychopathology than others. A
preliminary analysis, then, would compare the rates of the three types of stress
in the three genotypes. Graphically, we want to depict the relationship between
two different categorical variables.
The problem with this type of analysis lies in the error associated with the
percents. The standard error of the proportion for the ith category in the jth
group equals
pij (1 − pij )
�
(6.1)
Nj
where pij equals the proportion of the ith category in the jth group and Nj is
the sample size for the jth group. Hence, the standard error depends on the
sample size of the group as well as observed proportion. When sample size
is equal, then the standard errors will be the same for the same proportion.
When sample size is not equal, however, visual comparison of the pie charts
across groups can be misleading. If, by sampling error alone, the first category
is oversampled in one group but under sampled in a second group, then some
of the remaining categories must be under sampled in the first group but over
sampled in the second group. After all, the area of the whole pie must be 1 in
both groups.
For this reason, the recommended presentation for group comparisons is a
side-by-side bar chart. Figure X.X presents such graphs for simulated data in
the candidate gene-stress example. Both panels convey the same information
but in different forms. Both plot the probability of a type of stress given a
genotype. The upper one places genotype on the horizontal axis while the lower
one has type of stress. Given that the purpose was to assess whether the three
types of stress are equal in the three genotypes, the lower panel is the better
presentation.
In comparing percentages across groups, one must be very careful to note
that the errors are not independent. One again, this derives from the common
sense observation that the proportions within a group must add to 1.0, so sampling error in one category in a group must be “felt” in the other categories. As
a result, do not rely on visual inspection and always defer to a statistical test.
When comparing proportions in different groups, visual interpretation
of error bars can be misleading. Always rely on a statistical test.
CHAPTER 6. GRAPHICS
20
1.0
Figure 6.14: Distribution of genotype and type of stress.
0.6
0.4
0.0
0.2
Proportion
0.8
Stress:
None
Controllable
Uncontrollable
aa
Aa
AA
1.0
Genotype
0.6
0.4
0.2
0.0
Proportion
0.8
Genotype:
aa
Aa
AA
None
Controllable
Type of Stress
Uncontrollable
CHAPTER 6. GRAPHICS
21
Table 6.7: R code to create conditional density plots for audiogenic seizures as a
function of sound level in mice administered phenobarbital and vehicle controls.
centering
l a y o u t ( matrix ( 1 : 2 , 2 ) )
pheno <− which ( Treatment == " P h e n o b a r b i t a l " )
veh <− which ( Treatment == " V e h i c l e " )
par ( mar=c ( 4 , 4 , 2 , 3 ) )
c d p l o t (dB [ veh ] , a s . f a c t o r ( S e i z u r e [ veh ] ) ,
data=a u d i o g e n i c ,
main=" V e h i c l e " , xlim=c ( 8 0 , 1 1 0 ) ,
x l a b ="" , y l a b=" S e i z u r e " )
c d p l o t (dB [ pheno ] , a s . f a c t o r ( S e i z u r e [ pheno ] ) ,
data=a u d i o g e n i c ,
main="P h e n o b a r b i t a l " , xlim=c ( 8 0 , 1 1 0 ) ,
x l a b="N o i s e (dB ) " , y l a b=" S e i z u r e " )
The present example is an excellent illustration of this principle. Visual
inspection of the proportions and error bars for the No Stress and Controllable
Stress categories in the lower panel of Figure 6.14 suggests significant differences
among the genotypes. Yet the appropriate statistical test reveals that those
differences fail to reach significance (χ24 = 6.64, p = 0.16).
6.2.1.2
Conditional density plots
Conditional density plots are useful for examining how the change in frequencies
for a categorical variable over a numerical variable. Later, in Section X.X, we
will see how to analyze the presence or absence of audiogenic seizures in mice.
The numerical variable is the intensity of the sound and there are two groups–a
vehicle control and a group administered a dose of phenobarbital. R code to
produce the plot is given in Table 6.7 and the conditional density plots for the
two groups are presented in Figure 6.15.
For each value of Noise, the black area gives the proportion of mice who
did not have seizures and the gray area the proportion who did have seizures.
It is clear that as the intensity of the noise increases, the proportion of mice
who seize also increases. The phenobarbital inhibited seizures at the low to mid
noise levels.
One very important issue about this form of a conditional density plot is
that it smooths the conditional probabilities. Without this smoothing, the plot
would look jagged as plots of observed means usually are.
6.2.2
Plots of means
Plot of means are the heart and soul of experimental neuroscience. There are
two major types of graphs used to plot means, the bar chart and the line chart.
CHAPTER 6. GRAPHICS
22
Figure 6.15: Conditional density plots for audiogenic seizures as a function of
sound level in mice administered phenobarbital and vehicle controls.
0.0 0.4 0.8
No
Seizure
Yes
Vehicle
80
85
90
95
100
105
110
105
110
0.0 0.4 0.8
Yes
No
Seizure
Phenobarbital
80
85
90
95
100
Noise (dB)
CHAPTER 6. GRAPHICS
23
Figure 6.16: Bar and line plots for the means of four groups.
20
10 15 20
Line Plot
●
10
5
●
0
5
B
C
D
A
B
C
D
22
22
A
●
18
18
●
14
14
●
10
●
10
Mean (+/− 1 sem)
●
●
0
Mean (+/− 1 sem)
Bar Plot
A
B
C
Group
D
A
B
C
D
Group
When sample size is small, one can also superimpose one of these types over a
dot plot of the individual data points. A sine qua non for plotting means is a
graphic giving the error in estimating the mean. Usually, this is the standard
error of the mean (see Section X.X) but in some cases, confidence intervals (see
Section X.X) are used. Make certain to always specify in the graph which error
statistics are used. If you are using confidence limits, also specify the level of
the intervals (e.g., 90%, 95%).
Figure 6.16 gives both bar and line plots for hypothetical data on four groups.
The difference between the upper two and the lower two plots is in the range
of values plotted on the vertical axis. Notice how the lower plots accentuate
the differences in group means by starting the vertical axis at 10 instead of
0. Which scale should be used? Common sense should be the arbiter here.
Given the observed means in Figure 6.16, both scales conveys the appropriate
information. Were the measurement milliseconds and the range of means from
800 to 850 with small standard errors, then a scale from 0 to 1000 can obscure
real differences.
When there is a single classification variable like Group in Figure 6.16, the
choice between a bar plot and a line plot is immaterial. When there is more than
CHAPTER 6. GRAPHICS
24
Figure 6.17: Mean plots for two categorical variables: Bar plots.
14
18
Mean (+/− 1 sem)
Treatment:
Control
Ethanol
Drug Emphasized
Drug:
A
B
C
D
8
10
Mean (+/− 1 sem)
22
Treatment Emphasized
A
B
C
Group
D
Control
Ethanol
Treatment
one classification variable, however, the choice can be tricky. Suppose that A, B,
C, and D were four different drugs and that the design had a second condition
called Treatment in which Controls were injected with saline and Ethanol were
injected with saline and a certain dose of ethanol. The is a four by two design
giving eight groups: Drug A Control, Drug A Ethanol, Drug B Control, etc.
There are two ways of constructing a bar plot and a line plot in this case. One
could put Drug on the horizontal axis with different colored bars or lines for the
two treatment groups. The alternative is to place Treatment on the horizontal
axis and have different colored bars or lines for the four Drug groups. Figure
6.17 presents the two types of bar charts and Figure 6.18 the two orderings for
line plots.
In a bar plot there is a tendency to compare the heights of adjacent bars or
bars that are grouped together. Hence, when Drug is on the horizontal axis, we
tend to compare the mean of the Control to the Ethanol group for each Drug.
When Treatment is on the horizontal axis, we are drawn to the differences
among the four drugs. In a line chart, we have a tendency to compare the lines.
Hence, when Drug on on the horizontal axis, we will usually be drawn to the
similarities or differences in the shape of the lines for the Control and Ethanol
CHAPTER 6. GRAPHICS
25
Figure 6.18: Mean plots for two categorical variables: Line plots.
Treatment:
●
Control
Ethanol
●
18
●
Drug:
A
●
B
C
D
●
14
●
Drug Emphasized
●
●
8 10
Mean (+/− 1 sem)
22
Treatment Emphasized
A
B
C
Drug
D
Control
Ethanol
Treatment
CHAPTER 6. GRAPHICS
26
groups. Placing Treatment on the horizontal can accentuate differences among
the Drugs. In general, this type of graph emphasizes differences in the legend of
the graph. Note that this is a typical situation. There will always be exceptions.
In a line plot, a second consideration is the number of lines. Usually, the
organization that gives the fewer number of lines is preferable, so if all things are
equal, the left-hand panel of Figure 6.18 would be preferred to the right-hand
one.
The choice between a bar plot and a line plot is often arbitrary. Bar plots
have an advantage in that the error bars for each grouping are separated. In
a line plot with several groups, the error bars may overlap and obscure some
differences. There is one situation, however, where the line plot is superior–
conveying a significant statistical interaction among the variables. A plot of
means for a group is often called a profile and the profile has a shape to it akin
to a two-dimensional representation of a mountain range with peaks and troughs.
Figure 6.19 gives idealistic profile shapes for the means of three categories. The
profile can be flat (panel A), linear (panels B and C), dog legged (D, E, F, G),
or V or inverted-V shaped (H and I). Most observed profiles will have mixtures
of these idealistic shapes. For example, a profile may be fundamentally linear
but with a slight dog leg.
The key point is that a statistical interaction implies that two (or more)
groups have significantly different profile shapes. Using a line plot will easily
convey the different shapes to the reader. With a bar plot, you have to mentally
connect the means for a group assess the interaction.
As practical advice, try constructing the graph both ways and then accept
the one that best conveys the results of the statistical analysis for the hypothesis
to be tested. If in doubt, show the graphs to one or more colleagues and solicit
their feedback.
6.2.2.1
A note on error bars
There is an unwritten rule of them that if the error bars (measured as one
standard error of the mean) for two means do not overlap, then the two means
will be statistically different from each other. That is often the case but it is not
always the case. There are a number of situations that violate this aphorism. It
is possible to list the most common ones, but there is a much safer and simpler
course of action: always use the appropriate statistical procedure to test for the
difference between means.
Visual inspection of error bars will not always lead to corrent inference
about the statistical significance of the difference among means. Always
use the appropriate statistical procedure to test for differences between
means.
CHAPTER 6. GRAPHICS
27
Figure 6.19: Idealistic examples of different profile shapes for a line plot of
means for a categorical variable with three levels.
(A)
●
(B)
●
●
●
(C)
●
●
●
●
(D)
●
●
●
●
●
●
●
●
(G)
(H)
●
●
(E)
(F)
●
●
●
●
●
●
(I)
●
●
●
(I)