Download Descriptive Statistics and Comparative Displays for Numeric Variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Transcript
JMP Tutorial ~ Graphical Displays and
Summary Statistics for Numeric Data
Graphical Displays for Numeric Data
Histograms and Outlier Boxplots
To obtain a histogram and boxplot for numeric data select Distribution from the
Analyze pull down menu and place the variable(s) that you wish to examine in the right
hand box. These data from from a study of DDT levels found in fish in the Tennessee
River near the Wheeler Reservoir. The data is contained in the file Catfish.JMP.
The variables in this data file are:
location - location on the river from which the fish were sampled.
distance - distance of the sample location from the mouth of the Tennesee River.
species - numeric indicator of the fish species (1 = catfish, 2= smallmouth buffalo, 3 =
largemouth bass)
Spec. Name - fish species
length - length of fish sampled (cm)
weight - weight of fish sampled (g)
DDT - DDT concentration found in a fillet of the fish (parts per million - ppm)
log(DDT) - natural logarithm of the DDT concentration
We begin by examining histograms and boxplots for the length, weight and DDT
concentration of the fish sampled. To do this select Distribution from the Analyze menu
and place length, weight and DDT in the right hand box. The results are shown below.
1
The Horizontal Layout, Prob Axis, Normal Curve & Smooth Curve options have
been used in constructing the plots above. These are options are illustrated in the
graphics below:
2
The normal curve and smooth
curve density estimate are
added by selecting these options
from the Fit Distribution pullout menu,
We can see that the lengths of the fish sampled appear to have a skewed left distribution
with several outliers on the low end. These outliers are all largemouth bass. The typical
length appears to be somewhere between 42-45 cm in length. The weight distribution
appears to be slightly skewed to the right, but is not far from normal as evidenced by the
fairly close agreement between the normal curve and smooth curve distribution estimate.
There also a couple of outliers flagged in the boxplot. A typical weight for the fish
sampled is approximately 1000 grams. The DDT concentrations of the fish sampled
follow a severely skewed right distribution with several obvious outliers on the high end.
Using the location and Spec. Name columns to label the points in succession shows
these observations correspond to catfish and smallmouth buffalo sampled from locations
1, 8, and 13. Examination of the map shows that locations 1 and 13 are in close proximity
to the plant that was the source of the DDT contamination of the ecosystem.
Transformations to Improve Normality
When the distribution of a variable is markedly skewed (left or right) we can often times
use a transformation to obtain approximate normality. The common remedy is to consider
raising the variable to some power. This type of transformation is known as a power
transformation. To remove right skewness we consider using powers less than 1 such as
1/2 (i.e. square root), 1/3 (i.e. cube root), 0 (which corresponds to a log transformation), 1/2 (i.e. reciprocal square root), -1 (i.e. reciprocal) , .... etc. As a rule of thumb, we often
avoid using negative power transformations because they change the ordering of the data,
i.e. the largest observed value with become the smallest and vice versa. Also the
associated units of a negative power transformed variable can be difficult to explain. To
remove left skewness, which is less common, we typically raise the power of the variable
in question (e.g. 1.5, 2 or 3).
3
In this example we see that the distribution of the DDT concentration is extremely
skewed to the right. To improve normality we will consider transformation to the log
scale. To do this in JMP you must use the JMP Calculator which allows you to perform a
variety of data transformations and manipulations. To create a column containing the a
function of another column double click to the right of the last column to add a new
column to the spreadsheet. Next double click at the top of the column to obtain the
Column Info window. In the window change the name of the new column to
log10(DDT) and select Formula from the New Property pull-down menu and click Edit
Formula.
The JMP Calculator should then appear on the screen. To take the base 10 logarithm of
the DDT variable select Transcendental from the menu to the right of the calculator
keypad because the logarithm is a transcendental (non-algebraic) function. In the list that
appears in the rightmost menu select base 10 logarithm (i.e. log10). In formula window
you should see log. Now you need supply the name of the variable you wish to take
the logarithm of, which is DDT in this case.
4
From leftmost menu select DDT from the list and the formula window will then look
like: Log10(DDT)
Finally click Apply and close the calculator window. The new column you created
should now contain the base 10 logarithm of the DDT concentrations. The histogram and
boxplot for the log scale DDT readings are shown below. We can clearly see approximate
normality has been achieved through transformation.
Summary Statistics - Measures of Central Tendency,
Variability and Location
Next to each of the histograms and boxplots shown above you will find the basic
summary statistics for each variable shown below.
5
To obtain the variance and coefficient of variation you need to select More Moments
from Display Options pull-out menu.
To obtain z-scores associated with each observation select Save Standardized from the
Save menu which is located within the main pull-down menu for the variable.
6
Three new columns labeled Std length, Std weight, and Std DDT will appear in the
original spreadsheet containing the z-scores. You could examine the distribution of the zscores themselves by using the Distribution command. Any observations with z-scores
exceeding 3 in absolute value could be classified as potential outliers.
The histogram below is for length standardized using z-scores.
All of the observations with
extreme z-scores for length
are Largemouth Bass.
7
Comparative Displays
In this study we could compare the DDT levels of the different fish species and also
compare DDT levels of fish by location. We first consider the potential difference in the
DDT levels in catfish found at different river locations by using comparative boxplots
and mean diamonds. To do this in JMP select Fit Y by X from the Analyze menu and put
Location in the X box and log(DDT) in the Y box. The resulting display will show the
log(DDT) levels plotted versus the location number. To add boxplots or items to this plot
use the Display Options menu located within the main pull-down menu.
The options and their effects are summarized below...
Box Plots - adds quantile boxplots to the display
Mean Diamonds - adds mean diamonds to the plot
Mean Lines – adds a horizontal showing the mean for each group/population.
Mean CI Lines – adds lines depicting the 95% confidence interval for the mean to the
plot.
Mean Error Bars - adds the means and standard errors (Ch. 6) to the plot
Std Dev Lines - add lines one standard deviation above and below the mean.
Connect Means - adds line segments connecting the individual means.
X-Axis Proportion - if checked the space allocated to the groups will proportional to the
sample size for that group.
Points Jittered – “jitters” the points so individual observations are more easily seen.
Points Spread – staggers the points much more than jittering.
8
The display below shows comparative boxplots for log(DDT) level across location with
the X-axis proportional option turned off.
Here we can clearly see that the fish from locations 1 & 13 have the highest DDT levels
and locations 6 & 17 appear to have the lowest. It is important to note that latter
locations are the only locations where largemouth bass were sampled.
We can construct a similar display for comparing the log DDT measurements across
species by placing Species Name instead of location in the X box.
To obtain summary statistics for the log(DDT) levels within each species type select
Quantiles and Mean, Std Dev, Std Err from the main pull-down menu. The results are
shown on the following page.
9
How do different species compare in terms of summary statistics?
Catfish have the highest mean and median DDT levels in the log scale while largemouth
bass have the smallest. Catfish have the smallest amount of variation and seen by
comparing the standard deviations or the coefficient of variations. (CV  s  100%)
x
CDF Plots
The plot below gives the CDF plots for the DDT levels found in each the fish species in
this study. To obtain these select the CDF Plots from the Oneway Analysis... pull-down
menu. We can clearly see that we are much more likely to find a catfish with a high DDT
level, e.g. there is an approximate 50% chance that we sample a catfish with a
log10(DDT) level exceeding 1 which is 10 ppm in the original scale. This same for
small-mouth buffalo is less than 25% and estimated to be 0 for bass.
10