Download Methods for a Single Numeric Variable – Descriptive Statistics In this

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Methods for a Single Numeric Variable – Descriptive Statistics
In this set of notes we will look at the various measures used to summarize a single numeric variable.
For convenience, we’ll label the observations of the data set x1, x2, …, xn. That is x1 is the first
measurement, x2 the second measurement, etc. Let n represent the total number of data points. We
will look at how to summarize the data with respect to the following:

_______________ of the data

_______________ (or variability) of the data

Shape/_______________ of the data
Measures of Location

Mean: The arithmetic average of all the values in the data set. This quantity measures the
center of the data set.
n
x
Sample mean = x =
i 1
i
n
Note: the population mean is denoted by µ.

Median: The middle observation in a data set (after the values have been arranged in ascending
or descending order). The median cuts off the 50th percentile of the data so that half the
observations fall below the median and the other half above the median. If the data set
contains an even number of observations, the median is the average of the middle two
observations. This quantity measures the center of the data set.

Quartiles:
o Q1 – The median of the lower half of the data, ________ percentile.
o
Q2 – The median, ________ percentile.
o
Q3 – The median of the upper half of the data, ________ percentile.
Example: From 1947 to 1971 DDT was manufactured in a plant located on Indian Creek which flowed
into the Tennessee River 321 miles from the mouth of the river. In 1972 the EPA banned the use of DDT
in the United States. In the late 1970’s widespread DDT contamination was discovered at the plan sire
and in nearby waterways. The data from a study conducted by the U.S. Army Corps of Engineers in the
summer of 1980 can be found in the file Catfish.jmp on the course website. The variables in the data
set are given below:

Fish ID – an identification number for each fish (1 – 44)
1






Location – the location on the river from which the fish was sampled.
 FCM5 = Flint Creek 5 miles from the mouth
 LCM3 = Limestone Creek 3 miles from the mouth
 SCM1 = Spring Creek 1 mile from mouth
 In general: TRM### = Tennessee River### miles from the mouth
Distance from mouth – approximate distance of the sample location from the mouth of the
Tennessee River
Species – fish species (catfish, smallmouth buffalo, largemouth bass)
Length – length of the sampled fish (in cm)
Weight – the weight of the sampled fish (in g)
DDT – the concentration of DDT found in a fillet of fish (in ppm)
A portion of the data set is given below.
We can use JMP to calculate measures of location for the variables Length, Weight, and DDT. To do so,
choose Analyze  Distribution and put all three variables in the Y, Columns box as shown below.
2
Click OK and JMP will return the following:
For Length:
For Weight:
For DDT:
Identify the following from the JMP output:

The number of observations in the data set: _______________

Sample mean for DDT: _______________

Median for DDT: _______________

Q1 for Length: _______________

Q2 for Length: _______________

Q3 for Length: _______________

The smallest observation for Weight: _______________

The largest observation for Weight: _______________
3
Measures of Variability or Spread
Consider the following data sets.
Questions:
1. What is the mean for each data set? The median?
Data set A
Data set B
Data set C
Mean
Median
2. Is a measure of center enough to describe these data sets? If not, what else do you think should
be used?
4
There are several measures of variability or spread of a data set.

Range: The difference between the __________________ and __________________
measurements in the data set.
Range = ___________________________________

Interquartile Range (IQR): The difference between the _________ and __________ quartiles.
IQR = __________________

Average Distance from the Mean: To summarize the variability in a set of measurements, we
may want to use every observation in the data set to calculate the “average distance from the
mean.”
Average Distance from Mean =
 x
i
 x
n
Calculate the average distance from the mean for Data set B from above.
Observation
Sample Mean
0
20
10
20
20
20
30
20
40
20
Sum of distances
Average distance from mean
Distance
Questions:
3. What is the problem with using this method?
4. It can be shown using a little algebra that we will always get zero for an answer. Do you have
any ideas as to how to overcome this problem?
5

Mean Absolute Deviation (MAD): This is the average distance from the mean calculated using
absolute difference. Compute the MAD for Data set B from above.
Observation
Sample Mean
0
20
10
20
20
20
30
20
40
20
Sum of distances
MAD
Absolute Distance
Although this gives a valid measure of variability in a data set, this quantity has difficult
statistical properties. Traditionally the ____________________ and __________________ are
used instead.

Variance: The average _______________ distance from the mean.
n
Sample variance = s2 =
 x  x 
i 1
2
i
n 1
Compute the sample variance for Data set B.
Observation
Sample Mean
0
20
10
20
20
20
30
20
40
20
Sum of distances
Sample variance

Squared Distance
Standard Deviation: The _____________ square root of the variance.
n
 x
Sample standard deviation = s =
i
i 1
 x
2
n 1
Compute the sample standard deviation for Data set B.
6

Coefficient of Variation: This measures the amount of variation relative to the size of the mean
CV =
s
x 100%
x
Example: We can obtain the range, sample variance, sample standard deviation, and coefficient of
variation in JMP for the variables Length, Weight, and DDT. You should already have the results from
selecting Analyze  Distribution open. Now, click the little red arrow next to each of the variables and
choose Display Options  Customize Summary Statistics. Then check the boxes next to Variance and
CV.
For Length:
For Weight:
For DDT:
Identify the following from the JMP output:

Range for Length: ___________________

IQR for Weight: ___________________

Variance for DDT: ___________________

Standard deviation for DDT: ___________________

Coefficient of variation for both Length and Weight: ____________________________________
7
Describing the Shape/Distribution of the Data
Determining the shape/distribution of the data is a very important step in many statistical procedures.
For example, some procedures require the distribution of the data be bell-shaped. Most often,
graphical techniques are used to determine the shape of the distribution, however, a few numerical
measures exist and will be discussed later.
Graphical Summaries for Shape
We can use many different types of graphical summaries to describe the shape (or distribution) of the
observed data.
Comments:

When plotting numeric data, the __________________ axis is a number line of values (i.e.
CONTINUOUS!).

The _________________ axis usually represents counts or sometimes the relative frequency of
observations which have the same value.
We’ll again use the file Catfish.jmp to introduce several graphical techniques for numerical data.

Dotplot:
o __________ data point is plotted.
o Dotplots are normally used for small data sets.
o JMP does not create dotplots, but I’ve created one of the variable Length using a
different software package.
Questions:
1. Where are most the fish located in terms of Length?
2. Would you consider any of the fish as extreme in terms of their Length? That is, would you
consider any of the fish as potential outliers? Explain.
8

Stem and Leaf Plots
o Again, every data point is plotted when creating a stem and leaf plot.
o Stem and Leaf plots are normally used for small data sets.
o JMP will produce a Stem and Leaf plot by clicking on Analyze  Distribution, put pH in
the Y, Columns box and then click on the little red arrow next to pH in the output.
Choose Stem and Leaf from the menu that appears. You should get the following plot.
Comments:
o The “leaf” always represents the last digit in the values recorded.
o The “stem” represents all the other decimal places in the values recorded.
o You’ll notice under the stem and leaf plot it says “1|8 represents 18.” This is the legend
which tells what the stem and leaf units are for that particular graph. In this case the
stem is the tens place and the leaf is the ones place.

Boxplot:
o The boxplot creates a picture of the data using the ______________ as reference points.
o The “box” portion is comprised of _____, _____ and _____.
o The “whiskers” represent one of two things:
 The endpoint of the lower whisker is the larger of: _____________ or
_______________________
 The endpoint of the upper whisker is the smaller of: _____________ or
_______________________
o Any measure beyond the endpoint of either _________________ is classified as a
potential ____________________________________ observation.
o An outlier boxplot is the other default plot plots produced when you choose Analyze 
Distribution in JMP. The boxplot for the Length data is given below.
Next, consider the outlier boxplot for DDT. What do you see in this plot?
9

Histograms
o This is a good type of plot when you have a lot of observations.
o The observations are placed into “bins” and the height of each bin represents the
number of observations that fall into any particular bin.
o The histogram is one of the default plots produced when you choose Analyze 
Distribution in JMP. The histogram of the Length data is given below.

Smoothed Histograms: Changing the number of classes in a histogram may influence your
perception of the shape or distribution of the data. Therefore, it is good practice to use JMP to
carry out a process called smoothing. Click on the red drop-down arrow next to Length and
select Continuous Fit  Smooth Curve. JMP should return the following plot.
This smooth curve represents JMP’s best guess for the shape or distribution of the ___________
From which the data is a random sample. That is, the smooth curve represents the general
trends, not just the patterns that are specific to the data which were collected.
Numerical Summaries for Shape/Distribution
Two numerical summaries for shape exist: ____________________ and ___________________

Skewness: A data distribution is said to be ___________________ if it has the same shape on
both sides of the center. Skewness measures the amount of ___________________.
o
The distribution of a set of data is said to be __________ skewed or _______________
skewed if the measurements tend to trail off to the __________.
o
Similarly, the distribution is said to be _________ skewed or ______________ if the
measurements tend to trail off to the __________.
10
Shape
Picture
Skewness Measure in
JMP
The most famous symmetric
distribution is the normal:
Symmetric
Zero
Others?
Right Skewed
Greater than zero
Left Skewed
Less than zero
11

Kurtosis: This is used to measure the amount of _______________ of the distribution of the
data relative to the normal distribution.
Shape
Picture
Normal
Kurtosis Measure
in JMP
zero
Taller or skinner than normal shape
Positive
Kurtosis
Greater than zero
Less than zero
Negative
Kurtosis
12
These values can be found using JMP by choosing Display Options  Customize Summary Statistics and
checking Skewness and Kurtosis. You should get the following output.
For Length:
For Weight:
For DDT:
Also, shown are the smoothed histograms for each variable.
Length:
Weight:
DDT:
Questions:
3. Based on the histograms, how would you describe the shape/distribution for:
a. Length: ____________________________
b. Weight: ____________________________
c. DDT: ____________________________
4. Doe the numerical measures of skewness agree with you see in the histograms? Explain.
13
5. If the data are extremely right skewed, which should be larger, the mean or median? Explain.
6. If the data are extremely left skewed, which should be larger, the mean or median? Explain.
7. If the data are symmetric, which should be larger, the mean or median? Explain.
8. Which summary statistic do you think is more representative of a typical DDT measurement, the
mean or median? Explain.
14
Measuring the Position of an Observation
There are two commonly used methods for determining an observation’s position relative to all other
measurements in the data set.

Percentiles: The ______ percentile for a set of measurements is a number such that _____ of
the measurements fall at or below the pth percentile.

Z-score: This measures how many standard deviations can observation is away from the mean.
Sometimes it is called the ________________________ value.
z-score =
observation - mean
standard deviation
Example: Once again, let’s consider the Length variable from the Catfish.jmp data set.
Questions:
9. What is the 50th percentile? The 25th percentile? The 75th percentile?
10. Identify the 10th and 90th percentiles. What percent of the observations lie between these two
values?
Example: On a related note, we can also create CDF plots in JMP. This shows the estimated probability
of observing a data point less than or equal to a given value. In JMP, select the red drop-down arrow
next to Length and choose CDF Plot. You should get the following plot.
15
Questions:
11. Estimate the probability of observing a randomly selected fish that is less than 45cm in length.
12. Estimate the probability of observing a randomly selected fish that is less than 50cm in length.
Example: To obtain z-scores for the measurements of the variables Length, Weight and DDT, select Save
 Standardized from the red drop-down arrow next to the variable name. You should then see the
following output in the data table.
Question:
13. Using the first observation show how the z-score for length was calculated.
Interpretation of z-scores

The standardized values transform the data so that the data is placed on the standardized scale.
The standardized scale has a mean of _____ and a standard deviation of _____.

The smallest value in the data set will always have the smallest z-score. Likewise, the largest
value in the data set will have the largest z-score.

If a z-score is _______________, then the data point is that many standard deviations below the
mean.

If a s-score is _______________, then the data point is that many standard deviations above the
mean.
16

If the z-score is ______________, then the data point is the same as the mean.

If the standard deviation is ________, then the z-score is NOT defined and thus cannot be
computed.
The following graphic compares the data on its original scale to the data on the standardized scale.
Questions:
14. What changes between the two graphs?
15. Why do you think z-scores are so important?
Example: Which is more extreme…a catfish 44.5 cm long or a smallmouth buffalo 43.5 cm long?
17
Identification of Outliers
There are two basic methods for identifying outliers: ________________ and ________________

Boxplots: As we have already seen, these are commonly used to identify outliers. Recall that
any measurement beyond the endpoint of either whisker is classified as a potential outlier
(extreme observation).

Z-scores: Z-scores are also used to identify outliers. Any data value whose z-score is below -2 or
above 2 is considered to be a potential outlier. Any data value whose z-score is below -3 or
above 3 is considered an outlier and warrants further investigation.
Rules for Data Concentration
Once you have estimated the mean and standard deviation for a set of measurements, you can utilize a
few rules to make statements about where the data is concentrated.


Empirical Rule: If the distribution of the data is _________________________ and symmetric,
then the Empirical Rule applies. This rule says that APPROXIMATELY…
o
________ of the values fall within one standard deviation of the mean.
o
________ of the values fall within two standard deviations of the mean.
o
________ of the values fall within three standard deviations of the mean.
Chebyshev’s Rule: This rule works for ANY distribution. Chebyshev’s Rule tells us that AT
LEAST…
o
________ of the values fall within two standard deviations of the mean.
o
________ of the values fall within three standard deviations of the mean.
o
________ of the values fall within k standard deviations of the mean.
18