Download Methods for a Single Numeric Variable – Descriptive Statistics So far

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Methods for a Single Numeric Variable – Descriptive Statistics
So far this semester we’ve been concentrating on categorical variables/data. We are now going to
discuss how to summarize and describe numeric data, as well as inferential procedures.
Measures of Center
We can use the following methods to describe the center or distribution of a given data set.
x

Mean: The arithmetic mean of the observations. Sample mean = x 

Median: The middle number in a data set after the numbers have been arranged in ascending
(or descending) order. This is the value which cuts off the 50th percentile of the observations,
i.e. 50% of the data values like above the median and 50% of the data values lie below the
median.
n
Example: Two researchers measured the pH (a scale on which a value of 7 is neutral and values below 7
are acidic) of water collected from rain and snow over a 6-month period in Allegheny County, PA. The
data can be found on the course website in the file pH.jmp. We can use JMP to find the mean and
median of the data.
1. Click on Analyze  Distribution. Then place pH in the Y, Columns box as shown below.
2. You should then get the following output:
The mean and median are circled in the above output.
1
In addition to the mean and median, quartiles (or percentiles or quantiles) give additional information
regarding the distribution of the data.

Q1: The first quartile which represents the _______ percentile. This is also the median of the
lower half of the data.

Q2: The second quartile which is the _______________.

Q3: The third quartile which represents the _______ percentile. This is also the median of the
upper half of the data.
Questions:
1. Referring to the JMP output above, identify the values for Q1 and Q3.
Q1 = ________
Q3 = ________
Note: There are a number of other percentiles listed in the JMP output.
2. What values do the 0th and 100th percentiles represent?
0th percentile = __________
100th percentile = __________
3. Together the ________, ________, ________, ________, and ________ form what’s called the
Five Number Summary. The Five Number Summary provides a numeric “picture” of the
distribution of the data.
4. What percent of the observations should fall between the 25th and 75th percentiles?
5. What about the 2.5th and 97.5th percentiles?
6. How about the 0.5th and 99.5th percentiles?
2
Measures of Variability or Spread
Consider the following data sets.
Questions:
7. What is the mean for each data set? The median?
Data set A
Data set B
Data set C
Mean
Median
8. Is a measure of center enough to describe these data sets? If not, what else do you think should
be used?
3
There are several measures of variability or spread of a data set.

Range: The difference between the __________________ and __________________
measurements in the data set.
Range = ___________________________________

Interquartile Range (IQR): The difference between the _________ and __________ quartiles.
IQR = __________________

Average Distance from the Mean: To summarize the variability in a set of measurements, we
may want to use every observation in the data set to calculate the “average distance from the
mean.”
Average Distance from Mean =
 x
i
 x
n
Calculate the average distance from the mean for Data set B from above.
Observation
Sample Mean
0
20
10
20
20
20
30
20
40
20
Sum of distances
Average distance from mean
Distance
Questions:
9. What is the problem with using this method?
10. It can be shown using a little algebra that we will always get zero for an answer. Do you have
any ideas as to how to overcome this problem?
4

Mean Absolute Deviation (MAD): This is the average distance from the mean calculated using
absolute difference. Compute the MAD for Data set B from above.
Observation
Sample Mean
0
20
10
20
20
20
30
20
40
20
Sum of distances
MAD
Absolute Distance
Although this gives a valid measure of variability in a data set, this quantity has difficult
statistical properties. Traditionally the ____________________ and __________________ are
used instead.

Variance: The average _______________ distance from the mean.
n
Sample variance = s2 =
 x  x 
i 1
2
i
n 1
Compute the sample variance for Data set B.
Observation
Sample Mean
0
20
10
20
20
20
30
20
40
20
Sum of distances
Sample variance

Squared Distance
Standard Deviation: The _____________ square root of the variance.
n
 x
Sample standard deviation = s =
i
i 1
 x
2
n 1
Compute the sample standard deviation for Data set B.
5
Example: Looking again at the pH data, compute the Range, IQR, Sample variance, and sample standard
deviation using the JMP output.
Range
IQR
Sample standard deviation (s)
Sample variance (s2)
Example: Download the file messages.jmp from the course website. This data set contains the number
of text messages an individual sends which was collected from the student data survey you may have
completed at the beginning of the semester. Answer the following questions regarding the data set.
Questions:
11. Compute the average number of text messages sent in a day?
12. Find the Five Number Summary for the number of text messages sent in a day.
13. Compute the range for the number of text messages sent in a day.
14. Between what two values does the middle 50% of text messages sent in a day lie?
15. Compute the sample standard deviation and sample variance for the number of text messages
sent in a day.
6
We’ve been looking at numerical summaries used to describe a single numeric variable. We will now
look at the various methods to graphically summarize these types of variables.
Description of Shape
We can use many different types of graphical summaries to describe the shape (or distribution) of the
observed data.
Comments:

When plotting numeric data, the __________________ axis a number line of values (i.e.
CONTINUOUS!).

The _________________ axis usually represents counts or sometimes the relative frequency of
observations which have the same value.
We will again use the file pH.jmp to discuss the various graphical techniques for describing the shape (or
distribution) of the observed data.

Dotplot
o _________ data point is plotted when creating a dotplot.
o Dotplots are normally used for small data sets.
o JMP does not create dotplot, but we’ve encountered them in Tinkerplots earlier this
semester.

Stem and Leaf Plots
o Again, every data point is plotted when creating a stem and leaf plot.
o Stem and Leaf plots are normally used for small data sets.
o JMP will produce a Stem and Leaf plot by clicking on Analyze  Distribution, put pH in
the Y, Columns box and then click on the little red arrow next to pH in the output.
Choose Stem and Leaf from the menu that appears. You should get the following plot.
7
Comments:
o The “leaf” always represents the last digit in the values recorded.
o The “stem” represents all the other decimal places in the values recorded.
o You’ll notice under the stem and leaf plot it says “41|2 represents 4.12.” This is the
legend which tells what the stem and leaf units are for that particular graph. In this case
the stem is the ones and tenths place and the leaf is the hundredths place.

Histograms
o This is a good type of plot when you have a lot of observations.
o The observations are placed into “bins” and the height of each bin represents the
number of observations that fall into any particular bin.
o The histogram is one of the default plots produced when you choose Analyze 
Distribution in JMP. The histogram of the pH data is given below.
When looking at a dotplot, stem and leaf plot, or histogram of the data, we can describe the
shape/distribution of the data using the following terminology.
o
Right Skewed/Positively Skewed
8
o
Left Skewed/Negatively Skewed
o
Symmetric
Questions:
16. Describe the shape/distribution of the pH data.
17. Does the information given in the histogram agree with what was seen in the dotplot?
18. If the data were extremely right skewed, which should be larger: the mean or the median?
Explain why this is the case.
19. If the data were extremely left skewed, which should be larger: the mean or the median?
Explain why this is the case.
20. If the data were symmetric, which should be larger: the mean or the median? Explain why this
is the case.
9

Boxplot
o The boxplot creates a picture of the data using the ______________ as reference points.
o The “box” portion is comprised of _____, _____ and _____.
o The “whiskers” represent one of two things:
 The endpoint of the lower whisker is the larger of: _____________ or
_______________________
 The endpoint of the upper whisker is the smaller of: _____________ or
_______________________
o Any measure beyond the endpoint of either _________________ is classified as a
potential ____________________________________ observation.
o An outlier boxplot is the other default plot plots produced when you choose Analyze 
Distribution in JMP. The boxplot for the pH data is given below.
Example: Again, let’s look at the text messaging data set from the course website.
Questions:
21. Using JMP, create a histogram for the number of test messages sent in a day.
22. Looking at the histogram created in Question 21 describe the shape/distribution for the number
of text messages sent in a day.
23. Looking at the boxplot created in JMP, is there any evidence of potential outliers? Explain.
Example: Consider two populations in the same state, where both populations are the same size.
Population 1 consists of all students at the state university. Population 2 consists of all residents in a
small town. Consider the variable age. Which population would most likely have the larger standard
deviation? Explain.
10
Example: A test is given to 100 students, and the median score is determined. After grading the test,
the instructor realizes that the 10 students with the highest scores did exceptionally well. The instructor
decides to award these 10 students a bonus of five additional points. How will the median of the new
score distribution change compared to that of the original distribution? Explain.
Example: The following histogram shows the distribution of the ages of male Oscar winners.
24. Which boxplot is graphing the same data as the histogram? Explain.
a.
c.
b.
d.
11
Example: Four histograms are presented below. Each histogram displays the quiz scores on a scale of 0
to 10 for one of four different STAT 110 classes.
25. Which of the classes would you expect to have the smallest standard deviation? Explain.
26. Which of the classes would you expect to have the largest standard deviation? Explain.
12
Measuring the Position of an Observation
There are two commonly used methods for determining an observation’s position relative to all other
measurements in the data set.

Z-score: This measures how many standard deviations can observation is away from the mean.
Sometimes it is called the ________________________ value.
z-score =
observation - mean
standard deviation
Example: From 1947 to 1971 DDT was manufactured in a plant located on Indian Creek which flowed
into the Tennessee River 321 miles from the mouth of the river. In 1972 the EPA banned the use of DDT
in the United States. In the late 1970’s widespread DDT contamination was discovered at the plan sire
and in nearby waterways. The data from a study conducted by the U.S. Army Corps of Engineers in the
summer of 1980 can be found in the file Catfish.jmp on the course website. The variables in the data
set are given below:







Fish ID – an identification number for each fish (1 – 44)
Location – the location on the river from which the fish was sampled.
 FCM5 = Flint Creek 5 miles from the mouth
 LCM3 = Limestone Creek 3 miles from the mouth
 SCM1 = Spring Creek 1 mile from mouth
 In general: TRM### = Tennessee River### miles from the mouth
Distance from mouth – approximate distance of the sample location from the mouth of the
Tennessee River
Species – fish species (catfish, smallmouth buffalo, largemouth bass)
Length – length of the sampled fish (in cm)
Weight – the weight of the sampled fish (in g)
DDT – the concentration of DDT found in a fillet of fish (in ppm)
A portion of the data set is given below.
Example: To obtain z-scores for the measurements of the variables Length, Weight and DDT, select Save
 Standardized from the red drop-down arrow next to the variable name. You should then see the
following output in the data table.
13
Question:
27. Using the first observation show how the z-score for length was calculated.
Interpretation of z-scores

The standardized values transform the data so that the data is placed on the standardized scale.
The standardized scale has a mean of _____ and a standard deviation of _____.

The smallest value in the data set will always have the smallest z-score. Likewise, the largest
value in the data set will have the largest z-score.

If a z-score is _______________, then the data point is that many standard deviations below the
mean.

If a s-score is _______________, then the data point is that many standard deviations above the
mean.

If the z-score is ______________, then the data point is the same as the mean.

If the standard deviation is ________, then the z-score is NOT defined and thus cannot be
computed.
14
The following graphic compares the data on its original scale to the data on the standardized scale.
Questions:
28. What changes between the two graphs?
29. Why do you think z-scores are so important?
Example: Which is more extreme…a catfish 44.5 cm long or a smallmouth buffalo 43.5 cm long?
15
Identification of Outliers
There are two basic methods for identifying outliers: ________________ and ________________

Boxplots: As we have already seen, these are commonly used to identify outliers. Recall that
any measurement beyond the endpoint of either whisker is classified as a potential outlier
(extreme observation).

Z-scores: Z-scores are also used to identify outliers. Any data value whose z-score is below -2 or
above 2 is considered to be a potential outlier. Any data value whose z-score is below -3 or
above 3 is considered an outlier and warrants further investigation.
Rules for Data Concentration
Once you have estimated the mean and standard deviation for a set of measurements, you can utilize a
few rules to make statements about where the data is concentrated.

Empirical Rule: If the distribution of the data is _________________________ and symmetric,
then the Empirical Rule applies. This rule says that APPROXIMATELY…
o
________ of the values fall within one standard deviation of the mean.
o
________ of the values fall within two standard deviations of the mean.
o
________ of the values fall within three standard deviations of the mean.
16