Download Methods for a Single Numeric Variable – Descriptive Statistics So far

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Methods for a Single Numeric Variable – Descriptive Statistics
So far this semester we’ve been concentrating on categorical variables/data. We are now going to
discuss how to summarize and describe numeric data, as well as inferential procedures.
Measures of Center
We can use the following methods to describe the center or distribution of a given data set.
x

Mean: The arithmetic mean of the observations. Sample mean = x 

Median: The middle number in a data set after the numbers have been arranged in ascending
(or descending) order. This is the value which cuts off the 50th percentile of the observations,
i.e. 50% of the data values like above the median and 50% of the data values lie below the
median.
n
Example: Two researchers measured the pH (a scale on which a value of 7 is neutral and values below 7
are acidic) of water collected from rain and snow over a 6-month period in Allegheny County, PA. The
data can be found on the course website in the file pH.jmp. We can use JMP to find the mean and
median of the data.
1. Click on Analyze  Distribution. Then place pH in the Y, Columns box as shown below.
2. You should then get the following output:
The mean and median are circled in the above output.
1
In addition to the mean and median, quartiles (or percentiles or quantiles) give additional information
regarding the distribution of the data.

Q1: The first quartile which represents the _______ percentile. This is also the median of the
lower half of the data.

Q2: The second quartile which is the _______________.

Q3: The third quartile which represents the _______ percentile. This is also the median of the
upper half of the data.
Questions:
1. Referring to the JMP output above, identify the values for Q1 and Q3.
Q1 = ________
Q3 = ________
Note: There are a number of other percentiles listed in the JMP output.
2. What values do the 0th and 100th percentiles represent?
0th percentile = __________
100th percentile = __________
3. Together the ________, ________, ________, ________, and ________ form what’s called the
Five Number Summary. The Five Number Summary provides a numeric “picture” of the
distribution of the data.
4. What percent of the observations should fall between the 25th and 75th quartiles?
5. What about the 2.5th and 97.5th percentiles?
6. How about the 0.5th and 99.5th percentiles?
2
Measures of Variability or Spread
Consider the following data sets.
Questions:
7. What is the mean for each data set? The median?
Data set A
Data set B
Data set C
Mean
Median
8. Is a measure of center enough to describe these data sets? If not, what else do you think should
be used?
3
There are several measures of variability or spread of a data set.

Range: The difference between the __________________ and __________________
measurements in the data set.
Range = ___________________________________

Interquartile Range (IQR): The difference between the _________ and __________ quartiles.
IQR = __________________

Average Distance from the Mean: To summarize the variability in a set of measurements, we
may want to use every observation in the data set to calculate the “average distance from the
mean.”
Average Distance from Mean =
 x
i
 x
n
Calculate the average distance from the mean for Data set B from above.
Observation
Sample Mean
0
20
10
20
20
20
30
20
40
20
Sum of distances
Average distance from mean
Distance
Questions:
9. What is the problem with using this method?
10. It can be shown using a little algebra that we will always get zero for an answer. Do you have
any ideas as to how to overcome this problem?
4

Mean Absolute Deviation (MAD): This is the average distance from the mean calculated using
absolute difference. Compute the MAD for Data set B from above.
Observation
Sample Mean
0
20
10
20
20
20
30
20
40
20
Sum of distances
MAD
Absolute Distance
Although this gives a valid measure of variability in a data set, this quantity has difficult
statistical properties. Traditionally the ____________________ and __________________ are
used instead.

Variance: The average _______________ distance from the mean.
n
Sample variance = s2 =
 x  x 
i 1
2
i
n 1
Compute the sample variance for Data set B.
Observation
Sample Mean
0
20
10
20
20
20
30
20
40
20
Sum of distances
Sample variance

Squared Distance
Standard Deviation: The _____________ square root of the variance.
n
 x
Sample standard deviation = s =
i
i 1
 x
2
n 1
Compute the sample standard deviation for Data set B.
5
Example: Looking again at the pH data, compute the Range, IQR, Sample variance, and sample standard
deviation using the JMP output.
Range
IQR
Sample standard deviation (s)
Sample variance (s2)
Example: Download the file messages.jmp from the course website. This data set contains the number
of text messages an individual sends which was collected from the student data survey you may have
completed at the beginning of the semester. Answer the following questions regarding the data set.
Questions:
11. Compute the average number of text messages sent in a day?
12. Find the Five Number Summary for the number of text messages sent in a day.
13. Compute the range for the number of text messages send in a day.
14. Between what two values does the middle 50% of text messages sent in a day lit?
15. Compute the sample standard deviation and sample variance for the number of text messages
sent in a day.
6
We’ve been looking at numerical summaries used to describe a single numeric variable. We will now
look at the various methods to graphically summarize these types of variables.
Description of Shape
We can use many different types of graphical summaries to describe the shape (or distribution) of the
observed data.
Comments:

When plotting numeric data, the __________________ axis a number line of values (i.e.
CONTINUOUS!).

The _________________ axis usually represents counts or sometimes the relative frequency of
observations which have the same value.
We will again use the file pH.jmp to discuss the various graphical techniques for describing the shape (or
distribution) of the observed data.

Dotplot
o _________ data point is plotted when creating a dotplot.
o Dotplots are normally used for small data sets.
o JMP does not create dotplot, but we’ve encountered them in Tinkerplots earlier this
semester.

Stem and Leaf Plots
o Again, every data point is plotted when creating a stem and leaf plot.
o Stem and Leaf plots are normally used for small data sets.
o JMP will produce a Stem and Leaf plot by clicking on Analyze  Distribution, put pH in
the Y, Columns box and then click on the little red arrow next to pH in the output.
Choose Stem and Leaf from the menu that appears. You should get the following plot.
7
Comments:
o The “leaf” always represents the last digit in the values recorded.
o The “stem” represents all the other decimal places in the values recorded.
o You’ll notice under the stem and leaf plot it says “41|2 represents 4.12.” This is the
legend which tells what the stem and leaf units are for that particular graph. In this case
the stem is the ones and tenths place and the leaf is the hundredths place.

Histograms
o This is a good type of plot when you have a lot of observations.
o The observations are placed into “bins” and the height of each bin represents the
number of observations that fall into any particular bin.
o The histogram is one of the default plots produced when you choose Analyze 
Distribution in JMP. The histogram of the pH data is given below.
When looking at a dotplot, stem and leaf plot, or histogram of the data, we can describe the
shape/distribution of the data using the following terminology.
o
Right Skewed/Positively Skewed
8
o
Left Skewed/Negatively Skewed
o
Symmetric
Questions:
16. Describe the shape/distribution of the pH data.
17. Does the information given in the histogram agree with what was seen in the dotplot?

Boxplot
o The boxplot creates a picture of the data using the ______________ as reference points.
o The “box” portion is comprised of _____, _____ and _____.
o The “whiskers” represent one of two things:
 The endpoint of the lower whisker is the larger of: _____________ or
_______________________
 The endpoint of the upper whisker is the larger of: _____________ or
_______________________
o Any measure beyond the endpoint of either _________________ is classified as a
potential ____________________________________ observation.
o An outlier boxplot is the other default plot plots produced when you choose Analyze 
Distribution in JMP. The boxplot for the pH data is given below.
9
Numerical Measures for Shape
There are two numerical summaries for shape that exist: _______________ and _______________.

Skewness
o A data distribution is said to be symmetric if it has the same shape on both sides of the
center of the distribution. Skewness is a measure of __________________.
Shape
Picture
Skewness Measure in
JMP
The most famous symmetric
distribution is the normal:
Symmetric
Zero
Others?
Right Skewed
Greater than zero
Left Skewed
Less than zero
10

Kurtosis
o This is used to measure the amount of _________________ in the distribution of the
data relative to the normal distribution.
Shape
Picture
Normal
Kurtosis Measure
in JMP
zero
Taller or skinner than normal shape
Positive
Kurtosis
Greater than zero
Less than zero
Negative
Kurtosis
11
JMP will display both the skewness and kurtosis values by clicking on the red drop-down arrow next to
pH and choosing Display Options  Customize Summary Statistics and checking Skewness and
Kurtosis. You should then get the following output.
Questions:
18. How did we describe the shape/distribution of the pH data in Question 16?
19. Does the numerical measure for skewness agree with this? Explain.
20. If the data were extremely right skewed, which should be larger: the mean or the median?
Explain why this is the case.
21. If the data were extremely left skewed, which should be larger: the mean or the median?
Explain why this is the case.
22. If the data were symmetric, which should be larger: the mean or the median? Explain why this
is the case.
12
Example: Again, let’s look at the text messaging data set from the course website.
Questions:
23. Using JMP, create a histogram for the number of test messages sent in a day.
24. Looking at the histogram created in Question 23 describe the shape/distribution for the number
of text messages sent in a day.
25. Looking at the boxplot created in JMP, is there any evidence of potential outliers? Explain.
26. Give the values for skewness and kurtosis.
27. Does the value for skewness agree with your answer to Question 24?
Example: Consider two populations in the same state, where both populations are the same size.
Population 1 consists of all students at the state university. Population 2 consists of all residents in a
small town. Consider the variable age. Which population would most likely have the larger standard
deviation? Explain.
Example: A test is given to 100 students, and the median score is determined. After grading the test,
the instructor realizes that the 10 students with the highest scores did exceptionally well. The instructor
decides to award these 10 students a bonus of five additional points. How will the median of the new
score distribution change compared to that of the original distribution? Explain.
13
Example: The following histogram shows the distribution of the ages of male Oscar winners.
28. Which boxplot is graphing the same data as the histogram? Explain.
a.
c.
b.
d.
14
Example: Four histograms are presented below. Each histogram displays the quiz scores on a scale of 0
to 10 for one of four different STAT 110 classes.
29. Which of the classes would you expect to have the smallest standard deviation? Explain.
30. Which of the classes would you expect to have the largest standard deviation? Explain.
15