Download Notes - Describing Data Numerically

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Describing
Distributions Numerically
Center**Whenever you find the center of data, always ask yourself how well it actually
summarizes the data!!
Median: middle value that divides the histogram into two equal areas; same units as
data (be sure to include the units when describing the median)
Finding the Median: (with data in numerical order)
 When n is odd: the median is in the
n 1
position
2
 When n is even: there are two middle values, the median is the average of the
values in the
n
n
position and the  1 position
2
2
Mean:
 y




Total  y

n
n
Can be affected by outliers; the mean is “pulled” in the direction of the longer tail
Only appropriate when the shape is symmetric and there are no outliers
To check for symmetry and outliers, plot the data!
For fairly symmetric distributions, the mean and median are very close. In this
situation, the mean will be most useful (as we’ll see in later chapters)
Spread*The more the data vary, the less the median alone can tell us
Range:
 Range = max-min
 Range is always a single number (not an interval of values)
 Range is not often used in statistic, it is too susceptible to outliers
Interquartile Range:
 Lower quartile (25th percentile): the median of the lower half of the data; 25% of data
lies below it
 Upper quartile (75th percentile): the median of the upper half of the data; 25% of
data lies above it
 Interquartile Range (IQR): the middle half of the data
IQR = upper quartile – lower quartile
Standard Deviation:
 Although the IQR is a reasonable summary of spread, it ignores much of the
information about how the individual data values vary
 Standard Deviation is a better measure, but is ONLY appropriate for symmetric data
 Variance: gives the “average” of the squared values (to keep values from cancelling
out) of the deviations, but the units are also squared
 To get back to the original (and more useful) units, we take the square root to get the
standard deviation
 Variance: s 2 
 ( y  y)
2
n 1
 Standard Deviation: s 
 ( y  y)
n 1
2
Shape, Center, and Spread:
 If the shape is skewed, report the median and the IQR. You may want to include the
mean and standard deviation, but you should point out why the mean and median
differ. The fact that the mean and median do not agree is a sign that the distribution
may be skewed. A histogram will help you make that point.
 If the shape is symmetric, report the mean and standard deviation and possibly the
median and IQR as well. For symmetric data the IQR is usually a bit larger than the
standard deviation. If that is not true for your data set, look again to make sure the
distribution isn’t skewed and there are no outliers.
 If there are any clear outliers and you are reporting the mean and standard
deviation, report them with the outliers present and with the outliers removed. The
differences may be revealing. (Of course, the median and IQR are not likely to be
affected by the outliers).
 Always pair the median with the IQR and the mean with the standard deviation
 Generally, report summary statistics to one or two more decimal places than the
original data
Categorical Data:
**Step-by-Step: pg. 85
**Read “What Can Go Wrong?” – pg. 86-87
Re-expressing to Equalize the Spread of Groups Useful when comparing groups that have very different spreads
 For measurements that can’t be negative and whose distributions are skewed to the
high end, a good first guess at a re-expression is the logarithm. This can also
improve the symmetry of the distribution and pull in most of the apparent outliers.
Changing Units:
 Adding/Subtracting – affect measures of center only
 Multiplying/Dividing – affect all measures of center and spread
Notations:
 n = number of values
 y (y-bar) = find the mean; this is generally true for any variable with a bar over it
  (sigma) = sum of the observations
 SQRT(y) or y^0.5 = common computer representations for square root
 s = standard deviation

s 2 = variance
Boxplots5-Number Summary: a description of a distribution that reports the median, quartiles,
and extremes (max and min).
Boxplot: a visual display of the 5-Number Summary; useful for comparison
To make a boxplot:
 Draw a single axis (vertical or horizontal) spanning the extent of the data.
 Draw short lines (horizontal or vertical, respectively) at the lower and upper quartiles
and at the median.
 Connect the lines to form a box.
 Outliers: Construct “fences.” Fences are just for construction, not part of the
display. Do not include them in your boxplot!
o Upper fence = Q3+1.5(IQR)
o Lower fence = Q1-1.5(IQR)
 Draw lines from the ends of the box up and down to the most extreme data values
found within the fences. If data falls outside one of the fences, do not connect it.
 Any data values beyond the fences are represented by special symbols (often dots
or x’s). These are the outliers. “Far outliers” – more than 3 IQR’s from the quartiles
– often use a different symbol.
Summarizing a Boxplot If the median is roughly centered between the quartiles, then the middle half of the
data is roughly symmetric
 If it is not centered, the distribution is skewed.
 The “whiskers” show skewness as well if they are not roughly the same length
**Step-by-Step: pg. 78-79
**TI Tips: pg. 80
**Just Checking: pg. 84
Suggested Practice (Old Book): #3, 5-8, 12-14, 17, 19, 21, 24, 27, 28, 30, 34, 47
Suggested Practice (New Book):
pg. 160 #4.1-4.16
Pg. 169 #4.17*, 4.19*, 4.21, 4.22, 4.23, 4.25, 4.29, 4.30
pg. 176 #4.32-4.37