Download Chapter 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 3
Measures of the Middle
1. mean (or average)- is computed by adding all the data points and dividing
by the total number of data points.
- common notation for the mean is
-
x
1
other notation is x 
 xi
n
- is not a resistant measure (not influenced by extreme
observations) of center
2. median- is found by ordering all observations from smallest t o largest
and picking the observation in the middle; if two observations are in the
middle, then average the two to find the mean
- notation is M for median
- is a resistant measure of center
- is the observation where 50% of the data is smaller and 50% of the
data is larger
3. mode- is the observation that occurs most often
- on a graph it would have the largest “bar”
- a set of data could be bimodal (have two modes) or multimodal
4. midrange-the value in the middle of the range
Comparing the mean and the median
-Remember they are computed differently and thus represent different
things.
- If a distribution is completely symmetric, then the two will be the same.
- In skewed distributions the mean is further along the tail than the
median.
- In skewed distributions it is better to use the median as a measure
of the center
Measures of Spread
1. range- is found by subtracting the smallest observation from the largest, it
is not an interval it is a numerical value
2. standard deviation- is the measure of spread around the mean
- notation is S for the standard deviation of a sample and  for the standard
deviation of an entire population
1
2
- is computed by S 
 xi  x 
n 1
where n-1 is the
degrees of freedom
1
2
- is computed by  
 xi  x 
N
where N represents the
total number of observations in the population
3. variance- is computed from the standard deviation
- the variance of a sample is S2 and the variance of a population is 2
4. interquartile range- is a measure of the spread of the middle 50% of the
data
- is calculated by subtracting the first quartile from the third, IQR=Q3-Q1
- Q1 is the median of all observations to the left of Q2
-Q3 is the median of all observations to the right of Q2
Five Number Summary
Minimum Q1 M
Q3
Maximum
Boxplot
- is a graph of the five number summary
- Steps
1. use only one axis (draw it with a scale)
2. one end of the box is Q1 the other is Q3
3. place the minimum and maximum on the graph and draw lines
connecting them to the box
4. draw a line through the box representing the median
Note: You can also draw a boxplot with outliers represented as dots.
-An outlier is found to be any point not within 1.5IQR of the
quartiles.
- Thus, any number outside of the range (Q1-1.5IQR, Q3 +1.5IQR)
is an outlier.
- The minimum point on your boxplot would then be next number
bigger than the outlier. The maximum point would be the next number
smaller than the outlier.
What can go wrong?
1. Not making a reality check. Make sure your calculated summaries make
sense.
2. Forgetting to sort the values before calculating the median or percentiles.
3. Computing numerical summaries for qualitative variables.
4. Not taking into account multimodal situations. If the data has gaps in it,
then you may want to discuss each cluster separately and not give only one
measure of center and spread.
5. Not scrutinizing outliers. If you have an outlier, check it out. It could be
an error. If it is, toss it. If it’s not, use resistant measures of center and spread
(median and IQR).
6. Not checking the picture (graph). If the data has outliers or is heavily
skewed then maybe the mean and standard deviation aren’t the best
measures to use.
7. Not taking into account their spreads when comparing different groups.