Download week2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Transcript
Describing distributions with numbers
• A large number or numerical methods are available for
describing quantitative data sets. Most of these methods
measure one of two data characteristics:
 The central tendency of the set of observations – it is the
tendency of the data to cluster, or center, about certain
numerical values.
 The variability of the set of observation – it is the spread
of the data.
week2
1
Measuring Center
• The Mode is the observation that occurs most frequently.
• The mode for categorical variable will be the label of the
category with the highest number of counts.
• Measuring center
 Two common measures of center are the mean and the median.
 These two measures behave differently.
 The mean is the “average value” and the median is the “middle
value”.
week2
2
Measuring center: the median
•
The median M is the midpoint of the distribution, the
number
such that half the observations are smaller then it and the
other half are larger.
•
To find the median of a distribution:
1. Arrange the observations in order of size, from smallest
to largest.
2. If the number of observations n is odd, the median is the
center observation in the ordered list.
3. If the number of observations n is even, the median is the
average of the two center observations in the ordered list.
week2
3
Example
The annual salaries (in thousands of $) of a random
sample of five employees of a company are:
40, 30, 25, 200, 28
Arranging the values in increasing order:
25 28 30 40 200
median = 30
Excluding 200 median = (28+30)/2=29.
week2
4
• MINITAB commands Stat > Basic Statistics > Display
Descriptive Statistics
• MINITAB output for the data in the example above is given
bellow:
Variable
N
Median
salary
5
30.0
week2
5
Measuring center: mean
• To find the mean x of a set of observations, add their
values and divide by the number of observations. If the n
observations are x1,x2,…xn, their mean is given by
x1  x2      xn
mean  x 

n
x
i
n
• Example
Find the mean of the following observations: 4, 5, 9, 3, 5.
Solution:
week2
6
Example
• The annual salaries (in thousands of $) of a random sample of
five employees of a company are: 40, 30, 25, 200, 28.
If we exclude 200 as an outlier,
• Mean is sensitive to the influence of a few extreme
observations. Because the mean cannot resist the influence of
extreme values, we say that it is NOT a resistant measure of
center.
week2
7
Example - Calculation for grouped data
Determine the mean of the data represented by the following
frequency table.
Class Interval Frequency
10-20
2
20-30
30-40
40-50
50-60
4
7
6
1
Solution:
week2
8
Mean versus median
• The median and mean are the most common measures of the
center of a distribution.
• If the distribution is exactly symmetric, the mean and median
are exactly the same.
• Median is less influenced by extreme values.
• If the distribution is skewed to the right, then
mode < median < mean
• If the distribution is skewed to the left, then
mean < median < mode.
week2
9
Trimmed mean
• Trimmed mean is a measure of the center that is more resistant
than the mean but uses more of the available information than
the median.
• To compute the 10% trimmed mean, discard the highest 10%
and the lowest 10% of the observations and compute the mean
of the remaining 80%. Similarly, we can compute 5%, 20%
etc. trimmed mean.
• Trimming eliminates the effect of a small number of outliers.
week2
10
Example
• Compute the 10% trimmed mean of the data given below.
20 40 22 22 21 21 20 10 20 20
20 13 18 50 20 18 15
8 22 25
• Solution:
- Arrange the values in increasing order:
8 10 13 15 18 18 20
20 20 21 21 22 22 22
20
25
20
40
20
50
- The are 20 observations and 10% of 20 = 2. Hence, discard
the first 2 and the last 2 observations in the ordered data and
compute the mean of the remaining 16 values.
Variable
N
Mean
C2
16
19.812
Exercise 1.77 on page 63 in IPS.
week2
11
Questions
1.
2.
3.
4.
You are asked to recommend a measure of center to
characterize the following data:
0.6, 0.2, 0.1, 0.2, 0.2, 0.3, 0.7, 0.1, 0.0, 22.5, 0.4.
What is your recommendation and why?
The mean is ____ sensitive to extreme values than the
median.
(a) more
(b) less
(c) equally
(d) can’t say without data
Changing the value of a single score in a data set will
necessarily cause the mean to change. (T/F)
Changing the value of a single score in a data set will
necessarily cause the median to change. (T/F)
week2
12
Percentiles
• The simplest useful numerical description of a distribution
consists of both a measure of center and a measure of spread.
• We can describe the spread or variability of a distribution by
giving several percentiles.
• The pth percentile of a distribution is the value such that p
percent of the observations are smaller or equal to it.
• The median is the 50th percentile.
• If a data set contains n observations, then the pth percentile is
p th value in the ordered data set.
the (n1)100
week2
13
Example
• Find the 20th percentile of the data represented by the
following stem-and-leaf plot.
Stem-and-leaf of Rural
Leaf Unit = 1.0
1
2 1
5
3 3589
(12)
4 122333456788
12
5 112467
6
6 7
5
7 04
3
8 48
1
9
1
10 8
week2
N = 29
N* = 7
14
Solution
week2
15
Quartiles
• The 25th percentile is called the first quartile (Q1).
• The first quartile (Q1) is the median of the observations whose
position in the ordered list is to the left of the location of the
overall median.
• The 75th percentile is called the third quartile (Q3).
• The third quartile (Q3) is the median of the observations whose
position in the ordered list is to the right of the location of the
overall median.
NOTE: The median is the second quartile Q2 .
week2
16
Example
The highway mileages of 20 cars, arranged in increasing order are:
13 15 16 16 17 19 20 22 23 23 | 23 24 25 25 26 28 28 28
29 32.
The median is …
The first quartile Q1 is …
The third quartile Q3 is…
•
Exercise: Find
(a) the 10th percentile.
(b) the 90th percentile of the above data set.
week2
17
Measuring Spread
• The range (max-min) is a measure of spread but it is very
sensitive to the influence of extreme values.
• The distance between the first and third quartiles is called the
Interquartile range (IQR) i.e. IQR =Q3 – Q1 .
• The IQR is another measure of spread that is less sensitive to
the influence of extreme values.
week2
18
The five-number summary
• The five-number summary of a set of observations consists of
the smallest observation, the first quartile, the median, the
third quartile and the largest observation.
• These five numbers give a reasonably complete description of
both the center and the spread of the distribution.
• MINITAB commands: Stat > Basic Statistics > Display
Descriptive Statistics
week2
19
Example
• The highway mileages of 20 cars, arranged in increasing order are:
13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32.
Give the five number summary.
• Answer
From example 1.14 p45 we have, min = 13, first quartile = 18,
median = 23, third quartile = 27 , max. = 32.
The MINITAB output using the above commands is as follows:
Variable
mileage
N
20
Minimum
13.00
Q1
17.50
week2
Median
23.00
Q3
27.50
Maximum
32.00
20
Box-plot
• A box-plot is a graph of the five-number summary.
• Example:
Make a box-plot for the data in the above example.
Boxplot of Mileages
Mileages
30
25
20
15
• MINITAB commands: Graph > Boxplot
week2
21
Exercise
• The dot-plot for a set of 20 observations is given below:
Dotplot for Score
90
100
110
120
Score
• Draw a box-plot for the data (use the same scale).
week2
22
Exercise
The stemplot for a set of 50 observations is given below:
 Draw a box-plot for the data.
 Compute the 40% trimmed mean.
Stem-and-leaf of Fees
N = 50
Leaf Unit = 1.0
2
0 89
7
1 00234
15
1 55558899
(28)
2 0000000000000111112222222223
7
2 59
5
3 0
4
3 5
3
4 00
1
4
1
5 0
week2
23
Exercise
• The box-plot, histogram and stem-and-leaf plot for a data set
are given below. Describe the distribution.
Stem-and-leaf of C2
N = 50
Leaf Unit = 1.0
(29)
0 00011111111122222222233444444
21
0 55555666788
10
1 0234
6
1 66
4
2 1
3
2 88
1
3
1
3 8
Frequency
20
10
0
0
0
10
20
C2
30
40
4
8
12
16
20
24
28
32
36
40
C2
week2
24
Exercise
• Consider the following Minitab generated box-plots of coagulation times in
seconds for samples of blood drawn from animals receiving three different diets
denoted 1, 2, and 3 :
coagtimes
70
65
60
1
2
3
Diet
•
State whether the following statements are true or false
a) The animal that had the longest coagulation time was given diet 3.
b) The greatest variability occurs with diet 2.
c) Diet 1 shows evidence of right (positive) skewness but diet 2 shows evidence
of left (negative) skewness.
d) Approximately 25% of animals on diet 2 had coagulation times less then 63.
e) The smallest upper (third) quartile is for diet 3.
f) We can see that the mean for diet 1 is less than 62 seconds.
week2
25
Measuring spread: Standard deviation
• The variance (s2) of a set of n observations x1 , x2 ,..., xn is
2 ( x  x ) 2  ( x  x ) 2
2
(
x

x
)
(
x

x
)

n
i
2
s2  1

n1
n1
• The standard deviation (s) is the square root of the variance (s2).
i.e.
( x1 x)2 ( x2  x)2  ( xn  x)2
 ( xi  x)2
s

n1
n1
• It can be shown that,
 xi2 nx 2
s
n1
This formula is usually quicker.
week2
26
• The deviations xi  x display the spread of the values xi about
their mean. Some of these deviations will be positive and some
negative because the observations fall on each side of the
mean.
• The sum of the deviations of the observations from their mean
will always be zero.
• Squaring the deviations makes them all positive, so that
observations far from the mean in either direction have large
positive squared deviations.
• The variance is the average of the squared deviations.
• The variance, s2, and the standard deviation, s, will be large if
the observations are widely spread about their mean, and small
if the observations are all close to the mean.
week2
27
Example
• Find the standard deviation of the following data set: 4, 8, 2, 9, 7.
• Solution: n=5 ,
• Using the second formula we have
and so
week2
28
• MINITAB commands Stat > Basic Statistics >
Display Descriptive Statistics
• MINITAB output for the above data is given below:
Variable N
C1
5
StDev
2.92
• Exercise: Find the standard deviation of the following data set:
5, 8, 7, 9, 7, 11.
week2
29
Example -Calculation from grouped data
• Determine the standard deviation of the data represented by the
following frequency distribution.
Class Interval
10-20
20-30
30-40
40-50
50-60
Frequency
2
4
7
6
1
week2
30
Solution
week2
31
Properties of standard deviation (s)
• s measures the spread about the mean and should be used only
when the mean is chosen as the measure of center.
• s = 0 only when there is no spread. This happens only when all
observations have the same value. Otherwise, s > 0.
• s, like the mean , is not resistant to extreme values. A few
outliers can make s very large.
week2
32
Ballpark approximation for s
• The ballpark approximation for the standard deviation s is the
Range/4 (divide by 3 if there are less then 10 observations,
divide by 5 if there are more then 100 observations).
• For the data set 4, 8, 2, 9, 7, range = 9 – 2 = 7 and so
s  7  2.33 .
3
week2
33
The empirical (68-95-99.7) rule
• With a bell shaped distribution,
 about 68% of the data fall within a distance of 1 standard
deviation from the mean.
 95% fall within 2 standard deviations of the mean.
 99.7% fall within 3 standard deviations of the mean.
• What if the distribution is not bell-shaped?
There is another rule, named Chebyshev's Rule, that tells us
that there must be at least 75% of the data within 2 standard
deviations of the mean, regardless of the shape, and at least
89% within 3 standard deviations.
week2
34
Outliers
An outlier is an observation that is usually large or small
relative to the other values in a data set. Outliers are typically
attributable to one of the following causes:
1. The observation is observed, recorded, or entered incorrectly.
2. The observation comes from a different population.
3. The observation is correct but represents a rare event.
week2
35
The 1.5×IQR Criterion for outliers
• Call an observation a suspected outlier if it falls more than
1.5×IQR above the 3rd quartile or below the 1st quartile.
• Example
Consider the data given in example 1.13 on page 43 in IPS
(mileage data with an extra observation of 66).
Variable
Mileages
N
21
Mean
24.67
Min
13
Q1
18
Median
23
Q3
28
Max
66
The IQR = 28-18 = 10 and the largest observation, 66, falls
more than 1.5×IQR above Q3 and therefore is an outlier.
week2
36
Choosing a summary
• The five-number summary is usually better than the mean and
the standard deviation for describing skewed distributions or
distributions with strong outliers.
• Use mean and standard deviation for reasonably symmetric
distributions that are free of outliers.
week2
37
Questions
1. How do the mean, median, and mode compare, usually, when
a distribution is positively skewed? negatively skewed? Draw
a picture and try to estimate the locations of these measures.
2. Which type of display is the most useful type for clear direct
comparisons of the key characteristics of several data sets (e.g.
blood cholesterol changes for several different treatments) ?
3. In a frequency table of 300 scores, the mean is reported as 80
and the median as 65. One would expect this distribution to be
a.
b.
c.
d.
positively skewed.
negatively skewed.
symmetrical
rectangular.
week2
38
4. Find the median of the following frequency distribution.
Score
1
2
Frequency
2
9
3
4
5
5
3
1
5. On sta220 term test, John scored at the 78th percentile, and Jack scored at the
63rd. State whether the following statements are true of false
a. John is 15 times better than jack.
b. John scored 15 more points than Jack.
c. 15% of those taking the test got scores ranging between
John's and Jack's scores.
d. 62 students scored less than John.
week2
39
6. Estimate the mean and standard deviation of the distribution
represented by the following histogram.
Frequency
10
5
0
1
3
5
7
9
11
13
15
Rate
week2
40