Download mean

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Mean field particle methods wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

History of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
1.2 Describing distributions with numbers
(p28)
 Measuring center: mean
 Two common measures of center are
the mean and the median.
 The two measures behave differently.
 Example
Find the mean of the following
observations.
4, 5, 9, 3, 5
Solution:
mean  4  6  9  3 6  28  5.6
5
5
 If there are n observations x1, x2,, xn in a
sample, the sample mean (denoted by x ) is
given by
sum of xi ' s  xi
x

.
n
n
1
 Example
The annual salaries (in thousands) of a
random sample of five employees of a
company are:
40, 30, 25, 200, 28
mean  40  30  25  200  28  323  64.6
5
5
If we exclude 200 as an outlier,
mean  40  30  25  28  123  30.75
4
4
 Mean is sensitive to the influence of
extreme observations. It cannot resist
influence of the extreme values. Mean is
NOT a resistant measure of center. (p31)
2
Measuring center: the median (p31)
The median (M) is the midpoint of the
distribution, the number such that half
the observations are smaller and other
the half are larger.
 To find the median of a distribution:
1.Arrange all observations in order of
size, from smallest to largest.
2.If the number of observations is odd
the median is the center observation
in the ordered list.
3.If the number of observations is even
the median is the average of the two
center observations in the ordered
list.
 Examples
1.The annual salaries (in thousands) of
a random sample of five employees
of a company are:
40, 30, 25, 200, 28
Arranging the values in increasing
order:
25 28 30 40 200
3
median = 30
Excluding 200 median = (28+30)/2.
 Note that the mean for this data set
was 64.6 and the influence of the
extreme value 200 is much less.
 StatCrunch commands Stat > Summary
Statistics
 StatCrunch output for the data in Example
above is as follows:
Summary statistics:
Column
salary
n
5
Mean
64.6
Median
30
4
Mean versus median (p32)
 The median and mean are the most
common measures of the center of a
distribution.
 If the distribution is exactly symmetric, the
mean and median are exactly the same.
 Median is less influenced by extreme
values.
 If the distribution is skewed to the right,
mode < median < mean
 If the distribution is skewed to the left,
mean < median < mode.
5
 Examples
The distribution of Co2 (Table 1.3 p26)
Variable: CO2
Decimal point is 1 digit(s) to the right of the colon.
0 : 00000000001111111122233344444
0 : 555677888999
1 : 0001
1 : 67
2:0
Summary statistics:
Column n
Mean
CO2
48
4.5958333
Median
3.2
Min Max
0
19.9
6
Distribution of a simulated data set (100
values)
Variable: x
Decimal point is 1 digit(s) to the right of the colon.
0:7
1:
1:
2:1
2:
3:
3:
4:1
4:5
5 : 012344
5 : 88
6:4
6 : 55689
7 : 1344
7 : 5567889
8 : 00011112223
8 : 555566666678888899
9 : 000001112333334
9 : 555555666778888888889999999
Summary statistics:
Column n
Mean
x
Median
Min
Max
100 82.23267 86.8068 7.228082 99.321556
7
Questions
1.You are asked to recommend a measure of
center to characterize the following data:
0.6, 0.2, 0.1, 0.2, 0.2, 0.3, 0.7, 0.1, 0.0,
22.5, 0.4.
What is your recommendation and why?
2.The mean is ____ sensitive to extreme
values than the median
a) more
b) less
c) equally
d) can’t say without data
3.Changing the value of a single score in a
data set will necessarily cause the mean to
change. (T/F)
4. Changing the value of a single score in a
data set will necessarily cause the median
to change. (T/F)
8
Measuring Spread
 The range (max-min) is a measure of
spread but it is very sensitive to the
influence of extreme values.
Measuring spread: interquartile range
(IQR) p38
Quartiles: p33)
The first quartile (Q1) is the median of the
observations whose position in the ordered
list is to the left of the median of the overall
median.
The 3rd quartile (Q3) is the median of the
observations whose position in the ordered
list is to the right of the median of the
overall median.
i.e. IQR  Q3  Q1
9
 Example
The highway mileages of 18 cars,
arranged in increasing order are:
13 13 16 19 21 21 23 23 24 26 26
27 27 27 28 28 30 30
n = 18 (n is even), n1  18 1  9.5 and so
2
2
the median is the average of 9th and 10th
values in the above ordered data set =
24  26  25 .
2
th
Q  .5 value = 21.
1
th
Q  5 value from the upper end = 27.
3
IQR  Q  Q  27  21  6 .
3 1
10
 The five-number summary p36
 The five-number summary of a set
of observations consists of the
minimum, the first quartile, median,
the third quartile and the maximum.
 These five numbers give a quick
summary of the both center and the
spread of the distribution.
 StatCrunch commands:Stat >
Summary Statistics
 Example
The highway mileages of 18 cars,
arranged in increasing order are:
13 13 16 19 21 21 23 23 24 26 26
27 27 27 28 28 30 30
Give the five number summary.
Ans: min = 13, first quartile = 21,
median = 25, third quartile = 27 , max. =
30.
11
The StatCrunch output using the above
commands is as follows:
Summary statistics:
Column n
Mean
mileage 18 23.444445
Median Min Max Q1 Q3
25
13
30 21 27
Boxplot p36
 A boxplot is a graph of the five-number
summary.
 Example: Make a boxplot for the data in
the above example.
12
 StatCrunch commands: Graphics >
Boxplot
13
1.5 IQR rule for outliers (p37)
Strength
0, 0, 550, 750, 950, 950, 1150, 1150,
1150, 1150, 1150, 1250, 1250, 1350,
1450, 1450, 1450, 1550, 1550, 1550
1850, 2050, 3150
Summary statistics:
Column
strength
n
23
Mean
1254.3478
Median
1250
Min
Max
0
3150
Q1
Q3
950
1550
Range
3150
IQR = 1550 – 950 = 600
1.5 IQR = 900
Q3 + 1.5 IQR = 1550 + 900 = 2450
Q1 - 1.5 IQR = 950 – 900 = 50
14
15
Side-by-side boxplots for comparison
Example Consider Ex1.41 p26
Summary statistics for StudyTime:
Group by: Gender
Gender
n
Mean
Min
Max
Q1
Q3
IQR
F
30
165.16667
60
360
120
180
60
M
30
117.166664
0
300
60
150
90
16
 Measuring spread Standard deviation
(p39)
The variance ( s2 ) of a set of n
observations x , x ,, xn is
1 2
2 ( x  x)2 ( x  x)2
2
(
x

x
)
(
x

x
)

n
i
2
.
s2  1

n1
n1
The standard deviation(s) is the square
root of the variance ( s2 ).
i.e.
( x1 x)2 ( x2  x)2 ( xn  x)2
 ( xi  x)2
s

n1
n1
Example
Find the standard deviation of the following
data set: 5, 8, 2
n 3, Mean ( x ) = 5 8  2 15  5
3
3
2 (85)2 (25)2 18
(5

5)
2
s 
 9
31
2
s  9  3.
17
Example Consider Ex1.41 p26 again
Summary statistics for StudyTime:
Group by: Gender
Gender n
Mean
F
30
M
30 117.166664
Std. Dev. Median Range Min Max Q1 Q3 Variance IQR
165.16667 56.514927
74.23963
175
300
60
120
300
0
360 120 180 3193.9368
60
300
90
60 150
5511.523
Properties of the standard deviation(s),
p41
s measures the spread about the mean x .
s = 0 only when there is no spread. This
happens only when all observations have the
same value.
s, like the mean x , is not resistant. A few
outliers can make s very large.
18
Choosing a summary p42
The five-number summary is usually better
than the mean and the standard deviation for
describing skewed distributions or
distributions with strong outliers.
Use mean and std. deviation for reasonably
symmetric distributions that are free of
outliers.
19
Effect of a Linear Transformation p43
•Multiplying each observation in a data set
by a number b multiplies the mean, median,
by b and the measures of spread (standard
deviation, IQR) by abs(b) .
•Adding the same number a to each
observation in a data set adds a to measures
of center, quartiles, percentiles but does not
change the measures of spread.
20