Download GRAPHICAL METHODS FOR QUANTITATIVE DATA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Chapter 3 Part 2
Numerical Summaries of
Symmetric Data.
Measure of Center: Mean
Measure of Variability: Standard
Deviation
Symmetric Data
Body temp. of 93 adults
Recall: 2 characteristics of a
data set to measure
 center
measures where the “middle” of the
data is located
 variability
measures how “spread out” the data is
Measure of Center When Data
Approx. Symmetric
 mean
(arithmetic mean)
 notation
xi : ith measurement in a set of observations
x1 , x2 , x3 , , xn
n: number of measurements in data set; sample
size
n
 xi  x1  x2  x3    xn
i 1
Sample mean x
n
x
x1  x2  x3  xn i 1
x

n
n
i
Population mean  (value typically not known)
N = population size
N
x
  i 1
N
i
Recall: Warmup
456=270; 270-40=230; 230/5=46
 Six
people in a room have a median
age of 45 years and mean age of 45
years.
 One person who is 40 years old leaves
the room.
 Questions:
1. What is the median age of the 5 people
remaining in the room? Can’t answer
2. What is the mean age of the 5 people
remaining in the room? 46
Connection Between Mean
and Histogram
A
histogram balances when supported
at the mean. Mean x = 140.6
Histogram
70
50
40
Fr equency
30
20
10
Abs e nce s f rom Work
More
1 60.5
153.5
146.5
139 .5
132.5
125.5
0
118.5
Fre que ncy
60
Mean: balance point
Median: 50% area each half
right histo: mean 55.26 yrs, median 57.7yrs
Properties of Mean, Median
1. The mean and median are unique; that is, a
data set has only 1 mean and 1 median (the
mean and median are not necessarily equal).
2. The mean uses the value of every number in
the data set; the median does not.
20
46
Ex. 2, 4, 6, 8. x   5; m 
5
4
2
21 1
46
Ex. 2, 4, 6, 9. x   5 4 ; m 
5
4
2
Example: class pulse rates
 53
64 67 67 70 76 77 77 78 83 84 85 85
89 90 90 90 90 91 96 98 103 140
n  23
23
x 
x
i 1
i
 84.48;
23
m :location: 12th obs. m  85
2010, 2014 baseball salaries
2010
n = 845
 = $3,297,828
median = $1,330,000
max = $33,000,000

2014
n = 848
 = $3,932,912
median = $1,456,250
max = $28,000,000

Disadvantage of the mean
 Can
be greatly influenced by just a few
observations that are much greater or
much smaller than the rest of the data
Mean, Median, Maximum
Baseball Salaries 1985 - 2014
Baseball Salaries: Mean, Median and Maximum 1985-2014
Mean
Median
Maximum
35,000,000
3,200,000
25,000,000
2,700,000
20,000,000
2,200,000
15,000,000
1,700,000
10,000,000
1,200,000
Year
2013
2011
2009
2007
2005
2003
2001
1999
1997
1995
1993
0
1991
200,000
1989
5,000,000
1987
700,000
Maximum Salary
30,000,000
1985
Mean, Median Salary
3,700,000
Skewness: comparing the
mean, and median
 Skewed
to the right (positively skewed)
 mean>median
2011 Baseball Salaries
600
490
Frequency
500
400
300
200
100
53
102
72
35 21 26 17
8
10
0
Salary ($1,000's)
2
3
1
0
0
1
Skewed to the left; negatively
skewed
 Mean
< median
 mean=78; median=87;
Histogram of Exam Scores
Frequency
30
20
10
0
20
30
40
50 60 70 80
Exam Scores
90 100
Symmetric data
 mean,
median approx. equal
Bank Customers: 10:00-11:00 am
15
10
5
0
70
.8
78
.6
86
.4
94
.2
10
2
10
9.
8
11
7.
6
12
5.
4
13
3.
2
m
or
e
Frequency
20
Number of Customers
DESCRIBING VARIABILITY OF
SYMMETRIC DATA
Describing Symmetric Data
(cont.)
 Measure
of center for symmetric data:
Sample mean x
n
x1  x2  x3 
x
n
 Measure
data?
 xn

x
i 1
i
n
of variability for symmetric
Example
2
data sets:
x1=49, x2=51 x=50
y1=0, y2=100 y=50
On average, they’re both
comfortable
0 100
49 51
Ways to measure variability
range=largest-smallest
ok sometimes; in general, too crude;
sensitive to one large or small obs.
1.
2. measure spread from the middle, where
the middle is the mean x ;
 deviation of xi from the mean: xi  x

n
 (x
i 1
i
 x ); sum the deviations of all the xi 's from x ;
n
 ( x  x )  0 always; tells us nothing
i 1
i
Previous Example
sum of deviations from mean:
x1  49, x2  51; x  50 
( x1  x )  ( x2  x )  (49  50)  (51  50)  1  1  0;
y1  0, y2  100; y  50 
( y1  y )  ( y2  y )  (0  50)  (100  50)  50  50  0
The Sample Standard Deviation, a
measure of spread around the mean
 Square
the deviation of each
observation from the mean; find the
square root of the “average” of these
squared deviations
n
( x i  x ) ;  ( x i  x ) 2 and find the " average" ,
2
i 1
then take the square root of the average
n
s 
 (x
i 1
deviation
i
 x )2
n 1
called the sample standard
Calculations …
Women height (inches)
i
xi
x
(xi-x)
(xi-x)2
1
59
63.4
-4.4
19.0
2
60
63.4
-3.4
11.3
3
61
63.4
-2.4
5.6
4
62
63.4
-1.4
1.8
5
62
63.4
-1.4
1.8
6
63
63.4
-0.4
0.1
7
63
63.4
-0.4
0.1
8
63
63.4
-0.4
0.1
9
64
63.4
0.6
0.4
10
64
63.4
0.6
0.4
11
65
63.4
1.6
2.7
12
66
63.4
2.6
7.0
13
67
63.4
3.6
13.3
14
68
63.4
4.6
21.6
Mean = 63.4
Sum
0.0
Sum
85.2
Sum of squared deviations from mean = 85.2;
Mean
63.4
x
(n − 1) = 13; (n − 1) is called degrees freedom (df)
s2 = variance = 85.2/13 = 6.55 inches squared
s = standard deviation = √6.55 = 2.56 inches
i
xi
x
(xi-x)
(xi-x)2
1
59
63.4
-4.4
19.0
2
60
63.4
-3.4
11.3
3
61
63.4
-2.4
5.6
4
62
63.4
-1.4
1.8
We’ll
never
calculate
these by hand, so make sure to
5
62
63.4
-1.4
know
how
to get
the1.8standard deviation using your
6
63
63.4
-0.4
0.1
calculator,
Excel,
or
other software.
7
63
63.4
-0.4
0.1
x
8
63
63.4
-0.4
0.1
9
64
63.4
0.6
0.4
10
64
63.4
0.6
0.4
11
65
63.4
1.6
2.7
12
66
63.4
2.6
7.0
13
67
63.4
3.6
13.3
14
68
63.4
4.6
21.6
Sum
0.0
Sum
85.2
Mean
63.4
Mean
± 1 s.d.
2. Then take the square root to get the
1. First calculate the variance s2.
s2 
n
1
( xi  x ) 2

n 1 1
standard deviation s.
1 n
2
s
(
x

x
)

i
n 1 1
Population Standard Deviation
N
 
2
(
x


)
 i
i 1
N
value of 
population standard deviation
typically not known;
use s to estimate value of 
Remarks
1. The standard deviation of a set of
measurements is an estimate of the
likely size of the chance error in a
single measurement
Remarks (cont.)
2. Note that s and  are always greater
than or equal to zero.
3. The larger the value of s (or  ), the
greater the spread of the data.
When does s=0? When does  =0?
When all data values are
the same.
Remarks (cont.)
4. The standard deviation is the most
commonly used measure of risk in
finance and business
– Stocks, Mutual Funds, etc.
5. Variance




s2 sample variance
 2 population variance
Units are squared units of the original data
square $, square gallons ??
Remarks 6):Why divide by n-1
instead of n?
 degrees
of freedom
 each observation has 1 degree of
freedom
 however, when estimate unknown
population parameter like , you lose 1
degree of freedom
In formula for s , we use x to estimate the unkown
n
value of  ;
s 
2
(
x

x
)
 i
i 1
n 1
Remarks 6) (cont.):Why divide
by n-1 instead of n? Example
 Suppose
we have 3 numbers whose
average is 9
Choose ANY values for x1
x2
 x1=
x2= and
Since the average (mean)
is 9, x1 + x2 + x3 must
 then x3 must be
equal 9*3 = 27, so x3 = 27
 once we selected x1– and
(x1 + xx22) , x3 was
determined since the average was 9
 3 numbers but only 2 “degrees of
freedom”
Computational Example
observations 1, 3, 5, 9; x  184 4.5
(1  4.5) 2  (3  4.5) 2  (5  4.5) 2  (9  4.5) 2
s 
4 1
(3.5) 2  (1.5) 2  (.5) 2  (4.5) 2

3
12.25  2.25  .25  20.25
35


 11.67 3.42;
3
3
s 2 11.67
class pulse rates
53 64 67 67 70 76 77 77 78 83 84 85 85 89 90
90 90 90 91 96 98 103 140
n  23 x  84.48 m  85
s  290.26(beats per minute)
s  17.037 beats per minute
2
2
Review: Properties of s and 
s and  are always greater than or
equal to 0
when does s = 0?  = 0?
 The larger the value of s (or ), the
greater the spread of the data
 the standard deviation of a set of
measurements is an estimate of the
likely size of the chance error in a single
measurement

Summary of Notation
SAMPLE
y sample mean
POPULATION
 population mean
m sample median
m population median
s sample variance  2 population variance
s sample stand. dev.  population stand. dev.
2
End of Chapter 3