Download Chapter3.3to3.4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Warm up
The following graphs show foot sizes of
gongshowhockey.com users.
What shape are the distributions?
Calculate the mean, median and mode for one
FreqSep5
FreqApr1
450
400
350
300
250
200
150
100
50
0
80
70
60
50
40
30
20
10
0
8
9
10
11
12
13
13+
8
9
10
11
12
13
13+
FreqApr1
80
70
60
50
40
30
20
10
0
8
9
10
11
12
13
13+
FreqSep5
450
400
350
300
250
200
150
100
50
0
8
9
10
11
12
13
13+
Measures of Spread
Chapter 3.3 – Tools for Analyzing Data
I can: calculate and interpret measures of spread
MSIP/Home Learning: p. 168 #2b, 3b, 4, 6, 7, 10
What is spread?
Histogram
data

Count
spread tells you how
widely the data are
dispersed
The histograms have
identical mean and
median, but the spread
is different
6
5
4
3
2
1
2
3
4
5
6
data
7
8
9
Histogram
data
4
Count

7
3
2
1
2
4
6
sp
8
10
Why worry about spread?

spread indicates how close the values cluster
around the middle value

less spread means you have greater confidence
that values will fall within a particular range.
Vocabulary




spread and dispersion refer to the same
thing
1) range = max - min
a quartile is one of three numerical values
that divide a group of numbers into 4 equal
parts
2) the Interquartile Range (IQR) is the
difference between the first and third quartiles

IQR = Q3 – Q1
Quartiles Example








26 28 34 36 38 38 40 41 41 44 45 46 51 54 55
range = 55 – 26 = 29
Q2 = 41
Median
Q1 = 36
Median of lower half of data
Q3 = 46
Median of upper half of data
IQR = Q3 – Q1
= 46 – 36 = 10 (contains 50% of data)
if a quartile occurs between 2 values, it is
calculated as the average of the two values
Quartiles Example







26 28 34 36 38 38 40 41 44 45 46 51 54 55
range = 55 – 26 = 29
Q2 = 40.5
Median
Q1 = 36
Median of lower half of data
Q3 = 46
Median of upper half of data
IQR = Q3 – Q1
= 46 – 36 = (contains 50% of data)
A More Useful Measure of Spread






Range is a very basic measure of spread.
Interquartile range is a somewhat useful
measure of spread.
Standard deviation is more useful.
To calculate it we need to find the mean and
the deviation for each data point
Mean is easy, as we have done that before
Deviation is the difference between a
particular point and the mean
Deviation

The mean of these numbers is 48
Deviation = (data) – (mean)
The deviation for 24 is 24 - 48 = -24
-24

12 24

36
The deviation for 84 is 84 - 48 = 36




36
48
60
72
84
Standard Deviation




deviation is the distance from the piece of
data you are examining to the mean
variance is a measure of spread found by
averaging the squares of the deviation
calculated for each piece of data
Taking the square root of variance, you get
standard deviation
Standard deviation is a very important and
useful measure of spread
Example of Standard Deviation






26 28 34 36
mean = (26 + 28 + 34 + 36) / 4 = 31
σ² = (26–31)² + (28-31)² + (34-31)² + (36-31)²
4
σ² = 25 + 9 + 9 + 25
4
σ² = 17
σ = √17 = 4.1
Measure of Spread - Recap





Measures of Spread are numbers indicating how spread
out / consistent data is
Smaller measure of spread = more consistent data
1) Range = (max) – (min)
2) Interquartile Range: IQR = Q3 – Q1 where
 Q1 = first half median
 Q3 = second half median
3) Standard Deviation
 Find mean (average)
 Find deviations (data – mean)
 Square all, average them - this is variance (#4) or σ2
 Take the square root to get std. dev. σ
Standard Deviation




σ² (lower case sigma
squared) is used to
represent variance
σ is used to represent
standard deviation
σ is commonly used to
measure the spread of
data, with larger values
of σ indicating greater
spread
we are using a
population standard
deviation
 x  x 
2

i
n
Standard Deviation with Grouped Data


Hours of TV
2
3
4
5
Frequency
2
6
6
2
grouped mean = (2×2 + 3×6 + 4×6 + 5×2) / 16 = 3.5
deviations:








2:
3:
4:
5:
2 – 3.5 = -1.5
3 – 3.5 = -0.5
4 – 3.5 = 0.5
5 – 3.5 = 1.5

σ² = 2(-1.5)² + 6(-0.5)² + 6(0.5)² + 2(1.5)²
16
σ² = 0.7499
σ = √0.7499 = 0.9

f i xi  x 
2
f
i
MSIP / Homework





read through the examples on pages 164-167
Complete p. 168 #2b, 3b, 4, 6, 7, 10
you are responsible for knowing how to do
simple examples by hand (~6 pieces of data)
we will use technology (Fathom/Excel) to
calculate larger examples
have a look at your calculator and see if you
have this feature (Σσn and Σσn-1)
Normal Distribution
Chapter 3.4 – Tools for Analyzing Data
Learning goal: Determine the % of data within
intervals of a Normal Distribution
MSIP / Home Learning: p. 176 #1, 3b, 6, 8-10
Histograms

Histograms may be skewed...
Right-skewed
Left-skewed
Histograms
... or symmetrical
Histogram
Collection 1
5
4
Count

3
2
1
3
4
5
6
7
a
8
9
10
11
Normal?



A normal distribution creates a histogram that is
symmetrical and has a bell shape, and is used quite
a bit in statistical analyses
Also called a Gaussian Distribution
It is symmetrical with equal mean, median and mode
that fall on the line of symmetry of the curve
A Real Example

the heights of 600 randomly chosen Canadian
students from the “Census at School” data set
the data approximates a normal distribution
Histogram
600 Student Heights
0.035
0.030
0.025
Density

0.020
0.015
0.010
0.005
100
120
Density = normalDensity
140
x
mean
160
180
Heightcm
s
200
220
240
The 68-95-99.7% Rule





area under curve is 1 (i.e. it represents 100%
of the population surveyed)
approx 68% of the data falls within 1 standard
deviation of the mean
approx 95% of the data falls within 2 standard
deviations of the mean
approx 99.7% of the data falls within 3
standard deviations of the mean
http://davidmlane.com/hyperstat/A25329.html
Distribution of Data
99.7%
95%
68%
X ~ N ( x, )
2
34%
34%
0.15%
0.15%
13.5%
13.5%
2.35%
2.35%
x - 3σ
x - 2σ
x - 1σ
x
x + 1σ
x + 2σ
x + 3σ
Normal Distribution Notation
X ~ N ( x, )
2


The notation above is used to describe the Normal
distribution where x is the mean and σ² is the
variance (square of the standard deviation)
e.g. X~N (70,82) describes a Normal distribution
with mean 70 and standard deviation 8 (our class at
midterm?)
An example


Suppose the time before burnout for an LED
averages 120 months with a standard
deviation of 10 months and is approximately
Normally distributed. What is the length of
time a user might expect an LED to last with
68% confidence? With 95% confidence?
So X~N(120,102)
An example cont’d






68% of the data will be within 1 standard deviation of the
mean
This will mean that 68% of the bulbs will be between
120–10 months and 120+10
So 68% of the bulbs will last 110 - 130 months
95% of the data will be within 2 standard deviations of
the mean
This will mean that 95% of the bulbs will be between
120 – 2×10 months and 120 + 2×10
So 95% of the bulbs will last 100 - 140 months
Example continued…





Suppose you wanted to know how long
99.7% of the bulbs will last?
This is the area covering 3 standard
deviations on either side of the mean
This will mean that 99.7% of the bulbs will be
between 120 – 3×10 months and 120 + 3×10
So 99.7% of the bulbs will last 90-150 months
This assumes that all the bulbs are produced
to the same standard
Example continued…
99.7%
95%
34%
34%
13.5%
13.5%
2.35%
2.35%
90
months
100
months
120
months
140
months
150
months
Percentage of data between two values



The area under any normal curve is 1
The percent of data that lies between two
values in a normal distribution is equivalent to
the area under the normal curve between
these values
See examples 2 and 3 on page 175
Why is the Normal distribution so
important?

Many psychological and educational
variables are distributed approximately
normally:


Normal distributions are statistically easy to
work with


height, reading ability, memory, IQ, etc.
All kinds of statistical tests are based on it
Lane (2003)
Exercises


Complete p. 176 #1, 3b, 6, 8-10
http://onlinestatbook.com/
References


Lane, D. (2003). What's so important about
the normal distribution? Retrieved October 5,
2004 from
http://davidmlane.com/hyperstat/normal_distri
bution.html
Wikipedia (2004). Online Encyclopedia.
Retrieved September 1, 2004 from
http://en.wikipedia.org/wiki/Main_Page