Download annotated

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
STAB22 Statistics I
Lecture 4
1
Spread of Distribution
Spread describes how much data vary about
their centre (a.k.a. dispersion or variability)
Spread is an important aspect of a distribution
150
Investment 2
0
0
50
100
Frequency
150
Investment 1
100
200
200
E.g. Investments below have same mean return (3%)
but different spread; which do you consider safer?

50

Frequency

-10
-5
0
5
% return
10
15
-10
-5
0
5
% return
10
15
2
Range

Range = Maximum – Minimum value


Simple to calculate, but not always helpful



E.g. Data: 4, 6, 8, 8, 10, 15 → Range =15 − 4= 11
Data set 1: 4, 4, 4, 4, 4, 4, 10
Data set 2: 4, 5, 6, 7, 8, 9, 10
same range, but
not same spread!!
Range is very sensitive to outliers


Depends only on extreme values (min & max)
Better alternative is Interquartile Range
3
Quartiles

Quartiles: 3 values that divide distribution
into 4 parts, each containing ¼ of data


1st quartile (Q1) has ¼ of data below & ¾ above
2nd quartile (Q2) has ½ of data below & ½ above


Q2 =?
3rd quartile (Q3) has ¾ of data below & ¼ above
¼
¼
¼
¼ of data
4
Q1
Q2
Q3
Quartiles

To find Q1 & Q3, split data in two halves &
calculate median of each half


When # data is odd (n=5, 7,…) include median
value in each half
E.g. Find Q1 & Q3 for following data sets:
(1 3 4 6 7 9)
(1 3 4 4 6 7 9)
5
Interquartile Range

IQR (Interquartile Range) = Q3 – Q1


Distance between Q1 and Q3
Resistant (not sensitive) to outliers
IQR
Q1

Q3
E.g. Data 3, 5, 7, 7, 8 → IQR =
6
Variance


Variance (s2): measures average squared
deviation of individual data from their mean
For values x1, x2, …, xn, variance is given by:
2
s
2
x1  x    x2  x 


2
n 1

    xn  x 
2
x  x


i
n 1
Units of variance are squared → measuring spread in
$² or in days² !!
 Easy fix: take square root (√) of s2 to correct units
7
2
Standard Deviation

Standard Deviation (s): s 
 Standard Deviation =

2
s 
Variance
Both variance (s2) & st. dev. (s) measure
spread around mean ( x )


They are sensitive to outliers; they work best for
symmetric distributions
Nevertheless, they are commonly used in Stats
because they have useful properties
8
Example

2 3
5 6
9
Find standard deviation of data 2, 3, 5, 6 & 9.
Data
Deviation
( xi )
( xi  x )
Deviation²
( xi  x ) 2
2
3
5
6
9
Total
25
9
Five-Number Summary

Five-number summary: set of five measures
giving quick summary of a distribution



Measures consist of: (1) Minimum (2) Q1
(3) Median (4) Q3 and (5) Maximum
Gives an idea of both center & spread
E.g. For data (10, 11, 14, 15, 17, 19, 21, 28, 35),
5-number summary is:
Min
Q1
10
Median
14
17
Q3
Max
21
StatCrunch: Stat > Summary Stats
35
10
Boxplot

Boxplot: visual display
of 5-number summary
E.g. Cereal
sodium content

Maximum
Q3
IQR
Median
Summary statistics:
Column Min
Q1
sodium
130
0
Median
180
Q3
210
Max
320
Q1
Minimum
StatCrunch: Graphics > Boxplot
11
Boxplot & Outliers

Q1
Q3
130
210
IQR
Boxplot helps identify
outliers using “fences”
Let:

Draw lines from box to
farthest value within fences
Plot values outside fences
individually

upper
fence
Upper fence  Q3  1.5  IQR
Lower fence  Q1  1.5  IQR


80
These are suspect outliers
that require examination
StatCrunch: Graphics > Boxplot (tick “use fences to identify outliers”)
lower
fence
suspect
outlier
12
Example

Cereal calorie content
Summary statistics:
Column
calories
Min
50
Q1
100
Median
Q3
110 110
Max
160
13
Choosing a Summary

Mean and standard deviation work well for
reasonably symmetric distributions w/o outliers


Five-number summary is better for describing
skewed distributions w/ outliers

In addition to numerical summaries, always try
to include a plot of the distribution (e.g. boxplot
or histogram)
14