Download Class 1 - Courses

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
AMS7: WEEK 2.CLASS 1
Measures of Variation. Introduction to
Probability
Wednesday April 8, 2015
Measures to understand Data Variation
• Range
• Standard Deviation
• Variance
• Coefficient of Variation
• RANGE: (Max. Value) – (Min. Value)
• STANDARD DEVIATION: Measure of the variation of
observations about the mean.
=
)
∑( −
−1
SAMPLE
STANDARD
DEVIATION
Measures to understand Data Variation
(Cont.)
• Short cut to the Standard Deviation Formula
∑ − (∑ )
=
( − 1)
Note: This formula is better when you use a calculator with
the summation function
Example: Standard deviation
• Get the standard deviation of 1, 3, 14 (waiting times in
minutes):
∑
1) Compute the mean: =
= = 6
): −5, −3, 8
2) Subtract the mean from each value ( − ) : 25, 9, 64
3) Square each value in 2) ( − ) = 98
4) Sum the values in 3) : ∑( − 5) Divide the quantity in 4) by n-1= 3-1 = 2:
6) Find the square root:
49 = 7.0 mins.
= 49
Standard Deviation of a population
=
∑(
)మ
(: sigma)
• VARIANCE OF A SAMPLE AND A POPULATION: Is the
-
square of the standard deviation
Sample variance: Population variance: EXAMPLE: Waiting times. Sample Variance=(7.0) = 49
min2 (Units are squared!)
Note: is an unbiased estimator of when n is
large (Sample value converges to the
population value)
Coefficient of Variation
Ratio of the standard deviation to the mean (normalized
measure of dispersion)
CV= ⨯ 100% : SAMPLE VALUE
CV= ⨯100%: POPULATION VALUE
EXAMPLE: Waiting times: CV= ⨯100 %= 116.67 %
Rough Estimates
• STANDARD DEVIATION: S ≅
EXAMPLE: Cotinin levels: S ≅
=
(Range is known)
≅ 123
(Correct value is 119.5)
• Minimum and Maximum usual values:
- Min. = (mean) - 2⨯ (standard deviation)
- Max. = (mean)+ 2⨯ (standard deviation)
Rules for data with a Bell- Shaped
distribution
• EMPIRICAL RULE:
- About 68% of all values fall within 1 standard deviation of
the mean
- About 95% of all values fall within 2 standard deviations
of the mean
- About 99.7% of all values fall within 3 standard deviations
of the mean
Example:
• Heights of women have a bell-shaped distribution with
mean of 163 cm and a standard deviation of 6 cm. What
percentage of women have heights between 145 and 181
cm?
- 145 and 181 are 3 standard deviations away from the
mean:
- 163-18 = 145
- 163+18 = 181
- About 99.7% of all values are within 3 standard deviations
of the mean
99.7% of women heights are
between 145 cm and 181 cm.
Comparing values from different data sets
• Z Scores: Number of standard deviations above or below
the mean
-Z=
-Z =
SAMPLE
POPULATION
ORDINARY VALUES: -2 ≤ Z score ≤ 2
UNUSUAL VALUES: Z score < -2, Z score > 2
Quartiles
• Divide data sorted into four equal parts:
1. Q1 (First quartile): Value with 75% of observations
above it, and 25% below it.
2. Q2 ( Second quartile): Same as the median: 50% of
observations above it and 50% of observations below it.
3. Q3 (Third quartile): 75% of observations are below it
and 25% above it.
1, 3, 6, 10, 15, 21, 28, 36
EXAMPLE: Ordered data set:
Q1= 4.5 Q2=12.5 Q3=24.5
Percentiles
• The 99 percentiles divide the data into 100 groups.
• For a given value X in the data set:
Percentile of value X=
#
!
• EXAMPLE: Cotinine level of smokers
0 1 1 3 17 32 35 44 48 86 87 103 112 121 123 130 131 149
164 167 173 173 198 208 210 222 227 234 245 250 253 265 266
277 284 289 290 313 477 491
• Percentile corresponding to the level of 112=
112 is the 30th Percentile
"
× 100=30
Exploratory Data Analysis
(J. W. Tukey, 1977)
• Process of using Statistical Tools (graphs, measures of
center, measures of variation) to investigate data sets in
order to understand their important characteristics:
1) Center; 2) Variation and 3) Nature of the distribution.
• OUTLIERS: Value located very far away from almost all
the other values. Relative to the other data, an outlier is
an extreme value.
BOXPLOTS
• Very useful plot for comparing different data sets.
You need to have:
- Q1: First quartile
- Q2: Second quartile
- Q3: Third quartile
- Maximum value
- Minimum value
Q1
86.5
Q2 Q3
170 251.5
Min
0
Max
491
COTININE DATA SET
BOXPLOT
0
100
200
300
ng/ml
400
500
Outliers
• Some boxplot procedures identify extreme outliers
Extreme outliers
- Values less than Q1- 3⨯(Q3-Q1) (Q3-Q1 is the
interquartile range)
- Values higher than Q3+3(Q3-Q1)
OUTLIER
EXAMPLE: Include the value
551 in the cotinine levels
0
100
200
300
400
500
Boxplots and distributions