Download Chapter 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Chapter 3
3.2 Measures of Central Tendency, Variation, and Shape
Cover Summation Notation
Ex. 1
10
3n n=1
Ex. 2
6
 2n + 3 n=2
Ex. 3
5
Xn = X1 + X2 + X3 + X4 + X5
n=1
The Arithmetic Mean
Arithmetic Mean (or just Mean) – Used to measure the central tendency.
It is easily thrown off by extreme values (or outliers)
_
Notation:
X represents the mean of a set of values.
X1, X2, X3, X4, X5 represents individual values.
The mean of the above 5 values is computed as follows:
_
X =
X1 + X2 + X3 + X4 + X5
-----------------------------5
_
n
In general,
X
=
Xi
i=1
_____
n
= X1 + X2 + X3 +…+ Xn
_________________________
n
The Median
a. The median is the middle value in an ordered array of data.
b. Not affected by outlier (extreme values)
c. Can be used in place of the Mean when outliers exist.
N - Odd
If the number of observations is odd, the median is the value in position (n+1)
2
Ex 4. 2 4 6 9 10
N - Even
If the number of observations is even, the median is the average of the two middle observations.
Ex 5.
2 4 6
9 10 11
Median is 6 + 9
2
Note: Both the Mean and Median are measures of central tendency.
The Mode
The mode is the value in the set of observations that occurs most frequently.
a. Not affected by outliers (extreme values)
b. Used mainly for descriptive purposes since this value could vary from sample to sample.
Ex 6.
5 6 6 8 9 10 10 10 11 13
Mode is 10
Quartile
First Quartile :
Q1 is in position
Third Quartile: Q3 is in position
n 1
4
3( n  1)
4
If the position is half way between two integers, use the average of the two integers.
If the position is neither an integer nor half way between two integers, round to the nearest integer.
Ex 7.
2.4 3.6 4.7 5.8 6.7 6.8 6.9 7.1 8.4 (nine observations)
Q1:
9 + 1 = 10/4 = 2.5
4
Q1 = average of the 2nd and 3rd value.
Q1 = 3.6 + 4.7 = 4.15
2
Ex 8.
2.3 2.4 3.5 3.6 5.6 6.2 6.9 8.0
Q3:
(eight observations)
3(8 + 1) = 27/4 = 6.75 which rounds to 7
4
Q3 = 6.9 (which is the seventh observation)
Measures of Variation
See pg. 117
Measure of how spread out the data are.
Five measures of variation
1. Range
2. Interquartile Range
3. Variance
4. Standard Deviation
5. Coefficient of Variation
Range
a. A very simple measure of variation
b. Doesn’t take into account the data between the largest and smallest value
c. Measure of total spread (using end values)
Range = largest value – smallest value
Interquartile Range
a. Difference between the third and first quartiles
b. Not affected by outliers
c. Measure of middle spread
Interquartile Range = Q3 – Q1
Ex 9. 4 5 6 8 9 12 13 15 15 17 (ten observations)
Q1 is in position (10 + 1) =
4
11/4
= 2.75 (approx. 3)
Q1 = 6 (number in the third position)
Q3 is in position 3(10 + 1) = 33/4
4
Q3 = 15 (number in the eight position)
Interquartile Range is Q3 – Q1 = 15 – 6 = 9
= 8.25 (approx. 8)
Sample Variance See pg. 119 Exhibit 3.1
a. Takes into account all data
b. Shows how a set of data is distributed around the mean.
c. Measure of the average scatter around the mean.
d. The result is a number in squared units from the original data. (data is in inches, Variance would be in inches
squared.
e. Variance is denoted by S 2
n
S
2

(X i X )
2
i 1
n 1
Ex 10. Consider the set of values that represent the age at which a sample of 6 people graduated from college:
18
22
_
X
22
22
22
22
22
22
Xi
18
22
22
23
24
26
S2
22
23
_
(X i - X)
(18 - 22) = -4
(22 -22) = 0
(22 -22) = 0
(23 -22)=1
(24-22)=2
(26-22)=4
= 16 + 0 + 0 + 1 + 4 + 16
6-1
24
26
_
(X i - X)2
(-4)2
=16
(0)2
=0
(0)2
=0
(1)2
=1
(2)2
=4
(4)2
= 16
(makes the value positive)
= 37/5 = 7.4 (note: in units squared)
Sample Standard Deviation
a.
b.
c.
d.
e.
f.
g.
Takes into account all data.
Primary measure of variation
Shows how a set of data is distributed around the mean.
Measure of the average scatter around the mean.
The Standard Deviation is the square root of the sample variance.
This results in a number that has the same units as the individual values.
Standard Deviation is denoted by S
S
Ex 11.
S
2
Using above sample
S  7.4
= 2.720294101 approx. 2.72 (same units as the original values)
Which of the following will have a Standard Deviation of 0?
Between ex a. and ex c., which one will have the largest Standard Deviation?
ex a. 1 6 7 11 21 21 35
ex b. 3 3 3 3 3 3 3
ex c. 4 4 5 8 10 11 11
Pg. 121
For most sets of data, the majority of the values are within one Standard Deviation (1*S) of the mean
X  1S and X  1S
For example, using Ex 10
ie. The majority of the values lie between 22 – 2.72 and 22 + 2.72
Coefficient of Variation (CV)
1. Measures the scatter in the data relative to the mean.
2. Relative measure of variation expressed as a percentage
CV =
S (100%)
X
as S.D increases, CV increases
as X increases, CV decreases
The CV is useful when comparing two sets of data that are measured in different units.
(SD / Mean -- units cancel)
Shape of a set of data
Asymmetrical data – not symmetrical – skewed left or right.
Skewed Left – Negative Skew
Mean < Median
Extreme low values throw the mean off (decrease the mean)
Skewed Right – Positive Skew
Mean > Median
Extreme high values throw the mean off (increase the mean)
Symmetrical – Not Skewed
Mean = Median
No extreme values
Low and high values balance each other.
20
64
22
22
24
24
24
26
27
32
67 67 76 76 80 89 90 99 100
Mean =
50.87879
Median =
46
34
35
35
36
45
45
46
46
54
54
54
54
56
56
Sec 3.3
Exploratory Data Analysis
The 5-Number summary
X smallest
Q1
Median
Q3
X largest
Right-Skewed Distributions – distance from median to Xlargest > distance from Xsmallest to median
Right-Skewed Distributions – distance from Q3 to Xlargest > distance from Xsmallest to Q1.
Left-Skewed Distributions -- distance from Xsmallest to median > distance from median to Xlargest
Left-Skewed Distributions – distance from Xsmallest to Q1 > distance from Q3 to Xlargest.
Recall: Q1 = n+1 position of observation
4
Q3 = 3(n+1) position of observation
4
3.3 continued
Box-and-Whisker Plot (uses the 5-number summary)
Five-number Summary
Minimum
First Quartile
Median
Third Quartile
Maximum
Plot
skewed left
skewed right
Min
Q1
Median
50
77
85
89
100
Q3
Max
Sec 3.4
Recall:
_
X represents sample mean
S2 represents sample variance
S represents sample standard deviation
If the data set represents an entire population instead of just a sample …
Population Mean
 represents the mean of the population (read as mu).
N represents the number of observations.
Xi represents the ith individual observation.
N

X
=
i 1
i
N
 2 (lowercase letter sigma. Read as “Sigma Squared”)
Population variance
 ( X i)
N

2

2
i 1
N
Population Standard Deviation
 (Square root of variance)
=2
18
_
X
22
22
22
22
22
22
Example
Xi
18
22
22
23
24
26
S2
S2 = 7.4
22
22
23
_
(X i - X)
(18 - 22) = -4
(22 -22) = 0
(22 -22) = 0
(23 -22)=1
(24-22)=2
(26-22)=4
24
26
_
(X i - X)2
(-4)2
=16
(0)2
=0
(0)2
=0
(1)2
=1
(2)2
=4
(4)2
= 16
(makes the value positive)
= 16 + 0 + 0 + 1 + 4 + 16
= 37/5 = 7.4 (note: in units squared)
6
=
2.720294101 approx. 2.72
Empirical Rule -- Not to be used with data sets that are highly skewed
1. Approx. 67% (2/3) of the observations lie within a distance of +- 1 S.D. of the mean
67% lie between 22 + 2.72 and 22 – 2.72. Between 24.72 and 19.28
2. Approx. 95% of the observations lie within a distance of +- 2 S.D. of the mean
95% lie between 22 + 2*2.72 and 22 – 2*2.72. Between 27.44 and 16.56
Sec 3.5
Coefficient of Correlation
Error – Pg. 138 States
“In section 2.5, scatter diagrams are used to .. yadda yadda “ Scatter Diagrams were discussed in Section 2.3
Recall: Scatter Diagram – graphically displays bivariate (two variables) numerical data.
Coefficient of correlation (r)
– numerical description for measuring the strength of the relationship between two variables.
-- Measures the degree of linear association between two variables.
-- Values range from –1 (perfect negative correlation) to 1 (perfect positive correlation)
-- Perfect means that all points could be connected with a straight line.)
-- The relationship between 2 variables is described as a “tendency” and not as a “cause & effect”.
-- Correlation alone can not prove that the change in one variable caused the change in the other variable.
-- Further analysis is needed to prove causation.
-- Causation implies correlation but correlation does not imply causation.
Column B
The Coefficient of Correlation is computed as follows:
120
100
80
60
40
20
0
n
( X
r
i 1
n
i
 X )(Y i  Y )
n
 ( X i  X )  (Y i Y )
i 1
0
20
40
60
80
2
2
i 1
100
Column A
X
i
3
7
9
11
Y
Xi X
i
20
15
14
10
3-7.5 = -4.5
7-7.5 = -0.5
9-7.5 = 1.5
11-7.5 = 3.5
(X i X )
Y i Y
2
(Y i Y )
2
20-14.75 = 5.25
15-14.75 = 0.25
14-14.75 = -0.75
10-14.75 = -4.75
X = (3+7+9 +11)/4 = 7.5
Y = (20+15+14+10)/4 = 14.75
Column B
Positive Correlation. As one variable increases, the other variable increases.
Negative Correlation. As one variable increases, the
other variable decreases.
120
100
80
60
40
20
0
Zero Correlation. As one variable changes, the other
variable stays constant.
Coefficient of correlation
Pg. 140
0
20
40
60
Column A
study) may be the cause
Cause and effect
80
100
Explanations for a correlation between two
variables.
Caused by chance
A third variable (not included in the
Sec 3.6
Pitfalls in numerical descriptive measures and ethical issues
Read, Read, Read