Download Coefficient of correlation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Transcript
Ch 4 實習
Numerical Descriptive Techniques

Measures of Central Location


Measures of Variability


Percentiles, Quartiles
Measures of Linear Relationship

2
Range, Standard Deviation, Variance, Coefficient of
Variation
Measures of Relative Standing


Mean, Median, Mode
Covariance, Correlation
Jia-Ying Chen
The Arithmetic Mean

This is the most popular and useful measure of
central location
Sum of the observations
Mean =
Number of observations

Drawback

3
Very sensitive to extreme values (outliers)
Jia-Ying Chen
The Median

The Median of a set of observations is the value that
falls in the middle when the observations are arranged
in order of magnitude.
Sample and population medians are computed the same way.
Example
Comment
Find the median of the time on the internet Suppose only 9 adults were sampled
for the 10 adults
Even number of observations
0, 0, 5,
0, 7,
5, 8,
7, 8,
9, 12,
14,14,
22,22,
33 33
8.59,, 12,
4
Odd number of observations
0, 0, 5, 7, 8 9, 12, 14, 22
Jia-Ying Chen
The Mode



The Mode of a set of observations is the value that occurs
most frequently.
Set of data may have one mode (or modal class), or two or
more modes.
When the number of all data appears only once, mode doesn’t
exit.
The modal class
5
For large data sets
the modal class is
much more relevant
than a single-value
mode.
Jia-Ying Chen
Example 1
The times (to the nearest minute) that a sample of
9 bank customers waited in line were recorded
and are listed here.
7 4 0 2 7 3 1 9 12
 Determine the mean, median, and mode for these
data.

6
Jia-Ying Chen
Solution



7
Mean=(7+4+0+2+7+3+1+9+12)/9=5
Ordered data:0 1 2 3 4 7 7 9 12
Median=4, Mode=7
Jia-Ying Chen
Relationship among Mean, Median, and Mode

If a distribution is symmetrical, the mean, median and
mode coincide

If a distribution is asymmetrical, and skewed to the
left or to the right, the three measures differ.
A positively skewed distribution
(“skewed to the right”)
8
Mode Mean
Median
Jia-Ying Chen
Relationship among Mean, Median, and Mode
9

If a distribution is symmetrical, the mean, median and
mode coincide

If a distribution is non symmetrical, and skewed to the
left or to the right, the three measures differ.
A positively skewed distribution
(“skewed to the right”)
A negatively skewed distribution
(“skewed to the left”)
Mode
Mean
Median
Mean
Mode
Median
Jia-Ying Chen
The range



The range of a set of observations is the difference between the
largest and smallest observations.
Its major advantage is the ease with which it can be computed.
Its major shortcoming is its failure to provide information on the
dispersion of the observations between the two end points.
But, how do all the observations spread out?
? ? ?
The range cannot assistRange
in answering this question
10
Smallest
observation
Largest
observation
Jia-Ying Chen
Variance…

The variance of a population is:
population size

The variance of a sample is:
population mean
sample mean
Note! the denominator is sample size (n) minus one !
11
Jia-Ying Chen
思考!!

如何測量離散程度??

每各數字跟平均值差的和?


每各數字跟平均值的差的平方的和?


都會是0
資料越多會越大
所以每各數字跟平均值的差的平方的和再取平均來
測量離散程度比較合理
原來是這樣~~
12
Jia-Ying Chen
Variance…


13
As you can see, you have to calculate the sample mean
(x-bar) in order to calculate the sample variance.
Alternatively, there is a short-cut formulation to
calculate sample variance directly from the data without
the intermediate step of calculating the mean. Its given
by:0
Jia-Ying Chen
Proof
s
2




14
1
  ( xi  x) 2 


n 1 
2
1 
2

(
x

2
x
x

x
)

i
i

n  1 
2
1 
2

(
x
)

2
x
x

nx


i
i

n  1 
xi

x
n
2
2

(
x
)
(
x
)
1 


i
i
2

  ( xi )  2

n  1 
n
n 
2

(
x
)
1 

i
2
  ( xi ) 

n  1 
n 
Jia-Ying Chen
Coefficient of Variation…

The coefficient of variation of a set of observations is
the standard deviation of the observations divided by
their mean, that is:



CV是相對離勢量數(measure of relative disperson)


15
Population coefficient of variation = CV =  / 
Sample coefficient of variation = cv = s / x
比較幾組資料單位不同的差異情形。
比較幾組資料單位相同,但平均數相差懸殊之差異情形。
Jia-Ying Chen
The Empirical Rule…
Approximately 68% of all observations fall
within one standard deviation of the mean.
Approximately 95% of all observations fall
within two standard deviations of the mean.
Approximately 99.7% of all observations fall
within three standard deviations of the mean.
16
Jia-Ying Chen
4.16
Chebysheff’s Theorem…
A more general interpretation of the standard deviation is derived from
Chebysheff’s Theorem, which applies to all shapes of histograms (not just
bell shaped).
The proportion of observations in any sample that lie within k standard
deviations of the mean is at least:
For k=2 (say), the theorem states
that at least 3/4 of all observations
lie within 2 standard deviations of
the mean. This is a “lower bound”
compared to Empirical Rule’s
approximation (95%).
17
Jia-Ying Chen
4.17
Example 2

Determine the variance, standard deviation,
range, and the cv of the following sample.
9 15 11 31 23 13 15 17 21
18
Jia-Ying Chen
Solution
x

x
155

 17.2222
n
9
2

 
(
x
)

1552 
i
2
  xi 
 3041 

n
9

  46.4444
s2  
n 1
8
i
s  46.4444  6.8150
Range  31  9  22
cv 
19
6.8150
 0.3957
17.2222
Jia-Ying Chen
Example 3
20

班上同學的第一次考試成績平均為60分,標準
差為11分,全班共有100人,成績分配不為常
態分配,請問班上至少有多少人位於82分和38
分之間?

Solution
Jia-Ying Chen
Example 4

王先生最近打算將手中的閒錢用來購買股票,在要求
的投資報酬率下,初步篩選可投資股票,王先生決定
由A、B、C三家上市公司的股票中擇一購買。由最近
30天的收盤資料得三股票的收盤平均值與標準差分別
為: (單位:元)
X A  52, X B  232, X C  176, S A  5, S B  24, SC  16

21
若王先生是一個保守的投資人,則他會購買何家公司
的股票?又他所依據的決策準則為何?
Jia-Ying Chen
Solution

追求相對風險,即追求三者中Risk最小者
5
CVA 
 9.62%
52
24
CVB 
 10.34%
232
16
CVC 
 9.09%
176

22
因為王先生為一個保守的投資人所以他追求為相
對風險最小,所以要用CV來衡量
Jia-Ying Chen
Measures of Relative Standing
and Box Plots

Percentile

The pth percentile of a set of measurements is the value
for which
p percent of the observations are less than that value
 100(1-p) percent of all the observations are greater than that
value.


Example

Suppose your score is the 60% percentile of a SAT test. Then
60% of all the scores lie here
23
Your score
40%
Jia-Ying Chen
Quartiles

Commonly used percentiles





24
First (lower)decile
First (lower) quartile, Q1,
Second (middle)quartile,Q2,
Third quartile, Q3,
Ninth (upper)decile
= 10th percentile
= 25th percentile
= 50th percentile
= 75th percentile
= 90th percentile
Jia-Ying Chen
Location of Percentiles

Find the location of any percentile using the
formula
P
LP  (n  1)
100
w hereLP is the location of the P th percentile
25
Jia-Ying Chen
Example 5

Determine the first, second, and third quartiles of
the following data
10.5 14.7 15.3 17.7 15.9 12.2 10.0 14.1 13.9
18.5 13.9 15.1 14.7
26
Jia-Ying Chen
Solution

排序後數列




27
10.0 10.5 12.2 13.9 13.9 14.1 14.7 14.7 15.1 15.3 15.9
17.7 18.5
First quartile: L25=(13+1)*25/100=3.5; the first
quartile is 13.05
Second quartile: L50=(13+1)*50/100=7; the first
quartile is 14.7
Third quartile: L75=(13+1)*75/100=10.5; the first
quartile is 15.6
Jia-Ying Chen
Interquartile Range


This is a measure of the spread of the middle 50%
of the observations
Large value indicates a large spread of the
observations
Interquartile range = Q3 – Q1
28
Jia-Ying Chen
Box Plot

This is a pictorial display that provides the main
descriptive measures of the data set:





L - the largest observation
Q3 - The upper quartile
Q2 - The median
Q1 - The lower quartile
S - The smallest observation
1.5(Q3 – Q1)
S
29
Whisker
1.5(Q3 – Q1)
Q1
Q2 Q 3
Whisker
L
Jia-Ying Chen
Measures of Linear Relationship…




30
We now present two numerical measures of linear relationship
that provide information as to the strength & direction of a
linear relationship between two variables (if one exists).
They are the covariance and the coefficient of correlation.
Covariance - is there any pattern to the way two variables
move together?
Coefficient of correlation - how strong is the linear
relationship between two variables?
Jia-Ying Chen
Covariance…
population mean of variable X, variable Y
sample mean of variable X, variable Y
Note: divisor is n-1, not n as you may expect.
31
Jia-Ying Chen
Covariance…

32
In much the same way there was a “shortcut” for
calculating sample variance without having to
calculate the sample mean, there is also a shortcut
for calculating sample covariance without having
to first calculate the mean:
Jia-Ying Chen
Covariance… (Generally speaking)
When
two variables move in the same direction (both
increase or both decrease), the covariance will be a large
positive number.
When
two variables move in opposite directions, the
covariance is a large negative number.
When
there is no particular pattern, the covariance is a
small number.
33
Jia-Ying Chen
Coefficient of Correlation…

The coefficient of correlation is defined as the
covariance divided by the standard deviations of the
variables:
Greek letter
“rho”
This coefficient answers the question:
How strong is the association between X and Y?
34
Jia-Ying Chen
Coefficient of Correlation…
The
advantage of the coefficient of correlation over
covariance is that it has fixed range from -1 to +1, thus:
If the two variables are very strongly positively related, the
coefficient value is close to +1 (strong positive linear
relationship).
If the two variables are very strongly negatively related, the
coefficient value is close to -1 (strong negative linear
relationship).
No straight line relationship is indicated by a coefficient
close to zero.
35
Jia-Ying Chen
Coefficient of Correlation…
+1 Strong positive linear relationship
r or r =
0
No linear relationship
-1 Strong negative linear relationship
36
Jia-Ying Chen
Example 6


37
A retailer wanted to estimate the monthly fixed and variable
selling expenses. As a first step she collected data from the
past 8 months. The total selling expenses (in $thousands) and
the total sales (in $thousands were recorded and listed below.
Total sales
Selling Expenses
20
14
40
16
60
18
50
17
50
18
55
18
60
18
70
20
Compute the covariance and the coefficient of correlation
Jia-Ying Chen
Solution
38
Jia-Ying Chen
Solution
39
Jia-Ying Chen
鑑往知來

40
樣本為96年學年度修郭老師統計課的考試成績
Jia-Ying Chen