Download Coefficient of correlation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
上實習課之前不可不知道的事情
• 助教:陳佳滎(ㄧㄥˊ)
• 助教mail:[email protected]
• 負責事項
–
–
–
–
禮拜五習題講演(不ㄧ定上到兩點)
每次針對當週老師上課做重點複習與習題練習
出作業
考試前做複習與重點整理
上實習課之前不可不知道的事情
• 上課可以吃午餐,喝東西,如需早退也可
自行離開
• 所有投影片會上傳到老師網頁上面去
統計學內容
• 敘述統計學
– 研究如何簡化與表示現成之統計資料
抽樣
– 圖表方式
– 數值方式
母體
• 機率分配
• 推論統計學
樣本
推論
– 研究如何利用母體中所抽取之樣本,去估計、
檢定或預測母體中之未知特性之科學方法
Chapter 4
Numerical Descriptive Techniques
Numerical Descriptive Techniques
• Measures of Central Location
– Mean, Median, Mode
• Measures of Variability
– Range, Standard Deviation, Variance,
Coefficient of Variation
• Measures of Relative Standing
– Percentiles, Quartiles
• Measures of Linear Relationship
– Covariance, Correlation, Determination, Least
Squares Line
The Arithmetic Mean
• This is the most popular and useful
measure of central location
Sum of the observations
Mean =
Number of observations
The Median
• The Median of a set of observations is the
value that falls in the middle when the
observations are arranged in order of
magnitude.
Sample and population medians are computed the same way.
Example
Comment
Find the median of the time on the internet Suppose only 9 adults were sampled
(exclude, say, the longest time (33))
for the 10 adults
Even number of observations
0, 0, 5,
0, 7,
5, 8,
7, 8,
9, 12,
14,14,
22,22,
33 33
8.59,, 12,
Odd number of observations
0, 0, 5, 7, 8 9, 12, 14, 22
The Mode
• The Mode of a set of observations is the value
that occurs most frequently.
• Set of data may have one mode (or modal class),
or two or more modes.
The modal class
For large data sets
the modal class is
much more relevant
than a single-value
mode.
Example 1
• The times (to the nearest minute) that a
sample of 9 bank customers waited in line
were recorded and are listed here.
7 4 0 2 7 3 1 9 12
• Determine the mean, median, and mode
for these data.
Solution
Relationship among Mean, Median, and Mode
• If a distribution is symmetrical, the mean,
median and mode coincide
• If a distribution is asymmetrical, and
skewed to the left or to the right, the three
measures differ.
A positively skewed distribution
(“skewed to the right”)
Mode Mean
Median
Relationship among Mean, Median, and
Mode
• If a distribution is symmetrical, the mean,
median and mode coincide
• If a distribution is non symmetrical, and
skewed to the left or to the right, the three
measures differ.
A positively skewed distribution
(“skewed to the right”)
A negatively skewed distribution
(“skewed to the left”)
Mode
Mean
Median
Mean
Mode
Median
The range
– The range of a set of observations is the difference
between the largest and smallest observations.
– Its major advantage is the ease with which it can be
computed.
– Its major shortcoming is its failure to provide
information on the dispersion of the observations
between the two end points.
But, how do all the observations spread out?
? ? ?
The range cannot assistRange
in answering this question
Smallest
observation
Largest
observation
Variance…
population mean
• The variance of a population is:
population size
• The variance of a sample is:
Note! the denominator is sample size (n) minus one !
sample mean
Variance…
• As you can see, you have to calculate the
sample mean (x-bar) in order to calculate the
sample variance.
• Alternatively, there is a short-cut formulation
to calculate sample variance directly from the
data without the intermediate step of
calculating the mean. Its given by:
Coefficient of Variation…
• The coefficient of variation of a set of
observations is the standard deviation of the
observations divided by their mean, that is:
• Population coefficient of variation = CV =
• Sample coefficient of variation = cv =
The Empirical Rule…
Approximately 68% of all observations fall
within one standard deviation of the mean.
Approximately 95% of all observations fall
within two standard deviations of the mean.
Approximately 99.7% of all observations fall
within three standard deviations of the mean.
4.17
Chebysheff’s Theorem…
A more general interpretation of the standard deviation is
derived from Chebysheff’s Theorem, which applies to all
shapes of histograms (not just bell shaped).
The proportion of observations in any sample that lie within
k standard deviations of the mean is at least:
For k=2 (say), the theorem states
that at least 3/4 of all observations
lie within 2 standard deviations of
the mean. This is a “lower bound”
compared to Empirical Rule’s
approximation (95%).
4.18
Example 2
• Determine the variance, standard deviation,
range, and the cv of the following sample.
9 15 11 31 23 13 15 17 21
Solution
• Range=31-9=22
• cv=6.82/17.22
Measures of Relative Standing
and Box Plots
• Percentile
– The pth percentile of a set of measurements is
the value for which
• p percent of the observations are less than that value
• 100(1-p) percent of all the observations are greater
than that value.
– Example
• Suppose your score is the 60% percentile of a SAT
test. Then
60% of all the scores lie here
Your score
40%
Quartiles
• Commonly used percentiles
– First (lower)decile
= 10th percentile
– First (lower) quartile, Q1,
= 25th percentile
– Second (middle)quartile,Q2, = 50th percentile
– Third quartile, Q3,
= 75th percentile
– Ninth (upper)decile
= 90th percentile
Location of Percentiles
• Find the location of any percentile using
the formula
P
LP  (n  1)
100
w hereLP is the location of the P th percentile
Example 3 (Textbook 4.40)
• Determine the first, second, and third
quartiles of the following data
10.5 14.7 15.3 17.7 15.9 12.2 10.0
14.1 13.9 18.5 13.9 15.1 14.7
Solution
Example 4 (Textbook 4.38)
• Find the third and eighth deciles (30th and
80th percentiles) of the following data set
26 23 29 31 24 22 15 31 30 20
Solution
Interquartile Range
• This is a measure of the spread of the
middle 50% of the observations
• Large value indicates a large spread of the
observations
Interquartile range = Q3 – Q1
Box Plot
– This is a pictorial display that provides the
main descriptive measures of the data set:
•
•
•
•
•
L - the largest observation
Q3 - The upper quartile
Q2 - The median
Q1 - The lower quartile
S - The smallest observation
1.5(Q3 – Q1)
S
Whisker
1.5(Q3 – Q1)
Q1
Q2 Q 3
Whisker
L
Measures of Linear Relationship…
• We now present two numerical measures of linear
relationship that provide information as to the
strength & direction of a linear relationship
between two variables (if one exists).
• They are the covariance and the coefficient of
correlation.
 Covariance - is there any pattern to the way two
variables move together?
 Coefficient of correlation - how strong is the linear
relationship between two variables?
Covariance…
population mean of variable X, variable Y
sample mean of variable X, variable Y
Note: divisor is n-1, not n as you may expect.
Covariance…
• In much the same way there was a
“shortcut” for calculating sample variance
without having to calculate the sample
mean, there is also a shortcut for
calculating sample covariance without
having to first calculate the mean:
Covariance… (Generally speaking)
•When two variables move in the same direction
(both increase or both decrease), the covariance
will be a large positive number.
•When two variables move in opposite directions,
the covariance is a large negative number.
•When there is no particular pattern, the
covariance is a small number.
Coefficient of Correlation…
• The coefficient of correlation is defined as the
covariance divided by the standard deviations
of the variables:
Greek letter
“rho”
This coefficient answers the question:
How strong is the association between X and Y?
Coefficient of Correlation…
•The advantage of the coefficient of correlation over
covariance is that it has fixed range from -1 to +1,
thus:
•If the two variables are very strongly positively
related, the coefficient value is close to +1 (strong
positive linear relationship).
•If the two variables are very strongly negatively
related, the coefficient value is close to -1 (strong
negative linear relationship).
•No straight line relationship is indicated by a
coefficient close to zero.
Coefficient of Correlation…
+1 Strong positive linear relationship
r or r =
0
No linear relationship
-1 Strong negative linear relationship
Example 5 (Textbook 4.58)
• Are the marks one receives in a course related to the
amount of time spent studying the subject? To analyze this
mysterious possibility, a student took a random sample of
10 students who had enrolled in an accounting class last
semester. She asked each to report his or her mark in the
course and the total number of hours spent studying
accounting. These data are listed here.
Time Spent
Studying 40 42 37 47 25 44 41 48 35 28
Marks
77 63 79 86 51 78 83 90 65 47
• a. Calculate the covariance
• b. Calculate the coefficient of correlation
• c. Determine the least squares line
• d. What do the statistics calculated above tell you about
the relationship between marks and study time?
• e. Calculate the coefficient of determination
Solution
Solution
Solution
• e. R2=r2=(0.8811)2=0.7763