Survey

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
```Lecture 4
DESCRIPTIVE STATISTICS:
Numerical summaries
Chap 4 (Keller)
1
Outline
2
• Measures of center:
- Mean, median, mode
- Selection of measures of location
- Range, quartile range, quartile deviation,
variance, standard deviation
• Empirical rule (general case: Chebyshev’s
law)
• Coefficient of variation
• Coefficient of skewness
3
Measures of center
4
Measures of center
• A measure of center or location shows
where the center of the data is
• Three most useful measures of location:
§ Arithmetic mean/average
§ Median
§ Mode
5
6
Arithmetic mean from frequency table
Arithmetic mean from raw data
N
• Arithmetic mean from population:
µ=
∑X
i
• Apply this formula for the sample:
i =1
N
k
n
• Arithmetic mean from sample:
i i
∑x
x=
i
x=
∑x f
i =1
i =1
k
∑f
n
i
i =1
Where:
Xi, xi - the value of each item
N, n - total number of items
Where: xi - the value of class i
fi – frequency of class i
7
8
Mean is sensitive to outliers
– Easy to understand and calculate
– Values of every items are included => representative for
the whole set of data
– Sensitive to outliers:
Sample: (43; 38; 37; : : : ; 27; 34): =>
x = 33.5
Contaminated sample
(43; 38; 37; : : : ; 27; 1934): =>
x = 71.5
9
10
Median
Calculate median from raw data
•
 Median is the value of the observation which is
located in the middle of the data set

Steps to find median:
If the data has an odd number of observations:
(n + 1)th
–
Middle observation:
2
Median = x ( n +1)th
1. Arrange the observations in order of size (normally
ascending order)
2
•
2. Find the number of observations and hence the middle
observation
If the data has an even number of observations:
–
There are two observations located in the middle and
Median = ( x
3. The median is the value of the middle observation
th
⎛n⎞
⎜ ⎟
⎝2⎠
11
+x
⎛n ⎞
⎜ +1⎟
⎝2 ⎠
th
)/2
12
Example
•
E.g1. Raw data: 11, 11, 13, 14, 17 => find median
•
E.g 2. Raw data: 11, 11, 13, 14, 16, 17 => find
median
– Easy to understand and calculate
– Not affected by outlying values => thus can be used
when the mean would be misleading
– Value of one observation => fails to reflect the whole
data set
– Not easy to use in other analysis
13
14
Mode
•
•
Example to calculate mode
Mode is the value which occurs most
frequently in the data set
Steps to find mode
1. Draw a frequency table for the data
2. Identify the mode as the most frequent value
15
Frequency
8
3
12
7
16
12
17
8
19
5
16
Mean, median and mode in normal and skewed
distributions
Bimodal and multimodal data
Bimodal (two modes)
X
Multimodal (several modes)
17
18
Which measure of centre is best?
Measures of dispersion (variability)
• Mean generally most commonly used
• Sensitive to extreme values
• If data skewed/extreme values present, median better, e.g.
real estate prices
• Mode generally best for categorical data – e.g. restaurant
service quality (below): mode is very good. (ordinal)
Rating
# customers
Excellent
20
Very good
50
Good
30
Satisfactory
12
Poor
10
Very Poor
6
•
Measures of dispersion tell you how spread
out all other values of the distribution from
the central tendency
Measures of dispersion
•
•
The range, quartile range, and quartile deviation
•
Variance and standard deviation
19
Why do we need measures of dispersion?
20
Why measures of dispersion? (1)
• Two data sets of midterm marks of 5 students:
– First set: 100, 40, 40, 35, 35 => Mean: 50
– Second set: 70, 55, 50, 40, 35 => Mean: 50
Ø Which mean (first or second) is more reliable?
• Need to know the spread of other values around the
central tendency, especially important in analysing
stock market.
21
Why measures of dispersion? (2)
22
Range
• Range is the difference between the largest and
smallest value => Sort data before computing range
• Formula: Range = maximum value - minimum
value
• Advantages of Range: easy to calculate for
ungrouped data.
– Take into account only two values
– Affected by one or two extreme values
– More difficult to calculate for grouped data
23
24
Quartiles
Quartile range and quartile deviation
• Quartiles: are defined as values of observations
which are a quarter of the way through data
• Quartile range = Q3 – Q1
– Q1 - the first quartile: the value of the
observation of which 25% of observations fall
below
• Quartile deviation =
– Q2 - the second quartile: the median (50% of the
observations fall below)
• Advantages of quartile deviation (semi-interquartile range):
less affected by extreme value
Q3 − Q1
2
• Disadvantages: take into account only 50% of the data
– Q3 - the third quartile: the value of the
observation of which 75% of observations fall
below
25
26
Variance
• Variance from population:
• Variance from sample
Standard deviation (σ )
σ2 = ∑
s2 =
( X i − µ )2
• Standard deviation (S.D) is the square root of variance
• S.D from population:
N
∑ ( x − x)
2
σ = σ2
n −1
• S.D from sample:
• Take into account all values
• Easy to interpret the result.
s = s2
• Overcome the disadvantage of meaningless unit of
variance
• The most widely used measure of dispersion (the bigger
its value => the more spread out are the data)
• Disadvantages: the unit of variance has no meaning
27
Application of this in finance
• Variance (or S.D) of an investment, can be used
as a measure of risk e.g. on profits/return.
• Larger variance è larger risk
• Usually, higher rate of return, higher risk
28
Example – 2 funds over 10 years (1)
• Rates of return
A
8.3 -6.2 20.9 -2.7 33.6 42.9 24.4 5.2
3.1 30.5
B 12.1 -2.8 6.4 12.2 27.8 25.3 18.2 10.7 -1.3 11.4
x A = 16%
xB = 12%
s A2 = 280.34(%) 2
s A2 = 99.37(%) 2
• Which fund will you invest?
Empirical rules or the law of 3 σ
Example – 2 funds over 10 years (2)
• For a normal or symmetrical distribution:
l
– 68.26% of all obs fall within 1 standard deviation of the
mean, i.e. in the range:
Depending on how Risk-averse you are:
Fund A: higher risk, but also higher average rate
of return.
( x − 1s) ↔ ( x + 1s)
– 95.45% of all obs fall within 2 standard deviation of the
mean, i.e. in the range:
( x − 2s) ↔ ( x + 2s)
– 99.73% of all obs fall within 3 standard deviation of the
mean, i.e. in the range:
( x − 3s ) ↔ ( x + 3s )
32
Meaning of the law of 3σ
Boxplot
• Convert z-score to probability (next lecture)
Here is the Boxplot of height of international students
studying at UNSW
• Identify outliers
Boxplot of Height
200
190
whisker
upper quartile
Height
180
170
median
box
lower quartile
160
whisker
150
33
34
Boxplots
Shapes of Boxplots
• Need MEDIAN and QUARTILES to create a boxplot
• MEDIAN = middle of observations, i.e. ½ way through
observations
• QUARTILES = mark quarter points of observations, i.e. ¼
(Q1) and ¾ (Q3) of the way through data [(n+1)/4; 3(n+1)/
4]
• INTERQUARTILE RANGE = Q3-Q1
• Whiskers: max length is 1.5*IQR; stretch from box to
furthest data point (within this range)
• Points further out from box marked with stars; called
outliers
Boxplot of Symmetric, Positive skew, Negative skew, Bimodal
5.0
• Skewness/
symmetry
• Modality
• Range
Data
2.5
0.0
-2.5
-5.0
Symmetric
35
Positive skew
Negative skew
Bimodal
36
Coefficient of skewness (C of S)
Activity 1
• Summary statistics of two data sets are as follows
• This measures the shape of distribution
• There are some measures of skewness.
• Below is a common one: Pearson’s coefficient of skewness.
Coefficient of skewness = 3 x (mean-median)/standard
deviation
• If C of S is nearly +1 or -1, the distribution is highly skewed
• If C of S is positive => distribution is skewed to the right
(positive skew)
n
• If C of S is negative => distribution is skewed to the left
(negative skew)
Set 1:
Ages of students
studying at UNSW
Set 2:
Wages of staffs
294.3
Mean
22.4839
Median
21
292.5
Standard deviation
6.3756
125.93
Compute the Pearson’s coefficient of skewness of these data
sets and describe their shapes of distribution
37
38
Investigating the relationship between variables
Distribution shapes
• Methods:
– Table: Cross-table
– Charts:
6
– 99.73% of obs of the population fall within 1 standard
deviation of the mean, i.e. in the range:
2
4
100
50
Frequency
o Multiple bar chart
o Scatterplot (mentioned in lecture 8)
–
0
0
Frequency
150
8
200
10
– 95.45% of obs of the population fall within 1 standard
deviation of the mean, i.e. in the range:
20
40
age
Skewed to the right
60
80
100
200
300
wages
400
500
600
Nearly normal
39
– Overcome the disadvantage of meaningless unit of
variance
– The most widely used measure of dispersion (the bigger
its value => the more spread out are the data)
Cross-table
Cross-table
• Cross-table is used to investigate the relationship
b/w two categorical vars or discrete variables with
few values.
• EX: use gss.sav data file to explore the
relationship b/w internet use and degree
• Note:
– Need to identify dependent and independent variables.
– Know how to calculate row and column percentages
– Rule of thumb: independent var in row and dependent
var in column
41
42
Multiple bar chart
Multiple bar chat
Here you are
• We can use multiple bar chart to explore the
relationship b/w variables.
• The skill is to know how to draw chart
• EX: use gss.sav data file to explore the
relationship b/w internet use, age, and degree
43
44
```
Related documents