Download Chapter 2 Describing Data: Graphs and Tables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Describing Data
Descriptive Statistics:
Central Tendency and Variation
Lecture Objectives
You should be able to:
1. Compute and interpret appropriate measures of
centrality and variation.
2. Recognize distributions of data.
3. Apply properties of normally distributed data based
on the mean and variance.
4. Compute and interpret covariance and correlation.
Summary Measures
1. Measures of Central Location
Mean, Median, Mode
2. Measures of Variation
Range, Percentile, Variance, Standard Deviation
3. Measures of Association
Covariance, Correlation
Measures of Central Location:
The Arithmetic Mean
It is the Arithmetic Average of data values:
n
x 
x
Sample Mean
i 1
i
n
xi  x2      xn

n
The Most Common Measure of Central Tendency
Affected by Extreme Values (Outliers)
0 1 2 3 4 5 6 7 8 9 10
Mean = 5
0 1 2 3 4 5 6 7 8 9 10 12 14
Mean = 6
Median
Important Measure of Central Tendency
In an ordered array, the median is the “middle” number.
If n is odd, the median is the middle number.
If n is even, the median is the average of the 2 middle
numbers.
Not Affected by Extreme Values
0 1 2 3 4 5 6 7 8 9 10
Median = 5
0 1 2 3 4 5 6 7 8 9 10 12 14
Median = 5
Mode
A Measure of Central Tendency
Value that Occurs Most Often
Not Affected by Extreme Values
There May Not be a Mode
There May be Several Modes
Used for Either Numerical or Categorical Data
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
Measures of Variability
Range
The simplest measure
Percentile
Used with Median
Variance/Standard Deviation
Used with the Mean
Range
Difference Between Largest & Smallest
Observations:
Range = x Largest  xSmallest
Ignores How Data Are Distributed:
Range = 12 - 7 = 5
Range = 12 - 7 = 5
7
8
9
10
11
12
7
8
9
10
11
12
Percentile
2008 Olympic Medal Tally for top 55 nations. What is the percentile
score for a country with 9 medals? What is the 50th percentile?
Obs
Medals
Obs
Medals
Obs
Medals
Obs
Medals
Obs
Medals
1
110
12
24
23
10
34
6
45
3
2
100
13
19
24
9
35
6
46
3
3
72
14
18
25
8
36
6
47
2
4
47
15
18
26
8
37
5
48
2
5
46
16
16
27
7
38
5
49
2
6
41
17
15
28
7
39
5
50
2
7
40
18
14
29
7
40
4
51
2
8
31
19
13
30
6
41
4
52
1
9
28
20
11
31
6
42
4
53
1
10
27
21
10
32
6
43
4
54
1
11
25
22
10
33
6
44
3
55
1
Percentile - solutions
Order all data (ascending or descending).
1.
Country with 9 medals ranks 24th out of 55. There
are 31 nations (56.36%) below it and 23 nations
(41.82%) above it. Hence it can be considered a 57th
or 58th percentile score.
2.
The medal tally that corresponds to a 50th
percentile is the one in the middle of the group, or
the 28th country, with 7 medals. Hence the 50th
percentile (Median) is 7.
Now compute the first and third quartile values.
Box Plot
The box plot shows 5 points, as follows:
Smallest
Q1
Q3
Median
Largest
Outliers
20
40
50
60
80
105
Outlier
Interquartile Range (IQR) = [Q3 – Q1] = 60-40 = 20
1 Step = [1.5 * IQR] = 1.5*20 = 30
Q1 – 30 = 40 - 30 = 10
Q3 + 30 = 60 + 30 = 90
Any point outside the limits (10, 90) is considered an outlier.
Variance
For the Population:
For the Sample:

s
2
2
X


 
2
i
N

X


X
2
i
n 1
Variance is in squared units, and can be difficult to interpret. For
instance, if data are in dollars, variance is in “squared dollars”.
Standard Deviation
For the Population:
For the Sample:

s
2


X


 i
N
 X
X
2
i
n 1
Standard deviation is the square root of the variance.
Computing Standard Deviation
Computing Sample Variance and Standard Deviation
Mean of X =
6
Deviation
X
From Mean
Squared
3
-3
9
4
-2
4
6
0
0
8
2
4
9
3
9
26
Sum of Squares
6.50
Variance = SS/n-1
2.55
Stdev = Sqrt(Variance)
The Normal Distribution
A property of
normally
distributed data
is as follows:
Distance from
Mean
Percent of
observations
included in that
range
± 1 standard
deviation
Approximately
68%
± 2 standard
deviations
Approximately
95%
± 3 standard
deviations
Approximately
99.74%
Comparing Standard Deviations
Data A
11 12 13 14 15 16 17 18 19 20 21
Data B
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 3.338
Mean = 15.5
s = .9258
Data C
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 4.57
Outliers
Typically, a number beyond a certain number of
standard deviations is considered an outlier.
In many cases, a number beyond 3 standard
deviations (about 0.25% chance of occurring) is
considered an outlier.
If identifying an outlier is more critical, one can make
the rule more stringent, and consider 2 standard
deviations as the limit.
Coefficient of Variation
Standard deviation relative to the mean.
Helps compare deviations for samples with
different means
S
CV     100%
X 
Computing CV
Stock A: Average Price last year = $50
Standard Deviation = $5
Stock B: Average Price last year = $100
Standard Deviation = $5
Coefficient of Variation:
Stock A: CV = 10%
Stock B: CV = 5%
Standardizing Data
Obs
Age
Income
Z-Age
Z-Income
1
25
25000
-1.05
-1.13
2
28
52000
-0.86
-0.63
3
35
63000
-0.41
-0.43
4
36
74000
-0.34
-0.22
5
39
69000
-0.15
-0.31
6
45
80000
0.23
-0.11
7
48
125000
0.42
0.72
8
75
200000
2.15
2.11
Mean
41.38
86000.00
Std Dev
15.63
53973.54
Which of the two
numbers for person 8
is farther from the
mean? The age of 75
or the income of
200,000?
Z scores tell us the
distance from the
mean, measured in
standard deviations
Measures of Association
Covariance and Correlation
Mean
Mean
2
9
Stdev
1
X
Dev
1
-1
2
3
3.6
Product
Dev
Y
3
-3
6
0
0
-1
8
1
4
4
13
7
Covariance
3.5
Correlation
0.97
Covariance measures the
average product of the
deviations of two variables
from their means.
Correlation is the
standardized form of
covariance (divided by the
product of their standard
deviations).
Correlation is always
between -1 and +1.
Related documents