Download Descriptive Measures

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

World Values Survey wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Data Summary Using Descriptive Measures
Sections 3.1 – 3.6, 3.8
Based on Introduction to Business Statistics
Kvanli / Pavur / Keeling
1
|»| Summary of Descriptive Measures
DESCRIPTIVE MEASURES
A single number computed from the sample data that provides information about the data.
An example of such measures is the mean, which the average of all the observations in a sample or a population.
Measures of Central
Tendency
Determine the center of the data
values or possibly the most
typical value.
Measures of Variation
Measures of Position
Measures of Shape
Determine the spread of
the data.
Indicate how a particular
data point fits in with all the
other data points.
Indicate how the data
points are distributed.
Mean
Range
Percentile
Skewness
The average of the data
values.
Range = H - L
P% below P-th
Percentile & (1-P)%
above it
Median
Variance
Quartiles
The tendency of a
distribution to stretch
out in a particular
direction
The value in the center of the
ordered data values
The average of the sum
squared differences of the
mean from individual values.
The 25th, 50th and 75th
percentiles
Mode
Standard Deviation
Z-Score
The value that occurs more
than once and the most often
The positive squared
root of the variance
Midrange
Coefficient of Variation
Expresses the number
of standard deviations
the value x is from the
mean.
The average of the highest
and the lowest values
The standard deviation
in terms of the mean.
Kurtosis
A measure of the
peakedness of a
distribution
2
|»| The Mean
 The mean represents the average of the data and is computed by dividing
the sum of the data points by the number of the data points.
 It is the most popular measure of central tendency.
 We can easily compute and explain the mean.
 We have two types of mean depending on whether the data set includes
all items of a population or a subset of items of a population – Sample
Mean and Population Mean.
3
|»| Sample Mean
 It is the sum of the data values in a sample divided by the number of data
values in that sample.
 We use X (X-bar) to denote the sample mean, and n to denote the number
of data values in a sample.
 Therefore, for ungrouped data, we obtain,
n
x1  x2  x3  .....  xn  x
X


n
n
x
i 1
i
n
 Example 3.1 (Accident Data): The following sample represents the
number of accidents (monthly) over 11 months: 18, 10, 15, 13, 17, 15, 12,
15, 18, 16, 11. Compute the mean number of monthly accidents, i.e.,
compute the sample mean.
x 18  10  15  13  17  15  12  15  18  16  11 160

X


 14.55
n
11
11
4
|»| Sample Mean (cont.)
 Example 3.2: The mean of a sample with 5 observations is 20. If the sum
of four of the observations is 75, what is the value of the fifth observation?
4
x
i 1
i
5
 75
x
i 1
i
 20  5  100
5
4
i 1
i 1
x5   xi   xi  100  75  25
|»| Population Mean
 It is the sum of the data values in a population divided by the number of
data values in that population.
 We use μ to denote the population mean, and N to denote the number of
data values in a population.
 Therefore, we obtain,
x1  x2  x3  .....  xN  x


N
N
5
|»| The Median, Md
 The Median (Md) of a set of data is the value in the center of the data
values when they are arranged from lowest to highest.
 It has the equal number of items to the right and the left.
 Median is preferred to the mean as a measure of central tendency for
data set with outliers.
 Calculating the median from a sample involves the following steps:
 Arrange data values in ascending order.
 Find the position of the median. The median position is the
ordered value.
(n  1)
st
2
 Find the median value.
6
|»| The Median (cont.)
 Example 3.3: Compute the median for the accident data given in Example
3.1.
Ascending order: 10, 11, 12, 13, 15, 15 , 15, 16, 17, 18, 18.
n = 11, Median Position = (11+1)/2 = 6th ordered value.
Md = 15.
7
|»| The Mode, Mo
 The Mode (Mo) of a data set is the value that occurs more than once and
the most often.
 Mode is not always a measure of central tendency; this value need not
occur in the center of the data.
 There may be more than one mode if several numbers occur the same
(and the largest) number of times.
 Mode is extensively used in areas such as manufacturing of clothing,
shoes, etc.
 Example 3.4: Find the mode for the accident data given in Example 3.1.
The data point 15 appears the most number of times, so Mo = 15.
8
|»| The Midrange, Mr
 It is the average of the highest and the lowest values of a data set.
 Midrange provides an easy-to-grasp measure of central tendency.
 If we use H to denote the highest value and L to denote the lowest value
of a data set, we obtain,
Mr 
LH
2
 Example 3.5: Find the midrange for the accident data given in Example
3.1.
L = 10, H = 18, so Mr = (L + H)/2 = (10 + 18)/2 = 19.
9
|»| The Range, R
 The numerical difference between the largest value (H) and the smallest
value (L). That is, Range = H – L.
 Example 3.6: The range for the accident data given in Example 3.1 is H –
L = 18 – 10 = 8.
 The range is a crude measure of variation but easy to calculate and
contains valuable information in some situations.
 For instance, stock reports cite the high and low prices of the day.
 Similarly, weather forecasts use daily high and low temperatures. Range
is strongly influenced by the outliers.
10
|»| The Variance
 Variance describes the spread of the data values from the mean.
 It is the average of the sum of the squared differences of the mean from
individual values.
 Two types of variance are (1) Sample variance, and (2) Population
variance.
|»| Sample Variance, S2
 S2 describes the variation of the sample values about the sample mean.
 It is the average of the sum of the squared differences of the sample
mean from individual values.
 That is,

x

x 



x

x

n


2
2
S2
n 1
2
n 1
11
|»| Sample Variance - Example
 Example 3.7: Calculate the sample variance for the accident data.
(x  x)
x
s2 
(x  x)2
18
18 - 14.55 =
3.45
11.9025
10
10 - 14.55 =
-4.55
20.7025
15
15 - 14.55 =
0.45
0.2025
13
13 - 14.55 =
-1.55
2.4025
17
17 - 14.55 =
2.45
6.0025
15
15 - 14.55 =
0.45
0.2025
12
12 - 14.55 =
-2.55
6.5025
15
15 - 14.55 =
0.45
0.2025
18
18 - 14.55 =
3.45
11.9025
16
16 - 14.55 =
1.45
2.1025
11
11 - 14.55 =
-3.55
12.6025
2


x

x

n 1

74.7275 74.7275

 7.47
11  1
10
2
(
x

x
)
 74.7275

12
|»| Sample Variance - Examples
 Example 3.8: From 50 collected data, the statistics ∑x and ∑x2 are calculated to
be 20 and 33, respectively. Compute the sample variance,

x

x  n
2
2
S2 
n 1
(20) 2
33 
50  33  8  0.51

50  1
49
 Example 3.9: The values of the difference between data values and the
sample mean are -5, 1, -3, 2, 3, and 2, What is the variance of the data?
(x  x)
(x  x)2
-5
(-5)2 = 25
1
(1)2 = 1
-3
(-3)2
2
(2)2 = 4
3
(3)2 = 9
2
(2)2 = 4
=9
2
(
x

x
)
 52

x  x 


2
s
2
n 1
52 52


 10.4
6 1 5
13
|»| Population Variance, σ2
 σ 2 describes the variation of the population values about the population
mean.
 It is the average of the sum of the squared differences of the population
mean from individual values.
 That is,
2 
2


x



N
|»| The Standard Deviation
S  S2 
 Standard deviation is the positive square
root of the variance.
 The positive square root of the sample
variance is the sample standard
deviation, denoted by S.
 The positive square root of the population
variance is the population standard
deviation, denoted by σ.
2


x

x

n 1

x

x  n
2
2

  2 
n 1
2


x



N
14
|»| Standard Deviation
 Example 3.10: Find the sample standard deviations for Examples 3.7, 3.8,
and 3.9.
From Example 3.7:
s  s 2  7.47  2.73
From Example 3.8:
s  s 2  0.51  0.71
From Example 3.9:
s  s 2  10.4  3.22
15
|»| Coefficient of Variation, CV
 Measures the standard deviation in terms of mean.
 For example, what percentage of x-bar is s?
 The Coefficient of Variation (CV) is used to compare the variation of two or
more data sets where the values of the data differ greatly.
s
CV  100
x
 Example 3.11: The scores for team 1 were 70, 60, 65, and 69. The scores
for team 2 were 72, 58, 61, and 73. Compare the coefficients of variation
for these two teams.
For team 1:
X  66
S  4.55
s
CV  100  6.89
X
For team 2:
X  66
S  7.62
s
CV  100  11.55
X
16
|»| Percentile
 The P-th percentile is a number such that P% of the measurements fall
below the P-th percentile and (100-P)% fall above it.
 Most common measure of position.
 How to calculate percentile
 Arrange the data
 Find the location of the Pth percentile.
n
P
100
 Find percentile using the following rules:
 Location Rule 1: If n  P/100 is not a counting number, round it up,
and the Pth percentile will be the value in this position of the
ordered data.
 Location Rule 2: If n  P/100 is a counting number, the Pth
percentile is the average of the number in this location (of the
ordered data) and the number in the next largest location
17
|»| Percentile - Example
 Example 3.12: Find the 35th percentile from the following aptitude data
(Aptitude Data).
22 25 28 31 34 35 39 39 40 42 44 44 46 48 49 51 53 53 55 55
56 57 59 60 61 63 63 63 65 66 68 68 69 71 72 72 74 75 75 76
78 78 80 82 83 85 88 90 92 96
 Number of data values, n = 50
 35th Percentile = P35.
So,
p
35
P35 Location  n 
 50 
 17.5
100
100
 17.5 is NOT a counting number. So, using Location Rule 1, P35 = 18th value = 53.
18
|»| Quartiles and Interquartile Range
 Quartiles are merely particular percentiles that divide the data into
quarters, namely.
 Q1 = 1st quartile = 25th percentile (P25)
 Q2 = 2nd quartile = 50th percentile (P50) = Median.
 Q3 = 3rd quartile = 75th percentile (P75)
 Example 3.13: Determine the quartiles for the aptitude data
p
25
 50 
 12.5
100
100
p
50
n
 50 
 25
100
100
p
75
n
 50 
 37.5
100
100
n
 Q1 = 13th ordered value = 46
 Q2 = Median = (61+63)/2 = 62
 Q3 = 38th ordered value = 75
 Interquartile Range (IQR)
 The range for the middle 50% of the data
 IQR = Q3 – Q1. For aptitude data: IQR = 75 – 46 = 29.
19
|»| Z-Scores
 Z-score determines the relative position of any particular data value X and is
based on the mean and standard deviation of the data set.
xx
z
s
 The Z-score is expresses the number of standard deviations the value x is from
the mean.
 A negative Z-score implies that x is to the left of the mean and a positive Z-score
implies that x is to the right of the mean.
 Example 3.14: Find the z-score
for an aptitude test score of 83.
x  x 83  60.36
z

 1.22
s
18.61
|»| Standardizing Sample Data
 The process of subtracting the mean and dividing by the standard deviation is
referred to as standardizing the sample data.
 The corresponding z-score is the standardized score.
20
|»| Skewness, Sk
 Skewness measures the tendency of a distribution to stretch out in a
particular direction.
 The Pearson’s coefficient of skewness is used to calculate skewness.
3( x  Md )
Sk 
s
 Example 3.15: Find the skewness for aptitude data.
 Sk = 3(60.36 – 62)/18.61 = 3(-1.64)/18.61 = -4.92/18.61 = -0.26
 The values of Sk will always fall between -3 and 3
 A positive Sk number implies a shape which is skewed right and the
mode < median < mean
 In a data set with a negative Sk value the
mean < median < mode
21
|»| Skewness, Sk – In Graphs
Frequency
 Histogram of Symmetric Data
x = Md = Mo
22
|»| Skewness, Sk – In Graphs
Relative Frequency
 Histogram with Right (Positive) Skew
Sk > 0
Mode Median Mean
(Mo) (Md)
(x)
23
|»| Skewness, Sk – In Graphs
Relative Frequency
 Histogram with Left (Negative) Skew
Sk < 0
Mean Median Mode
(x)
(Md) (Mo)
24
|»| Kurtosis
 Kurtosis is a measure of the peakedness of a distribution.
 Large values occur when there is a high frequency of data near the mean and in
the tails.
 The calculation is cumbersome and the measure is used infrequently.
|»| Interpreting X-bar and S
 How many or what percentage of the data values are/is within two standard
deviation of the mean?
 Usually three ways to know that:
 Actual percentage based on the sample
 Chebyshev’s Inequality
 Empirical Rule
25
|»| Kurtosis


 According to Chebyshev, in general, at least 1 - 1 k 2 100%of the data
values lie between x  ks and x  ks (have z-scores between –k and k)
for any k > 1.
 Chebyshev’s Inequality is usually conservative but makes no
assumption about the distribution of the population.
 Empirical rule assumes bell-shaped distribution of the population, i.e.,
normal population
Actual
Percentage
(Aptitude Data)
Chebyshev’s
Inequality
Percentage
Empirical Rule
Percentage
x - s and x + s
66%
(33 out of 50)
—
≈ 68%
x - 2s and x + 2s
98%
(49 out of 50)
≥ 75%
≈ 95%
x - 3s and x + 3s
100%
(50 out of 50)
≥ 89%
≈ 100%
Between
26
|»| A Bell-Shaped (Normal) Population
27
|»| Bivariate Data
 Data collected on two variables for each item.
 Example 3.16: Data for 10 families on income (thousands of dollars)
and square footage of home (hundreds of square feet) (IncomeFootage Data).
Income (000s), X
Sq Footage of Home (00s), Y
32
16
36
17
55
26
47
24
38
22
60
21
66
32
44
18
70
30
50
20
28
|»| Scatter Diagram
Y
Square footage (hundreds)
Square footage (hundreds)
 Graphical illustration of bivariate data
 Each observation is represented by a point, where the X-axis is always
horizontal and the Y-axis is vertical.
35 –
Y
35 –
30 –
30 –
25 –
25 –
20 –
20 –
15 –
15 –
10 –
10 –
5–
|
20
|
30
|
40
|
50
|
60
|
70
|
80
Income (thousands)
(a)
X
5–
|
20
|
30
|
40
|
50
|
60
|
70
|
80
X
Income (thousands)
(b)
29
|»| Coefficient of Correlation, r
 Measures the strength of the linear relationship between X variable
and Y variable.
r
 x  x ( y  y )
 x  x    y  y 
2
2

 xy 
 x2 
 x y 
n
 x y   y 

n
n
2
2
2

SCPxy
SS x SS y
 r ranges from -1 to 1.
 The larger the |r| is, the stronger the linear relationship is between X and Y.
 If r = 1 or r = -1, X and Y are perfectly correlated.
 If r > 0, X and Y have positive relationship (i.e., large values of X are associated
with large values of Y).
 If r < 0, X and Y have negative relationship (i.e., large values of X are associated
with small values of Y).
30
|»| Coefficient of Correlation – Example
 Example 3.17: Calculate r for Income-Footage Data.
X2
Y2
32x16=512
(32)2=1024
(16)2=256
17
36x17=612
(36)2=1296
(17)2=289
55
26
55x26=1430
3025
676
47
24
47x24=1128
2209
576
38
22
38x22=836
1444
484
60
21
60x21=1260
3600
441
66
32
66x32=2112
4356
1024
44
18
44x18=792
1936
324
70
30
70x30=2100
4900
900
50
20
50x20=1000
2500
400
498
226
11782
26290
5370
Income, X
Footage, Y
32
16
36
r
 xy 
XY
 x y 
n


x
y


x  n y  n
2
2
2
2
(498)( 226)
10

 0.843
2
2
(498)
(226)
26290 
5370 
10
10
11782 
31
|»| Coefficient of Correlation, r – In Graphs
y
y
x
x
r=0
y
r=1
(a)
y
(b)
x
x
r = -1
r = .9
(c)
(d)
32
|»| Coefficient of Correlation, r – In Graphs
y
y
x
x
r = -.8
r = .5
(e)
(f)
33