Download Describing Data: Numerical

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Topic 3
Describing Data: Numerical
Figures won´t lie, but liars will figure
Statistics for Business and Economics
Chap 3-1
Why descriptive statistics is important
to managers?


Managers also need to become acquainted
with numerical descriptive measures that
provide very brief and easy-to understand
summaries of a data collection
There are two broad categories into which
these measures fail: measures of central
tendency and measures of variability
Topic Goals
After completing this topic, you should be able to
compute and interpret the:

Measures of central tendency, variation, and
shape






(Arithmetic) Mean, median, mode, geometric mean
Quartiles
Five number summary and box-and-whisker plots
Range, interquartile range, variance and standard
deviation, coefficient of variation
Symmetric and skewed distributions
Flat and peaked distribution
Statistics for Business and Economics
Chap 3-3
Describing Data Numerically
Describing Data Numerically
Central Tendency
Variation
Arithmetic Mean
Range
Median
Interquartile Range
Mode
Variance
Geometric Mean
Standard Deviation
Coefficient of Variation
Statistics for Business and Economics
Chap 3-4
Measures of Central Tendency
Overview
Central Tendency
Mean
Median
Mode
Midpoint of
ranked values
Most frequently
observed value
n
x
x
i1
i
n
Arithmetic
average
Statistics for Business and Economics
Chap 3-5
Arithmetic Mean

The arithmetic mean (mean) is the most
common measure of central tendency

For a population of N values:
N
μ

x
i
i 1
N
Population
values
(simple) or μ =
N
 x n
i 1
i
i
N
Population size
(weighted)
For a sample of size n:
n
x
x
i 1
n
Statistics for Business and Economics
i
Observed
values
(simple) or x 
Sample size
n
 x n
i 1
i
n
i
(weighted)
Chap 3-6
Arithmetic mean
The arithmetic mean of a collection of numerical
values is the sum of these values divided by
the number of values. The symbol for the
population mean is the Greek letter μ (mu), and
the symbol for a sample mean is X (X-bar)
A population parameter is any measurable
characteristic of a population.
A sample statistic is any measurable
characteristic of a sample.

Characteristics of the arithmetic
mean





l. Every data set measured on an interval or ratio
level has a mean.
2. The mean has valuable mathematicaI properties
that make it convenient to use in further
computations.
3. The mean is sensitive to extreme values.
4. The sum of the deviations of the numbers in a
data set from the mean is zero
5. The sum of the squared deviations of the
numbers in a data set from the mean is a minimum
value.
Geometric Mean

The geometric mean is the most common
measure of central tendency for rates (growth
rates, interest rates, etc.)

For N values:
N
μ geo  N  x i  N x1  x 2  x N
i 1
Statistics for Business and Economics
Chap 3-9
Arithmetic Mean
(continued)



The most common measure of central tendency
Mean = sum of values divided by the number of values
Affected by extreme values (outliers) !!!
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
1  2  3  4  5 15

3
5
5
Statistics for Business and Economics
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
1  2  3  4  10 20

4
5
5
Chap 3-10
Median

The numerical value in the middle when data set
is arranged in order (50% above, 50% below)
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Median = 3

Not affected by extreme values
Statistics for Business and Economics
Chap 3-11
Finding the Median

The location of the median:
n 1
Median position 
position in the ordered data
2



If the number of values is odd, the median is the middle number
If the number of values is even, the median is the average of
the two middle numbers
n 1
is not the value of the median, only the
2
position of the median in the ranked data
Note that
Statistics for Business and Economics
Chap 3-12
Mode






A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or categorical data
There may be no mode
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
Statistics for Business and Economics
0 1 2 3 4 5 6
No Mode
Chap 3-13
Review Example:
Summary Statistics
House Prices:
€2,000,000
500,000
300,000
100,000
100,000

Mean:

Median: middle value of ranked data
= €300,000

Mode: most frequent value
= €100,000
Sum 3,000,000
Statistics for Business and Economics
(€3,000,000/5)
= €600,000
Chap 3-14
Which measure of location
is the “best”?

Mean is generally used, unless
extreme values (outliers) exist

Then median is often used, since
the median is not sensitive to
extreme values.

Statistics for Business and Economics
Example: Median home prices may be
reported for a region – less sensitive to
outliers
Chap 3-15
Measures of Variability
Variation
Range

Interquartile
Range
Variance
Standard
Deviation
Coefficient
of Variation
Measures of variation give
information on the spread
or variability of the data
values.
Same center,
different variation
Statistics for Business and Economics
Chap 3-16
Range


Simplest measure of variation
Difference between the largest and the smallest
observations:
Range = Xlargest – Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12
13 14
Range = 14 - 1 = 13
Statistics for Business and Economics
Chap 3-17
Disadvantages of the Range

Ignores the way in which data are distributed
7
8
9
10
11
12
Range = 12 - 7 = 5

7
8
9
10
11
12
Range = 12 - 7 = 5
Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Statistics for Business and Economics
Chap 3-18
Quartiles

Quartiles split the ranked data into 4 segments with
an equal number of values per segment
25%
Q1



25%
25%
Q2
25%
Q3
The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are
larger)
Only 25% of the observations are greater than the third
quartile
Statistics for Business and Economics
Chap 3-19
Quartile Formulas
Find a quartile by determining the value in the
appropriate position in the ranked data, where
First quartile position:
Q1 = 0.25(n+1)
Second quartile position: Q2 = 0.50(n+1)
(the median position)
Third quartile position:
Q3 = 0.75(n+1)
where n is the number of observed values
Statistics for Business and Economics
Chap 3-20
Quartiles

Example: Find the first quartile
Sample Ranked Data: 11 12 13 16 16 17 18 21 22
(n = 9)
Q1 = is in the 0.25(9+1) = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,
so
Statistics for Business and Economics
Q1 = 12.5
Chap 3-21
Interquartile Range

Can eliminate some outlier problems by using
the interquartile range

Eliminate high- and low-valued observations
and calculate the range of the middle 50% of
the data

Interquartile range = 3rd quartile – 1st quartile
IQR = Q3 – Q1
Statistics for Business and Economics
Chap 3-22
Interquartile Range
Five number summary –Box plot
Example:
X
minimum
Q1
25%
12
Median
(Q2)
25%
30
25%
45
X
Q3
maximum
25%
57
70
Interquartile range
= 57 – 30 = 27
Statistics for Business and Economics
Chap 3-23
Population Variance

Average of squared deviations of values from
the mean

Population variance:
N
σ2 
2
(x

μ)
 i
i 1
N
Where
N
(simple) or σ 2 
2
(x

μ)
 ni
 i
i 1
N
(weighted)
μ = population mean
N = population size
xi = ith value of the variable x
Statistics for Business and Economics
ni = absolute frequency
Chap 3-24
Sample Variance

Average (approximately) of squared deviations
of values from the mean

Sample variance:
n
s2 
2
(x

x)
 i
i 1
n-1
Where
n
(simple) or s 2 
2
(x

x)
 ni
 i
i 1
n-1
(weighted)
x = arithmetic mean
n = sample size
xi = ith value of the variable x
Statistics for Business and Economics
ni = absolute frequency
Chap 3-25
Population Standard Deviation




The square root of population variance
Most commonly used measure of variation
Shows variation about the mean
Has the same units as the original data

Population standard deviation:
N
σ
2
(x

μ)
 i
i 1
N
Statistics for Business and Economics
N
(simple) or σ 
2
(x

μ)
 ni
 i
i 1
N
(weighted)
Chap 3-26
Sample Standard Deviation




The square root of the sample variance
Most commonly used measure of variation
Shows variation about the mean
Has the same units as the original data

Sample standard deviation:
n
s
2
(x

x)
 i
i 1
n-1
Statistics for Business and Economics
n
(simple) or s 
2
(x

x)
 i  ni
i 1
n-1
(weighted)
Chap 3-27
Calculation Example:
Sample Standard Deviation
Sample
Data (xi) :
10
12
14
n=8
s
15
17
18
18
24
Mean = x = 16
(10  X)2  (12  x)2  (14  x)2    (24  x)2
n 1

(10  16)2  (12  16)2  (14  16)2    (24  16)2
8 1

126
7

Statistics for Business and Economics
4.2426
A measure of the “average”
scatter around the mean
Chap 3-28
Measuring variation
Small standard deviation
Large standard deviation
Statistics for Business and Economics
Chap 3-29
Comparing Standard Deviations
Data A
11
12
13
14
15
16
17
18
19
20 21
Mean = 15.5
s = 3.338
20 21
Mean = 15.5
s = 0.926
20 21
Mean = 15.5
s = 4.570
Data B
11
12
13
14
15
16
17
18
19
Data C
11
12
13
Statistics for Business and Economics
14
15
16
17
18
19
Chap 3-30
Advantages of Variance and
Standard Deviation

Each value in the data set is used in the
calculation

Values far from the mean are given extra
weight
(because deviations from the mean are squared)
Statistics for Business and Economics
Chap 3-31
Coefficient of Variation

Measures relative variation

Always in percentage (%)

Shows variation relative to mean

Can be used to compare two or more sets of
data measured in different units
 s 
cv    100%
x 
Statistics for Business and Economics
Chap 3-32
Comparing Coefficient
of Variation

Stock A:
 Average price last year = $50
 Standard deviation = $5
s
cvA  
x

Stock B:



$5
100%  10%
 100% 
$50

Average price last year = $100
Standard deviation = $5
s
cvB  
x
Statistics for Business and Economics

$5
100%  5%
 100% 
$100

Both stocks
have the same
standard
deviation, but
stock B is less
variable relative
to its price
Chap 3-33
Skewness


Shows asymmetry and refers to the shape of a
distribution
Can take on positive, negative or zero values
n
1 
3
(x

x)
 i
i 1
s n
3
Statistics for Business and Economics
n
(simple) or  1 
3
(x

x)
 ni
 i
i 1
s n
3
(weighted)
Chap 3-34
Distribution Shape

The shape of the distribution is said to be symmetric
if the observations are balanced, or evenly distributed,
about the center; coefficient of skewness equal zero
Frequency
Symmetric Distribution
10
9
8
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
When the distribution is unimodal, the mean, median,
and mode are all equal to one another and are located
at the center of the distribution
Statistics for Business and Economics
Chap 3-35
Distribution Shape
(continued)

The shape of the distribution is said to be
skewed if the observations are not
symmetrically distributed around the center.
Positively Skewed Distribution
12
10
Frequency
A positively skewed distribution
(skewed to the right) has a tail that
extends to the right in the
direction of positive values.
8
6
4
2
0
1
3
4
5
6
7
8
9
7
8
9
Negatively Skewed Distribution
12
10
Frequency
A negatively skewed distribution
(skewed to the left) has a tail that
extends to the left in the direction
of negative values.
2
8
6
4
2
0
1
Statistics for Business and Economics
2
3
4
5
6
Chap 3-36
Shape of a Distribution

Describes how data are distributed

Measures of shape

Symmetric or skewed
Left-Skewed
Symmetric
Right-Skewed
Mean < Median < Mode Mean = Median = Mode Mean <Median < Mode
Statistics for Business and Economics
Chap 3-37
Kurtosis


Refers to the shape of a distribution
Can take positive (peaked distribution), negative
(flat distribution) or zero values
(a symmetrical, bell-shaped, normal distribution)
n
2 
 (x i  x)
n
4
i 1
s n
4
Statistics for Business and Economics
- 3 (simple) or  2 
4
(x

x)
 ni
 i
i 1
s n
4
-3 (weighted)
Chap 3-38
Kurtosis
(continued)
2 > 0
2  0
2 < 0
Statistics for Business and Economics
Chap 3-39
The Empirical Rule


If the data distribution is bell-shaped, then
the interval:
μ  1σ contains about 68% of the values in
the population or the sample
68%
μ
μ  1σ
Statistics for Business and Economics
Chap 3-40
The Empirical Rule


μ  2σ contains about 95% of the values in
the population or the sample
μ  3σ contains about 99.7% of the values
in the population or the sample
95%
99.7%
μ  2σ
μ  3σ
Statistics for Business and Economics
Chap 3-41
Using Microsoft Excel

Simple Descriptive Statistics can be
obtained from Microsoft® Excel

Use menu choice:
Data Tab / Data analysis / Descriptive statistics

Enter details in dialog box
Statistics for Business and Economics
Chap 3-42
Using Microsoft Excel
(continued)

Enter dialog box
details

Check box for
summary statistics

Click OK
Statistics for Business and Economics
Chap 3-43
Excel output
Microsoft Excel
Simple descriptive statistics
output, using the house price
data:
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
Statistics for Business and Economics
Chap 3-44
Next topic

Theory of probability
Thank you!
Have a nice day!