Download Note

Document related concepts

Psychometrics wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Statistics for Business
(Env)
Chapter 3
Descriptive Statistics:
Numerical Methods
1
Descriptive Statistics
3.1 Describing Central Tendency
3.2 Measures of Variation
3.3 Percentiles, Quartiles and Box-andWhiskers Displays
3.4 Covariance, Correlation, and the Least
Square Line
3.5 Weighted Means and Grouped Data
(Optional)
3.6 The Geometric Mean (Optional)
Describing Central Tendency
• In addition to describing the shape of a
distribution, want to describe the data set’s
central tendency
– A measure of central tendency represents the
center or middle of the data
– It is most typical or most representative of the
entire data
Parameters and Statistics
• A population parameter is a number
calculated from all the population
measurements that describes some aspect of
the population
• A sample statistic is a number calculated
using the sample measurements that
describes some aspect of the sample
Measures of Central Tendency
Mean, 
Median, Md
Mode, Mo
The average or expected value
The value of the middle point of
the ordered measurements
The most frequent value
The Mean
Population X1, X2, …, XN

Sample x1, x2, …, xn
x
Population Mean
Sample Mean
n
N


Xi
i=1
N
x
x
i
i=1
n
The Sample Mean
For a sample of size n, the sample mean is defined as
n
x
x
i 1
n
i
x1  x2  ...  xn

n
and is a point estimate of the population mean 
• It is the value to expect, on average and in the long run
• And the amount each member gets when the total is
distributed equally within the sample
Mean as the balance point for a distribution
Data: 2, 2, 6, 10
mean=(2+2+6+10)/4=5
8
Data: 3, 6, 6, 9, 11
mean=(3+6+6+9+11)/5=7
What will happen to the mean if we add one more
number to the data?
9
The Median
The median Md is a value such that 50% of all
measurements, after having been arranged in
numerical order, lie above (or below) it
1. If the number of measurements is odd, the median is the
middlemost measurement in the ordering
2. If the number of measurements is even, the median is
the average of the two middlemost measurements in the
ordering
Data: 3, 5, 8, 10, 11
median=8
11
Data: 3, 3, 4, 5, 7, 8
median=(4+5)/2=4.5
12
Data: 1, 2, 2, 3, 4, 4, 4, 4, 4, 5
median=4??
13
14
Example for non-integer data
• Example 3.1: First five observations from
Table 3.1:
30.8, 31.7, 30.1, 31.6, 32.1
• In order: 30.1, 30.8, 31.6, 31.7, 32.1
• There is an odd so median is one in middle, or
31.6
Data: 2, 2, 2, 3, 3, 12
mean=4
median=(2+3)/2=2.5
16
The Mode
The mode Mo of a population or sample of
measurements is the measurement that occurs most
frequently
– Modes are the values that are observed “most typically”
– Sometimes higher frequencies at two or more values
• If there are two modes, the data is bimodal
• If more than two modes, the data is multimodal
– When data are in classes, the class with the highest
frequency is the modal class
• The tallest box in the histogram
Histogram Describing the 50 Mileages
19
Selecting a measure of Central Tendency
• Usually the mean is a good measure, because
it uses every score in the distribution.
• There are some extreme cases in which the
mean is not representative (or calculable).
Then the mode and the median are used.
20
Mean=(10+11*4+12*3+13+100)/10=20.3
Mode=11
Median=(11+12)/2=11.5
21
Mean – not computable
Median=(12+13)/2=12.5
Mode – not meaningful
Open-ended distributions A distribution is said to be
open-ended when there is no upper limit (or lower
limit) for one of the categories
22
Measures of Variation/variability
• Knowing the measures of central tendency is not
enough
• Both of the distributions below have identical
measures of central tendency
24
Measures of Variation
Range
Largest minus the smallest measurement
Variance
The average of the squared deviations of
all the population measurements from
the population mean
Standard
Deviation
The square root of the variance
They provide quantitative measures of the degree to which
data in a distribution are spread out or clustered together.
Range for discrete & continuous data
• The range is the distance between the largest
score (Xmax) and the smallest score (Xmin) in the
distribution for discrete data.
• For continuous data, you must also take into
account the real limits of the maximum and
minimum X values.
• range = URL Xmax - LRL Xmin
26
Population Variance and Standard
Deviation
• The population variance (σ2) is the average of
the squared deviations of the individual
population measurements from the
population mean (µ)
• The population standard deviation (σ) is the
positive square root of the population
variance
Variance
• For a population of size N, the population
variance σ2 is:
N
2 
2


x


 i
i 1
N
2
2
2

x1     x2       xN   

N
• For a sample of size n, the sample variance s2
is:
n
2
s2 
 x  x 
i 1
i
n 1
2
2
2

x1  x   x2  x     xn  x 

n 1
Sample variability tends to underestimate the
population value
29
Standard Deviation
• Population standard deviation (σ):
  
2
• Sample standard deviation (s):
s s
2
Example: Sample Variance and
Standard Deviation
• Data points are: 60, 41, 15, 30, 34
• Mean is 36
• Variance is:

2
2
2
2
2
2

60  36  41  36  15  36  30  36  34  36

5
576  25  441  36  4 1082


 216.4
5
5
Standard deviation is:
  216.4  14.71
X
32
Percentiles & Quartiles
•
•
•
•
For a set of measurements arranged in increasing
order, the pth percentile is a value such that p percent
of the measurements fall at or below the value and
(100-p) percent of the measurements fall at or above
the value
The first quartile Q1 is the 25th percentile
The second quartile (or median) is the 50th
percentile
The third quartile Q3 is the 75th percentile
The interquartile range IQR is Q3 - Q1
Cumulative percentages &
PERCENTILES
Q3
Q1
X=2 means that the measurement was
somewhere between the real limits of 1.5 and
2.5.
30% of the individuals have been
accumulated by the time you
reach the top of the interval for
X=2.
34
35
What is the 95th percentile? (Answer: X = 4.5.)
What is the percentile rank for X = 3.5? (Answer: 70%.)
What is the 50th percentile?
What is the percentile rank for X = 4?
estimates of these values by a standard procedure
known as interpolation
36
37
38
Using the following distribution of scores, we will use
interpolation to find the 50th percentile:
39
For the scores, the width of the interval is 5 points. For the percentages, the
width is 50 points.
The value of 50% is located 10 points from the top of the percentage interval. As
a fraction of the whole interval, this is 10 out of 50, or 1/5 of the total interval.
The 50th percentile is X = 8.5.
40
USING INTERPOLATION TO FIND THE MEDIAN
Answer: X = 3.70 is the median
Notice that this is exactly the same answer we
obtained using the graphic method of interpolation
in Figure 3.7
41
42
43
Example: Quartiles
A slightly different way to find the quartiles (without using interpolation).
20 customer satisfaction ratings:
1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10
Md = (8+8)/2 = 8
Q1 = (7+8)/2 = 7.5
Q3 = (9+9)/2 = 9
IQR = Q3  Q1 = 9  7.5 = 1.5
Five Number Summary in
descriptive statistic
1.
2.
3.
4.
5.
The smallest measurement
The first quartile, Q1
The median, Md
The third quartile, Q3
The largest measurement
•
Displayed visually using a box-and-whiskers
plot
Box-and-whisker plots
A box and whisker plot (sometimes called a boxplot) is a graph
that presents information from a five-number summary. It does
not show a distribution in as much detail as a stem and leaf plot
or histogram does, but is especially useful for indicating whether
a distribution is skewed and whether there are potential unusual
observations (outliers) in the data set.
46
Outliers
• Outliers are measurements that are very different
from other measurements
– They are either much larger or much smaller than most of
the other measurements
• Outliers lie beyond the fences of the box-andwhiskers plot
– Measurements between the inner and outer fences are
mild outliers
– Measurements beyond the outer fences are severe
outliers
Box-and-Whiskers Plots
• The box plots the:
– first quartile, Q1
– median, Md
– third quartile, Q3
– inner fences
– outer fences
From: Business Statistics in Practice, 5th Edition, Bowerman O’Connell Murphree,
Box-and-Whiskers Plots
Continued
• Inner fences
– Located 1.5IQR away from the quartiles:
• Q1 – (1.5  IQR)
• Q3 + (1.5  IQR)
• Outer fences
– Located 3IQR away from the quartiles:
• Q1 – (3  IQR)
• Q3 + (3  IQR)
From: Business Statistics in Practice, 5th Edition, Bowerman O’Connell Murphree,
Box-and-Whiskers Plots
Continued
• The “whiskers” are dashed lines that plot the
range of the data
– A dashed line drawn from the box below Q1 down
to the smallest measurement between the inner
fences
– Another dashed line drawn from the box above Q3
up to the largest measurement between the inner
fences
Box-and-Whiskers Plots
Continued
Symmetric distribution: A distribution having the same shape on
either side of the center
Skewed distribution: One whose shapes on either side of the center
differ; a nonsymmetrical distribution.
positive
negative
Can be positively or negatively skewed, or bimodal
Symmetric distribution: mean = median = mode
Mean
Median
Mode
The Relative Positions of the Mean, Median, and Mode: Symmetric Distribution
54
Relationships Among Mean, Median
and Mode
Relative Positions of the Mean, Median, and Mode
Mean<Median<Mode
The left tail is longer; the mass of the
distribution is concentrated on the
right of the figure. The distribution is
also said to be left-skewed.
Mode<Median<Mean
The right tail is longer; the mass of the
distribution is concentrated on the left
of the figure. The distribution is also
said to be right-skewed.
Skewness is the measurement of the lack of symmetry of the distribution.
coefficient of
skewness can range from -3.00
The
up to 3.00 when using the following
formula:
A value of 0 indicates a symmetric
distribution.
Negatively Skewed Distribution has
negative coefficient of skewness.
Positively Skewed Distribution has
positive coefficient of skewness.

3 X  Median
Sk 
s
Some software packages use a
different formula which results
in a wider range for the
coefficient.

Normal distribution
Normal distribution (Gaussian
distribution) is a symmetric
distribution that often gives a good
description of data that cluster
around the mean. The graph of the
distribution is bell-shaped, with a
peak at the mean, and is known as
the Gaussian function or bell curve.
Normal distribution in nature
The bean machine is a device
invented by Sir Francis Galton to
demonstrate how the normal
distribution appears in nature. This
machine consists of a vertical board
with interleaved rows of pins. Small
balls are dropped from the top and
then bounce randomly left or right
as they hit the pins. The balls are
collected into bins at the bottom
and settle down into a pattern
resembling the Gaussian curve.
Normal distribution in nature
Height (in.)
Distribution of the heights of 1052 women fits the normal distribution,
with a goodness of fit p value of 0.75
Histogram of daily percentage changes in the S&P 500 index
The Empirical Rule for a
Normal/Gaussian distribution
• If a population has mean µ and standard deviation σ
and is described by a normal distribution, then
– 68.26% of the population measurements lie within one
standard deviation of the mean: [µ-σ, µ+σ]
– 95.44% of the population measurements lie within two
standard deviations of the mean: [µ-2σ, µ+2σ]
– 99.73% of the population measurements lie within three
standard deviations of the mean: [µ-3σ, µ+3σ]
Origin and meaning of “Six Sigma Process"
Six Sigma – quality control of process outputs
If one has six standard deviations between the process mean and
the nearest specification limit, as shown in the graph, practically
no items will fail to meet specifications.
If the upper and lower specification limits (USL, LSL) are at a distance of 6σ from the
mean, values lying that far away from the mean are extremely unlikely. Even if the
mean were to move right or left by 1.5σ at some point in the future (1.5 sigma shift),
there is still a good safety cushion. This is why if the mean is at least 6σ away from
the nearest specification limit, the process will be under good quality control.
Chebyshev’s Theorem
• Let µ and σ be a population’s mean and
standard deviation, then for any value k> 1
• At least 100(1 - 1/k2 )% of the population
measurements lie in the interval [µ-kσ, µ+kσ]
• Or, at most 100/k2 % of the population
measurements lie outside the interval [µ-kσ,
µ+kσ]
Chebyshev’s Theorem: what is it?
• Chebyshev’s Theorem tells us roughly the
percentage of data with values that will fall
within a certain number of standard
deviations of the mean.
• Chebyshev's Theorem applies to all
distributions regardless of their shape and can
therefore be used for non normal distributions
and distributions where the shape is
unknown.
Chebyshev’s Theorem: comparision
Minimum percentage of the
population lie within the interval
Chebyshev’s Theorem: Example
One study of heights in the U.S. concluded that women
have a mean height of 65 inches with a standard
deviation of 2.5 inches. If this is correct, what is the
percentage of women that have heights between 60
and 70 inches according to Chebyshev’s Theorem? If
we assume the distribution is a normal distribution,
what is the percentage of women with heights
between 60 and 70 inches?
Since the interval between 60 & 70 inches is 2 σ away from the
mean, according to Chebyshev’s Theorem, there are at least
100(1 - 1/22 )% or 75% of women with heights between 60
and 70 inches. But if we know the distribution is a normal
distribution, then there is 95.44% of the population
measurements lie within two standard deviations from the
mean.
z Scores
• For any x in a population or sample, the associated z
score is defined as
x  mean
z
standard deviation
Z-score is: the exact location of a score
within a distribution
The z-score transforms each X value into a signed
number so that
1. The sign tells whether the score is located
above (+) or below (-) the mean, and
2. The number tells the distance between the
score and the mean in terms of the number of
standard deviations.
69
Z=(76-70)/3=2
Z=(76-70)/12=0.5
Two distributions of exam scores. For both distributions,  = 70, but
for one distribution, σ = 3, and for the other, σ = 12.
The position of X = 76 is very different for these two distributions.
70
z Scores: application
What if you score a 65 on a test?
On the surface it might look bad.
But what if that was the mean is 45 and the standard deviation is 10.
Then your z-Score is
Z = (65 – 45)/10 = 2.
That means you are 2 σ above the mean value.
From Chebyshev’s Theorem, you are at worse in the top 12.5% of the class if the
distribution is symmetric.
If the distribution is normal you are in the top 2-3% of the class.
z Scores: assignment
In order to get an “A”, you should be in the top 10% of the
class.
What should be your z-Score to get an “A”, according to
Chebyshev’s Theorem and assuming the distribution is
symmetrical?
Example: z Score
• Population of profit margins for five American
companies: 8%, 10%, 15%, 12%, 5%
• µ = 10%, σ = 3.406%
74
TRANSFORMING z-SCORES TO A DISTRIBUTION
WITH A PREDETERMINED mean and standard
deviation
An instructor gives an exam to a psychology class. For this exam, the distribution
of raw scores has a mean of 57 with sd= 14. The instructor would like to simplify
the distribution by transforming all scores into a new, standardized distribution
with mean= 50 and sd= 10. To demonstrate this process, we will consider what
happens to two specific students: Joe, who has a raw score of X= 64 in the original
distribution; and Maria, whose original raw score is X= 43.
Step 1: Transform each of the original raw scores into z-scores. For Joe, X= 64, so his
z-score is 0.5
For Maria, X= 43, and her z-score is -1.0
Step 2: Change the z-scores to the new, standardized scores.
Joe’s z-score, z=0.5, indicates that he is above the mean by exactly 1/2 standard
deviation, so his standardized score would be 55.
Maria’s score is located 1 standard deviation below the mean. So her new score
would be X= 40.
75
Example: A psychologist has developed a new intelligence test.
For years, the test has been given to a large number of people;
for this population, mean= 65 and sd= 10. The psychologist would
like to make the scores of this test comparable to scores on other
IQ tests, which have mean= 100 and sd= 15. If the test is
standardized so that it is comparable (has the same mean and sd)
to other tests, what would be the standardized scores for the
following individuals?
For A
B
C
z=(75-65)/10=1
z=(45-65)/10=-2
z=(67-65)/10=0.2
standardized score=100+1x15=115
100-2x15=70
100+0.2x15=103
76
Coefficient of Variation
• Measures the size of the standard deviation relative
to the size of the mean
• Coefficient of variation =standard deviation/mean ×
100%
• Used to:
– Compare the relative variabilities of values about the
mean
– Compare the relative variability of populations or samples
with different means and different standard deviations
– Measure risk
Covariance, Correlation, and the
Least
Squares
Line
• When points on a scatter plot seem to
fluctuate around a straight line, there is a
linear relationship between x and y
• A measure of the strength of a linear
relationship is the covariance sxy
 x  x y  y 
n
s xy 
i 1
i
i
n 1
Covariance
• A positive covariance indicates a positive
linear relationship between x and y
– As x increases, y increases
• A negative covariance indicates a negative
linear relationship between x and y
– As x increases, y decreases
Interpretation of the Sample Covariance
Interpretation of the Sample Covariance
Continued
• Points in quadrant I correspond to xi and yi both
greater than their averages
– (x- x ) and (y-y̅) both positive so covariance positive
• Points in quadrant III correspond to xi and yi both less
than their averages
– (x- x ) and (y-y̅) both negative so covariance positive
• If sxy is positive, the points having the greatest
influence are in quadrants I and III
• Therefore, a positive sxy indicates a positive linear
relationship
Interpretation of the Sample Covariance
Continued
• Points in quadrant II correspond to xi less than x
and yi greater than y
– (x- x) is negative and (y-y̅) is positive so covariance
negative
• Points in quadrant IV correspond to xi greater than x
and yi less than y̅
– (x- x) is positive and (y- y̅) is negative so covariance
negative
• If sxy is negative, the points having the greatest
influence are in quadrants II and IV
• Therefore, a negative sxy indicates a negative linear
relationship
Correlation Coefficient
• Magnitude of covariance does not indicate
the strength of the relationship
– Magnitude depends on the unit of
measurement used for the data
• Sample correlation coefficient (r) is a
measure of the strength of the relationship
that does not depend on the magnitude of
the data
r
s xy
sx s y
Correlation Coefficient
Continued
• Sample correlation coefficient r is always
between -1 and +1
– Values near -1 show strong negative correlation
– Values near 0 show no correlation
– Values near +1 show strong positive correlation
• Sample correlation coefficient is the point
estimate for the population correlation
coefficient ρ
Least Squares Line
• If there is a linear relationship between x and
y, might wish to predict y on the basis of x
• This requires the equation of a line describing
the linear relationship
• Line is calculated based on least squares line
– Discussed in detail in Chapter 13
Least Squares Line
Continued
• Need to calculate slope (b1) and y-intercept
(b0)
s xy
b1  2
•
sx
•
b0  y  b1 x
Least Squares Line for the Sales Volume Data
Weighted Means
• Sometimes, some measurements are more
important than others
– Assign numerical “weights” to the data
• Weights measure relative importance of the value
• Calculate weighted mean as
w x
w
i i
i
where wi is the weight assigned to the ith
measurement xi
Example: Weighted Mean
• June 2001 unemployment rates by census
region
– Northeast, 26.9 million in civilian labor force, 4.1%
unemployment rate
– South, 50.6 million, 4.7% unemployment
– Midwest, 34.7 million, 4.4% unemployment
– West, 32.5 million, 5.0 unemployment
• Want the mean unemployment rate for the US
Example: Weighted Mean
Continued
• Want the mean unemployment rate for the
U.S.
• Calculate it as a weighted mean
– So that the bigger the region, the more heavily it
counts in the mean
• The data values are the regional
unemployment rates
• The weights are the sizes of the regional labor
forces
Example: Weighted Mean
Continued

26 .9  4.1  50 .6  4.7   34 .7  4.4   32 .5  5.0 

26 .9  50 .6  34 .7  25 .5  32 .5
663 .29

 4.58 %
144 .7
• Note that the unweigthed mean is 4.55%,
which underestimates the true rate by 0.03%
– That is, 0.0003  144.7 million = 43,410
workers
Descriptive Statistics for Grouped Data
• Data already categorized into a frequency
distribution or a histogram is called grouped
data
• Can calculate the mean and variance even
when the raw data is not available
• Calculations are slightly different for data from
a sample and data from a population
Descriptive Statistics for Grouped Data (Sample)
• Sample mean for grouped data:
 fi M i
 fi M i
x

n
 fi
• Sample variance for2grouped data:
 f i M i  x 
2
s 
n 1
fi is the frequency for class i
Mi is the midpoint of class i
n = Σfi = sample size
Descriptive Statistics for Grouped Data
(Population)
• Population mean for grouped data:
 fi M i
 fi M i


N
 fi
• Population variance for grouped data:
2


f
M

x

i
i
2 
N
fi is the frequency for class i
Mi is the midpoint of class i
N = Σfi = population size
The Geometric Mean (Optional)
• For rates of return of an investment, use
the geometric mean to give the correct
wealth at the end of the investment
• Suppose the rates of return (expressed as
decimal fractions) are R1, R2, …, Rn for
periods 1, 2, …, n
• The mean of all these returns is the
calculated as the geometric mean:
Rg 
n
1  R1  1  R2  1  Rn  1
3- 96
Mean Deviation
The arithmetic mean of
the absolute values of the
deviations from the
arithmetic mean.
The main features:
 All
values are used in the calculation.
 It is sensitive to unusual data.
 The absolute values are difficult to
manipulate. So, it is not used in
inferential stat.
MD =
X-X
n
Mean Deviation
Chapter Three: Summary
ONE
Calculate the arithmetic mean, median, mode, weighted mean, and the geometric
mean.
TWO
Explain the characteristics of each measure of location.
THREE
Identify the position of the arithmetic mean, median, and mode for both a
symmetrical and a skewed distribution.
Chapter Three: Summary
FOUR
Compute and interpret the range, the mean deviation, the variance, and the standard
deviation of ungrouped data.
FIVE
Explain the characteristics, uses, advantages, and disadvantages of each measure of
dispersion.
SIX
Understand Chebyshev’s theorem and the Empirical Rule as they relate to a set of
observations.