Download Document

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Statistics for Business and
Economics:
Types of data and descriptives
STT 315: Section 201
Instructor: Abdhi Sarkar
What is statistics?
We see it everyday and rely on it. It is
information that can be derived from data. With
sound intuition and mathematical tools we are
able to historically state what has occurred until
now and have the ability to predict and project
into the future. Statistics has tremendous
applications in almost every scientific field and
prolific applications specifically in business and
economics.
Population and Sample
• Suppose we would like to estimate the fraction of
East Lansing residents who are students.
• In this case, the population is all East Lansing
residents.
• However, surveying the entire population may be
costly, time-consuming and laborious and therefore,
we can do our job by selecting a sample which is “a
good representative of the population”.
3
Parameter and Statistic
• Parameters are the values we calculate from the
population data.
Population mean, population variance, population
median etc. are the examples of parameters.
• Statistics - a word with 2 meanings
– A subject, like mathematics or physics.
– Values we compute from sample data.
Sample mean, sample variance, sample proportion etc.
are the examples of statistics.
Singular of statistics is “statistic”.
4
There are four basic processes in statistics:
1.
2.
3.
4.
Data Collection
Data Organization
Data Analyses
Interpretation of the Analyses
There are two broad categories of data:
a. Qualitative / Categorical
Example: Hair Color, Hometown, Nationality, Types of
Cars, Yes or No questions, Blood Type etc.
b. Quantitative / Numerical
Example: Height of a person, Car Mileage, Annual
Income, Age, Property values, etc.
Statistical Methods
Descriptive Statistics
• Involves collection of data
• Organization of data
Usually the data organization is in
the form of carefully tabulating
and graphing plots and figures. It
also involves summarizing
characteristics in terms of Mean,
Median, Mode etc.
Inferential Statistics
• Point Estimation and Interval
estimation
• Testing of Hypotheses
Here lies the true essence of
statistics in terms of
understanding characteristics of a
certain population in terms of the
sampled data observed from it.
Descriptive Statistics
In order to visualize the observed or collected data, first we require to organize
it. Once the tabulation of the data is done in the form of spreadsheets or
computer software, we use several different plots to describe it.
Data
Presentation
Qualitative
Data
Quantitative
Data
Dot
Plot
Summary
Table
Bar
Graph
Pie
Chart
Pareto
Diagram
Stem-&-Leaf
Display
Histogram
Describing Qualitative Data:
Some key terms:
• A class is one of the categories into which qualitative data can be classified.
• The class frequency is the number of observations in the data set falling into
a particular class.
• The class relative frequency is the class frequency divided by the total
numbers of observations in the data set.
• The class percentage is the class relative frequency multiplied by 100.
Summary Table
• This table lists the different categories
/classes and the corresponding number of
elements for each category.
• The number is obtained by tallying
responses.
• Sometimes these numbers may be
represented as percentages
Major
Frequency/Count
Accounting
50
Business
30
Economics
20
Total
100
Bar Graph and Pareto Diagram
60
• A bar graph is a chart that uses vertical
bars to show comparisons among
categories.
• The X-axis of the chart shows the
specific categories being compared, and
the Y-axis represents a discrete value.
• Each bar is of equal width.
• The heights of the bar may show the
frequency or relative frequency (in %)
• A pareto diagram is when the bars are
is descending order.
Pie Chart
• This chart gives a breakdown of all the
categories by dividing the circle in
terms of the angle proportional to the
frequencies in each category.
• Its utility mainly lies in showing relative
differences among categories.
Bar Graph
50
40
30
20
10
0
Business
60
50
Economics
Accounting
Pareto Diagram
40
30
20
10
0
Accounting
Business
Economics
Pie Chart
Accounting
Business
20%
50%
30%
Economics
Describing Quantitative Data:
Consider the data set of pulse rates, in beats per minute, for a group of 30 students.
68 60
72 56
76 68 64 80 72
88 76 80 68 80
76
84
92
64
68
80
56
72
72
64
68
68
60
76
84
72
Dot Plot:
1. Horizontal axis is a scale for the quantitative variable.
2. The numerical value of each measurement is represented on the horizontal scale
by a dot.
50
55
60
65
70
75
80
85
90
Stem and Leaf plot:
1. Each observation is divided into the stem and leaf of a quantitative variable.
2. The stem is usually the ten’s place or a combination of ten’s and hundredth's
place of the value.
3. The leaf comprises the units place and are placed in ascending order alongside
each other.
This facilitates ordering the data in ascending order. Below is the stem and leaf for
the same data used in the dot plot.
Stem
Leaf
60
5
6
7
8
9
6
0
2
0
2
6
0 4 4 4 8 8 8 8 8 8
2 2 2 2 6 6 6 6
0 0 0 4 4 8
Histogram:
1. The possible numerical values of the
quantitative variable are partitioned
into class intervals, where each
interval has the same width.
2. These intervals form the scale of the
horizontal axis.
3. The frequency or relative frequency of
observations in each class interval is
determined.
4. A horizontal bar is placed over each
class interval, with height equal to
either the class frequency or class
relative frequency.
5. Each bar is immediately adjacent to
each other.
The histogram for the pulse rate data is
shown:
Numerical Measures of Central
Tendency
The central tendency of the set of measurements–that is, the tendency of the data
to cluster, or center, about certain numerical values; usually the Mean, Median or
Mode.
Mean:
1.
2.
3.
4.
Most common measure of central tendency
Acts as ‘balance point’
Affected by extreme values (‘outliers’)
Denoted by ̅
Formula: ̅ =
∑
⋯
=
Where n= No. of observations & the
! " !
" #
$
Example: Mean pulse rate = (56+56+60+……+88+92)/30= 72.1333
Median:
1. The median is a measure of central tendency but it is a positional value.
2. When the data is ordered in ascending order, the median is the mid point of
the data set.
3. The position of the Median is found at
%
&
.Here however two situations
arise.
a. When n is odd: Ex: n=9 , then position of Median is at 5= (9+1)/2
b. When n is even: Ex: n=10, then position of Median is at 5.5 = (10+1)/2 i.e.
The Median is the average of the 5th and 6th value in the ordered data set.
4. The median is not affected by extreme values.
Example:
Raw Data:
24.1
22.6
21.5
23.7
22.6
Ordered:
21.5
22.6
22.6
23.7
24.1
Position of Median:
1
2
3
4
5
Median=22.6 since n=5, (n+1)/2=3 and n is odd.
Raw Data:
10.3
4.9
8.9
11.7
6.3
7.7
Ordered:
4.9
6.3
7.7
8.9
10.3
11.7
Position of Median:
1
2
3
4
5
6
Median=(7.7+8.9)/2=8.3 since n=6, (n+1)/2=3.5 and n is even.
Quartiles: Just like the median above, where we find the position by splitting
the data into 2 parts, for quartiles we split the data into four parts.
• First Quartile (Q1): Median of the first half of the data
• Third Quartile (Q3): Median of the second half of the data
• Second Quartile (Q2) is the same as the median.
Mode:
1. The mode is the value with the highest observed frequency.
2. It is not affected by extreme values.
3. In some cases data may have multiple modes. A mode need not
necessarily be unique.
Example:
Raw Data:
68 60 76 68 64 80 72 76 92 68 56 72 68 60 84
72 56 88 76 80 68 80 84 64 80 72 64 68 76 72
The value with the highest frequency, i.e. the value that appears most times in the
data is 68. This can be verified from the dot plot on a previous slide.
Uses of Mode: We can use the mode in various entrepreneurial scenarios where
sales is considered. The most worn shoe size (because most people have mid-sized
feet, it would be redundant to produce shoes that are too large or too small.
Effect of Linear Transformation
• Suppose every observation is multiplied by a fixed
constant. Then
median of transformed observations is the median of the
original observations times that same constant.
mean of transformed observations is the mean of the
original observations times that same constant.
Data: 10, 13, 18, 22, 29
Mean = 18.40.
Median = 18.
Suppose transformed data = (-3)*original data.
So transformed data: -30, -39, -54, -66, -87
Mean = (-3)*18.40 = -55.20.
Median = (-3)*18 = -54.
16
16
Effect of Linear Transformation
• Suppose a fixed constant is added to (or subtracted
from) each observation. Then
median of transformed observations is the median of the
original observations plus (or minus) that same constant.
mean of transformed observations is the mean of the
original observations plus (or minus) that same constant.
Data: 10, 13, 18, 22, 29
Mean = 18.40.
Median = 18.
Suppose transformed data = original data + 2.5.
Hence transformed data: 12.5, 15.5, 20.5, 24.5, 31.5
Mean = 18.40 + 2.5 = 20.90. Median = 18 + 2.5 = 20.50.
17
17
Spread of a Distribution
Are the values concentrated around the center of the
distribution or they are spread out?
Range,
Interquartile Range,
Variance,
Standard Deviation.
Note: Variance and standard deviation are more
appropriate when the distribution is symmetric.
18
18
Range
• Range of the data is defined as the difference
between the maximum and the minimum values.
• Data: 23, 21, 67, 44, 51, 12, 35.
Range = maximum – minimum = 67 – 12 = 55.
• Disadvantage: A single extreme value can make it very
large, giving a value that does not really represent the
data overall. On the other hand, it is not affected at all
if some observation changes in the middle.
19
19
Interquartile Range (IQR)
• What is IQR?
IQR = Third Quartile (Q3) – First Quartile (Q1).
• What are quartiles?
Recall: Median divides the data into 2 equal
halves.
The first quartile, median and the third quartile
divide the data into 4 roughly equal parts.
20
20
Quartiles
• The first quartile (Q1, lower quartile) is that value
which is larger than 25% of observations, but smaller
than 75% of observations.
• The second quartile (Q2) is the median, which is
larger than 50% of observations, but smaller than
50% of observations.
• The third quartile (Q3, upper quartile) is that value
which is larger than 75% of observations, but smaller
than 25% of observations.
• Obviously, Q1 < Q2 (= median) < Q3.
• How to compute the quartiles?
We shall use TI 83/84 Plus.
21
21
IQR vs. Range
• IQR is a better summary of the spread of a
distribution than the range because it has
some information about the entire data,
where as range only has information on the
extreme values of the data.
• IQR is less outlier-sensitive than range.
22
22
Outlier-sensitivity
• Data: 10, 13, 17, 21, 28, 32
Without the outlier
• IQR = 15 Range = 22
• Data: 10, 13, 17, 21, 28, 32, 59
With the outlier
• IQR = 19 Range = 49
Conclusion: IQR is less outlier-sensitive than range.
23
23
Variance and Standard Deviation
• The sample variance (s2) is defined as:
1
2
s =
( x1 − x ) 2 + L + ( xn − x ) 2 .
n −1
[
]
• Subtract the mean from each value, square each
difference, add up the squares, divide by one fewer
than the sample size.
• The sample standard deviation (s), is the positive
square root of sample variance, i.e.
s=+ s .
2
24
24
Variance and Standard Deviation
• Larger the variance (and standard deviation)
more dispersed are the observations around
the mean.
• The unit of variance is square of the unit of
the original data,
whereas standard deviation has the same
unit as the original data.
• Both variance and standard deviation are
more appropriate for symmetric distributions.
25
25
Standard Deviation: An Example
Data: 3, 12, 8, 9, 3 (n=5 in this case)
Mean = (3+12+8+9+3)/5 = 35/5 =7.
Data Deviations from mean Squared Deviations
-----------------------------------------------------------------------------3
3 – 7 = -4
(-4)x(-4) =16
12
12 – 7 = 5
5 x 5 =25
8
8–7= 1
1x1= 1
9
9–7= 2
2x2= 4
3
3 – 7 = -4
(-4)x(-4) =16
-----------------------------------------------------------------------------Total = 62
Now divide by n-1=4: s2 = 62/4 = 15.50. s = √15.5 = 3.94.
Answer: The standard deviation in this example is 3.94
and the variance is 15.50.
26
26
Effect of Linear Transformation
• Suppose every observation is multiplied by a fixed
constant. Then
range/IQR/standard deviation of transformed observations is
the range/IQR/standard deviation of the original observations
times the absolute value of that same constant.
variance of transformed observations is the variance of the
original observations times the square of that same constant.
Temperature data (in F): 10, 13, 18, 22, 29
Range = 19 F, IQR =14 F, s = 7.5 F, s2 = 56.25 F2.
Suppose transformed data = (-3)*original data.
So transformed data (in F): -30, -39, -54, -66, -87
Range = |-3|*19 = 57 F, IQR = |-3|*14 = 42 F,
s = |-3|* 7.5 = 22.50 F, s2 = (-3)2*56.25 = 506.25 F2.
27
27
Effect of Linear Transformation
• Suppose a fixed constant is added to (or subtracted
from) each observation. Then
range/IQR/standard deviation/variance of
transformed observations remains the same as
that of the original observations.
Temperature data (in F): 10, 13, 18, 22, 29
Range = 19 F, IQR =14 F, s = 7.5 F, s2 = 56.25 F2.
Suppose transformed data = original data + 2.5.
Hence transformed data (in F): 12.5, 15.5, 20.5, 24.5, 31.5
Range = 19 F, IQR =14 F, s = 7.5 F, s2 = 56.25 F2.
28
28
Chebyshev’s rule
%
*+
For any distribution at least 1 −
of the observations
will fall within k standard deviations of mean, where , ≥ 1.
• Chebyshev’s rule is for any distribution, whereas the
empirical rule is valid only for approximately symmetric
unimodal (mound-shaped) distribution.
• If k=1, not much information is available from
Chebyshev’s rule.
• According to Chebyshev at least 75% observations fall
within 2 standard deviations of mean.
• According to Chebyshev at least 88.9% of observations
fall within 3 standard deviations of mean.
29
Empirical rule
For approximately symmetric unimodal (bellshaped/mound shaped) distribution
• Approximately 68% of observations fall within 1
standard deviation of mean.
• Approximately 95% of observations fall within 2
standard deviations of mean.
• Approximately 99.7% of observations fall within 3
standard deviations of mean.
30
Empirical rule
31
Empirical rule
32
Box Plot
Box plot is another graphical representation of
quantitative data using the following 5 number
summary:
1. Minimum Value,
2. Lower Quartile,
3. Median (the middle value),
4. Upper Quartile,
5. Maximum Value.
NOTE: Data must be ordered from lowest
value to highest value before finding the
5 number summary.
33
Box Plots
• Are a representation of the five
number summary (Minimum,
Maximum, Median, Lower
Quartile, Upper Quartile).
• Half the data are in the box
• One-quarter of the data are in
each whisker.
• If one part of the plot is long,
the data are skewed.
• Box-plot is very useful for
comparing distributions
• This box plot indicates data are
skewed to the left.
34
Box Plot
• Box Plot is a pictorial representation of the 5-number
summary.
35
Outliers
• Any observation farther than 1.5
times IQR from the closest
boundary of the box is an outlier.
• If it is farther than 3 times IQR, it is
an extreme outlier, otherwise a
mild outlier.
• One can also indicate the outliers in
a box plot, by drawing the whiskers
only up to 1.5 times IQR on both
sides, and indicating outliers with
stars or crosses (or other symbols).
36
An example
Suppose
min = 2, Q1 = 18, median = 20, Q3 = 22, max = 35.
Which of the following observations are
outliers?
Lower Fence= Q1-1.5*IQR= 18-1.5(22-18)=12
A. 10
Upper Fence= Q3+1.5*IQR=22+1.5(22-18)=28
B. 15
Note: All observations below the lower fence
and above the higher fence are considered to be
C. 25
outliers.
D. 30
37
Histogram vs. Box plot
• Both histogram and box plot capture the
symmetry or skewness of distributions.
• Box plot cannot indicate the modality of the
data.
• Box plot is much better in finding outliers.
• The shape of histogram depends to some
extent on the choice of bins.
38
Comparing Distributions
We can compare between distributions of
various data-sets using
Box Plots (or the 5-Number Summary),
Histograms.
We shall first compare distributions using box
plots.
Which type of car has the largest median Time to
accelerate?
A.
B.
C.
D.
E.
upscale
sports
small
large
family
40
Which type of car has the smallest median
time value?
A.
B.
C.
D.
E.
upscale
sports
small
Large
Luxury
41
Which type of car always take less than 3.6
seconds to accelerate?
A.
B.
C.
D.
E.
upscale
sports
small
Large
Luxury
42
Which type of car has the smallest IQR
for Time to accelerate?
A.
B.
C.
D.
E.
upscale
sports
small
Large
Luxury
43
What is the shape of the distribution of
acceleration times for luxury cars?
A. Left skewed
B. Right skewed
C. Roughly
symmetric
D. Cannot be
determined
from the
information
given.
44
What percent of luxury cars accelerate to 30
mph in less than 3.5 seconds?
A.
B.
C.
D.
E.
Roughly 25%
Exactly 37.5%
Roughly 50%
Roughly 75%
Cannot be
determined
from the
information
given
45
What percent of family cars accelerate
to 30 mph in less than 3.5 seconds?
A.
B.
C.
D.
E.
Less than 25%
More than 50%
Less than 50%
Exactly 75%
None of the above
46
Z-Scores
How to compare apples with oranges?
• A college admissions committee is looking at
the files of two candidates, one with a total
SAT score of 1500 and another with an ACT
score of 22. Which candidate scored better?
• How do we compare things when they are
measured on different scales?
• We need to standardize the values.
47
How to standardize?
• Subtract mean from the value and then divide
this difference by the standard deviation.
• The standardized value = the z-score
value − mean
=
std .dev.
• z-scores are free of units.
48
z-scores: An Example
Data: 4, 3, 10, 12, 8, 9, 3 (n=7 in this case)
Mean = (4+3+10+12+8+9+3)/7 = 49/7 =7.
Standard Deviation = 3.65.
Original Value
z-score
-------------------------------------------------------------4
(4 – 7)/3.65 = -0.82
3
(3 – 7)/3.65 = -1.10
10
(10 – 7)/3.65 = 0.82
12
(12 – 7)/3.65 = 1.37
8
(8 – 7)/3.65 = 0.27
9
(9 – 7)/3.65 = 0.55
3
(3 – 7)/3.65 = -1.10
-------------------------------------------------------------49
Interpretation of z-scores
• The z-scores measure the distance of the data values
from the mean in the standard deviation scale.
• A z-score of 1 means that data value is 1 standard
deviation above the mean.
• A z-score of -1.2 means that data value is 1.2
standard deviations below the mean.
• Regardless of the direction, the further a data value is
from the mean, the more unusual it is.
• A z-score of -1.3 is more unusual than a z-score of
1.2.
50
How to use z-scores?
• A college admissions committee is looking at the files
of two candidates, one with a total SAT score of 1500
and another with an ACT score of 22. Which
candidate scored better?
• SAT score mean = 1600, std dev = 500.
• ACT score mean = 23, std dev = 6.
• SAT score 1500 has z-score = (1500-1600)/500 = -0.2.
• ACT score 22 has z-score = (22-23)/6 = -0.17.
• ACT score 22 is better than SAT score 1500.
51
Which is more unusual?
A. A 58 in tall woman
z-score = (58-63.6)/2.5 = -2.24.
B. A 64 in tall man
z-score = (64-69)/2.8 = -1.79.
C. They are the same.
Heights of adult women have
mean of 63.6 in.
std. dev. of 2.5 in.
Heights of adult men have
mean of 69.0 in.
std. dev. of 2.8 in.
52
Using z-scores to solve problems
An example using height data and U.S. Marine and
Army height requirements
Question: Are the height restrictions set up by the
U.S. Army and U.S. Marine more restrictive for
men or women or are they roughly the same?
53
Data from a National Health Survey
Heights of adult women have
– mean of 63.6 in.
– standard deviation of 2.5 in.
Heights of adult men have
– mean of 69.0 in.
– standard deviation of 2.8 in.
Height Restrictions
Men
Minimum
U.S. Army
U.S. Marine Corps
60 in
64 in
Women
Minimum
58 in
58 in
54
Heights of adult men have
– mean of 69.0 in.
– standard deviation of 2.8 in.
Men Minimum
U.S.
Army
U.S.
Marine
Heights of adult women have
– mean of 63.6 in.
– standard deviation of 2.5 in.
Women minimum
60 in
58 in
z-score = -3.21
z-score = -2.24
Less restrictive
More restrictive
64 in
58 in
z-score = -1.79
z-score = -2.24
More restrictive
Less restrictive
55
Effect of Standardization
• Standardization into z-scores does not change
the shape of the histogram.
• Standardization into z-scores changes the
center of the distribution by making the mean
0.
• Standardization into z-scores changes the
spread of the distribution by making the
standard deviation 1.
56
Z-score and Empirical Rule
When data are bell shaped, the z-scores of the
data values follow the empirical rule.
57
Outlier detection with z-score
• Empirical Rule tells us that if data are mound-shaped
distributed, then almost all the data-points are within
plus minus 3 standard deviations from the mean. So an
absolute value of z-score larger than 3 can be
considered as an outlier.
58
2004 Olympics
Women’s Heptathlon
Austra Skujyte (Lithunia)
Shot Put = 16.40m,
Long Jump = 6.30m.
Mean
Shot Put
Long Jump
13.29m
6.16m
1.24m
0.23m
28
26
Carolina Kluft (Sweden)
Shot Put = 14.77m,
Long Jump = 6.78m.
(all contestant)
Std.Dev.
n
59
Which performance was better?
A. Skujyte’s shot put,
z-score of Skujyte’s shot put = 2.51.
B. Kluft’s long jump,
z-score of Kluft’s long jump = 2.70.
C. Both were same.
Mean
Shot Put
Long Jump
13.29m
6.16m
1.24m
0.23m
28
26
(all contestant)
Std.Dev.
n
60
Based on shot put and long jump whose
performance was better?
A. Skujyte’s,
z-score: shot put = 2.51, long jump = 0.61.
Total z-score = (2.51+0.61) = 3.12.
B. Kluft’s,
z-score: shot put = 1.19, long jump = 2.70.
Total z-score = (1.19+2.70) = 3.89.
C. Both were same.
61