Download BBA240: STATISTICS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
BBA240 Statistics for Economics and Finance
SCHOOL OF BUSINESS, ECONOMICS AND MANAGEMENT
BBA240 STATISTICS/ QUANTITATIVE METHODS FOR BUSINESS
AND ECONOMICS
Unit Two
Moses Mwale
e-mail: [email protected]
ii
Contents
Contents
UNIT 2: Numerical Descriptions of Data
3
2.1 Measures of Central Tendency ................................................................................... 3
2.1.1 Mean, Median, and Mode............................................................................... 3
Mean. ............................................................................................................. 3
Finding the Mean of grouped Data ........................................................ 3
Finding the Mean of a Frequency Distribution ..................................... 4
Median ........................................................................................................... 4
Finding the Median of a Frequency Distribution ........................................... 5
Mode .............................................................................................................. 7
Finding the Mode of a Frequency Distribution ..................................... 7
2.2 Measures of Variation ................................................................................................. 9
2.2.1 Deviation, Variance, and Standard Deviation .............................................. 10
Finding the Sample Variance and Standard Deviation ........................ 12
Standard deviation for grouped data .................................................... 14
2.2.2 Chebychev’s Theorem .................................................................................. 16
2.3 Measures of Position ................................................................................................. 17
2.3.1 Quartiles and Interquartile Range................................................................. 17
Calculating Interquartile Range ................................................................... 18
2.3.2 Percentiles and other Fractiles ...................................................................... 19
BBA240 Statistics for Economics and Finance
UNIT 2: Numerical Descriptions of Data
2.1 Measures of Central Tendency
In Sections 1.4, you learned about the graphical representations of
quantitative data. In this section, you will learn how to supplement
graphical representations with numerical statistics that describe the center
and variability of a data set.
A measure of central tendency is a value that represents a typical, or
central, entry of a data set. The three most commonly used measures of
central tendency are the mean, the median, and the mode.
2.1.1 Mean, Median, and Mode
Mean.
The mean of a data set is the sum of the data entries divided by the
number of entries.
To find the mean of a data set, use one of the following formulas.
ο‚·
Population Mean: πœ‡ =
ο‚·
Sample Mean: π‘₯Μ… =
Ξ£π‘₯
𝑁
Ξ£π‘₯
𝑛
The lowercase Greek letter πœ‡ (pronounced mu) represents the population
mean and π‘₯Μ… (read as β€œx bar”) represents the sample mean.
Note that N represents the number of entries in a population and n
represents the number of entries in a sample.
Recall that the uppercase Greek letter sigma Ξ£ indicates a summation of
values.
Finding the Mean of grouped Data
If data are presented in a frequency distribution, you can approximate the
mean as follows.
Definition
3
4
UNIT 2: Numerical Descriptions of Data
The mean of a frequency distribution for a sample is approximated by
π‘₯Μ… =
Ξ£(π‘₯βˆ™π‘“)
𝑛
Note that 𝑛 = Σ𝑓
where x and f are the midpoints and frequencies of a class, respectively.
Finding the Mean of a Frequency Distribution
1. Find the midpoint of each class.
(πΏπ‘œπ‘€π‘’π‘Ÿ π‘™π‘–π‘šπ‘–π‘‘) + (π‘ˆπ‘π‘π‘’π‘Ÿ π‘™π‘–π‘šπ‘–π‘‘)
2
2. Find the sum of the products of the midpoints and the frequencies.
π‘₯=
Ξ£(π‘₯ βˆ™ 𝑓)
3. Find the sum of the inconsistence frequencies.
𝑛 = Σ𝑓 Inconsistence
4. Find the mean of the frequency distribution.
π‘₯Μ… =
Ξ£(π‘₯ βˆ™ 𝑓)
𝑛
Example: Finding the Mean of a Frequency Distribution
Use the frequency distribution below to approximate the mean number of
minutes that a sample of Internet subscribers spent online during their
most recent session.
Solution
π‘₯Μ… =
=
Ξ£(π‘₯ βˆ™ 𝑓)
𝑛
2089.0
50
β‰ˆ 41.8
So, the mean time spent online was approximately 41.8 minutes.
Median
Another important measure of central tendency is the median. It is
defined as follows.
Definition: Median
The median is the value of the middle term in a data set that has been
ranked in increasing order.
BBA240 Statistics for Economics and Finance
As is obvious from the definition of the median, it divides a ranked data
set into two equal parts. The calculation of the median consists of the
following two steps:
1. Rank the data set in increasing order.
2. Find the middle term. The value of this term is the median.
Note that if the number of observations in a data set is odd, then the
median is given by the value of the middle term in the ranked data.
However, if the number of observations is even, then the median is given
by the average of the values of the two middle terms.
Example
The following data give the prices (in thousands of dollars) of seven
houses selected from all houses sold last month in a city.
312
257 421 289
526
374
497
Find the median.
Solution
First, we rank the given data in increasing order as follows:
257 289 312 374 421 497 526
Since there are seven homes in this data set and the middle term is the
fourth term, the median is given by the value of the fourth term in the
ranked data.
257 289 312 374
421
497
526
Thus, the median price of a house is 374, or $374,000.
Finding the Median of a Frequency Distribution
To estimate the Median, let's look at an example.
Example: Alex did a survey of how many games each of 20 friends owned, and
got this:
9, 15, 11, 12, 3, 5, 10, 20, 14, 6, 8, 8, 12, 12, 18, 15, 6, 9, 18, 11
The Frequency Distribution for the Data is as follows
Number of
games
Frequency
1-5
2
5
6
UNIT 2: Numerical Descriptions of Data
6 - 10
7
11 - 15
8
16 - 20
3
ο‚·
The groups (1-5, 6-10, etc) also called class intervals, are of width 5
ο‚·
The numbers 1, 6, 11 and 16 are the lower class boundaries
ο‚·
The numbers 5, 10, 15 and 20 are the upper class boundaries
ο‚·
The midpoints
ο‚·
So the midpoints are 3, 8, 13 and 18
are halfway between the lower and upper class boundaries
The median is in the class where the cumulative frequency reaches half
the sum of the absolute frequencies.
The median is the mean of the middle two numbers (the 10th and 11th
values) and they are both in the 11 - 15 group:
We can say "the median group is 11 - 15"
But if we need to estimate a single Median value we can use this
formula:
(n/2) βˆ’ cfb
Estimated Median = L +
×w
fm
where:
ο‚·
L is the lower class boundary of the group containing the median
ο‚·
n is the total number of data
ο‚·
cfb is the cumulative frequency of the groups before the median
group
ο‚·
fm is the frequency of the median group
ο‚·
w is the group width
For our example:
ο‚·
L = 11
ο‚·
n = 20
ο‚·
cfb = 2 + 7 = 9
ο‚·
fm = 8
ο‚·
w=5
BBA240 Statistics for Economics and Finance
(20/2) βˆ’ 9
Estimated Median = 11 +
× 5 = 11 + (1/8) x 5 = 11.625
8
Mode
Mode is a French word that means fashionβ€”an item that is most popular
or common. In statistics, the mode represents the most common value in
a data set.
Definition: Mode
The mode is the value that occurs with the highest frequency in a data set.
Example
The following data give the speeds (in miles per hour) of eight cars that
were stopped for speeding violations.
77
82
74
81 79 84 74
78
Find the mode.
Solution In this data set, 74 occurs twice, and each of the remaining
values occurs only once. Because 74 occurs with the highest frequency, it
is the mode. Therefore, Mode = 74 miles per hour
A major shortcoming of the mode is that a data set may have none or may have
more than one mode, whereas it will have only one mean and only one median.
For instance, a data set with each value occurring only once has no mode. A data
set with only one value occurring with the highest frequency has only one mode.
The data set in this case is called unimodal.
A data set with two values that occur with the same (highest) frequency has two
modes. The distribution, in this case, is said to be bimodal. If more than two
values in a data set occur with the same (highest) frequency, then the data set
contains more than two modes and it is said to be multimodal.
Finding the Mode of a Frequency Distribution
Again, looking at our data:
Number of
games
Frequency
1-5
2
6 - 10
7
7
8
UNIT 2: Numerical Descriptions of Data
11 - 15
8
16 - 20
3
We can easily identify the modal group (the group with the highest
frequency), which is 11 - 15
We can say "the modal group is 11 - 15"
But the actual Mode may not even be in that group! Or there may be
more than one mode. Without the raw data we don't really know.
But, we can estimate the Mode using the following formula:
fm βˆ’ fm-1
Estimated Mode = L +
×w
(fm βˆ’ fm-1) + (fm βˆ’ fm+1)
where:
ο‚·
L is the lower class boundary of the modal group
ο‚·
fm-1 is the frequency of the group before the modal group
ο‚·
fm is the frequency of the modal group
ο‚·
fm+1 is the frequency of the group after the modal group
ο‚·
w is the group width
In this example:
ο‚·
L = 11
ο‚·
fm-1 = 7
ο‚·
fm = 8
ο‚·
fm+1 = 3
ο‚·
w=5
8βˆ’7
Estimated Mode = 11
+
(8 βˆ’ 7) + (8 βˆ’
3)
× 5 = 11 + (1/6) × 5 =
11.833...
Exercises
1.
2.
3.
Explain how the value of the median is determined for a data set that
contains an odd number of observations and for a data set that contains
an even number of observations.
Briefly explain the meaning of an outlier. Is the mean or the median a
better measure of central tendency for a data set that contains outliers?
Illustrate with the help of an example.
Using an example, show how outliers can affect the value of the mean.
BBA240 Statistics for Economics and Finance
4.
Which of the three measures of central tendency (the mean, the median,
and the mode) can be calculated for quantitative data only, and which can
be calculated for both quantitative and qualitative data?
5. Illustrate with examples.
6. Which of the three measures of central tendency (the mean, the median,
and the mode) can assume more than one value for a data set? Give an
example of a data set for which this summary measure assumes more than
one value.
7. Is it possible for a (quantitative) data set to have no mean, no median, or
no mode? Give an example of a data set for which this summary measure
does not exist.
8. Explain the relationships among the mean, median, and mode for
symmetric and skewed histograms. Illustrate these relationships with
graphs.
9. Prices of cars have a distribution that is skewed to the right with outliers
in the right tail. Which of the measures of central tendency is the best to
summarize this data set? Explain.
10. The following data set belongs to a population:
5 -7 2 0 -9 16 10 7
Calculate the mean, median, and mode.
11. The following data set belongs to a sample:
14 18 -1 08 8 -16
Calculate the mean, median, and mode.
12. The following data give the 2007 gross domestic product (GDP) in
billions of dollars for all 50 states.
The data are entered in alphabetic order by state (Bureau of Economic
Analysis, June 2005).
166 45 247 95
62
51 610 246
352 382 255 89
76 1103 399 28
34 244 1142 106
1813 236
129 117
229 34
466 139
25 383
216 60 735 397
154 216 48 269
80 127 57 465
158 531 47 153
311 58 232 32
a. Calculate the mean and median for these data. Are these values of the mean
and the median sample statistics or population parameters? Explain.
b. Do these data have a mode? Explain.
2.2 Measures of Variation
In this section, you will learn different ways to measure the variation of a
data set. The simplest measure is the range of the set.
Definition: Range
The range of a data set is the difference between the maximum and
minimum data entries in the set. To find the range, the data must be
quantitative.
Range = (Maximum data entry) – (Minimum data entry)
9
10
UNIT 2: Numerical Descriptions of Data
Example
2.2.1 Deviation, Variance, and Standard Deviation
As a measure of variation, the range has the advantage of being easy to
compute. Its disadvantage, however, is that it uses only two entries from
the data set.
Two measures of variation that use all the entries in a data set are the
variance and the standard deviation. However, before you learn about
these measures of variation, you need to know what is meant by the
deviation of an entry in a data set.
Definition: Deviation
The deviation of an entry x in a population data set is the difference
between the entry and the mean of the data set.
Deviation of π‘₯ = π‘₯ βˆ’ πœ‡
BBA240 Statistics for Economics and Finance
Example
In the previous example, notice that the sum of the deviations is zero.
Because this is true for any data set, it doesn’t make sense to find the
average of the deviations.
To overcome this problem, you can square each deviation. When you add
the squares of the deviations, you compute a quantity called the sum of
squares, denoted SSX. In a population data set, the mean of the squares of
the deviations is called the population variance.
The population variance of a population data set of N entries is
Population variance = 𝜎 2 =
Ξ£(π‘₯βˆ’πœ‡)2
𝑁
The symbol 𝜎 is the lowercase Greek letter sigma.
The population standard deviation of a population data set of N entries
is the square root of the population variance.
Population standard Deviation = 𝜎 = √𝜎 2 = √
Ξ£(π‘₯βˆ’πœ‡)2
𝑁
How to find the Population Variance and Standard Deviation
1. Find the mean of the population data set.
2. Find the deviation of each entry.
3. Square each deviation.
4. Add to get the sum of squares.
5. Divide by N to get the population variance.
11
12
UNIT 2: Numerical Descriptions of Data
6. Find the square root of the variance to get the population standard
deviation.
Example
Definition
The sample variance and sample standard deviation of a sample data
set of n entries are listed below.
Sample variance = 𝑠 2 =
Ξ£(π‘₯βˆ’π‘₯Μ… )2
π‘›βˆ’1
Sample standard deviation = 𝑠 = βˆšπ‘  2 = √
Finding the Sample Variance and Standard Deviation
1. Find the mean of the sample data set.
2. Find the deviation of each entry.
3. Square each deviation.
4. Add to get the sum of squares.
5. Divide by n-1 to get the sample variance.
Ξ£(π‘₯βˆ’π‘₯Μ… )2
π‘›βˆ’1
BBA240 Statistics for Economics and Finance
6. Find the square root of the variance to get the sample standard
deviation.
Example
When interpreting the standard deviation, remember that it is a measure
of the typical amount an entry deviates from the mean. The more the
entries are spread out, the greater the standard deviation.
13
14
UNIT 2: Numerical Descriptions of Data
Standard deviation for grouped data
We have learned that large data sets are usually best represented by
frequency distributions. The formula for the sample standard deviation
for a frequency distribution is
Sample standard deviation = 𝑠 = √
Ξ£(π‘₯βˆ’π‘₯Μ… )2 𝑓
π‘›βˆ’1
where 𝑛 = Σ𝑓 is the number of entries in the data set.
Example
EXERCISES: CONCEPTS AND PROCEDURES
1. The range, as a measure of spread, has the disadvantage of being
influenced by outliers. Illustrate this with an example.
BBA240 Statistics for Economics and Finance
2. Can the standard deviation have a negative value? Explain.
3. When is the value of the standard deviation for a data set zero?
Give one example. Calculate the standard deviation for the
example and show that its value is zero.
4. Briefly explain the difference between a population parameter and
a sample statistic. Give one example of each.
5. The following data set belongs to a population:
5 -7 2 0 -9 1 61 07
Calculate the range, variance, and standard deviation.
6. The following data set belongs to a sample:
14 18 -1 08 8 -16
Calculate the range, variance, and standard deviation.
7. The following data give the number of shoplifters apprehended
during each of the past 8 weeks at a large department store.
8. 7 1 08 3 1 51 26 1 1
a. Find the mean for these data. Calculate the deviations of the data
values from the mean. Is the sum of these deviations zero?
b. Calculate the range, variance, and standard deviation.
9. The following data give the prices of seven textbooks randomly
selected from a university bookstore.
$89 $170 $104 $113 $56 $161 $147
a. Find the mean for these data. Calculate the deviations of the data
values from the mean. Is the sum of these deviations zero?
b. Calculate the range, variance, and standard deviation.
10. The following data give the numbers of car thefts that occurred in
a city in the past 12 days.
6 3 7 1 14 3 8 7 2 6 9 1 5
Calculate the range, variance, and standard deviation.
11. Refer to the data in Exercise 3.23, which contained the numbers of
tornadoes that touched down in
12. 12 states that had the most tornadoes during the period 1950 to
1994. The data are reproduced here.
1113 2009 1374 1137 2110 1086 1166 1039 1673 2300
1139 5490
Find the variance, standard deviation, and range for these data.
13. The following data give the numbers of pieces of junk mail
received by 10 families during the past month.
41 33 28 21 29 19 14 31 39 36
Find the range, variance, and standard deviation.
14. The following data give the number of highway collisions with
large wild animals, such as deer or moose, in one of the northeastern states during each week of a 9-week period.
7 1 03 8 2 5 7 4 9
Find the range, variance, and standard deviation.
15. Attacks by stinging insects, such as bees or wasps, may become
medical emergencies if either the victim is allergic to venom or
multiple stings are involved. The following data give the number
15
16
UNIT 2: Numerical Descriptions of Data
of patients treated each week for such stings in a large regional
hospital during 13 weeks last summer.
1 5 2 3 0 4 1 7 0 1 2 0 1
Compute the range, variance, and standard deviation for these
data.
16. The following data give the number of hot dogs consumed by 10
participants in a hot-dog-eating contest.
21 17 32 8 20 15 17 23 9 18
Calculate the range, variance, and standard deviation for these
data.
2.2.2 Chebychev’s Theorem
Chebyshev’s theorem gives a lower bound for the area under a curve
between two points that are on opposite sides of the mean and at the same
distance from the mean.
The portion of any data set lying within k standard deviations (k>1) of the
mean is at least
1βˆ’
1
π‘˜2
1
3
1
8
β€’
k=2: In any data set, at least 1 βˆ’ 22 = 4 or 75%, of the data lie within 2
standard deviations of the mean.
β€’
k=3: In any data set, at least 1 βˆ’ 2 = or 88.9%, of the data lie within 3
3
9
standard deviations of the mean.
Example
The average systolic blood pressure for 4000 women who were screened
for high blood pressure was found to be 187 mm Hg with a standard
deviation of 22. Using Chebyshev’s theorem, find at least what
percentage of women in this group have a systolic blood pressure
between 143 and 231 mm Hg.
Solution
BBA240 Statistics for Economics and Finance
Let πœ‡ and 𝜎 be the mean and the standard deviation, respectively, of the
systolic blood pressures of these women. Then, from the given
information,
ΞΌ = 187 and Οƒ = 22
To find the percentage of women whose systolic blood pressures are
between 143 and 231 mm Hg, the first step is to determine k.
As shown below, each of the two points, 143 and 231, is 44 units away
from the mean.
The value of k is obtained by dividing the distance between the mean and
each point by the standard deviation. Thus,
Hence, according to Chebyshev’s theorem, at least 75% of the women
have systolic blood pressure between 143 and 231 mm Hg.
2.3 Measures of Position
A measure of position determines the position of a single value in
relation to other values in a sample or a population data set. There are
many measures of position; however, only quartiles, percentiles, and
percentile rank are discussed in this section.
2.3.1 Quartiles and Interquartile Range
Quartiles are the summary measures that divide a ranked data set into
four equal parts. Three measures will divide any data set into four equal
parts. These three measures are the first quartile (denoted by Q1), the
second quartile (denoted by Q2), and the third quartile (denoted by
Q3). The data should be ranked in increasing order before the quartiles
are determined. The quartiles are defined as follows.
17
18
UNIT 2: Numerical Descriptions of Data
Definition: Quartiles are three summary measures that divide a ranked data set into
four equal parts. The second quartile is the same as the median of a data
set. The first quartile is the value of the middle term among the
observations that are less than the median, and the third quartile is the
value of the middle term among the observations that are greater than the
median.
Approximately 25% of the values in a ranked data set are less than Q1
and about 75% are greater than Q1. The second quartile, Q2, divides a
ranked data set into two equal parts; hence, the second quartile and the
median are the same. Approximately 75% of the data values are less than
Q3 and about 25% are greater than Q3.
The difference between the third quartile and the first quartile for a data
set is called the interquartile range (IQR).
Calculating Interquartile Range
The difference between the third and the first quartiles gives the
interquartile range; that is,
IQR = Interquartile range = Q3 - Q1
Example
BBA240 Statistics for Economics and Finance
2.3.2 Percentiles and other Fractiles
In addition to using quartiles to specify a measure of position, you can
also use percentiles and deciles. These common fractiles are summarized
as follows.
Percentiles are often used in education and health-related fields to
indicate how one individual compares with others in a group. They can
also be used to identify unusually high or unusually low values. For
instance, test scores and children’s growth measurements are often
expressed in percentiles. Scores or measurements in the 95th percentile
and above are unusually high, while those in the 5th percentile and below
are unusually low.
Exercise
1. The following data give the weights (in pounds) lost by 15 members of a
health club at the end of 2 months after joining the club.
5 10 8 7 25 12 5 14
11 10 21 9 8 11 18
a. Compute the values of the three quartiles and the interquartile range.
b. Calculate the (approximate) value of the 82nd percentile.
c. Find the percentile rank of 10.
2. The following data give the speeds of 13 cars (in mph) measured by radar,
traveling on I-84.
73 75 69 68 78 69 74
19
20
UNIT 2: Numerical Descriptions of Data
76 72 79 68 77 71
a. Find the values of the three quartiles and the interquartile range.
b. Calculate the (approximate) value of the 35th percentile.
c. Compute the percentile rank of 71.
3. The following data give the numbers of computer keyboards assembled at
the Twentieth Century
Electronics Company for a sample of 25 days.
45 52 48 41 56 46 44 42 48 53
51 53 51 48 46 43 52 50 54 47
44 47 50 49 52
a. Calculate the values of the three quartiles and the interquartile range.
b. Determine the (approximate) value of the 53rd percentile.
c. Find the percentile rank of 50.
4. The following data give the numbers of minor penalties accrued by each
of the 30 National Hockey
League franchises during the 2007–08 regular season.
318 336 337 339 362 363 366 369 372 375
378 381 384 385 386 387 390 393 395 403
405 409 417 431 433 434 438 444 461 480
a. Calculate the values of the three quartiles and the interquartile
range.
b. Find the approximate value of the 57th percentile.
c. Calculate the percentile rank of 417.
5. According to Fair Isaac, β€œThe Median FICO (Credit) Score in the U.S. is
723” (The Credit Scoring Site, 2009). Suppose the following data represent
the credit scores of 22 randomly selected loan applicants.
494 728 468 533 747 639 430 690 604 422 356
805 749 600 797 702 628 625 617 647 772 572
a. Calculate the values of the three quartiles and the interquartile
range. Where does the value 617 fall in relation to these quartiles?
b. Find the approximate value of the 30th percentile. Give a brief
interpretation of this percentile.
c. Calculate the percentile rank of 533. Give a brief interpretation of
this percentile rank.
6. The fatality rate on the nation’s highways in 2007 was the lowest since
1994, but these numbers are still mind-boggling. The number of persons
killed in motor vehicle traffic crashes, by town, in 2007 is listed here.
1110 84 1066 650 3974 554 277 117 44 3214 1641
BBA240 Statistics for Economics and Finance
138 252 1249 898 445 416 864 985 183 614 417
1088 504 884 992 277 256 373 129 724 413 1333
1675 111 1257 754 455 1491 69 1066 146 1210 3363
299 66 1027 568 431 756 150
a. Draw a dotplot of fatality data.
b. Draw a stem-and-leaf display of these data. Describe how the three
large-valued data are handled.
c. Find the 5-number summary and draw a box-and whiskers display.
d. Find P10 and P90.
e. Describe the distribution of the number of fatalities per state, being
sure to include information learned in parts a through d.
f. Why might it be unfair to draw conclusions about the relative safety
level of highways in the 51 states based on these data?
21