Download Introduction: Descriptive Statistics

Document related concepts

History of statistics wikipedia , lookup

Gibbs sampling wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Speaking of Statistics (13)
Instructor: Prof. Ken Tsang
T.A. : Ms Lisa Liu
Office: E409
Tel: 362 0606(office) 362 0630(T.A.)
Email: [email protected] (Instructor)
[email protected] (TA)
What is Statistics all about?
 The subject of statistics involves the study of how
to collect, summarize, analyze and interpret data.
 Data are numerical facts and figures from which
conclusions can be drawn. Such conclusions are
important to the decision-making processes of
many professions and organizations.
Data

Some sources of data are:
 Data distributed by an organization or an individual
 A designed experiment
 A survey
 An observational study


Web, telephone
Data can be
 Curves, figures

Sounds

Papers, books

Web, telephone process
Distinguished Statisticians in History!
Sir R. A. Fisher
Karl Pearson
1890-1962
1857-1936
W. Edwards Deming
--The Father of the Quality Evolution
1900-1993
Data Scientist:
The Sexiest Job of the
21st Century
Final Evaluation Proportion
20%
40%
10%
20%
10%
Assignments
Quizzes
Group Project
Mid-term Test
Final Examination
CRA (Criterion-Referenced Assessment)



Adoption of the Criterion-Referenced
Assessment (CRA) for evaluating students’
performance
OBTL Syllabus
CRA model is directly compatible with the
OBTL philosophy.
UIC Regulations on CRA
Assessment rubrics should be developed for
assessment tasks, individually or in combination,
which contribute to 40% of the course grade.
 We will use rubric for following assessments.
1. Oral Presentation and Group Presentation
(20%)
2. Final Examination (40%)
Rubric for Assessment of Oral and Group Presentation (1)
Criteria for
assessment
Performance levels
Excellent
4
Content of presentation
Organization
( _20_ % weighting)
Accuracy and
Depth
( _40_ % weighting)
Presentation techniques
Oral English
( _10_ % weighting)
Body language and
facial expressions
( _10_ % weighting)
Time management
( _10_ % weighting)
Question and Answer Performance
Responsiveness
( _10_ % weighting)
Good
3
Satisfactory
2
Marginal Pass
1
Fail
0
Oral and Group Presentation (1)

Choose your teammates

4-5 members in one team

Submit your group form before November

Study rubric for oral presentation (1)

Choose a topic for your team

Prepare your PPT

Oral presentation will be given (roughly) on
the 12th week (Nov ? Dec 2014)
Suggested Grade Distribution
Assessment grade system:
 A (Not more than 5%)
 A and A- (Not more than
15%)
 A and B that include A, A-,
B+, B and B- (Not more
than 75%)
 Below C and not include C
(No any limit ).
Letter
Grade
Academic
Performance
A
Excellent
A-
Excellent
B+
Good
B
Good
B-
Good
C+
Satisfactory
C
Satisfactory
D
Marginal Pass
F
Fail
Some notices on this Course



Assignments must be handed in before the deadline.
After the deadline, we refuse to accept your
assignments!
For the mid-term test and final examination, you
cannot bring anything except some stationeries and
water! Mobile are not allowed.
For the final examination, we cannot tell you the
score before the AR inform the official results. If
you have any question on the score, you can check
the marked sheet via AR.
General Information

Textbook
Essentials of Business Statistcs
Bowerman/O'Connell/Murphree/Orris
McGraw Hill, International Edition
ISBN 978-0-07-131471-8

Advantages
 Unified textbook for all the year one students
 More applications
General Information

References




Basic Statistics, for Business & Economics
Fifth Edition
D.A. Lind, W.G. Marchal and S.A. Wathen
2006, McGraw Hill, International Edition
Business Statistics, A First Course, Fourth Ed.
D.M. Levine, T.C. Krehbiel and M.L. Berenson
2006, Pearson Prentice Hall, New Jersey
Statistics for Business and Economics, Ninth Ed.
J.T. McClave, P.G. Benson and T. Sincich
2005, Pearson Prentice Hall, New Jersey
Modern Elementary Statistics, 11th Ed.
J.E. Freund, 2004, Prentice Hall.
Statistics for the Behavioral Sciences
Frederick J Gravetter
and Larry B. Wallnau
Wadsworth Publishing; 8 edition (December
10, 2008)
18
Chapter 1
An introduction to Business Statistics
Populations and Samples
Sampling a Population of
Existing Units
Sampling a Process
An Introduction to Survey
Sampling
Section 1.1 Populations(总体) and
Samples(样本)
Population A set of existing units (people, objects, or
events)
1. All of the last year’s graduates of Dartmouth College’s Master
of Business Administration program.
2. All Lincoln Town Cars that were produced last year.
3. All accounts receivable invoices accumulated last year by The
Procter & Gamble Company.
4. All fire reported last month to the Tulsa, Oklahoma, fire
department.
Variable(变量) A measurable characteristic of the population.
We carry out a measurement to assign a value of a
variable to each population unit.
The variable is said to be quantitative(定量的):
Measurements that represent quantities (for example,
“how much” or “how many”). For example, annual
starting salary is quantitative, age and number of
children is also quantitative
The variable is said to be qualitative(定性的) or
categorical(属性的): A descriptive category to which a
population unit belongs. For example, a person’s gender,
the make of an automobile and whether a person who
purchases a product is satisfied with the product are
qualitative.
There are two types of qualitative variables:

Nominative(无顺序分类的):
 Identifier or name
 Unranked categorization
 Example: gender, car color

Ordinal(顺序的):
 All characteristics of nominative plus…
 Rank-order categories
 Ranks are relative to each other
 Example: Low (1), moderate (2), or high (3) risk
Census(普查) An examination of the entire population of
measurements.
Note: Census usually too expensive, too time consuming, and
too much effort for a large population.
Sample
A selected subset of the units of a population.
Population
Sample
For example, a university graduated 8,742 students
a. This is too large for a census.
b. So, we select a sample of these graduates and
learn their annual starting salaries.
Sample of measurements


Measured values of the variable of interest for the
sample units.
For example, the actual annual starting salaries of the
sampled graduates.
Descriptive statistics
The science of describing the important aspects of a set of
measurements
 For example, for a set of annual starting salaries, we want
to know:
How much to expect
What is a high versus low salary
How much the salaries differ from each other
 If the population is small enough, could take a census and
not have to sample and make any statistical inferences
 But if the population is too large, then ……….
Statistical Inference(统计推断)
The science of using a sample of measurements
to make generalizations about the important
aspects of a population of measurements.
• For example, use a sample of starting salaries
to estimate the important aspects of the
population of starting salaries
There is a criteria on how to choose a sample: the
information contained in a sample is to accurately reflect
the population under study.
The Lady Tasting Tea
Tea is tasted different depending
upon whether the tea was poured
into the milk or whether the milk
was poured into the tea.
Let us test the
proposition!
Observation Study
---Smoking is harmful to health
Section 1.2 Sampling a Population of Existing
Units
Random sample(随机样本)
A random sample is a sample selected from a population so that:


Each population unit has the same chance of being selected as
every other unit
 Each possible sample (of the same size) has the same chance
of being selected
For example, randomly pick two different people from a group of 15:
 Number the people from 1 to 15; and write their numbers on 15
different slips of paper
 Thoroughly mix the papers and randomly pick two of them
 The numbers on the slips identifies the people for the sample
Sample with replacement(有放回抽样)
Replace each sampled unit before picking next unit

The unit is placed back into the population for possible
reselection
 However, the same unit in the sample does not contribute
new information
Sample without replacement(无放回抽样)
A sampled unit is withheld from possibly being
selected again in the same sample

Guarantees a sample of different units
Each sampled unit contributes different information
Sampling without replacement is the usual and customary
sampling method

Example 1.1 The Cell Phone Case: Estimating Cell Phone Costs
The bank has 2,136 employees on a 500-minute-permonth plan with a monthly cost of $50. The bank will
estimate its cellular cost per minute for this plan by
examining the number of minutes used last month by each
of 100 randomly selected employees on this 500-minute
plan.
According to the cellular management service, if the
cellular cost per minute for the random sample of 100
employees is over 18 cents per minute, the bank should
benefit from automated cellular management of its calling
plans.
In order to randomly select the sample of 100 cell
phone users, the bank will make a numbered list of the
2,136 users on the 500-minite plan. This list is called a
frame(设计框架).
The bank can use a random number table, such as
Table 1.1(a), or a computer software package, such as
Table 1.1 (b), to select the needed sample.
The 100 cellular-usage figures are given in Table 1.2.
Approximately Random Samples
Sometimes it is not possible to list and thus number all
the units in a population. In such a situation we often
select a systematic sample, which approximates a random
sample.
A Systematic Sample(系统抽样)
Randomly enter the population and systematically
sample every kth unit.
Example 1.2 The Marketing Research Case: Rating a
New Bottle Design
To study consumer reaction to a new design, the brand group
will use “mall intercept method” in which shoppers at a large
metropolitan shopping mall are intercepted and asked to participate
in a consumer survey. The questionnaire are shown in Figure 1.1.
Each shopper will be exposed to the new bottle design and asked to
rate the bottle image using a 7-point “Likert scale.”
We select a systematic sample. To do this, every 100th shopper
passing a specified location in the mall will be invited to participate
in the survey. During a Tuesday afternoon and evening, a sample of
60 shoppers is selected by using the systematic sampling process.
The 60 composite scores are given in Table 1.3. From this table, we
can estimate that 95 percent of the shoppers would give the bottle
design a composite score of at least 25.
Another Sampling Method
Voluntary response sample
Participants select themselves to be in the sample
• Participants “self-select”
• For example, calling in to vote on American
Idol
• Commonly referred to as a “non-scientific”
sample
Usually not representative of the population
• Over-represent individuals with strong opinions
• Usually, but not always, negative opinions
Section 1.3 Sampling a Process
Process(过程)
A sequence of operations that takes inputs (labor, raw
materials, methods, machines, and so on) and turns them
into outputs (products, services, and the like)
Inputs
Process
Outputs
Processes produce output over time
The
“population” from a process is all output
produced in the past, present, and the yet-to-occur
future.
For
example, all automobiles of a particular make
and model, for instance, the Lincoln Town Car
Cars
will continue to be made over time
The Coffee Temperature Case: Monitoring
Coffee Temperatures
This case concerns coffee temperatures at a fast-food
restaurant. To do this, the restaurant personnel measure
the temperature of the coffee being dispensed (in degrees
F) at half-hour intervals from 10 A.M. to 9:30 P.M. on a
given day. Data is list on Table 1.7.
A process is in statistical control if it does not exhibit
any unusual process variations.
To determine if a process is in control or not, sample
the process often enough to detect unusual variations
A runs plot is a graph of individual process
measurements over time. Figure 1.3 shows a runs plot of
the temperature data.
Example 1.3
Figure 1.3 Runs Plot of Coffee Temperatures: The Process
is in Statistical Control.
Results


Over time, temperatures appear to have a fairly constant
amount of variation around a fairly constant level
 The temperature is expected to be at the constant level
shown by the horizontal blue line
 Sometimes the temperature is higher and sometimes
lower than the constant level
 About the same amount of spread of the values (data
points) around the constant level
 The points are as far above the line as below it
 The data points appear to form a horizontal band
So, the process is in statistical control
 Coffee-making process is operating “consistently”
Remark


Because the coffee temperature has been and is presently
in control, it will likely stay in control in the future
 If the coffee making process stays in control, then
coffee temperature is predicted to be between 152o
and 170o F
In general, if the process appears from the runs plot to be
in control, then it will probably remain in control in the
future
 The sample of measurements was approximately
random
 Future process performance is predictable
Section 1.4 An Introduction to Survey Sampling


Already know some sampling methods
 Also called sampling designs, they are:
 Random sampling
 The focus of this book
 Systematic sampling
 Voluntary response sampling
But there are other sample designs:
 Stratified random sampling(分层随机抽样)
 Cluster sampling(分块抽样)
Stratified Random Sample




Divide the population into non-overlapping groups,
called strata, of similar units
Separately, select a random sample from each and
every stratum
Combine the random samples from each stratum to
make the full sample
Appropriate when the population consists of two or
more different groups so that:
 The groups differ from each other with respect to the
variable of interest
 Units within a group are similar to each other
 For example, divide population into strata by age,
gender, income, etc
Cluster Sampling





“Cluster” or group a population into subpopulations
 Cluster by geography, time, and so on…
Each cluster is a representative small-scale version of the
population (i.e. heterogeneous group)
A simple random sample is chosen from each cluster
Combine the random samples from each cluster to make the
full sample
Appropriate for populations spread over a large geographic
area so that…
 There are different sections or regions in the area with
respect to the variable of interest
 A random sample of the cluster
More on Systematic Sampling





Want a sample containing n units from a population
containing N units
Take the ratio N/n and round down to the nearest whole
number
 Call the rounded result k
Randomly select one of the first k elements from the
population list
Step through the population from the first chosen unit and
select every kth unit
This method has the properties of a simple random sample,
especially if the list of the population elements is a random
ordering
Sampling Problem


Random sampling should eliminate bias
But even a random sample may not be representative
because of:
 Under-coverage
 Too few sampled units or some of the population
was excluded
 Non-response
 When a sampled unit cannot be contacted or
refuses to participate
 Response bias
 Responses of selected units are not truthful
Chapter 2
Descriptive Statistics
Describing the Shape of a Distribution
Describing Central Tendency
Measures of Variation
Percentiles, Quartiles, and Box-andWhiskers Displays
Describing Qualitative Data
Weighted Means
Section 2.1 Describing the Shape of a
Distribution


To know what the population looks like, find the
“shape” of its distribution
Picture the distribution graphically by any of the
following methods:
 Stem-and-leaf display(茎叶图)
 Frequency distributions(頻率分布表)
 Histogram(直方图)
 Dot plot(点图)
Stem-and-leaf display


The purpose of a stem-and-leaf display is to see
the overall pattern of the data, by grouping the
data into classes
 To see:
 the variation from class to class
 the amount of data in each class
 the distribution of the data within each
class
Best for small to moderately sized data
distributions
Example 2.1
The Car Mileage Case
In this case study, we consider a tax credit offered by
the federal government to automakers for improving
the fuel economy of midsize cars.
To find the combined city and highway mileage
estimate for a particular car model, the EPA tests a
sample of cars.
Table 2.1 presents the sample of 49 gas mileages that
have been obtained by the new midsize model.
Table 2.1 A sample of 49 mileages
30.8
31.7
30.1
31.6
30.9
30.4
32.5
30.3
32.0
31.4
30.8
32.8
32.3
32.7
31.2
30.6
32.6
31.4
31.8
31.9
32.1
33.3
31.3
31.3
32.1
32.5
32.0
31.5
32.4
31.7
31.4
32.2
32.8
31.5
31.6
31.0
32.0
32.4
31.8
30.4
30.5
31.0
29.8
31.1
31.5
31.7
30.6
30.6
32.2
The stem-and-leaf display of car mileages:
29 + 0.8 = 29.8
29
30
31
32
33
8
13445666889
00123344455566777889
0001122344556788
3
33 + 0.3 = 33.3
Another display of the same data using more classes

Starred classes (*) extend from 0.0 to 0.4

Unstarred classes extend from 0.5 to 0.9
29
30*
30
31*
31
32*
32
33*
8
1344
5666889
001233444
55566777889
0001122344
556788
3

Looking at the last stem-and-leaf display, the
distribution appears almost “symmetrical”
 The upper portion of the display…
 Stems 29, 30*, 30, and 31*
 … is almost a mirror image of the lower portion of
the display
 Stems 31, 32*, 32, and 33*
 But not exactly a mirror reflection
 Maybe slightly more data in the lower portion
than in the upper portion
 Later,
we will call this a slightly “leftskewed” distribution
Constructing a Stem-and-Leaf Display
1. Decide what units will be used for the stems and the
leaves. As a general rule, choose units for the stems so
that there will be somewhere between 5 and 20 stems.
2. Place the stems in a column with the smallest stem at
the top of the column and the largest stem at the
bottom.
3. Enter the leaf for each measurement into the row
corresponding to the proper stem. The leaves should
be single-digit numbers (rounded values).
4. If desired, rearrange the leaves so that they are in
increasing order from left to right.
The Payment Time Case: Reducing
Payment Times
In order to assess the effectiveness of the system, the
consulting firm will study the payment times for invoices
processed during the first three months of the system’s
operation.
Example 2.2
During this period, 7,823 invoices are processed using
the new system. To study the payment times of these
invoices, the consulting firm numbers the invoices from
0001 to 7823 and uses random numbers to select a
random sample of 65 invoices. The resulting 65 payment
times are given in Table 2.2
Table 2.2 A Sample of Payment Times (in Days)
for 65 Randomly Selected Invoices.
22
19
16
18
13
16
29
17
15
23
18
21
16
10
16
22
17
25
15
21
20
16
15
19
18
15
22
16
24
20
17
14
14
19
15
27
12
17
25
13
17
16
13
18
19
18
14
17
17
12
23
24
18
16
16
20
15
24
17
21
15
14
19
26
21
• The number 27 (no
parentheses) indicates that
there are 27 payments
made in 16 or less days
0
00
000
0000
0000000
000000000
00000000
000000
00000
000
000
000
00
000
00
0
0
0
Longer tail
• The number 8 in
parentheses indicates that
there are 8 payments in
the stem for 17 days
1 10
1 11
3 12
6 13
10 14
17 15
26 16
(8) 17
30 18
24 19
19 20
16 21
13 22
10 23
8 24
5 25
3 26
2 27
1 28
1 29
Shorter tail
The leftmost column of
numbers are the numbers are
the amounts of values in
each stem
The Payment Times: Results
Looking at this display, we see that all of the sampled
payment times are substantially less than the 39-day
typical payment time of the former billing system.
The stem-and-leaf display do not appear symmetrical.
The “tail” of the distribution consisting of the higher
payment times is longer than the “tail” of the
distribution consisting of the smaller payment times.
We say that the distribution is skewed with a tail to the
right.
Frequency Distribution and Histogram
A frequency distribution is a list of data classes
with the count or “frequency” of values that belong
to each class
• “Classify and count”
• The frequency distribution is a table
Show the frequency distribution in a histogram
• The histogram is a picture of the frequency
distribution
See Examples 2.2, The Payment Time Case
Constructing the frequency distribution

Steps in making a frequency distribution:
1. Determine the number of classes K
2. Determine the class length
3. Set the starting value for the classes, that is, the
distribution “floor”
4. Calculate the class limits
5. Setup all the classes

Then tally the data into the K classes and record the
frequencies
The number of classes K


Group all of the n data into K number of classes
K is the smallest whole number for which
2K  n

In Examples 2.2 , n = 65
 For K = 6, 26 = 64, < n
 For K = 7, 27 = 128, > n
 So use K = 7 classes
Class Length L

Class length L is the step size from one to the next
Largest value - smallest value
L
K

In Examples 2.2, The Payment Time Case, the largest
value is 29 days and the smallest value is 10 days, so
29 - 10 days
19 days
L

 2.7143 days/class
7 classes
7 classes

Arbitrarily round the class length up to 3 days/class
Starting the classes




The classes start on the smallest data value
 This is the lower limit of the first class
The upper limit of the first class is
smallest value + (L – 1)
 In the example, the first class starts at 10 days and goes
up to 12 days
The second class starts at the upper limit of the first class +
1 and goes up (L – 1) more
 The second class starts at 13 days and goes up to 15
days
And so on
Tallies and Frequencies: Example 2.2
Classes (days)
10 to 12
13 to 15
16 to 18
19 to 21
22 to 24
25 to 27
28 to 30
Tally
|||
|||| |||| ||||
Frequency
3
14
|||| |||| |||| |||| |||
23
|||| |||| ||
12
|||| |||
8
||||
4
|
1
65
Check: All frequencies must sum to n
Relative Frequency(相对频率)

The relative frequency of a class is the proportion or
fraction of data that is contained in that class
 Calculated by dividing the class frequency by the
total number of data values
 Relative frequency may be expressed as either a
decimal or percent
 A relative frequency distribution is a list of all
the data classes and their associated relative
frequencies
Relative Frequency: Example 2.2
Classes (days)
10 to 12
13 to 15
16 to 18
19 to 21
22 to 24
25 to 27
28 to 30
Frequency
3
14
23
12
8
4
1
65
Relative Frequency
3/65 = 0.0462
14/65 = 0.2154
0.3538
0.1846
0.1231
0.0615
0.0154
1.0000
Check: All relative frequencies must sum to 1
Classes Frequency Relative Frequency Boundaries Midpoint
10 to 12
3
0.0462
9.5, 12.5
11
13 to 15
14
0.2154
12.5, 15.5
14
16 to 18
23
0.3538
15.5, 18.5
17
19 to 21
12
0.1846
18.5, 21.5
20
22 to 24
8
0.1231
21.5, 24.5
23
25 to 27
4
0.0615
24.5, 27.5
26
28 to 30
1
0.0154
27.5, 30.5
29
65
1.0000
Histogram



A graph in which rectangles represent the
classes
The base of the rectangle represents the class
length
The height of the rectangle represents
 the frequency in a frequency histogram, or
 the relative frequency in a relative frequency
histogram
Histogram
Example 2.2: The Payment Times Case
Frequency Histogram Relative Frequency Histogram
As with the earlier stem-and-leaf display, the tail on the
right appears to be longer than the tail on the left.
Example 2.1 The Car Mileage Case
We should use K=6 classes, the largest and smallest
mileages in Table 2.1 are 33.3 and 29.8. So we find the
class length by computing (33.3-29.8)/6=0.5833.
To obtain a more convenient class length, we round this
value up to 0.6.
To form the first class, we start with the smallest
mileage-29.8-and add 0.5 to obtain the class 29.8-30.3.
Following this instruction, we can obtain all classes.
Remark: Although we have given a procedure for
determining the number of classes, it is often desirable
to let the nature of the problem determine the classes.
Table: A Frequency Distribution and a Relative
Frequency Distribution of the 49 Mileages
Classes
29.8-30.3
30.4-30.9
31.0-31.5
31.6-32.1
32.2-32.7
32.8-33.3
Freq. Relative Freq.
3
0.0612
9
0.1837
12
0.2449
13
0.2653
9
0.1827
3
0.0612
Boundaries Midpoint
29.75, 30.35
30.05
30.35, 30.95
30.65
30.95, 31.55
31.25
31.55, 32.15
31.85
32.15, 32.75
32.45
32.75, 33.35
33.05
Back-to-Back
Display
Comparing
Twohistogram
Distributions
with back-toback Histogram
78
The Normal Curve(正态曲线)
Symmetrical and bell-shaped
curve for a normally distributed
population
The height of the normal over
any point represents the relative
proportion of values near that point
Example 2.1, The Car Mileages
Case
Normal distribution in nature
The bean machine is a device
invented by Sir Francis Galton to
demonstrate how the normal
distribution appears in nature. This
machine consists of a vertical board
with interleaved rows of pins. Small
balls are dropped from the top and
then bounce randomly left or right
as they hit the pins. The balls are
collected into bins at the bottom
and settle down into a pattern
resembling the Gaussian curve.
Normal distribution in nature
Height (in.)
Distribution of the heights of 1052 women fits the normal distribution,
with a goodness of fit p value of 0.75
Histogram of daily percentage changes in the S&P 500 index
那么何谓正态分布呢?通俗地讲就是“中间多,两头少”,比如我们每
个人的身高,巨人或侏儒在人口总数中所占的比例都很小,而中等身材
的人占的比例最大。换成统计学的讲法,如果把身高做为随机变量,那
么这种规律就是说一个人的身高达到平均值的概率最大,但身高越偏离
平均值,其概率也越小。在自然现象和社会现象中,大量的随机变量都
服从或近似地服从正态分布.
由于P{a-b<X≤a+b}=0.6826,P{a-2b<X≤a+2b}=0.9544,P{a3b<X≤a+3b}=0.9974,我们可以看到,对于服从正态分布的随机变量X
来说,它的值落在a-3b与a+3b之间几乎是肯定的,这就是所谓的“3b规
则”。
Skewness(偏度)
Skewed distributions are not symmetrical about their
center. Rather, they are lop-sided with a longer tail on
one side or the other.
• A population is distributed according to its relative
frequency curve
• The skew is the side with the longer tail
Left Skewed
Symmetric
Right Skewed
Section 2.2 Describing Central
Tendency
Population Parameters(总体参数)
A population parameter is a number calculated from all
the population measurements that describes some
aspect of the population
The population mean, denoted , is a population
parameter and is the average of the population
measurements
Point Estimates and Sample Statistics
A point estimate(点估计) is a one-number estimate of
the value of a population parameter
A sample statistic is a number calculated using sample
measurements that describes some aspect of the sample
 Use sample statistics as point estimates of the
population parameters
The sample mean, denoted x, is a sample statistic and is
the average of the sample measurements
 The sample mean is a point estimate of the population
mean
Measures of Central Tendency
Mean, : The average or expected value
Median, Md: The value of the middle point of
the ordered measurements
Mode, Mo: The most frequent value
The Mean(均值)
Population X1, X2, …, XN

Sample x1, x2, …, xn
x
Population Mean
Sample Mean
n
N


Xi
i=1
N
x
x
i
i=1
n
The Sample Mean(样本均值)
For a sample of size n, the sample mean is defined as
n
x
x
i 1
n
i
x1  x2  ...  xn

n
and is a point estimate of the population mean 
• It is the value to expect, on average and in the long run
Mean as the balance point for a distribution
Data: 2, 2, 6, 10
mean=(2+2+6+10)/4=5
What will happen to the mean if we add one more
number to the data?
90
Example: Car Mileage Case
Sample mean for first five car mileages from Table 2.1
30.8, 31.7, 30.1, 31.6, 32.1
5
x
x
i 1
5
i
x1  x2  x3  x4  x5

5
30.8  31.7  30.1  31.6  32.1 156.3
x

 31.26
5
5
Example: Car Mileage Case Continued
Sample mean for all the car mileages from Table 2.1
49
x
i
1546.1
x

 31.5531
49
49
i 1
Based on this calculated sample mean, the point
estimate of mean mileage of all cars is 31.5531 mpg
The Median(中位数)
The population or sample median Md is a value such that
50% of all measurements, after having been arranged in
numerical order, lie above (or below) it
The median Md is found as follows:
1. If the number of measurements is odd, the median
is the middlemost measurement in the ordered
values
2. If the number of measurements is even, the median
is the average of the two middlemost measurements
in the ordered values
Data: 3, 5, 8, 10, 11
median=8
94
Data: 3, 3, 4, 5, 7, 8
median=(4+5)/2=4.5
95
Example: Sample Median
Example 2.3 Internist’s Yearly Salaries (x$1000)
127 132 138 141 144 146 152 154 165 171 177 192 241
Because n = 13 (odd,) then the median is the middlemost
or 7th value of the ordered data, so
Md=152
 An annual salary of $180,000 is in the high end, well
above the median salary of $152,000
• In fact, $180,000 a very high and competitive
salary
Data: 2, 2, 2, 3, 3, 12
mean=4
median=(2+3)/2=2.5
97
The Mode(众数)
The mode Mo of a population or sample of
measurements is the measurement that occurs most
frequently
• Modes are the values that are observed “most
typically”
• Sometimes higher frequencies at two or more values
• If there are two modes, the data is bimodal
• If more than two modes, the data is multimodal
• When data are in classes, the class with the highest
frequency is the modal class
• The tallest box in the histogram
Example 2.4
DVD Recorder Satisfaction
Satisfaction rankings on a scale of 1 (not satisfied) to 10
(extremely satisfied), arranged in increasing order
1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10
Because n = 20 (even,) then the median is the average of
two middlemost ratings; these are the 10th and 11th
values. Both of these are 8 (circled), so
Md = 8
Because te rating 8 occurs with the highest rating,
Mo = 8
100
Relationships Among Mean,
Median and Mode
Comparing Mean, Median & Mode
Bell-shaped distribution: Mean = Median = Mode
Right skewed distribution: Mean > Median > Mode
Left-skewed distribution: Mean < Median < Mode
Also:
 The median is not affected by extreme values
• “Extreme values” are values much larger or much
smaller than most of the data
• The median is resistant to extreme values
 The mean is strongly affected by extreme values
• The mean is sensitive to extreme values
Selecting a measure of Central Tendency


Usually the mean is a good measure,
because it uses every score in the
distribution.
There are some extreme cases in which the
mean is not representative (or calculable).
Then the mode and the median are used.
103
Mean=(10+11*4+12*3+13+100)/10=20.3
Mode=11
Median=(11+12)/2=11.5
104
Mean – not computable
Median=(12+13)/2=12.5
Mode – not meaningful
Open-ended distributions A distribution is said to be
open-ended when there is no upper limit (or lower
limit) for one of the categories
105
Payment Time Case
Mean=18.108 days
Median=17.000 days
Mode=16.000 days
So:
Expect the mean payment time to be 18.108
days
A long payment time would be > 17 days and a
short payment time would be < 17 days
The typical payment time is 16 days
Section 2.3 Measures of Variation(变异数)
Figure 2.31 indicates that we need measures of
variation to express how the two distributions differ.
Figure 2.31 20 Repair Times for Personal Computers at Two Service Centers
Range(全距)
Largest minus the smallest measurement
The Population Variance  2 (pronounced sigma
squared) (总体方差)
The average of the squared deviations of all
the population measurements from the
population mean
Standard Deviation  (pronounced sigma) (标准差)
The square root of the variance
The Range
Range = largest measurement - smallest measurement
The range measures the interval spanned by all the data
Example 2.3: Internist’s Salaries (in thousands of
dollars)
127 132 138 141 144 146 152 154 165 171 177 192 241
Range = 241 - 127 = 114 ($114,000)
The Variance
Population X1, X2, …, XN
Sample x1, x2, …, xn
2
s2
Population Variance
N
2 
 X i -  
Sample Variance
n
2
i=1
N
s 2=
2


x
x
 i
i=1
n-1
The Variance
For a population of size N, the population variance 2 is defined
as
N
2 
2


x


 i
i 1
N
2
2
2

x1     x2       xN   

N
For a sample of size n, the sample variance s2 is defined as
n
s2 
2


x

x
 i
i 1
n 1
2
2
2

x1  x   x2  x     xn  x 

and is a point estimate for 2
n 1
Sample variability tends to underestimate the
population value
112
The Standard Deviation(标准差)
Population Standard Deviation, :
Sample Standard Deviation, s:
  
s s
2
2
Example 2.5
Consider the population of profit margins for five of the
best big companies in America as rated by Forbes
magazine on its website on March 16, 2005. These profit
margins are 8%, 10%, 15%, 12% and 5%.
Population Mean

8  10  15  12  5 50

 10 %
5
5
Population 2 8  10 2  10  10 2  15  10 2  12  10 2  5  10 2
 
Variance
5
2
2

 2   0 2  5 2  2 2   5 

5
4  0  25  4  25 58
2


 11.6% 
5
5
Population Standard Deviation
 
2 
11 . 6  3 . 406 %
Example 2.6 The Car Mileage Case
Sample variance and standard deviation for first five car
mileages from Table 2.1
30.8, 31.7, 30.1, 31.6, 32.1
s o x  3 1 .2 6
2
  xi  x 
5
s2 
i 1
5 1

30.8  31.26 2  31.7  31.26 2  30.1  31.26 2  31.6  31.26 2  32.1  31.26 2

4
= 2.572 /4 = 0.643
s  s 2  . 643  0 . 8019
Sample variance and standard deviation for all car mileages
from Table 2.1,
.
49
s 
2
2


x

x
 i
i 1
49  1
30.66204

 0.638793
48
s  s 2  0.638793  0.7992
The point estimate of the variance of all cars is 0.638793 mpg2
and the point estimate of the standard deviation of all cars is
0.7992 mpg.
The computational formula for the sample variance
s2
2
n


 
  xi  

1  n
 

xi2   i 1


n  1  i 1
n






The Payment Time Case
Example 2.7
Consider the sample of 65 payment times in Table 2.2.
65
x
i 1
i
65
x
i 1
2
i
 x1  x2    x65  22  19    21  1,177
2
 x12  x22    x65
 (22) 2  (19) 2    (21) 2  22,317
Therefore
1 
(1,177) 2  1,004.2464
s 
 15.69135
22,317 

(65  1) 
65 
64
2
and s  s 2  15.69135  3.9612 Days.
The Empirical Rule(经验准则) for
Normal Populations
If a population has mean  and standard deviation  and
is described by a normal curve, then
1. 68.26% of the population measurements lie within one
standard deviation of the mean: [, ]
2. 95.44% of the population measurements lie within two
standard deviations of the mean: [2, 2]
3. 99.73% of the population measurements lie within
three standard deviations of the mean: [3, 3]
Tolerance Intervals(容许区间)
An Interval that contains a specified percentage of the
individual measurements in a population is called a
tolerance interval.
 The one, two, and three standard deviation intervals
around  given in (1), (2) and (3) are tolerance
intervals containing, respectively, 68.26 percent, 95.44
percent and 99.73 percent of the measurements in a
normally distributed population.
 The three-sigma interval   3 ] to be a tolerance
interval that contains almost all of the measurements
in a normally distributed population.
Figure 2.32 The Empirical Rule and Tolerance Intervals
Example 2.8 The Car Mileage Case

68.26% of all individual cars will have mileages in
the range
x  s]  31.6  0.8]  30.8,32.4] mpg

95.44% of all individual cars will have mileages in
the range
x  2s]  31.6 1.6]  30.0,33.2] mpg

99.73% of all individual cars will have mileages in
the range
x  3s]  31.6  2.4]  29.2,34.0] mpg
Skewness and the Empirical Rule
 The Empirical Rule holds for normally distributed
populations.
 This rule also approximately holds for populations
having mound-shaped (single-peaked) distributions
that are not very skewed to the right or left.
 For example, Recall that the distribution of 65
payment times, it indicates that the empirical rule
holds.
Section 2.4 Percentiles, Quartiles(四分之一分位
点) and Box-and-Whiskers Display
For a set of measurements arranged in increasing order,
the pth percentile(百分位点) is a value such that p
percent of the measurements fall at or below the value
and (100-p) percent of the measurements fall at or above
the value
The first quartile Q1 is the 25th percentile
The second quartile (or median) Md is the 50th percentile
The third quartile Q3 is the 75th percentile
The interquartile range IQR(四分位距) is Q3 - Q1
Calculating pth percentile

Calculate the index i=(p/100) ×n

If i is not an integer, the next integer greater
than i denotes the position of the pth
percentile in the ordered arrangement.

If i is an integer, then the pth percentile is the
average of the measurements in position i
and i+1 in the ordered arrangement.
Figure 2.33 Using stem-and-leaf displays to find percentiles.
(a) The 75th percentile of the 65 payment (b) The 5th percentile of the 60 bottle
design ratings and a five-number summary
times, and a five-number summary
Example 2.10
DVD Recorder Satisfaction
20 customer satisfaction ratings:
1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10
Md = (8+8)/2 = 8
Q1 = (7+8)/2 = 7.5
Q3 = (9+9)/2 = 9
IQR = Q3  Q1 = 9  7.5 = 1.5
Five Number Summary in
descriptive statistic
1.
2.
3.
4.
5.

The smallest measurement
The first quartile, Q1
The median, Md
The third quartile, Q3
The largest measurement
Displayed visually using a box-andwhiskers plot
Box-and-whisker plots
A box and whisker plot (sometimes called a boxplot) is a graph
that presents information from a five-number summary. It does
not show a distribution in as much detail as a stem and leaf plot
or histogram does, but is especially useful for indicating whether
a distribution is skewed and whether there are potential unusual
observations (outliers) in the data set.
128
The Box-and-Whiskers Plots(盒型图)

The box plots the:
 first quartile, Q1
 median, Md
 third quartile, Q3
 inner fences, located 1.5IQR away from the quartiles:
 = Q1 – (1.5  IQR)
 = Q3 + (1.5  IQR)
 outer fences, located 3IQR away from the quartiles:
 = Q1 – (3  IQR)
 = Q3 + (3  IQR)


The “whiskers” are dashed lines that plot the
range of the data
 A dashed line drawn from the box below Q1
down to the smallest measurement
 Another dashed line drawn from the box
above Q3 up to the largest measurement
Note: Q1, Md, Q3, the smallest value, and the
largest value are sometimes referred to as the
five number summary
Outliers(异常值)


Outliers are measurements that are very different from
most of the other measurements
 Because they are either very much larger or very much
smaller than most of the other measurements
Outliers lie beyond the fences of the box-and-whiskers
plot
 Measurements between the inner and outer fences are
mild outliers
 Measurements beyond the outer fences are severe
outliers
Section 2.5 Describing Qualitative Data
Pie charts(饼图) of the proportion (as percent) of all
cars sold in the United States by different
manufacturers, 1970 versus 1997
Bar Chart(柱状图)
Percentage of Automobiles Sold by Manufacturer, 1970
versus 1997
Pie Chart
Percentage of Automobiles Sold by Manufacturer,1997
An Bar Chart of U.S Automobile Sales in 1997
Misleading Graphs and Charts:
Scale Break
Break the vertical scale to exaggerate effect
Mean Salaries at a Major University, 2002 - 2005
Misleading Graphs and Charts: Scale
Effects
Compress vs. stretch the vertical axis to exaggerate or minimize
the effect
Mean Salary Increases at a Major University, 2002 - 2005
You can use simple mathematical
operations (like averages) to create
nonsensical “facts” that can drive
whatever agenda you’d like.
Example: the average wealth of the
citizens of a particular town is
$100,000, therefore they don’t
need any government assistance.
(The town consists of 1 stingy
millionaire and 9 homeless people.)
139
Weighted Means(加权均值)

Sometimes, some measurements are more important than
others
 Assign numerical “weights” to the data


Weights measure relative importance of the value
Calculate weighted mean as
w x
w
i
i
i
where wi is the weight assigned to the ith measurement xi
Example 2.12
June 2001 unemployment rates in the U.S. by region
Census Region
Civilian Labor Force Unemployment
(millions)
Rate (%)
Northeast
26.9
4.1
South
50.6
4.7
Midwest
34.7
4.4
West
32.5
5.0
Want the mean unemployment rate for the U.S.

Calculate it as a weighted mean
 So that the bigger the region, the more heavily it counts
in the mean


The data values are the regional unemployment
rates
The weights are the sizes of the regional labor
forces
26 .9  4.1  50 .6  4.7   34 .7  4.4  32 .5  5.0



26 .9  50 .6  34 .7  25 .5  32 .5
663 .29
 4.58 %
144 .7
Note that the unweigthed mean is 4.55%, which
underestimates the true rate by 0.03%

That is, 0.0003  144.7 million = 43,410 workers
Population and Sample Proportions
X is a qualitative variable.
Population X1, X2, …, XN
Sample x1, x2, …, xn
pˆ
p
Sample Proportion
Population Proportion
pˆ 
n( X i  Vj )
n
p^ is the point estimate of p
Example 2.11 The Marketing Ethics Case
117 out of 205 marketing researchers disapproved
of action taken in a hypothetical scenario
X = 117, number of researches who disapprove
n = 205, number of researchers surveyed
Sample Proportion:
p̂ 
X 117

 0 .57
n 205
Scatter Diagrams

Scatter Diagrams are used to
examine possible relationships
between two numerical variables

The Scatter Diagram:
 one variable is measured on the vertical
axis and the other variable is measured
on the horizontal axis
Scatter Plots(散点图)
Visualize the data to see patterns, especially “trends”
Restaurant Ratings: Mean Preference vs. Mean Taste
A Scatter Plot Showing a Positive
Linear Relationship
147
A Scatter Plot Showing a Little or No Linear
Relationship
148
A Scatter Plot Showing a Negative Linear
Relationship
149