Download Chapter 3: Numerical Descriptive Measures

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Chapter 3
Numerical Descriptive Measures
Last (Family) Name: __________________________________.
First (Given) Name: __________________________________
INTRODUCTION
FROM PREVIOUS CHAPTERS:
A parameter is a numerical measure that describes a characteristic of a population.
A statistic is a numerical measure that describes a characteristic of a population sample.
What are the following?
According to a Pew survey, twenty one percent of Americans believe that
President Obama is a Muslim.
The average age of children in our family is 17.
First time unemployment claims were down by 5% in the last quarter.
WHY NUMERICAL DESCRIPTIVE MEASURES?
Charts, frequency distributions, and cross-tabs reduce the detail in the data in ways that make it
easier to see key patterns in the data.
Numerical descriptive measures reduce detail even further; they say something important about a
variable with a single number.
“The average age of our employees is 34.”
“Nearly everyone has more than the average number of legs.”
“More than 80% of our customers live within three miles of our store.”
Page 1
TYPES OF
DESCRIPTIVE MEASURES:
Central tendency describes the extent to which all of the data values group around a
typical center value.
Variation describes the amount of dispersion, or scattering, of values away from a
central value.
Shape is the pattern of the distribution of values from the lowest value to the highest
value.
Page 2
3.1
MEASURES OF CENTRAL
TENDENCY
THE MEDIAN
The middle value in ordered data.
Where is the “center” of the distribution?
Ranked values (odd number)
Is the measure of central tendency a
meaningful “typical” value?
Data: 12, 18, 27, 42, 150.
Median = 27
THE MEAN
Ranked values (even number)
The mean is what most people call “the
average”
Data: 4, 8, 15, 16, 23, 42
Tie for middle value; take the
average of 15 and 16.
Mean = X = sum of the values / number of
values
Values:
1, 3, 5, 7, 9
Sum:
25
Number:
5
Median = 15.5
Works with numerical data
Not sensitive to outliers.
Also works with ordered categorical data.
X = 25/5 = 5
Rates self as above average: 42
Only works with numerical data.
Rates self as average:
A few outliers can throw off the mean
Values:
1, 3, 5, 7, 30
Sum:
46
Number:
5
19
Rates self as below average:
6
Median is “Rates self as above
average.”
THE MODE
X = 46/5 = 9.2; yet no value is even
close to this.
The category with the largest frequency:
Data do not have to be ordered:
California
35
Values:
1, 9, 5, 7, 3
New York
9
Sum:
25
Nebraska
72
Number:
5
The mode is Nebraska
X = 25/5 = 5
Works with numerical and categorical data
Page 3
3.2:
VARIATION AND SHAPE
THE RANGE
The range is the largest value minus the smallest value.
THE VARIANCE AND THE STANDARD DEVIATION
Measure the “average” scatter around the mean.
You do not have to know how to compute them.
However, you do need to know:
That the standard deviation (S) and the variance (V) measure the same thing.
That the variance is the square of the standard deviation.
THE COEFFICIENT OF VARIATION
A relative measure of scatter compared to the size of the mean.
CV = (S/X) x 100%
Good for comparing things that have different units of measurement, like the weight and volume
differences among packing crates.
Z SCORES
The Z score is a measure of how far an outlier is away from the mean.
Z = (X-X)/S
Page 4
3.3
NUMERICAL DESRIPTIVE MEASURES OF A POPULATION
Means, standard deviations, and variances for the entire population, not just a sample
Sample
Population
Mean
X
µ (mu)
Standard Deviation
S
δ (lower-case sigma)
Variance
S2
δ2
Note: There are slight differences in how sample and population standard deviations and
variances are measured, but you can ignore them.
THE EMPIRICAL RULE AND THE CHEBYSHEV RULE
Percentage of Values Within the Interval
Normal Bell-Shaped Distribution
(Empirical Rule)
Any Distribution
(Chebyshev Rule)
Mean +/- one δ
Approximately 68%
NA
Mean +/- two δs
Approximately 95%
At least 75%
Mean +/- three δs
Approximately 99.7%
At least 88.89%
Typically, a value more than two standard deviations from the mean is viewed as an outlier.
Page 5
COMPUTING KEY STATISTICS IN EXCEL
Page 6
3.4
QUARTILES AND THE BOX PLOT
QUARTILES
A crude but common way of expressing ranges
Process
Order the data
2, 3, 4, 5, 6, 7, 8, 12
Divide it into four parts
2, 3, 4, 5, 6, 7, 8, 12
The lowest 25% of the data values form the first quartile—Q1 (2,3)
The next 25% of the data values from the second quartile—Q2 (4,5)
The next 25% of the data values from the third quartile—Q3 (6,7)
The highest 25% of the data values form the fourth quartile—Q4 (8,12)
EXAMPLES
In General
Q1 = (N+1)/4 ranked value
Q2 = 2(N+1)/4 ranked value
Q3 = 3(N+1)/4 ranked value
Example
Times in Minutes:
29, 31, 35, 39, 39, 40, 43 44, 46, 52
10 Values
Q1 = (10+1)/4 = 2.75th value
This is 35 (the third value)
Q2 = 2*(10+1)/4 = 5.5th value
The 5th value is 39
The 6th value is 40
Choose 39.5
Page 7
Q3 = 3*(10+1)/4 = 8.25th value
This is 44 (the 8th value)
THE INTERQUARTILE RANGE
The difference between the third and first quartile.
The middle 50 percent of the data.
Times in Minutes:
29, 31, 35, 39, 39, 40, 43 44, 44, 52
10 Values
Q1 = (10+1)/4 = 2.25th value
This is 35 (the third value)
Q3 = 3*(10+1)/4 = 8.25th value
This is 44 (the 8th value)
Interquartile range = 44-33 = 9 minutes
THE FIVE-NUMBER SUMMARY
THE BOXPLOT
Page 8
3.5
CORRELATION
RELATIONSHIPS
Relationships describe how two variables behave relative to one another.
POSITIVE RELATIONSHIP
As one variable increases, the other variable
also increases or
As one variable decreases, the other variable
also decreases.
NEGATIVE RELATIONSHIP
As one variable increases, the other variable
decreases or
As one variable decreases, the other variable
increases.
NO RELATIONSHIP
As one variable changes, this has no impact on
the other variable.
Page 9
THE COVARIANCE
THE COEFFICIENT OF CORRELATION CORRELATION COEFFICIENT
The correlation coefficient r (Pearson’s r)
Is a number that describes the strength of the relationship between two variables.
The Excel function is CORREL(data series 1, data series 2)
Interpreting the correlation coefficient
1.0 is perfect correlation (rare)
0 is no correlation
-1.0 is perfect negative correlation (rare)
Page 10
3.6
PRESENTING DESCRIPTIVE STATISTICS: PITFALLS AND ETHICAL ISSUES
You need to summarize data to understand it.
Single numerical measures can be very powerful.
However, they may summarize too much and lose important specifics.
Often, when you read a report, you are only given one or two measures.
This may leave you unable to interpret the results meaningfully.
Example: cannot tell if the mean and median are different, which would indicate a
skewed distribution.
Example: Cannot tell if the distribution is bimodal, so that the mean and median are
values for which there are no items in the distribution.
You must personally not be deceptive in what your report.
Giving a single number that is not characteristic of the distribution.
Giving single numbers that rely on characteristics your distribution does not have.
You must report the results fairly, not withholding information that detracts from the point you
are trying to make.
Page 11
HOMEWORK
1. CREATE A SCATTER PLOT FOR ADVERTISING PER CAPITAL AND SALES PER CAPITA USING THE
FOLLOWING DATA.
To do this, select the two data series and not the header row.
Then go to the Insert tab and select Scatter Chart.
Make Sales Per Capital the Vertical Title
Make Advertising Per Capita the Horizontal Title
Region
Honolulu
Orlando
Omaha
Ventura
Dallas
Chicago
Nashville
Boston
San
Francisco
Boise
Advertising pc
$10
$6
$3
$2
$4
$1
$7
$9
Sales pc
$153
$135
$116
$115
$126
$114
$140
$153
$5
$8
$127
$143
Paste your chart onto the following line:
*
Compute Pearson’s r
The value you computed: ______________________
Page 12
2. CREATE A SCATTER CHART FOR PRICE AND UNITS SOLD USING THE FOLLOWING DATA.
Use directions from Problem 1 but with these column names.
The table is for a sample.
Region
San Francisco
Dallas
Omaha
Orlando
Chicago
Ventura
Boston
Boise
Honolulu
Nashville
Price
5
4
3
6
1
2
9
8
10
7
Units Sold
21.7319557
32.1765348
35.6313442
21.6527418
29.7849633
36.4823921
25.1103128
25.6841931
12.0050118
30.7900904
Chart:
*
Correlation Coefficient
Answer:
Page 13
3. CREATE A SCATTER CHART FOR SALES PER CAPITA AND INCOME PER CAPITA USING THE FOLLOWING
DATA.
Use directions from Problem 1 but with these column names.
The table is for a population.
Region
Honolulu
Orlando
Omaha
Ventura
Dallas
Chicago
Nashville
Boston
San
Francisco
Boise
Sales pc
$56
$48
$3
$54
$34
$34
$12
$74
Income pc
$46,079
$26,355
$83,108
$23,164
$74,177
$34,859
$16,414
$92,937
$44
$100
$31,164
$89,303
Chart:
*
Correlation Coefficient
Answer:
Page 14
4. FOR SALES PER CAPITA, COMPUTE THE FOLLOWING STATISTICS.
The table is for a population.
City
Sales per Capita
Chicago
Ventura
Omaha
Dallas
San Francisco
Orlando
Nashville
Boise
Boston
Honolulu
Cheyenne
Raleigh
Detroit
Denver
Seattle
Bismark
New York
Trenton
Topeka
New Orleans
$145.66
$127.27
$145.41
$120.69
$151.42
$120.92
$149.06
$156.05
$125.91
$121.10
$165.61
$138.08
$112.99
$140.55
$157.04
$140.40
$161.95
$109.60
$139.17
$163.87
Mean: ______
Median: _____
Standard Deviation: _______
Variance: __________
Z value for Trenton; ________
Z value for Bismark: _________
Page 15