Download Chapter Ten

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Some Preliminaries
© 2007 Prentice Hall
16-1
Basics of Analysis

The process of data analysis
Observation
Data
Encode
Information
Analysis
Example 1:
© 2007 Prentice Hall

Gift Catalog Marketer

Mails 4 times a year to its customers

Company has I million customers on its file
16-2
Example 1

Cataloger would like to know if new
customers buy more than old
customers?

Classify New Customers as anyone who
brought within the last twelve months.

Analyst takes a sample of 100,000
customers and notices the following.
© 2007 Prentice Hall
16-3
Example 1

5000 orders received in the last month

3000 (60%) were from new customers

2000 (40%) were from old customers

© 2007 Prentice Hall
So it looks like the new customers are
doing better
16-4
Example 1


Is there any Catch here!!!!!
Data at this gross level, has no discrimination
between customers within either group.

© 2007 Prentice Hall
A customer who bought within the last 11 days is
treated exactly similar to a customer who bought
within the last 11 months.
16-5
Example 1

Can we use some other variable to distinguish
between old and new Customers?

Answer: Actual Dollars spent !

What can we do with this variable?


© 2007 Prentice Hall
Find its Mean and Variation.
We might find that the average purchase amount for
old customers is two or three times larger than the
average among new customers
16-6
Numerical Summaries of data


© 2007 Prentice Hall
The two basic concepts are the center
and the Spread of the data
n
Center of data
xi

- Mean, which is given by x  i 1
n
- Median
- Mode
16-7
Numerical Summaries of data

Forms of Variation
n

Sum of differences about the mean:
 ( x  x)
i 1
i
n


Variance:
2
(
x

x
)
 i
i 1
n 1
Standard Deviation: Square Root of Variance
© 2007 Prentice Hall
16-8
Confidence Intervals





© 2007 Prentice Hall
In catalog eg, analyst wants to know average
purchase amount of customers
He draws two samples of 75 customers each
and finds the means to be $68 and $122
Since difference is large, he draws another 38
samples of 75 each
The mean of means of the 40 samples turns
out to be $ 94.85
How confident should he be of this mean of
means?
16-9
Confidence Intervals

Analyst calculates the standard deviation of
sample means, called Standard Error (SE). It
is 12.91

Basic Premise for confidence Intervals

95 percent of the time the true mean purchase
amount lies between plus or minus 1.96 standard
errors from the mean of the sample means.

© 2007 Prentice Hall
C.I. = Mean (+or-) (1.96) * Standard Error
16-10
Confidence Intervals

However, if CI is calculated with only one
sample then
Standard Error of sample mean
= Standard deviation of sample
n

Basic Premise for confidence Intervals with one sample

95 percent of the time the true mean lies between plus or
minus 1.96 standard errors from the sample means.
© 2007 Prentice Hall
16-11
Example 2: Confidence Intervals for response rates





You are the marketing analyst for Online Apparel
Company
You want to run a promotion for all customers on
your database
In the past you have run many such promotions
Historically you needed a 4.5% response for the
promotions to break-even
You want to test the viability of the current fullscale promotion by running a small test promotion
© 2007 Prentice Hall
16-12
Example 2: Confidence Intervals for response rates




© 2007 Prentice Hall
Test 1,000 names selected at random from a new list.
To break-even the list must be expected to have a
response rate of 4.5 percent
Confidence Interval= Expected Response (+/-)
1.96*SE
= p(+/-) 1.96*SE
In our case C.I. = 3.22 % to 5.78%. Thus any
response between 3.22 and 5.78 % supports
hypothesis that true response rate is 4.5%
16-13
Example 2: Confidence Intervals for response rates





The list is mailed and actually pulls in 3.5%
Thus, the true response rate maybe 4.5%
What if the actual rate pulled in were 5% ?
Regression towards mean: Phenomenon of test
result being different from true result
Give more thought to lists whose cutoff
rates lie within confidence interval
© 2007 Prentice Hall
16-14
Frequency Distribution and
Cross-Tabulation
© 2007 Prentice Hall
15
© 2007 Prentice Hall
16-15
Chapter Outline
1) Frequency Distribution
2) Statistics Associated with Frequency Distribution
i.
Measures of Location
ii.
Measures of Variability
iii. Measures of Shape
3) Cross-Tabulations
i.
Two Variable Case
ii.
Three Variable Case
iii. General Comments on Cross-Tabulations
4) Statistics for Cross-Tabulation: Chi-Square
© 2007 Prentice Hall
16-16
Internet Usage Data
Table 15.1
Respondent
Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
© 2007 Prentice Hall
Sex
1.00
2.00
2.00
2.00
1.00
2.00
2.00
2.00
2.00
1.00
2.00
2.00
1.00
1.00
1.00
2.00
1.00
1.00
1.00
2.00
1.00
1.00
2.00
1.00
2.00
1.00
2.00
2.00
1.00
1.00
Familiarity
7.00
2.00
3.00
3.00
7.00
4.00
2.00
3.00
3.00
9.00
4.00
5.00
6.00
6.00
6.00
4.00
6.00
4.00
7.00
6.00
6.00
5.00
3.00
7.00
6.00
6.00
5.00
4.00
4.00
3.00
Internet
Usage
14.00
2.00
3.00
3.00
13.00
6.00
2.00
6.00
6.00
15.00
3.00
4.00
9.00
8.00
5.00
3.00
9.00
4.00
14.00
6.00
9.00
5.00
2.00
15.00
6.00
13.00
4.00
2.00
4.00
3.00
Attitude Toward
Usage of Internet
Internet
Technology Shopping
Banking
7.00
6.00
1.00
1.00
3.00
3.00
2.00
2.00
4.00
3.00
1.00
2.00
7.00
5.00
1.00
2.00
7.00
7.00
1.00
1.00
5.00
4.00
1.00
2.00
4.00
5.00
2.00
2.00
5.00
4.00
2.00
2.00
6.00
4.00
1.00
2.00
7.00
6.00
1.00
2.00
4.00
3.00
2.00
2.00
6.00
4.00
2.00
2.00
6.00
5.00
2.00
1.00
3.00
2.00
2.00
2.00
5.00
4.00
1.00
2.00
4.00
3.00
2.00
2.00
5.00
3.00
1.00
1.00
5.00
4.00
1.00
2.00
6.00
6.00
1.00
1.00
6.00
4.00
2.00
2.00
4.00
2.00
2.00
2.00
5.00
4.00
2.00
1.00
4.00
2.00
2.00
2.00
6.00
6.00
1.00
1.00
5.00
3.00
1.00
2.00
6.00
6.00
1.00
1.00
5.00
5.00
1.00
1.00
3.00
2.00
2.00
2.00
5.00
3.00
1.00
2.00
7.00
5.00
1.00
2.00
16-17
Frequency Distribution


© 2007 Prentice Hall
In a frequency distribution, one variable is
considered at a time.
A frequency distribution for a variable produces a
table of frequency counts, percentages, and
cumulative percentages for all the values associated
with that variable.
16-18
Frequency Distribution of Familiarity
with the Internet
Table 15.2
Value label
Not so familiar
Very familiar
Missing
Value
1
2
3
4
5
6
7
9
TOTAL
© 2007 Prentice Hall
Frequency (N)
Valid
Cumulative
Percentage percentage percentage
0
2
6
6
3
8
4
1
0.0
6.7
20.0
20.0
10.0
26.7
13.3
3.3
0.0
6.9
20.7
20.7
10.3
27.6
13.8
30
100.0
100.0
0.0
6.9
27.6
48.3
58.6
86.2
100.0
16-19
Frequency Histogram
Fig. 15.1
8
7
Frequency
6
5
4
3
2
1
0
© 2007 Prentice Hall
2
3
4
Familiarity
5
6
7
16-20
Statistics for Frequency Distribution:
Measures of Location

The mean, or average value, is the most commonly used
measure of central tendency. The mean, X,is given by
n
X = S X i /n
i=1
Where,
Xi = Observed values of the variable X
n = Number of observations (sample size)

The mode is the value that occurs most frequently. The
mode is a good measure of location when the variable is
inherently categorical or has otherwise been grouped into
categories.
© 2007 Prentice Hall
16-21
Statistics for Frequency Distribution:
Measures of Location


The median of a sample is the middle
value when the data are arranged in
ascending or descending order.
If the number of data points is even, the
median is the midpoint between the two
middle values. The median is the 50th
percentile.
© 2007 Prentice Hall
16-22
Statistics for Frequency Distribution:
Measures of Variability




The range measures the spread of the data.
The variance is the mean squared deviation from the
mean. The variance can never be negative.
The standard deviation is the square root of the variance.
The coefficient of variation is the ratio of the standard
deviation to the mean expressed as a percentage, and is a
unitless measure of relative variability.
CV = sx /X
© 2007 Prentice Hall
16-23
Statistics for Frequency Distribution:
Measures of Shape


Skewness. The tendency of the deviations from the mean
to be larger in one direction than in the other. Tendency for
one tail of the distribution to be heavier than the other.
Kurtosis is a measure of the relative peakedness or
flatness of the frequency distribution curve. The kurtosis of
a normal distribution is zero.
-kurtosis>0, then dist is more peaked than normal dist.
-kurtosis<0, then dist is flatter than a normal distribution.
© 2007 Prentice Hall
16-24
Skewness of a Distribution
Fig. 15.2
Symmetric Distribution
Skewed Distribution
Mean
Median
Mode
(a)
© 2007 Prentice Hall
Mean Median Mode
(b)
16-25
Cross-Tabulation


While a frequency distribution describes one variable at
a time, a cross-tabulation describes two or more
variables simultaneously.
Cross-tabulation results in tables that reflect the joint
distribution of two or more variables with a limited
number of categories or distinct values, e.g., Table 15.3.
© 2007 Prentice Hall
16-26
Gender and Internet Usage
Table 15.3
Gender
Internet Usage
Female
Light (1)
5
10
15
Heavy (2)
10
5
15
15
15
Column Total
© 2007 Prentice Hall
Male
Row
Total
16-27
Two Variables Cross-Tabulation


Since two variables have been cross-classified,
percentages could be computed either columnwise,
based on column totals (Table 15.4), or rowwise,
based on row totals (Table 15.5).
The general rule is to compute the percentages in the
direction of the independent variable, across the
dependent variable. The correct way of calculating
percentages is as shown in Table 15.4.
© 2007 Prentice Hall
16-28
Internet Usage by Gender
Table 15.4
Gender
Internet Usage
© 2007 Prentice Hall
Male
Female
Light
33.3%
66.7%
Heavy
66.7%
33.3%
Column total
100%
100%
16-29
Gender by Internet Usage
Table 15.5
Internet Usage
Gender
Light
Heavy
Total
Male
33.3%
66.7%
100.0%
Female
66.7%
33.3%
100.0%
© 2007 Prentice Hall
16-30
Introduction of a Third Variable in
Cross-Tabulation
Fig. 15.7
Original Two Variables
Some Association
between the Two
Variables
No Association
between the Two
Variables
Introduce a Third
Variable
Introduce a Third
Variable
Refined Association
between the Two
Variables
© 2007 Prentice Hall
No Association
between the Two
Variables
No Change in
the Initial
Pattern
Some Association
between the Two
Variables
16-31
3 Variables Cross-Tab:
Refine an Initial Relationship





As can be seen from Table 15.6, 52% (31%) of unmarried
(married) respondents fell in the high-purchase category
Do unmarried respondents purchase more fashion clothing?
A third variable, the buyer's sex, was introduced
As shown in Table 15.7,
- 60% (25%) of unmarried (married) females fell in the
high-purchase category
- 40% (35%) of unmarried (married) males fell in the highpurchase category.
Unmarried respondents are more likely to fall in the high
purchase category than married ones, and this effect is
much more pronounced for females than for males.
© 2007 Prentice Hall
16-32
Purchase of Fashion Clothing by
Marital Status
Table 15.6
Purchase of
Fashion
Clothing
Current Marital Status
Married
Unmarried
High
31%
52%
Low
69%
48%
Column
100%
100%
700
300
Number of
respondents
© 2007 Prentice Hall
16-33
Purchase of Fashion Clothing by
Marital Status and Gender
Table 15.7
Purchase of
Fashion
Clothing
Sex
Male
Female
Married
Not
Married
Married
Not
Married
High
35%
40%
25%
60%
Low
65%
60%
75%
40%
Column
totals
Number of
cases
100%
100%
100%
100%
400
120
300
180
© 2007 Prentice Hall
16-34
3 Variables Cross-Tab:
Initial Relationship was Spurious




Table 15.8 shows that 32% (21%) of those with
(without) college degrees own an expensive
automobile
Income may also be a factor
In Table 15.9, when the data for the high income
and low income groups are examined separately,
the association between education and ownership
of expensive automobiles disappears,
Initial relationship observed between these two
variables was spurious.
© 2007 Prentice Hall
16-35
Ownership of Expensive
Automobiles by Education Level
Table 15.8
Own Expensive
Automobile
College Degree
No College Degree
Yes
32%
21%
No
68%
79%
Column totals
100%
100%
250
750
Number of cases
© 2007 Prentice Hall
Education
16-36
Ownership of Expensive Automobiles
by Education Level and Income Levels
Table 15.9
Income
Own
Expensive
Automobile
Low Income
High Income
College
Degree
No
College
Degree
College
Degree
No College
Degree
Yes
20%
20%
40%
40%
No
80%
80%
60%
60%
100%
100%
100%
100%
100
700
150
50
Column totals
Number of
respondents
© 2007 Prentice Hall
16-37
3 Variables Cross-Tab:
Reveal Suppressed Association




Table 15.10 shows no association between desire to
travel abroad and age.
In Table 15.11, sex was introduced as the third
variable.
Controlling for effect of sex, the suppressed
association between desire to travel abroad and age
is revealed for the separate categories of males and
females.
Since the association between desire to travel abroad
and age runs in the opposite direction for males and
females, the relationship between these two
variables is masked when the data are aggregated
across sex as in Table 15.10.
© 2007 Prentice Hall
16-38
Desire to Travel Abroad by Age
Table 15.10
Desire to Travel Abroad
Age
Less than 45
45 or More
Yes
50%
50%
No
50%
50%
Column totals
100%
100%
500
500
Number of respondents
© 2007 Prentice Hall
16-39
Desire to Travel Abroad by
Age and Gender
Table 15.11
Desir e to
Tr avel
Abr oad
Sex
Male
Age
Female
Age
< 45
>=45
<45
>=45
Yes
60%
40%
35%
65%
No
40%
60%
65%
35%
100%
100%
100%
100%
300
300
200
200
Column
totals
Number of
Cases
© 2007 Prentice Hall
16-40
Three Variables Cross-Tabulations
No Change in Initial Relationship


Consider the cross-tabulation of family size and the
tendency to eat out frequently in fast-food
restaurants as shown in Table 15.12. No association
is observed.
When income was introduced as a third variable in
the analysis, Table 15.13 was obtained. Again, no
association was observed.
© 2007 Prentice Hall
16-41
Eating Frequently in
Fast-Food Restaurants by Family Size
Table 15.12
Eat Frequently in FastFood Restaurants
Family Size
Small
Large
Yes
65%
65%
No
35%
35%
Column totals
100%
100%
500
500
Number of cases
© 2007 Prentice Hall
16-42
Eating Frequently in Fast Food-Restaurants
by Family Size and Income
Table 15.13
Income
Eat Frequently in FastFood Restaurants
Low
Family size
Small Large
Yes
65% 65%
No
35% 35%
Column totals
100% 100%
Number of respondents 250 250
© 2007 Prentice Hall
High
Family size
Small Large
65% 65%
35% 35%
100% 100%
250 250
16-43
Statistics Associated with
Cross-Tab: Chi-Square

H0: there is no association between the two
variables

Use chi-square statistic

H0 will be rejected when the calculated value
of the test statistic is greater than the critical
value of the chi-square distribution
© 2007 Prentice Hall
16-44



Statistics Associated with
Cross-Tab: Chi-Square
2
compares the of the observed cell frequencies (fo) to
the frequencies to be expected when there is no
association between variables (fe)
The expected frequency for each cell can be calculated
by using a simple formula:
nr nc
fe 
n
nr=total number in the row
nc=total number in the column
n=total sample size
© 2007 Prentice Hall
16-45
Statistics for Cross-Tab: Chi-Square



From Table 3 in the Statistical Appendix, the
probability of exceeding a chi-square value
of 3.841 is 0.05.
The calculated chi-square is 3.333. Since
this is less than the critical value of 3.841,
the null hypothesis can not be rejected
Thus, the association is not statistically
significant at the 0.05 level.
© 2007 Prentice Hall
16-46