Download Ch 3 - csusm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 3
Data Characterization
BUS304 – Data Charaterization
1
Today: Mean and Variance
 Mean:
Exercise:
 also called “average”
 Formula:
mean=
compute the mean weight for the
Chargers’ offense players and
defense players.
Sum of data
Number of data
 Characterize the center of the
data distribution
 The most commonly used
measure
Which mean should be higher?
Why?
Are they population mean or sample
mean?
 Sample mean x
 The average derived from sample
Ways to compute the mean:
1. Use calculator.
 Population mean 
 The average derived from the
population
2. Use Excel. (function: average)
BUS304 – Data Charaterization
2
Sensitivity to outliers
Compute the mean for the following 2 groups of data
 Household income in
community a: (Unit =10000$)
 Household income in
community b: (Unit =10000$)
#1
#2
#3
#4
#5
#6
#7
#8
#1
#2
#3
#4
#5
#6
#7
#8
5
4
3
4
3
5
4
5
5
4
3
4
3
5
4
100
If the mayor decide to provide more public facilities to poor
communities, and the decision is made based on whether
the mean income in the community is below $50,000 per year.
Does such a decision make sense?
BUS304 – Data Charaterization
3
Compute the mean from
frequency table
Below is a frequency table showing
Create a histogram using the data on
the number of days the teams finish
the left, locate the mean on the graph.
their projects
 How to describe the shape of the
histogram?
Relative
Days to
Complete
Frequency
5
4
?
6
12
?
7
8
?
8
6
?
9
4
?
10
2
?
Frequency
 What is the relationship between
the mean and peak?
Use relative frequency to find out the
mean.
How many days on average does a team
finish one project?
mean=
total days
total teams
BUS304 – Data Charaterization
4
Compute the mean from Histogram
Histogram
Frequency
7
6
6
Histogram conveys the same
information as the frequency table
5
5
4
4
3
3
2
2
1
0
0
55
mean=
0
15
15
25
25
35
36
45
45
55
55
More
total data value 15  3  25  6  35  5  45  4  55  2

 33
data size
365 4 2
Mathematical Expression:
x=33 if sample,  =33 if population
BUS304 – Data Charaterization
5
Weighted Mean
 The mean assumes that each piece of information
equally.
 E.g. average score of the students.
 Sometimes, different data should be put in different
weight.
 One may be more important than the other.
• E.g. some instructor assign 60% on the homework score, and 40%
on the final exam. If a student’s homework score is 84, and got 70
in the exam, compute the student’s final score. (weighted mean of
homework score and exam score)
-- this teacher thinks homework reveals more comprehensive
information about a student’s knowledge, and hence put more
weight.
BUS304 – Data Charaterization
6
When to use weighted mean?
 Some other examples of weighted mean:
 A student’s GPA. A course with more credit takes more weight.
 An economic growth indicator. (some industries affects the economy more than
others)
 Crush time leader: a player who perform the best in the last few minutes of the
game. – can reveal the person’s performance under pressure.
 Expectation – you will see in chapter 4
• E.g. in a gambling game, if with 60% chance you lose one dollar, and with 40%
chance you gain one dollar, the expectation is
60%x$(-1)+40%x$1=-$0.2
 Other examples? (average Cal State Tuition)
Always think whether you should use weighted mean or simple mean.
BUS304 – Data Charaterization
7
Break
BUS304 – Data Charaterization
8
Variance

A measure of data spread.

Also called “the average of squared deviations from the mean”
The larger the variance, the fat the histogram
-- sample variance
N
n
s 
2
 (x
i 1
-- population variance
i
 x)
2
σ2 
2
(x

μ)
 i
i 1
n -1
N
Note the difference!
BUS304 – Data Charaterization
9
Steps to compute the variance
1.
Identify whether the data are of a population or sample
(the formulae are different.)
2.
Use the following table to compute the deviation:
a)
Data
list
5
4
Distance from
the mean
Square
the distance
=5-mean=1.167
=(1.67)2=1.36
Find out the mean:
mean=
5 4 453 2
 3.833
6
b)
Find out the distance
(fill out the 2nd column)
c)
Find out the squared distance
4
(the 3rd column)
5
d)
Add up the 3rd column
3
e)
divided by
2
i.
population size; or
ii.
sample size -1
BUS304 – Data Charaterization
10
Comparing variance vs. histogram
Find the variance for the following groups of sample data:
Compare the mean and variance.
Create the histogram to compare the distribution.
11
14
11
12
15
11
13
15
11
16
15
12
16
16
19
17
16
20
18
16
20
21
17
20
BUS304 – Data Charaterization
11
What does variance mean?
Variance indicate variation:
 The larger the variance, the more spread out the
data.
 Indicates unpredictability.
 E.g.
• Weather data: weather changes dramatically, hard to predict
tomorrow’s temperature
(If look at temperature data: which has larger variance,
Chicago or San Diego?)
• Stock: more risk on returns.
• A person’s performance: consistency. emotional…
• Other examples?
BUS304 – Data Charaterization
12
Use frequency table to compute the
population variance:
14
15
15
Data value
Frequency
Relative
Frequency
15
14
1
0.125
16
15
3
0.375
16
16
3
0.375
16
17
1
0.125
17
Data
distance
square
14
15
15
15
16
16
Data
distance
square
14
15
16
Compute the
weighted average
17
16
17
BUS304 – Data Charaterization
13
Standard Deviation
 Square root of variance.
 An indicator of data deviation, can be directly
compared to the mean.
s= s
2
OR
Sample variance
Sample standard deviation
= 
2
Exercise: compute the
standard deviation from
the histogram on slide no.
5 and locate it on the
histogram.
Population variance
Population standard deviation
BUS304 – Data Charaterization
14
Empirical Rule
 If the data is bell shaped
(most of the time), then
95%
68%
99.7%
 68% of all data will fall in
the range of
μ
μσ
μ  2σ
 
 95% of all data will fall in
the range of
  2
 99.7% of all data will fall in
the range of
  3
μ  3σ
BUS304 – Data Charaterization
15