Download Empirical Loop Lecture Flow Chart

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
4/7/12
Empirical Loop
Descriptive
Statistics
Inferential
Statistics
Collect
Data
Research
Design
Hypothesis
Lecture Flow Chart
Data
Frequency
Distributions
5-7 ft
3-5 ft
Graphs
16
13
Measures of
Central
Tendency
On average, these students
are 5.2 ft tall.
Chapter 3: Describing Data with
Averages
1. Quantitative Data
A. Mode
Measures of Central Tendency
B. Median
C. Mean
D. Which Average?
2. Qualitative & Rank Data
A. Nominal vs. Ordinal
1
4/7/12
Chapter 3: Describing Data with
Averages
1. Quantitative Data
A. Mode
B. Median
C. Mean
D. Which Average?
2. Qualitative & Rank Data
A. Nominal vs. Ordinal
Mode: The value of the most
frequent observation
4
4
4
8
4
8
2
6
4
12
8
8
2
6
5
3
4
8
4
8
Table 3.1: Terms in years of the last 20 U.S. Presidents
Mode: The value of the most
frequent observation
B. Bimodal
Figure 3.1
2
4/7/12
Chapter 3: Describing Data with
Averages
1. Quantitative Data
A. Mode
B. Median
C. Mean
D. Which Average?
2. Qualitative & Rank Data
A. Nominal vs. Ordinal
Median: The value that is greater
than or equal to 50% of the
observations
0
2
2
3
5
Median: The value that is greater
than or equal to 50% of the
observations
0
2
2
3
5
Numbers of Siblings
How to Calculate the Median 1. Order observations from least to most
0
2
2
3
5
Numbers of Siblings
2
5
2
3
0
3
4/7/12
How to Calculate the Median 1. Order observations from least to most
2. Find the middle position by adding one to the total
number of observations and dividing by two
0
2
2
3
5
How to Calculate the Median 1. Order observations from least to most
2. Find the middle position by adding one to the total
number of observations and dividing by two
3. If the middle position is a whole number, then the value
at the middle position is the median.
0
2
2
3
5
Middle: (5+1)/2=3
Middle: (5+1)/2=3
How to Calculate the Median 4. If the middle position is NOT a whole number, then add
the number immediately above and the number
immediately below the middle position and divide by two.
The result is the median.
Compute the mode and median of the following
sets of data
A) 2, 2, 8
Median=(2+3)/2=2.5
B) 2, 3, 5, 5
0
2
2
3
5
6
C) 20.3, 22.7, 21.4
Middle: (6+1)/2=3.5
4
4/7/12
Compute the mode and median of the following
sets of data
A) 2, 2, 8
Mode=2, Median=2
B) 2,3,5,5
Mode=5, Median=4
C) 20.3, 21.4, 22.7
Mode=?, Median=21.4
Mean: What people usually think
of as the “average.”
X=
∑ xi
n
Chapter 3: Describing Data with
Averages
1. Quantitative Data
A. Mode
B. Median
C. Mean
D. Which Average?
2. Qualitative & Rank Data
A. Nominal vs. Ordinal
Mean: What people usually think
of as the “average.”
0
2
2
3
5
Numbers of Siblings
0+2+2+3+5=12
n=5
12/5=2.4
€
5
4/7/12
Mean: The “balance point” of a
sample.
Compute the mean of the following sets of data
Figure 3.3
A) 2, 2, 8
B) 2, 2, 800
Compute the mean of the following sets of data
A) 2, 2, 8
Mean=4
B) 2, 2, 800
Mean=268
Compute the mode, median, and mean of the
following sets of data
A) 2, 2, 8
Mode=2, Median=2, Mean=4
B) 2, 2, 800
Mode=2, Median=2, Mean=268
6
4/7/12
Chapter 3: Describing Data with
Averages
B. Bimodal
1. Quantitative Data
A. Mode
B. Median
C. Mean
D. Which Average?
2. Qualitative & Rank Data
A. Nominal vs. Ordinal
Symmetric Unimodal Distributions
Bimodal/Multimodal Distributions
B. Bimodal
Mode=Median=Mean
Modes
7
4/7/12
+/- Skewed Unimodal Distributions
Self-Defense: Politics
Proponents of the Bush administration’s 2003 tax cut
proclaimed that on average, families receive a tax cut of
$1000. Opponents of the cut countered that more than half
of all families will receive a tax cut by less than $100.
Source: Best, J. (2004) More Damned Lies and Statistics
Chapter 3: Describing Data with
Averages
In 1984, the University of Virginia announced that its
department of rhetoric and communications
graduates’ MEAN STARTING SALARY was $55,000.
This was highly skewed by the salary of one graduate,
NBA center Ralph Sampson.
Source: Gonick, L. & Woollcott, S. (1993) The Cartoon Guide to
Statistics
1. Quantitative Data
A. Mode
B. Median
C. Mean
D. Which Average?
2. Qualitative & Rank Data
A. Nominal vs. Ordinal
8
4/7/12
Nominal Data: Can only use
mode.
Ordinal Data: Can use mode or
median.
Table 3.5
Exercise: 20 College students were surveyed to
determine where they would most like to spend their
spring vacation: Daytona Beach (DB), San Diego (SD),
South Padre Island (SP), Lake Havasu (LH) or Other (O).
The results were as follows:
DB
DB
SD
LH
DB
DB
8
SD
4
LH
3
SD
SP
LH
DB
O
SP
2
O
SP
SD
DB
LH
O
3
DB
SD
DB
O
DB
Find the Mode and (if possible) the Median
9
4/7/12
Lecture Flow Chart
Chapter 3: Describing Data with
Averages
Data
1. Quantitative Data
A. Mode
B. Median
C. Mean
D. Which Average?
2. Qualitative & Rank Data
A. Nominal vs. Ordinal
Frequency
Distributions
Graphs
Measures of
Dispersion
Measures of
Central
Tendency
The standard deviation of COGS 14
student height is 1.1 ft.
On average, these students
are 5.2 ft tall.
Chapter 4: Describing Variability
I. Quantitative Data
A. Range
Measures of
B.Variance
Dispersion
C. Standard Deviation
1. Sample, Population, or Estimate of Population
D. Interquartile Range
E. New Graphs
II. Nominal & Ordinal Data
A. Entropy
10
4/7/12
Range: The maximum value
observed minus the minimum
value observed
Problems with Range
• 
Its value is derived from only two
observations, which means that it won’t
replicate very well.
• 
Its value depends on the size of the sample.
Bigger samples will tend to have bigger
ranges.
-$3
-$1
0
$2
$2
Gambling Results
max-min=$2-(-$3)=$2+$3=$5
Estimated Population Variance (i.e.
estimated from your sample): s2
Chapter 4: Describing Variability
I. Quantitative Data
A. Range
B.Variance
C. Standard Deviation
1. Sample, Population, or Estimate of Population
D. Interquartile Range
E. New Graphs
II. Nominal & Ordinal Data
A. Entropy
2
s =
∑ (x
i
− x )2
i
n −1
The
mean
∑x
x=
i
i
n
n=# of observations in the sample
€
Note: Textbook calls€s2 the “variance for sample”
11
4/7/12
Estimated Population Variance (i.e.
estimated from your sample): s2
Problems with Variance
• 
-$3
-$1
0
$2
$2
Gambling Results
Its value is in “units squared.”
2
∑ (x
s2 =
∑ (x
i − x)
∑x
i
x=
n −1
− x )2
i
s =
2
i
n −1
i
i
n
€
-$3
-$1
0
$2
$2
2
2
s =4.5 dollars
€
€
Estimated Population Standard Deviation
(i.e. estimated from your sample): s
s = s2 =
∑ (x i − x )2
i
n −1
∑x
x=
Sample Standard Deviation: s
-$3
-$1
0
$2
$2
Gambling Results
i
i
n
s = s2 =
∑ (x
i
− x )2
i
n −1
n=# of observations in the sample
€
€
Note: Textbook calls this the “standard deviation for sample”
€
12
4/7/12
Sample Standard Deviation: s
∑ (x
s=
i
− x) 2
s=
n
Definition Formula
n ∑ x i2 −
(∑ x )
2
i
n2
Computation Formula
€
€
Standard Deviation Not Enough
For Perverted Statistician
http://www.theonion.com/content/index/3625
Standard Deviation:
• 
An intuitive interpretation: the standard
deviation is APPROXIMATELY the average
distance of the observations from the mean. • 
For many frequency distributions, a majority
(approx. 68% for normal distributions) of all
observations are within ± one standard
deviation of the mean.
• 
For many frequency distributions, a small
minority (approx. 5% for normal
distribution) of all observations are beyond
± two standard deviations of the mean.
pg. 86
13
4/7/12
Sample, Population,
Estimate of Population
Chapter 4: Describing Variability
I. Quantitative Data
A. Range
B.Variance
C. Standard Deviation
1. Sample, Population, or Estimate of Population
D. Interquartile Range
E. New Graphs
II. Nominal & Ordinal Data
A. Entropy
Sample
Population
Empirical Loop
Descriptive
Statistics
Collect
Data
Mean:
Research
Design
x=
€
Inferential
Statistics
Estimate of
Population from a
Sample
∑x
i
µ=
n
Estimate of Population
Parameter from a
Sample
∑x
i
N
Population Parameter
(for discrete data)
€
Hypothesis
14
4/7/12
Estimate of Population
Parameter from a
Sample
Population Parameter
(for discrete data)
Variance:
s2 =
∑ (x
i
− x )2
2
σ =
i
n −1
∑ (x
i
− µ) 2
1
i
N
∑ (x
s=
€
i
− x )2
€
i
∑ (x
σ=
n −1
i
-2
4
4
3
Scores on a Quiz (pts.)
Standard Deviation:
€
Compute the range of the sample and estimate the population
variance and standard deviation from the sample. − µ) 2
i
Range=4-(-2)=4+2=6 pts.
Mean=10/5=2 pts.
Estimate of Pop.Var.=26/4=6.5 pts.2
Estimate of Pop. Std.=sqrt(6.5)=2.5 pts.
N
€
Chapter 4: Describing Variability
Outliers: very extreme observations I. Quantitative Data
A. Range
B.Variance
C. Standard Deviation
1. Sample, Population, or Estimate of Population
D. Interquartile Range
E. New Graphs
II. Nominal & Ordinal Data
A. Entropy
15
4/7/12
Estimate of Pop. Std.
Estimate of Pop. Std.
-$3
-$1
$0
$1
$2
$2
-$3
-$1
$0
$1
$2
$2
Gambling Results
σˆ = 1.9
Gambling Results
-$3
€ -$1
$0
$1
$2
$200
s=1.9
-$3
-$1
$0
$1
$2
$200
Luckier Gambling Results
Luckier Gambling Results
σˆ = 81.7
s=81.7
Interquartile Range: IQR
Interquartile Range: IQR
€
1. Sort the data. 2. Split the data in half. If you have an odd
number of data points, throw out the
median.
3. The median of the lower half is the lower
quartile. The median of the upper half is the
upper quartile.
4. IQR=upper quartile-lower quartile upper quartile-lower quartile=IQR
-$3
-$1
0
$1
$2
$2
Gambling Results
$2-(-$1)=$2+$1=$3
-$3
-$1
0
$1
$2
$200
Luckier Gambling Results
$2-(-$1)=$2+$1=$3
16
4/7/12
Approx. Interquartile Range:
IQR (via upper & lower hinge)
Compute the interquartile range of these scores. 1. Sort the data. NOTE: THIS IS NOT
2. Split the data in half. If you have an odd
EXACTLY
THE throw
SAMEout
ASthe
THE
number
of data points,
IQR
ALGORITHM
IN
THE
median.
TEXTBOOK!! (pg 92)
3. The median of the lower half is the lower
quartile. The median of the upper half is the
USE THE IQR ALGORITHM
upper quartile.
FROM LECTURE
1
-2
4
4
3
-4
7
0
-1
5
Scores on a Quiz (pts)
IQR=4-(-1)=4+1=5 pts.
4. IQR=upper quartile-lower quartile Mean
Median
Standard
Deviation
Interquartile
Range
17
4/7/12
Comparing Multiple Data Sets
Chapter 4: Describing Variability
I. Quantitative Data
A. Range
B.Variance
C. Standard Deviation
1. Sample, Population, or Estimate of Population
D. Interquartile Range
E. New Graphs
II. Nominal & Ordinal Data
A. Entropy
Boxplot
(“refined” boxplot to be precise)
Outliers
Upper
Quartile (UQ)
Frequency Polygons
Comparing Multiple Data Sets
Positive outliers>UQ+1.5*IQR
Negative outliers<LQ-1.5*IQR
Maximum value that
is not an outlier
Median
Lower
Quartile (LQ)
Minimum value that is not an outlier
18
4/7/12
Comparing Multiple Data Sets
Chapter 4: Describing Variability
I. Quantitative Data
A. Range
B.Variance
C. Standard Deviation
1. Sample, Population, or Estimate of Population
D. Interquartile Range
E. New Graphs
II. Nominal & Ordinal Data
A. Entropy
(if data are normally distributed)
Frequency Distribution
Entropy (in bits): H
H = −∑ f (x i )log 2 ( f (x i ))
4
f(x) is the relative frequency of value x
€
6
4
6
0.4
0.6
19
4/7/12
Minimum Entropy:
10
0
1
0
0 bits
Minimum Entropy:
0
0
9912
1
0 bits
(When one outcome is much more probable,
entropy (uncertainty) is lower.)
Maximum Entropy (for two values):
5
5
0.5
0.5
1 bit
(When any outcome is equally probable,
entropy (uncertainty) is at its highest.)
Maximum Entropy (for four values):
5
5
5
5
0.25
0.25
0.25
0.25
2 bits
4 bits
20
4/7/12
Maximum Entropy: H
Compute the sample relative entropy of the following data. max
⎛ 1 ⎞
H max = −log 2 ⎜ ⎟ = log 2 k
⎝ k ⎠
Fr
So
Fr
Fr
Total # of Classes=k
Jr
Jr
So
Fr
Given the # of alternatives, how big can entropy get?
College Year
€
Relative Entropy: J
Entropy=.5+.5+.5+0=1.5 bits
Max Entropy=2 bits
Relative Entropy=1.5/2=.75
H
J=
H max
€
Converting log10 into log2:
log 2 x =
Base of log doesn’t matter for relative
entropy.
Maximum Entropy: H
log10 x
log10 2
max
H max
€
For example:
log 2 0.5 =
log10 0.5 −.301
=
= −1
log10 2
.301
⎛ 1 ⎞
= −log 2 ⎜ ⎟ = log 2 k
⎝ k ⎠
Total # of Classes=k
€
Relative Entropy: J
J=
H
H max
€
€
21
4/7/12
Chapter 4: Describing Variability
I. Quantitative Data
A. Range
B.Variance
C. Standard Deviation
1. Sample, Population, or Estimate of Population
D. Interquartile Range
II. Nominal & Ordinal Data
A. Entropy
Free Online Software:
This week’s homework:
•  In addition to Quiz 3!
•  Posted on WebCT
•  Due Monday 10/18 at start of lecture!
•  If you show your work, we can give you partial credit.
•  Graphs can be hand drawn but need to be legible.
•  It’s good to check answers with others, but do your
own work.
Free Online Software:
Mean, Median, Estimated Population Standard Deviation:
http://www.r-project.org/
• Examples of how to use R will be
posted on the course calendar.
• Go to labs for help
http://www.physics.csbsju.edu/stats/cstats_NROW_form.html
Google spreadsheets:
http://docs.google.com
Open Office (Software):
http://www.openoffice.org/
22
4/7/12
Lecture Flow Chart
Percentages
I spent 33% of yesterday asleep.
Data
Measures of
Dispersion
Measures of The standard deviation of COGS 14
student height is 1.1 ft.
Central
Tendency
Percentages
# of instances of a class
total # of instances
7 female students
=70%
10 students
On average, these students
are 5.2 ft tall.
Percentages
# of instances of a class
total # of instances
“The Base”
Percentages
A used car salesman originally lists a car on his lot
for $1000. He puts a sale sticker on the car,
announcing that he’s cut 20% off the list price.
When you go to look at the car, he tells you that
because he likes you so much, he’ll give you a
further 10% discount. What’s the price he’s
offering?
$1000*20%=$200
$800*10%=$80 Total Cut=$280 (28% cut)
23
4/7/12
Correlations
Smoker Statehood
(n=1000 smokers)
1. 20% of the smokers are Californians?
•  Are two variables related?
•  Let’s quantify it!
Hydration level (healthy = 0)
Proportion of obstacles avoided on driving test
2. New Yorkers are twice as likely to smoke as Californians?
Coffee consumption (cups per week)
*Caution: All of the following correlations are fictional. Any resemblance to real correlations,
living or dead, is entirely coincidental.
Coffee consumption (cups per week)
Relationship can be strong and negative 24
4/7/12
Yearly feet of snow in Arctic
UCSD
students
Kilopirates (thousands of pirates)
Hours of studying per week
RANGE RESTRICTION can lower a correlation
25