Download Z - UC Davis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Fun With Numbers
 Z-scores
 R code (wrt hw. 4)
Statistical Inference
 The normal distribution is our first choice in most cases because
it has nice properties:




Distribution is symmetrical around the mean
Percentage of cases associated with standard deviations
Can identify probability of values under the curve
A linear combination of normally distributed variables is
itself distributed normally
 Central limit theorem
 Great flexibility in using the normal distribution
Normal Distribution
 Normal Distribution and areas under it.
 68-95-99.7 Percent Rule
 In a normal distribution, about 68 percent of the
observations will fall within about +/- 1 standard
deviation...
 A Picture:
Area (with some added stuff)
http://members.aol.com/svennord/ed/normal.htm
Another Picture
What do we know?
 Area is useful to determine probabilities.
 Fun with Numbers
 Gas Prices (Let’s take a sidetrip)
 What are some research issues when looking at financial
data over time?
 Inflation!
 2007 dollars vs. 1990 dollars
 CPI: 2007 Price=1990 Price*(2007 Price/1990 Price)
Visualizing Data is FUNdamental
CPI Adjusted Gas Prices
CPI Adjusted Gas Prices
1.5
2
Gas Price (Regular, CPI-adjusted)
2.5
2
1.5
1990
1995
2000
2005
Year (with monthly measurements)
Unadjusted
2010
1
1
1
1.5
2
2.5
Gas Price (Regular, CPI-adjusted)
3
3
2.5
Non-CPI Adjusted Gas Prices
1990
1995
2000
2005
Year (with monthly measurements)
CPI-Adjusted
2010
1990
1995
2000
Year (with monthly measures)
2005
CPI-Adjusted w/o 05/06)
Histograms
Histogram of Gas Prices (CPI-Adjusted)
1.5
0
0
.5
1
1
2
Density
3
2
4
2.5
Histogram of Gas Prices
1
1.5
2
Price of Gas (unadjusted)
2.5
3
1
1.5
2
2.5
Price of Gas (CPI-adjusted)
3
Using z-scores
 Taking advantage of the normal distribution
 Area under the normal is probability area.
 Probabilities must sum to 1.
 Full density under normal is 1.
 Since it’s symmetric, we know the probability of “being
above” the mean is .50 (ditto on below)
Standard Normal Distribution
 N~(0,1)
 Easy to compute:
 When X=mean, z=0.
 Metric of z-score: standard
deviations from the mean.
 Thus, if z=1, X is 1 s.d. above
the mean.
 NOW since we know the 6895-99.7 Rule, we can identify
probs.
z
(X  X )

Getting Gas
 Let’s look at the adjusted gas prices.
 Means:







2006: 2.57 (.30)
1999: 1.37 (.15)
2005: 2.34 (.32)
1998: 1.27 (.04)
2004: 1.98 (.15)
1997: 1.51 (.04)
2003: 1.71 (..09)
1996: 1.54 (.08)
2002: 1.51 (.13)
1995: 1.47 (.06)
2001: 1.62 (.20)
1994: 1.46 (.07)
2000: 1.74 (.11)
1993: 1.49 (.03)
1992: 1.56 (.07)
1991: 1.62 (.05)
1990: 2.00 (.07) [small n]
(Anything interesting here?)
Compute a z-score
 Mean adjusted price: 1.68 (.37)
 To derive z-score for any year,
substitute a value X into 
 Suppose “X”=1.68?
 Z=(1.68-1.68)/.37=0
 The mean is normalized to 0.
 1 s.d. above mean? 1.68+.37=2.05
 Z=(2.05-1.68)/.37=1
 The metric of z is in standard
deviations.
z
(X  X )

“Standardizing” X allows us to use “z
distribution.”
The Most “Average”
Price
z
Week Year
|--------------------------------------|
| 1.680374
-.009361
Feb 12
2001 |
| 1.681257
-.0069663
Nov 03
2003 |
| 1.681329
-.0067707
Apr 24
2000 |
| 1.682352
-.0039966
Aug 04
2003 |
| 1.683292
-.001449
Jun 03
1991 |
|
|
| 1.684771
.0025612
Feb 04
1991 |
| 1.68625
.0065716
May 27
1991 |
| 1.688924
.0138213
Oct 27
2003 |
| 1.689519
.0154355
Apr 17
2000 |
| 1.69062
.0184197
Sep 24
2001 |
|--------------------------------------|
The 10 Most “Above Average”
The 10 Most “Below Average”
Price
Price
Z
Week
Year
|--------------------------------------|
| 1.096723
-1.59183
Feb 22
1999 |
| 1.103978
-1.572159
Mar 01
1999 |
| 1.111233
-1.552488
Feb 15
1999 |
| 1.113652
-1.545931
Mar 08
1999 |
| 1.120907
-1.52626
Feb 08
1999 |
|--------------------------------------|
| 1.123325
-1.519703
Feb 01
1999 |
| 1.13058
-1.500032
Jan 04
1999 |
| 1.131789
-1.496754
Jan 25
1999 |
| 1.137835
-1.480361
Jan 11
1999 |
| 1.141463
-1.470526
Jan 18
1999 |
|--------------------------------------|
Z
Week
Year
|-------------------------------------|
|
2.947
3.424879
May 15
2006 |
|
2.973
3.495373
Jul 10
2006 |
|
2.989
3.538755
Jul 17
2006 |
|
3
3.56858
Aug 14
2006 |
|-------------------------------------|
|
3.003
3.576713
Jul 24
2006 |
|
3.004
3.579425
Jul 31
2006 |
| 3.021628
3.62722
Oct 03
2005 |
|
3.038
3.67161
Aug 07
2006 |
| 3.049491
3.702766
Sep 12
2005 |
| 3.167136
4.021741
Sep 05
2005 |
|-------------------------------------|
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
1990
-2
2006
2007
0
1
2
3
4
2005
0
-2
0
2
4
-2
0
2
4
-2
0
2
4
Z-Score for CPI-Adjusted Gas Price
Graphs by year
2
4
-2
0
2
4
Finding Probabilities
 What is the probability of a Z gas price of 2.50 or higher?
 The z-score is 2.22.
 In the z-distribution, if gas prices were truly normally
distributed, a score this high or higher has a probability of
occurring of .013, or about 1.3%. It’s an unlikely event.
 How computed? 1-.9868 gives area above (consult standard
normal)
Finding Probabilities
 What is the probability of a z gas price being between 1.75
and -1.75
 P(above)=.04; P(below)=.04
 Therefore, P(in between)=1-.08= .92
 The upper tail is .04; the lower tail is .04
 Any probability calculation is this straightforward.
Issues
 The “gas price” example is pedagogical.
 Serious analysis of gas-pricing effects would require much
more sophisticated statistical techniques.
 z is useful to compare observations from historical eras or
across disparate cases.
 Hands-on examples in R
Plots and Z-scores
 How to do some of the “stuff” in HW 4
 Multiple plots on a single page
 Creating z-scores and finding p-values
 Visualizing political data
 Data: Obama vote share by county
Dot Chart: Obama Vote
dotchart(obamapercent, labels=row.names, cex=.7, xlim=c(0,
100), main="Support for Obama", xlab="Percent Obama")
abline(v=50)
Returns:
Support for Obama
San Francisco
Alameda
Marin
Santa Cruz
Sonoma
San Mateo
Mendocino
Santa Clara
Los Angeles
Contra Costa
Monterey
Yolo
Napa
Solano
Humboldt
Alpine
Imperial
Santa Barbara
San Benito
Lake
Sacramento
Mono
Ventura
San Joaquin
San Diego
Merced
San Luis Obispo
Nevada
San Bernardino
Trinity
Riverside
Fresno
Stanislaus
Butte
Orange
Del Norte
Placer
El Dorado
Inyo
Siskiyou
Plumas
Tuolumne
Mariposa
Madera
Amador
Kings
Calaveras
Tulare
Sutter
Yuba
Kern
Colusa
Sierra
Glenn
Tehama
Shasta
Lassen
Modoc
0
20
40
60
Percent Obama
80
100
Interpretation?
 Geographical Patterns?
 Central Valley
 Coastal
 SoCal, NorCal?
 Why might you observe these patterns?
 Z-scores
 NB: we’re doing this for learning purposes
Z-scores
 Easy: create mean, standard deviation
 Then derive z-score using formula from last slide set:
 R code on next slide
Z-scores and R
#Z scores for Obama
meanobama<-mean(obamapercent)
sdobama<-sd(obamapercent)
zobama<-(obamapercent-meanobama)/sdobama
Interpretation

Z-scores in metric of standard deviations

Large z imply the observation is further away from mean than observations with small
z.

Z=0 means the observation is exactly at the mean.

Dotchart (code):
par(mfcol=c(1,1))
dotchart(zobama, labels=row.names, cex=.7, xlim=c(-3, 3),
main="p-values for Obama Vote Z-scores", xlab="Probability")
abline(v=1, col="red")
abline(v=-1, col="red")
abline(v=2, col="dark red")
abline(v=-2, col="dark red")
abline(v=0)
Obama Vote Z-scores
San Francisco
Alameda
Marin
Santa Cruz
Sonoma
San Mateo
Mendocino
Santa Clara
Los Angeles
Contra Costa
Monterey
Yolo
Napa
Solano
Humboldt
Alpine
Imperial
Santa Barbara
San Benito
Lake
Sacramento
Mono
Ventura
San Joaquin
San Diego
Merced
San Luis Obispo
Nevada
San Bernardino
Trinity
Riverside
Fresno
Stanislaus
Butte
Orange
Del Norte
Placer
El Dorado
Inyo
Siskiyou
Plumas
Tuolumne
Mariposa
Madera
Amador
Kings
Calaveras
Tulare
Sutter
Yuba
Kern
Colusa
Sierra
Glenn
Tehama
Shasta
Lassen
Modoc
-3
-2
-1
0
Z-score
1
2
3
Probability Values
 High Z-scores are probabilistically less likely to be
observed than smaller scores.
 Consult a z-distribution table
 Probability area is given
 Can think about probabilities in the “tails”
 One-tail (upper or lower)
 Two-tail (upper + lower)
 R
R code
twotailp<- 2*pnorm(-abs(zobama)) #Gives us area in the upper and lower tails of z
onetailp<- pnorm(-abs(zobama)) #Gives us 1-tail probability area; if
#subtract this from 1, this give us the area
#below this z score (if z is positive) or
#area above this z score (if z is negative)
zp<-cbind(county, onetailp, twotailp, zobama ); zp
Plots

4 plots on one page:
par(mfcol=c(2,2))
boxplot(obamapercent, ylab="Vote Percent", main="Obama Vote: Box Plot", col="blue")
hist(zobama, xlab="Obama Vote as Z-Scores", ylab="Frequency",
main="Histogram of Standardized Obama Vote", col="blue")
hist(obamapercent, ylab="Frequency", xlab="Vote Percent", main="Obama Vote: Histogram", col="blue")
plot(zobama, onetailp, ylab="One-Tail p", xlab="Z-score", main="Z-scores and p-values", col="blue")
Obama Vote: Histogram
15
5
10
Frequency
70
60
50
0
30
40
Vote Percent
80
Obama Vote: Box Plot
30
40
50
60
70
80
90
Vote Percent
Z-scores and p-values
0.3
0.0
0.1
0.2
One-Tail p
10
5
0
Frequency
0.4
15
0.5
Histogram of Standardized Obama Vote
-2
-1
0
1
Obama Vote as Z-Scores
2
-1
0
Z-score
1
2
Related documents