Download Z - UC Davis

Fun With Numbers  Z-scores  R code (wrt hw. 4) Statistical Inference  The normal distribution is our first choice in most cases because it has nice properties:     Distribution is symmetrical around the mean Percentage of cases associated with standard deviations Can identify probability of values under the curve A linear combination of normally distributed variables is itself distributed normally  Central limit theorem  Great flexibility in using the normal distribution Normal Distribution  Normal Distribution and areas under it.  68-95-99.7 Percent Rule  In a normal distribution, about 68 percent of the observations will fall within about +/- 1 standard deviation...  A Picture: Area (with some added stuff) http://members.aol.com/svennord/ed/normal.htm Another Picture What do we know?  Area is useful to determine probabilities.  Fun with Numbers  Gas Prices (Let’s take a sidetrip)  What are some research issues when looking at financial data over time?  Inflation!  2007 dollars vs. 1990 dollars  CPI: 2007 Price=1990 Price*(2007 Price/1990 Price) Visualizing Data is FUNdamental CPI Adjusted Gas Prices CPI Adjusted Gas Prices 1.5 2 Gas Price (Regular, CPI-adjusted) 2.5 2 1.5 1990 1995 2000 2005 Year (with monthly measurements) Unadjusted 2010 1 1 1 1.5 2 2.5 Gas Price (Regular, CPI-adjusted) 3 3 2.5 Non-CPI Adjusted Gas Prices 1990 1995 2000 2005 Year (with monthly measurements) CPI-Adjusted 2010 1990 1995 2000 Year (with monthly measures) 2005 CPI-Adjusted w/o 05/06) Histograms Histogram of Gas Prices (CPI-Adjusted) 1.5 0 0 .5 1 1 2 Density 3 2 4 2.5 Histogram of Gas Prices 1 1.5 2 Price of Gas (unadjusted) 2.5 3 1 1.5 2 2.5 Price of Gas (CPI-adjusted) 3 Using z-scores  Taking advantage of the normal distribution  Area under the normal is probability area.  Probabilities must sum to 1.  Full density under normal is 1.  Since it’s symmetric, we know the probability of “being above” the mean is .50 (ditto on below) Standard Normal Distribution  N~(0,1)  Easy to compute:  When X=mean, z=0.  Metric of z-score: standard deviations from the mean.  Thus, if z=1, X is 1 s.d. above the mean.  NOW since we know the 6895-99.7 Rule, we can identify probs. z (X  X )  Getting Gas  Let’s look at the adjusted gas prices.  Means:        2006: 2.57 (.30) 1999: 1.37 (.15) 2005: 2.34 (.32) 1998: 1.27 (.04) 2004: 1.98 (.15) 1997: 1.51 (.04) 2003: 1.71 (..09) 1996: 1.54 (.08) 2002: 1.51 (.13) 1995: 1.47 (.06) 2001: 1.62 (.20) 1994: 1.46 (.07) 2000: 1.74 (.11) 1993: 1.49 (.03) 1992: 1.56 (.07) 1991: 1.62 (.05) 1990: 2.00 (.07) [small n] (Anything interesting here?) Compute a z-score  Mean adjusted price: 1.68 (.37)  To derive z-score for any year, substitute a value X into   Suppose “X”=1.68?  Z=(1.68-1.68)/.37=0  The mean is normalized to 0.  1 s.d. above mean? 1.68+.37=2.05  Z=(2.05-1.68)/.37=1  The metric of z is in standard deviations. z (X  X )  “Standardizing” X allows us to use “z distribution.” The Most “Average” Price z Week Year |--------------------------------------| | 1.680374 -.009361 Feb 12 2001 | | 1.681257 -.0069663 Nov 03 2003 | | 1.681329 -.0067707 Apr 24 2000 | | 1.682352 -.0039966 Aug 04 2003 | | 1.683292 -.001449 Jun 03 1991 | | | | 1.684771 .0025612 Feb 04 1991 | | 1.68625 .0065716 May 27 1991 | | 1.688924 .0138213 Oct 27 2003 | | 1.689519 .0154355 Apr 17 2000 | | 1.69062 .0184197 Sep 24 2001 | |--------------------------------------| The 10 Most “Above Average” The 10 Most “Below Average” Price Price Z Week Year |--------------------------------------| | 1.096723 -1.59183 Feb 22 1999 | | 1.103978 -1.572159 Mar 01 1999 | | 1.111233 -1.552488 Feb 15 1999 | | 1.113652 -1.545931 Mar 08 1999 | | 1.120907 -1.52626 Feb 08 1999 | |--------------------------------------| | 1.123325 -1.519703 Feb 01 1999 | | 1.13058 -1.500032 Jan 04 1999 | | 1.131789 -1.496754 Jan 25 1999 | | 1.137835 -1.480361 Jan 11 1999 | | 1.141463 -1.470526 Jan 18 1999 | |--------------------------------------| Z Week Year |-------------------------------------| | 2.947 3.424879 May 15 2006 | | 2.973 3.495373 Jul 10 2006 | | 2.989 3.538755 Jul 17 2006 | | 3 3.56858 Aug 14 2006 | |-------------------------------------| | 3.003 3.576713 Jul 24 2006 | | 3.004 3.579425 Jul 31 2006 | | 3.021628 3.62722 Oct 03 2005 | | 3.038 3.67161 Aug 07 2006 | | 3.049491 3.702766 Sep 12 2005 | | 3.167136 4.021741 Sep 05 2005 | |-------------------------------------| 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 1990 -2 2006 2007 0 1 2 3 4 2005 0 -2 0 2 4 -2 0 2 4 -2 0 2 4 Z-Score for CPI-Adjusted Gas Price Graphs by year 2 4 -2 0 2 4 Finding Probabilities  What is the probability of a Z gas price of 2.50 or higher?  The z-score is 2.22.  In the z-distribution, if gas prices were truly normally distributed, a score this high or higher has a probability of occurring of .013, or about 1.3%. It’s an unlikely event.  How computed? 1-.9868 gives area above (consult standard normal) Finding Probabilities  What is the probability of a z gas price being between 1.75 and -1.75  P(above)=.04; P(below)=.04  Therefore, P(in between)=1-.08= .92  The upper tail is .04; the lower tail is .04  Any probability calculation is this straightforward. Issues  The “gas price” example is pedagogical.  Serious analysis of gas-pricing effects would require much more sophisticated statistical techniques.  z is useful to compare observations from historical eras or across disparate cases.  Hands-on examples in R Plots and Z-scores  How to do some of the “stuff” in HW 4  Multiple plots on a single page  Creating z-scores and finding p-values  Visualizing political data  Data: Obama vote share by county Dot Chart: Obama Vote dotchart(obamapercent, labels=row.names, cex=.7, xlim=c(0, 100), main="Support for Obama", xlab="Percent Obama") abline(v=50) Returns: Support for Obama San Francisco Alameda Marin Santa Cruz Sonoma San Mateo Mendocino Santa Clara Los Angeles Contra Costa Monterey Yolo Napa Solano Humboldt Alpine Imperial Santa Barbara San Benito Lake Sacramento Mono Ventura San Joaquin San Diego Merced San Luis Obispo Nevada San Bernardino Trinity Riverside Fresno Stanislaus Butte Orange Del Norte Placer El Dorado Inyo Siskiyou Plumas Tuolumne Mariposa Madera Amador Kings Calaveras Tulare Sutter Yuba Kern Colusa Sierra Glenn Tehama Shasta Lassen Modoc 0 20 40 60 Percent Obama 80 100 Interpretation?  Geographical Patterns?  Central Valley  Coastal  SoCal, NorCal?  Why might you observe these patterns?  Z-scores  NB: we’re doing this for learning purposes Z-scores  Easy: create mean, standard deviation  Then derive z-score using formula from last slide set:  R code on next slide Z-scores and R #Z scores for Obama meanobama<-mean(obamapercent) sdobama<-sd(obamapercent) zobama<-(obamapercent-meanobama)/sdobama Interpretation  Z-scores in metric of standard deviations  Large z imply the observation is further away from mean than observations with small z.  Z=0 means the observation is exactly at the mean.  Dotchart (code): par(mfcol=c(1,1)) dotchart(zobama, labels=row.names, cex=.7, xlim=c(-3, 3), main="p-values for Obama Vote Z-scores", xlab="Probability") abline(v=1, col="red") abline(v=-1, col="red") abline(v=2, col="dark red") abline(v=-2, col="dark red") abline(v=0) Obama Vote Z-scores San Francisco Alameda Marin Santa Cruz Sonoma San Mateo Mendocino Santa Clara Los Angeles Contra Costa Monterey Yolo Napa Solano Humboldt Alpine Imperial Santa Barbara San Benito Lake Sacramento Mono Ventura San Joaquin San Diego Merced San Luis Obispo Nevada San Bernardino Trinity Riverside Fresno Stanislaus Butte Orange Del Norte Placer El Dorado Inyo Siskiyou Plumas Tuolumne Mariposa Madera Amador Kings Calaveras Tulare Sutter Yuba Kern Colusa Sierra Glenn Tehama Shasta Lassen Modoc -3 -2 -1 0 Z-score 1 2 3 Probability Values  High Z-scores are probabilistically less likely to be observed than smaller scores.  Consult a z-distribution table  Probability area is given  Can think about probabilities in the “tails”  One-tail (upper or lower)  Two-tail (upper + lower)  R R code twotailp<- 2*pnorm(-abs(zobama)) #Gives us area in the upper and lower tails of z onetailp<- pnorm(-abs(zobama)) #Gives us 1-tail probability area; if #subtract this from 1, this give us the area #below this z score (if z is positive) or #area above this z score (if z is negative) zp<-cbind(county, onetailp, twotailp, zobama ); zp Plots  4 plots on one page: par(mfcol=c(2,2)) boxplot(obamapercent, ylab="Vote Percent", main="Obama Vote: Box Plot", col="blue") hist(zobama, xlab="Obama Vote as Z-Scores", ylab="Frequency", main="Histogram of Standardized Obama Vote", col="blue") hist(obamapercent, ylab="Frequency", xlab="Vote Percent", main="Obama Vote: Histogram", col="blue") plot(zobama, onetailp, ylab="One-Tail p", xlab="Z-score", main="Z-scores and p-values", col="blue") Obama Vote: Histogram 15 5 10 Frequency 70 60 50 0 30 40 Vote Percent 80 Obama Vote: Box Plot 30 40 50 60 70 80 90 Vote Percent Z-scores and p-values 0.3 0.0 0.1 0.2 One-Tail p 10 5 0 Frequency 0.4 15 0.5 Histogram of Standardized Obama Vote -2 -1 0 1 Obama Vote as Z-Scores 2 -1 0 Z-score 1 2

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Z - UC Davis