Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
GV207 – Political Analysis, Week 04 Department of Government, University of Essex Probabilities and the Normal Distribution Importance of the normal distribution Many variables in the real world are normally distributed: height, shoe-size, length of tree leafs, IQ, but also marks in class tests . . . In the case of a normal distribution, mean = median = mode. Most importantly, we know the properties of the normal distribution: 68% of the area under the curve lie between one standard deviation to the left and one to the right of the mean, around 95% between two standard deviations,1 and 99.7% between three standard deviations. A normal distribution with a mean of 0 and a standard deviation of 1 is called a standard normal distribution. Every normal distribution can be transformed into a standard normal distribution using the ztransformation. The z-transformation2 When we have a sample, we can use the mean ( ) and the standard deviation (s) of a normally distributed variable (X) to calculate the z-score for a particular observation’s value. The z-score indicates by how many standard deviations the value deviates from the mean. Z-scores enable us to calculate the percentile of a specific observation because all z-scores and corresponding percentiles are already tabulated. Starting from the z-score for an observation, plus the mean and standard deviation of our variable, we can rearrange the equation to figure out the value of our observation: 1 The precise number is actually about 1.96, but we usually use 2 for sake of simplicity. Not to be confused with the z-transformation that comes up on Wikipedia. What we mean here is the transformation of an observation’s value so that we get a z-score. 2 1 GV207 – Political Analysis, Week 04 Department of Government, University of Essex How to compute z-scores in Stata We can generate our z-scores in Stata with the use of two commands, one of which we’ve already seen: summarize and generate. First we need to find out the mean and the standard deviation of the variable we want to compute zscores for. To do so we simply use: summarize varname Using the mean and the standard deviation, we can generate our new variable (z_varname) by using the equation for the z-score as given above. Assume that the mean of our variable is 70 and the standard deviation is 20. generate z_varname = (varname - 70)/20 This generates a new variable z_varname where the values are the z-scores which correspond to the values of varname. Calculating the area under the curve above or below an observation In the lecture you were told how to find out what proportion of observations lie above or below a particular value of a variable. 3 In this case, you consulted a table (from a statistics textbook) that told you the proportion of observations that corresponds to the z-score. While this is a perfectly valid way of doing so, it becomes time consuming if we want to calculate this for the values of every observation on our variable. Thankfully Stata allows us to do this much quicker. The function normal() gives us the value of the cumulative standard normal distribution, 4 i.e. what proportion of observations lie below a particular value. Thus we just combine this function with generate to create a new variable called below_varname that tells us what proportion of observations are below the value of each observation on our variable: generate below_varname = normal(z_varname) If instead we want to know the proportion of observations above the value of a particular observation then we don’t have to change too much. If we know the proportion below (Pbelow) then simply: Pabove = 1 – Pbelow Or if we want to create a new variable called above_varname that provides us with the proportion of observations above a particular observation’s value, we simply do: generate above_varname = 1 - normal(z_varname) 3 Proportions here are equivalent to percentages. If the proportion of observations below a particular value is 0.2, this is equivalent to saying 20% of the observations are below this particular value. 4 As noted above, the standard normal distribution is a normal distribution with mean 0 and standard deviation 1. This is why the z-transformation is sometimes called “standardisation”. 2 GV207 – Political Analysis, Week 04 Department of Government, University of Essex Stata exercise As always, we will use the data set Democracy small.dta. 1. Find a continuous variable in the data set. 2. Check to see if is normally distributed using the commands shown last week ( histogram and kdensity). Does it look normally distributed? 3. If it isn’t find another continuous variable and do the same checks. If you still don’t have a normally distributed variable just continue onto the next part with this variable. 4. Calculate the mean and standard deviation of this variable using the summarize command. 5. Using the list command and the appropriate if conditions list the values of this variable as well as the names of the countries in the data set. 6. Find the value of this variable for your home country and calculate (on paper or in your head) the corresponding z-score. How many standard deviations is it above or below the mean? 7. Using the commands outlined above, create a new variable that is the z-score associated with the variable you chose. Make sure to give the variable a meaningful name. You can also label the variable for more detail. 8. Using your new variable, generate a new variable that is the proportion of observations below the associated z-score. 9. Use the list command again with the appropriate if conditions to find out what proportion of observations lie below your home country’s value for this variable. 10. Knowing this, what proportion of countries have a value higher than that of your home country? 3