Download chapter 6 - Faculty Personal Web Pages

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
CHAPTER 6
The Normal Probability Distribution
The normal probability distribution is the most widely used distribution in statistics as many statistical
procedures are built around it. The central limit theorem is probably the main reason that contributes to
the importance of the normal distribution. It is essential for statistics students to learn how to use the
normal probability distribution for solving applied problems. In this Chapter we are going to study the
normal probability distribution using the appropriate functions in JMP. Also, we are going to perform
simulations using a random function to generate a normally distributed random variable with a specified
mean and standard deviation. We are going to perform a statistical experiment to demonstrate
numerically the central limit theorem, and finally we are going to assess the normality of a given
dataset.
Class Exercises: Compute probabilities for the normal distribution
Class example 1:
According to the National Health Survey, heights of adult males are normally distributed with a mean of
69” and a standard deviation of 2.9”. Compute the percentage of the population of adult males that falls
between 64” and 76”.
First, let’s open a new data table,
Figure 6.1
then, right click at the heading of “Column 1”
Chapter 6
Page 1
Figure 6.2
click on the text box for “Column Name” and change the name to “x”, as follows:
Figure 6.3
left click twice at the right side of the first column heading to open a new column
Figure 6.4
you can save the file as “Normal Dist” (or anything you like), then right click over “Column 2”, select
“Column Info” and change the name to “P(x)”, then click over “Column Properties” and select formula,
as shown below,
Chapter 6
Page 2
Figure 6.5
a new window will open, then choose “Probability” from “Functions (grouped)” and select Normal
Distribution as shown below
Figure 6.6
then click twice over variable “x”, and click inside the parenthesis, then after the variable x, type
“,69,2.8” as shown below:
Chapter 6
Page 3
Figure 6.7
click over “Apply”, you are going to see the following screen:
Figure 6.8
then click over “OK” on this window and in the next window, next we want to compute the cumulative
probability for x=64 and x = 76, let’s input these numbers in the first column as shown below:
Chapter 6
Page 4
Figure 6.9
the cumulative probabilities for these numbers are shown above. Thus, the probability that the height of
one person is between 64 and 76 is (rounding to three digits):
P(64<x<76) = 0.994-.037 = 0.957
Class Exercise 2: We can also perform probability computations using a simulation, for example, let’s
generate 10,000 random numbers from a normal distribution with a mean of 69 and standard deviation
of 2.9, to do this, let’s open a new data table as follows,
Figure 6.10
then, right click at the heading of “Column 1” and click over “Column Info”
Figure 6.11
Chapter 6
Page 5
click on the text box for “Column Name” and change the name to “x”, as follows:
Figure 6.12
then click over “Column Properties” and select “Formula”,
Figure 6.13
next, click over “Edit Formula” and select “Random”, then select “Random Normal”
Chapter 6
Page 6
Figure 6.14
click inside the parenthesis, and input the numbers, 69 and 2.8 separated by a comma as follows:
Figure 6.15
click over “OK” on this window and in the next window, then right click over the first column (below the
red arrow) and select “Add Rows…”, as below
Chapter 6
Page 7
Figure 6.16
type 10000 at the dialog box and click “OK”
Figure 6.17
at this point, you are going to see a sequence of randomly generated numbers from a normal
distribution,
Figure 6.18
Chapter 6
Page 8
you can draw a histogram using the “Analyze “ menu and choosing the “Distribution” option, (see
Chapter 3 for more details, this procedure is not shown here). You can check the shape of the
distribution and take a look at the summary statistics that will be approximately equal to the requested
mean and standard deviation (this activity is highly recommended, please ask you lab instructor if you
do not know how to do it).
Next, you need to sort the numbers from lowest to highest, by selecting “Tables” and “Sort”, then
choose the variable “x” and click over “By”, you will see the next window
Figure 6.19
click over OK, and you are going to see the sequence of random numbers ordered from lowest to
highest as follows:
Figure 6.20
Chapter 6
Page 9
computing the simulated probabilities is just a matter of counting the number of observations that
match the requirements for this problem. To compute the requested probabilities, you need to count
the number of observations that are less than 64”, you can do it by scanning the ordered dataset, and
looking at the index number on the left side of the screen,
Figure 6.21
we can see at the Figure above that there are 349 observations less than 64”, then this probability is
computed as follows:
P(x<64) = 349/10,0000 = 0.0349
Which is close to the computed probability using the normal distribution formula (see Figure 6.9) of
0.0370, please do not forget that this is a numeric simulation and the results shown here are
approximations to the true probabilities, but this result is close enough.
Next, we need to find the probability that a man selected at random has a height less than 76”, to do it
we need to count the number of observations that are less than 76 as shown below:
Chapter 6
Page 10
Figure 6.22
we found 9940 observations that are less than 76”, thus the probability associated with that event is
computed as follows
=9,940/10,000 = 0.994, then the computation for the probability that one man selected at random is
between 64” and 76” is as follows:
P(64<x<76) = 0.994 – 0.035 = 0.959, which is very close to the probability computed using the formulas,
as you can see here, the simulation provided acceptable results!
Class Exercise: The Central Limit Theorem
Please go to the website:
http://onlinestatbook.com/rvls.html
or search in your browser “Rice virtual labs”
1) Select “Simulations and Demonstrations”, and select “Sampling Distribution Simulation”
2) Select a normal distribution and choose a small sample size, then you can take 50,000 samples
(or more) and look at the graph for the sampling distribution of the mean
3) Select a skewed distribution and choose a small sample size (n = 2 or 5), repeat the same
procedure and see what happens.
4) Select a skewed distribution and choose the largest sample size available (n=25) and generate
again the sampling distribution of the means
5) What are your conclusions? Did you notice any difference among the previous simulations? How
can you relate your findings to the theory studied in class? Please remember the requirements
for the application of the central limit theorem
Chapter 6
Page 11
Now, let’s do a simulation using JMP, we are going to generate an integer uniform distribution using the
numbers 1 to 10 and we are going to obtain samples from this distribution
First, let’s open a new data table:
Figure 6.23
then, right click over the heading of “Column 1” and select “Column Info”,
Figure 6.24
choose “Formula” from “Column Properties” and select “Edit Formula”
Chapter 6
Page 12
Figure 6.25
select “Random” and “Random Integer” as follows,
Figure 6.26
type 1 inside the red box, and hit enter, type “,” and 10, you should see the following window
Chapter 6
Page 13
Figure 6.27
then hit enter, click “OK” on this window and click “OK” again in the next window. You are not going to
see any changes at the data window as we still have to add some columns. To do this, right click over the
cell below the red triangle and select “Add Rows…” as follows
Figure 6.28
type 200 inside the box
Chapter 6
Page 14
Figure 6.29
you can see randomly generated numbers from 1 to 10,
Figure 6.30
then left click twice over the space to the right of “Column 1” and keep doing that until you generate 4
new columns as follows
Figure 6.31
Chapter 6
Page 15
next, right click over the heading of “Column 1” and select, “Copy Column Properties”
Figure 6.32
then go over the heading of each new column and right click over the heading and select “Paste Column
Properties”, repeat this procedure for each column
Figure 6.33
you are going to see 5 columns with integer random numbers ranging from 1 to 10
Chapter 6
Page 16
Figure 6.34
now, let’s compute the mean for each row, and put these results in column 6. Let’s generate a new
column by double clicking on the space right to the heading of Column 5. Then, right click over the
heading of the new column and as we have done before. Select “Column info”, then select “Formula”
from “Column Properties” and click over “Edit Formula” (as in Figures 6.1 to 6.4) , then choose
“Statistical” from “Functions” and select “Mean” from the menu as follows,
Figure 6.35
then, click inside the parenthesis and click twice over “Column 1” under “Table Columns”, type a comma
and click over “Column 2” and so on, until you add all columns until “Column 5”, your formula should
look like this:
Chapter 6
Page 17
Figure 6.36
click over “OK” on this window and in the next window, now you can see the mean computed for every
row. The interesting thing about the new column is that it contains the sampling distribution of the
means from a uniform probability distribution of integers ranging from 1 to 10.
It will be interesting to take a look at the properties of the sampling distribution of the means that we
got on column 6. With that purpose in mind, let’s choose the “Analyze” menu and select “Distribution”,
then click over “Column 6” and next, click over “Y, Columns”, and click over ”OK”. You are going to
obtain a histogram for the sampling distribution of the means. You can see a bell shaped distribution
with a mean of 5.618 and a standard deviation of 1.278534 (results may vary). You can get a horizontal
layout by choosing this option from the “Display Options” located under the second red triangle. Notice
that the mean of your sampling distribution approximates the mean of the uniform distribution of the
integers (the mean is 5.5).
Figure 6.37
Chapter 6
Page 18
Also, you should observe that the sampling distribution of the means approximates a normal
distribution even that the original population is uniform with integers ranging from 1 to 10 and we used
a small sample size. The next step is to check your sampling distribution of the means for normality.
Class Exercise: Assessing normality,
Using results from the previous exercise we will assess normality of the sampling distribution of the
means located on “Column 6”. Let’s proceed as follows:
click over the lower right triangle on the window shown in Figure 6.37 and select “Continuous Fit”, then
select “Normal”
Figure 6.38
This option overlaps a normal shape over the histogram as shown below, but probably this is not
enough to assess normality,
Figure 6.39
then select from the lower right triangle, and choose the option “Normal Quantile Plot”
Chapter 6
Page 19
Figure 6.40
At this point, you can see a Q-Q plot (normal quantile plot) for the data in “Column 6” as shown bellow
Figure 6.41
we can see that the Q-Q plot follows a straight line pattern (more or less) and the dots are located
within the curves described with red dots. There is no presence of an obvious pattern on the Q-Q plot,
therefore we can accept normality of the sampling distribution of the means as predicted by the central
limit theorem (even that in this case the sample size was small).
Chapter 6
Page 20
Class Exercises:
1- Probability functions: Consider that women’s heights are normally distributed with a mean of
63.6”and a standard deviation of 2.5” then, answer the following questions using the function
“Normal Distribution” as in class example 1 (shown at the beginning of this Chapter).
a. Find the probability that a woman selected at random is between the heights of 60” and
66”.
b. Find the probability that a woman selected at random is taller than 69”
2- Simulations: Solve the previous problems using a simulation (Generate a sequence of 10,000
normally distributed random numbers). Compare the simulated results with the computed
probabilities from problem 1.
3- Central Limit Theorem: Generate 4 columns with 250 numbers in each column, using a random
normal distribution with a mean of 63.6”and a standard deviation of 2.5”
a. Compute the mean for each row on the fifth column
b. Analyze the sampling distribution of the means on the fifth column, obtain summary
statistics, describe the shape of the distribution and make comments
c. Compare the population mean with the mean from the sample means at Column 6, Are
they similar?
d. Compare the standard deviation of Column 6, with the standard deviation of the
population, how they are related? (Hint: take a look at the CLT)
e. Discuss your findings with your classmates
Team Assignment: Assessing Normality
Use your random sample that you obtained from the file “Small Town.xls” and do the following:
1- Assess normality using a Q-Q plot (Normal Probability Plot) for all numeric variables
2- Write a report showing your findings:
a. Show a histogram for each continuous variable
b. Show a Q-Q plot (normal probability plot) for each numeric continuous variable
c. Based on the previous graphs discuss if normality is acceptable for each variable, write
briefly the reasons that support your conclusion
d. Explore transformations for those variables that normality was not acceptable, that is:
apply a mathematical function such as the logarithmic function or the square root to
transform every value, and discuss if the results are different (better) than before
e. Summarize your findings on a table, showing which variables can be considered
normally distributed and which variables can’t be considered normally distributed,
specify if a transformation was applied to achieve normality
3- Choose a variable that is normally distributed, compute the mean and standard deviation and
simulate the results an equivalent normal distribution. Simulate a normal random variable with
these parameters, and find the probability that one observation is between 1.5 standard
deviations below the mean and 1.2 standard deviations around the mean, compare the result
obtained by simulation with the probability for a standard normal distribution P(-1.5< z <1.2)
Chapter 6
Page 21