Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Julian Archer To: Dr. Findsen STAT 511 – Section 3 Project #1 Question #1(a) Experimental Procedure: Step #1: Use “roll-dice-online.com” to generate 100 random paired dice rolls using two 6-sided dice. Step #2: Transfer the 100 observations for the paired dice rolls into excel. Step#3: Compute the combined total for each of the paired dice rolls (i.e. for each row of data). Step#4: Compute the number of occurrences where the total in step#3 was “12”. This value was found to be 2. Step#5: Compute the probability of getting a total of 12. Divide the value found in Step #4 by 100. The answer should be 0.02. Step#6: Compute the absolute difference for each of the paired dice rolls (i.e. for each row of data). Step#7: Compute the number of occurrences where the absolute difference in Step#6 was “4”. This value was found to be 11. Step#8: Compute the probability of getting an absolute difference of 4. Divide the value found in Step#7 by 100. The answer should be 0.11. Step#9: Compute the probability of getting a total of 12or an absolute difference of 4. This is simply adding the values found in Step#5 and Step#8. The answer was found to be 0.13. Theoretical Calculation: Number of ways to get a total of 12 after rolling two 6 sided dice N X 1 Total number of possible outcomes of totals when rolling two 6 sided dice N 36 Probability of getting a total of 12 P X N X 1 0.0278 N 36 Number of ways to get an absolute difference of 4 after rolling two 6 sided dice N D 4 Total number of possible outcomes of totals when rolling two 6 sided dice N 36 Probability of getting an absolute difference of 4 P D N D 4 0.1111 N 36 Probability of getting a total of 12 or an absolute difference of 4 P X D P X P D 0.0278 0.1111 0.1389 Answer Comparison: In experiment, the probability of getting a total of 12 or an absolute difference of 4 is slightly smaller than in theory, i.e. 0.13 versus 0.1389 Question #1(b) Experimental Procedure: Step #1: Use “roll-dice-online.com” to generate 100 random paired dice rolls using two 6-sided dice. Step #2: Transfer the 100 observations for the paired dice rolls into excel. Step#3: Compute the number of occurrences where at least one six occurred in each paired dice roll using a logical test (i.e. for each row of data). This value was found to be 26. Step#4: Compute the probability of at least one six. Divide the value found in Step#3 by 100. The answer should be 0.26. Step#5: Compute the absolute difference for each of the paired dice rolls (i.e. for each row of data). Step#6: Compute the number of occurrences where the absolute difference in Step#5 was “4”. This value was found to be 13. Step#7: Compute the probability of getting an absolute difference of 4. Divide the value found in Step#6 by 100. The answer should be 0.13. Step#8: Compute the probability of getting at least one six or an absolute difference of 4. This is simply adding the values found in Step#4 and Step#7. The answer was found to be 0.39. Theoretical Calculation: Number of ways to get at least one 6 after rolling two 6 sided dice N X 11 Total number of possible outcomes of totals when rolling two 6 sided dice N 36 Probability of getting at least one 6 P X N X 11 0.3056 N 36 Number of ways to get an absolute difference of 4 after rolling two 6 sided dice N D 4 Total number of possible outcomes of totals when rolling two 6 sided dice N 36 Probability of getting an absolute difference of 4 P D N D 4 0.1111 N 36 Probability of getting at least one 6 or an absolute difference of 4 P X D P X P D 0.3056 0.1111 0.4167 Correct this calculation with the intersection component Answer Comparison: In experiment, the probability of getting at least one 6 or an absolute difference of 4 is a bit smaller than in theory, i.e. 0.39 versus 0.4167 Question #2(a) Experimental Procedure: Step #1: Use “randomizer.org/form.htm” to generate 100 observations of 3 cards drawn randomly (without replacement) from one suite playing cards. Parameters: How many sets of numbers do you want to generate: 100 How many numbers per set: 3 Number range (e.g., 1-50): 1 to 13 Do you wish each number in a set to remain unique: Yes Step #2: Transfer the 100 observations in Step#1 into Excel. Step #3: Count the number of occurrences where there were no face cards for the three cards drawn. (i.e. for each row of data). This value was found to be 53. Step #4: Compute the probability of getting no face cards for the three cards drawn. Divide the value found in Step#3 by 100. The answer should be 0.53. Theoretical Calculation: Using a hypergeometric distribution, the parameters are as followed: Outcomes not face card success or face card failure Finite population N 13 || number of success in population M 10 Fixed sample size n 3 || number of success in sample x 3 M N M 10 3 x n x 3 0 0.4196 h(x 3; n 3, M 10, N 13) P (X 3) N 13 n 3 N n M M 13 3 10 10 X n. 1 3. 1 0.6662 N N 1 N 13 1 13 13 Answer Comparison: In experiment, the probability of drawing 3 non face cards was larger than what we would expect in theory, i.e. 0.53 versus 0.4196, which is approximately, 0.17 standard deviations from the theoretical value, i.e. (0.55 - 0.4196) / (0.6662). Question #2(b) Experimental Procedure: Step #1: Use “randomizer.org/form.htm” to generate 100 observations of 3 cards drawn randomly (with replacement) from one suite playing cards. Parameters: How many sets of numbers do you want to generate: 100 How many numbers per set: 3 Number range (e.g., 1-50): 1 to 13 Do you wish each number in a set to remain unique: No Step #2: Transfer the 100 observations in Step#1 into Excel. Step #3: Count the number of occurrences where there were no face cards for the three cards drawn. (i.e. for each row of data). This value was found to be 45. Step #4: Compute the probability of getting no face cards for the three cards drawn. Divide the value found in Step#3 by 100. The answer should be 0.45. Theoretical Calculation: Using a binominal distribution, the parameters are as followed: Sequence of independent trials n 3 Outcomes not face card success x or face card Probability of success p failure 10 / 13 3 b( x 3; n 3, p 10 / 13) P( X x ) p x (1 p ) n x (10 / 13) 3 (3 / 13) 0 0.4552 3 X np(1 p ) (3)(10 / 13) 1 10 0.7298 13 Answer Comparison: In experiment, the probability of drawing 3 non face cards was slightly smaller than what we would expect in theory, i.e. 0.45 versus 0.4552. It’s close though, -0.00071 standard deviations from the theoretical value, i.e. (0.45 - 0.4552) / (0.7298). Question #2(c) Experimental Procedure: Step #1: Use “randomizer.org/form.htm” to generate 100 observations of 3 cards drawn randomly (with replacement) from one suite playing cards. Parameters: How many sets of numbers do you want to generate: 100 How many numbers per set: 3 Number range (e.g., 1-50): 1 to 13 Do you wish each number in a set to remain unique: No Step #2: Transfer the 100 observations in Step#1 into Excel. Step #3: Count the number of occurrences where there were no face cards for the first of the three cards drawn. (i.e. for each row of data). This value was found to be 71. Step #4: Compute the probability of getting no face cards for the three cards drawn. Divide the value found in Step#3 by 100. The answer should be 0.71. Theoretical Calculation: Using a geometric distribution, the parameters are as followed: Sequence of independent trials n 3 Outcomes not face card success or face card Probability of success Interested in “1” trial p 10 / 13 x until the 1st rth failure success 0 10 3 g ( x 1; p 10 / 13) p(1 p ) x 1 0.7692 13 13 10 1 1 p X 2 132 2.0817 10 p 13 Answer Comparison: In experiment, the probability of drawing a non-face card on the first trial was a bit smaller than what we would expect in theory, i.e. 0.71 versus 0.7692. It’s close though, -0.0284 standard deviations from the theoretical value, i.e. (0.71 - 0.7692) / (2.0817). Question 3(a) Step #1: Using “http://www.random.org/gaussian-distributions/” generate 2 ten (10); 2 hundred (100); 1 one thousand (1000) sample sizes. Use these parameters below: The distribution's mean should be ±1,000,000). The numbers should have 2 0.0 (limits ±1,000,000) and its standard deviation 1.0 (limits significant digits (minimum 2, maximum 20). Step #2: For each of the sets of random numbers generated, transfer the values into MATLAB and place them in an array. Call these arrays X1, X2, X3, X4, and X5, respectively. Step #3: Generate Q-Q plots for X1, X2, X3, X4, and X5, respectively. Also get the standard deviation and mean values for each dataset so that they could be compared with the set parameters above. This is easily done using a few lines of code in MATLAB. Note: after a bunch of searching, to no avail, and playing around with numbers, I have come to the i 0.5 deduction that MATLAB uses this formula for calculating percentiles: Percentile 100 n th This is what I got below: QQ Plot of Sample Data versus Standard Normal QQ Plot of Sample Data versus Standard Normal 1.5 1.5 1 Quantiles of Input Sample Quantiles of Input Sample 1 0.5 0 -0.5 -1 0.5 0 -0.5 -1 -1.5 -2 -1.5 -2 -2 -2.5 -1 0 1 Standard Normal Quantiles 2 -3 -2 = -0.2391 This plot is not normal; it has an S-shape. There appears to be a large deviation from the straight line in the lower and upper standard normal quartiles. The standard deviation and mean of the X1 data set are not 1 and 0, respectively (which were the default parameters). Overall, it appears that the data may have a long (heavy) tail and slightly be bimodal. I can’t really tell skewness from this. 2 Plot (X2): Sample size = 10 Plot (X1): Sample size = 10 = 0.9457 -1 0 1 Standard Normal Quantiles = 1.078 = -0.7778 This plot, like the X1 plot is not normal; it has an Sshape. There appears to be a large deviation from the straight line in the lower and upper standard normal quartiles. The standard deviation and mean of the X2 data set are not 1 and 0, respectively (which were the default parameters). Overall, it appears that the data may have a long (heavy) tail and slightly be bimodal. I can’t really tell skewness from this. QQ Plot of Sample Data versus Standard Normal 3 2 2 Quantiles of Input Sample Quantiles of Input Sample QQ Plot of Sample Data versus Standard Normal 3 1 0 -1 -2 1 0 -1 -2 -3 -4 -3 -2 -1 0 1 Standard Normal Quantiles 2 3 -3 -3 -2 -1 0 1 Standard Normal Quantiles 2 3 Plot (X4): Sample size = 100 Plot (X3): Sample size = 100 = 1.091 = 0.9939 = -0.1199 = 0.09536 This plot is somewhat normal; it has a fairly straightlines shape. There still appears to be some deviation from the normal standard deviation and mean. However, this deviation appears to be lesser than in the previous X1 and X2 plots. Apparently with more data points, the normal distribution seems to be approaching. This plot is somewhat normal; it has a fairly straightlines shape. There still appears to be some deviation from the normal standard deviation and mean. However, this deviation appears to be lesser than in the previous X1 and X2 plots, even X3, although 100 points were used. Apparently with more data points, the normal distribution seems to be approaching. = 1.009 QQ Plot of Sample Data versus Standard Normal 4 = -0.004889 Quantiles of Input Sample 3 This plot is very close to normal; it follows a straight line. There is very little deviation from the normal standard deviation and mean. Compared to X1, X2, and X3 and X4, this plot yielded the best result of a normal distribution. 2 1 0 -1 -2 -3 -4 -4 -2 0 2 Standard Normal Quantiles Plot (X5): Sample size = 1000 4 Overall, it seems as though with more data points, the more normal the distribution of the sample seems to be when taken from the parent (which was established by the default parameters. A simple explanation for this is the fact that if a small sample is taken, there could be quite a few outliers that would tend to skew the mean and standard deviation. Remember the “0” mean and standard deviation of “1” is only obtained after averaging all data points in the entire population. Therefore, the larger the sample pulled from the population, the more representative the values become. Question 3(b) Step #1: For each of the sample data sets in the Project1.xlx file, transfer the values into MATLAB and place them in an array. Call these arrays S1, S2, and S3, respectively. Step #2: Generate Q-Q plots for S1, S2, and S3, respectively. Also get the standard deviation and mean values for each dataset so that they could be compared with the standard deviation and mean values of the normal line fitted to the data. This is easily done using a few lines of code in MATLAB. This is what I got below: QQ Plot of Sample Data versus Standard Normal QQ Plot of Sample Data versus Standard Normal 1400 1000 Quantiles of Input Sample Quantiles of Input Sample 1200 500 0 -500 -1000 -1500 -3 1000 800 600 400 200 0 -2 -1 0 1 2 Standard Normal Quantiles 3 -200 -3 -2 -1 0 1 2 Standard Normal Quantiles 3 Plot (S2): Sample size = 150 Plot (S1): Sample size = 150 QQ Plot of Sample Data versus Standard Normal Quantiles of Input Sample 15 Plots S1 and S2 are obviously not normal; they have an S-shape and deep curve, respectively. Plot S1 appears to have a somewhat short (light) tail, while Plot S2 appears to be positively skewed. There is tremendous deviation from the straight line, which is expected of a normal distribution Q-Q plot. 10 5 0 -5 -2 -1 0 1 Standard Normal Quantiles Plot (S3): Sample size = 20 2 Plot (S3) on the other hand, seems somewhat normally distributed. There is a general fitting to the straight line as seen in the graph, with the exception of the first and last points. Question 3(c) Only plot (S3) seemed normal (if we disregard the first two and last one data point) so I will explain this one. QQ Plot of Sample Data versus Standard Normal Quantiles of Input Sample 15 10 5 rise = (8.8 - 1.6) = 7.2 0 -5 -2 run = (1 - -1) = 2 -1.5 -1 -0.5 0 0.5 Standard Normal Quantiles 1 1.5 2 Plot (S3): Sample size = 20 Yes, it is possible to use the Q-Q plot to determine the mean and standard deviation of a sample that is normally distributed. We will assume that because the sample is normally distributed then it is theoretically represented by the straight red line shown in the graph above. Using this graph as a guide, this is how we do it. For the mean of the sample, simple identify the middle points on the graph. I provided the dashed green lines and an arrow as a guide. At this middle point, our sample mean appears to be roughly 5.1, reading from the y-axis. Hence the mean of the sample is 5.1. For the standard deviation, this is simply the slope of the straight line. From basic math we know that this slope is determined by dividing the “rise” of the y-values over the “run” of the x-values. I have provided dashed purple lines as guides. The “rise” is roughly 8.8 minus 1.6, while the “run” is roughly 1 minus -1. (I.e. rise = 7.2 and run = 2). Hence the slope of the red line is 7.2 divided by 2, which is equal to 3.6. Hence the standard deviation of the sample is 3.6. Again, this is all assuming all points are normally distributed. How do these values compare to the theoretical values in MATLAB? = 3.606 x = 5.125 These values are pretty close to what I have computed. Note however, that the actual standard deviation and mean values for the S3 dataset was 4.535, and 4.826, respectively. Remember that the data is only somewhat normal like I said before, just based on a visual check, hence, the deviation in values from the theoretical values. Question 3(d) Step #1: Using “http://www.random.org/integers/” generate 10 uniform distributions on the interval [1, 10] with 100 values each. Step #2: For each dataset, transfer the values into MATLAB and place them in an array. Call these arrays U1,…,U10, respectively. Step #3: Using MATLAB, average U1 (which is the same as the U1 array), then average U1+U2, then average U1+…+U6, then average U1+…+U10. Call these new arrays A1U, A2U, A6U, and A10U, respectively. Step #4: Generate Q-Q plots and histograms for A1U, A2U, A6U, and A10U, respectively. Also get the standard deviation and mean values for each dataset so that they could be compared with the theoretical standard deviation and mean values of the normal line fitted to the data. This is easily done using a few lines of code in MATLAB. This is what I got below: 12 QQ Plot of Sample Data versus Standard Normal 14 10 12 Quantiles of Input Sample 10 frequency 8 6 4 8 6 4 2 0 2 0 -2 -4 -3 1 2 3 4 5 6 x values 7 8 9 10 -2 -1 0 1 Standard Normal Quantiles Plot (A1U): Histogram (left) and Q-Q Plot (right) for A1U dataset, i.e. average of 1 Uniform distribution Theoretical: = 3.182 = 5.25 Sample: = 2.846 = 5.33 This is obviously not normally distributed; it is also bimodal. (Look at the histogram and look at the Sshape on the Q-Q plot). The theoretical versus sample mean and standard deviation, respectively, are somewhat “off”. 2 3 QQ Plot of Sample Data versus Standard Normal 18 12 16 10 Quantiles of Input Sample 14 frequency 12 10 8 6 8 6 4 2 4 0 2 0 1 2 3 4 5 6 x values 7 8 9 -2 -3 10 -2 -1 0 1 Standard Normal Quantiles 2 3 Plot (A2U): Histogram (left) and Q-Q Plot (right) for A1U dataset, i.e. average of 2 Uniform distributions Theoretical: = 2.121 This is somewhat normally distributed (Look at the histogram and look at the relatively straight line on the Q-Q plot). = 5.5 The theoretical versus sample standard deviation is a little off, but we can see the mean of the sample is fairly similar and representative of the theoretical vale. Sample: = 2.036 = 5.515 25 QQ Plot of Sample Data versus Standard Normal 9 8 Quantiles of Input Sample frequency 20 15 10 7 6 5 4 5 3 0 2 -3 2 3 4 5 6 7 8 9 -2 -1 0 1 Standard Normal Quantiles 2 x values Plot (A6U): Histogram (left) and Q-Q Plot (right) for A1U dataset, i.e. average of 6 Uniform distributions Theoretical: = 1.061 = 5.583 Sample: = 1.188 = 5.3 This is close to normally distributed (Look at the histogram and look at the relatively straight line on the Q-Q plot). The theoretical versus sample standard deviation and mean, respectively are a little off. 3 25 QQ Plot of Sample Data versus Standard Normal 9 8 Quantiles of Input Sample frequency 20 15 10 7 6 5 4 5 3 0 3 4 5 6 x values 7 8 2 -3 -2 -1 0 1 Standard Normal Quantiles 2 3 Plot (A10U): Histogram (left) and Q-Q Plot (right) for A1U dataset, i.e. average of 10 Uniform distributions Theoretical: = 0.9899 = 5.6 Sample: = 0.9528 This is a bit closer to a normal distribution than the previous 3 graphs. (Look at the histogram and look at the relatively straight line capturing more data points on the Q-Q plot). The theoretical versus sample standard deviation and mean, respectively are very close to each other. = 5.57 Question 3(e) Step #1: Using MATLAB, generate 50 sets of random exponential distributions, each having a mean value of 5, and an array size of 100*1 (i.e. 100 rows, 1 column). Name the arrays E1, E2,…E50, respectively. Because you said explicitly, here is an example of what the code looks like for one of the arrays: E1 = exprnd(5, [100,1]); Step #2: Using MATLAB, keep averaging E1+…+ E10, E1+…+E20, and so… on until Call these new arrays A10E, A20E, and so on respectively. E.g A6E = ((E1+E2+E3+E4+E5+E6)/6) Step #3: Generate Q-Q plots for A10E, A20E, and so on respectively. This is easily done using a few lines of code in MATLAB. E.g. of code used to generate a plot: qqplot(A6E) Answer: About 45 columns need to be averaged first before the resulting distribution appears normal. The number of column averages for the resulting distribution to be normal is larger than the value in part 3(d) because the exponential distribution with a mean value of 5 is right skewed, and therefore the mean is to the left of the graph. This means that in order for the mean to shift to the center and “pileup”, more values need to be averaged to allow the mean to start shifting and hence approach the center. Essentially, according to the central limit theorem, each column that we will average will be treated independently and therefore we expect the averages of those columns to be different, and eventually become normally distributed, as seen at the 45th average in this case. In the case of the uniform used in part 3(d), everything in equally likely, so we just need the mean value to become established and start developing a peak in the center as we normally observe in a normal distribution, hence when we need to average less.