Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NDSU 11: Random Processes ECE 111 - JSG Random Processes Probability and hypothesis testing Objective Determine the confidence interval for a random variable Determine the probability of an event exceeding a threshold Be able to use a t-table Be able to use www.stattrek.com to determine probabilities. Matlab Functions mean() std() Central Limit Theorem The Central Limit Theorem states that All distributions converge to a normal distribution as the number of samples goes to infinity, and Once you have a normal distribution, you remain with a normal distribution. For example, take a six sided die with each number having a probability of 1/6. Percent of the time each number comes up for rolling a six-sided die 100,000 times If you sum 10 dice, the result in a bell curve (it approaches a Normal distribution) Result from rolling ten 6-sided dice 100,000 times 1 April 4, 2017 NDSU 11: Random Processes ECE 111 - JSG Normal (Gaussian) Distributions: The normal distribution is written as N(x, s) and has the probability density function of −(x−x) 2 p(x) = α ⋅ exp ⎛⎝ s 2 ⎞⎠ where x is the mean, s is the standard deviation (a measure of the spread), and α is a constant required to make the area equal to one (the probability that something happens is one) N(0,1) is the standard-normal distribution with mean equal to zero, and standard deviation equal to one It's probability density function is: >> >> >> >> >> s = [-3:0.001:3]'; p = exp(-(s.^2)) / 1.7724; plot(s,p); xlabel('deviations'); ylabel('p()'); The area under the curve is the probability of an event happening. For example, the area within X standard deviations of the mean is: +/- 1 deviations +/- 2 deviations +/- 3 deviation 0.68 0.95 0.996 As a rough rule of thumb, 95% of the data should lie within +/- 2 standard deviations of the mean. (The mean tells you the average of the data, the standard deviation tells you the spread.) 2 April 4, 2017 NDSU 11: Random Processes ECE 111 - JSG Student t-distribution The t-distribution is like the normal distribution, but it takes the sample size into account. A t-table looks like the following: The left column is the degrees of freedom. This is the sample size minus one. The top tells you the probability level (the area to the left in terms) The table entries tell you how many standard deviations away from the mean you have to go to capture that much area Infinite sample size is a Normal distribution (cental limit theorem) Student t-Table (http://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf) p 0.75 0.8 0.85 0.9 0.95 0.975 0.99 0.995 0.999 0.9995 1 1 1.38 1.96 3.08 6.31 12.71 31.82 63.66 318.31 636.62 2 0.82 1.06 1.39 1.89 2.92 4.3 6.97 9.93 22.33 31.6 3 0.77 0.98 1.25 1.64 2.35 3.18 4.54 5.84 10.22 12.92 4 0.74 0.94 1.19 1.53 2.13 2.78 3.75 4.6 7.17 8.61 5 0.73 0.92 1.16 1.48 2.02 2.57 3.37 4.03 5.89 6.87 10 0.7 0.88 1.09 1.37 1.81 2.23 2.76 3.17 4.14 4.59 15 0.69 0.87 1.07 1.34 1.75 2.13 2.6 2.95 3.73 4.07 20 0.69 0.86 1.06 1.33 1.73 2.09 2.53 2.85 3.55 3.85 25 0.68 0.86 1.06 1.32 1.71 2.06 2.49 2.79 3.45 3.73 30 0.68 0.85 1.06 1.31 1.7 2.042 2.46 2.750 3.39 3.646 40 0.68 0.85 1.05 1.3 1.68 2.02 2.42 2.7 3.31 3.55 60 0.68 0.848 1.05 1.3 1.67 2 2.390 2.660 3.232 3.46 infinity 0.674 0.842 1.036 1.282 1.645 1.960 2.326 2.576 3.090 3.29 This is also available at StatTrek.com. For example, a probability of 0.95 with 10 degrees of freedom gives 1.81 - the same as the above table StatTrek.com t-distrubution 3 April 4, 2017 NDSU 11: Random Processes ECE 111 - JSG t-test and Circuit Analysis: Suppose you have 5% tolerance resistors. What is the 90% confidence interval for the voltage at Y? R2 1k 10V Y R1 1k Ideally, Y should be 5.00V. Due to variations in R1 and R2, it will be a little different. >> R1 = 1000 * (1 + 0.05*(rand*2-1) ) 1031.5 >> R2 = 1000 * (1 + 0.05*(rand*2-1) ) 1040.6 >> Y = (R1 / (R1 + R2)) * 10 4.9780 To find the 90% confidence interval, we need to know the probability distribution of Y (i.e. its mean and standard deviation). If I repeat this 10 times: result = []; for i=1:10 R1 = 1000 * (1 + 0.05*(rand*2-1) ); R2 = 1000 * (1 + 0.05*(rand*2-1) ); Y = (R1 / (R1 + R2)) * 10; result = [result ; Y]; end x = mean(result) 4.9465 s = std(result) 0.0723 For a 90% confidence interval, each tail shoulb be 5% (leaving 90% in the middle). The number of deviations you have to go out for a 5% tail is from a t-table with 9 degrees of freedom (due to a sample size of 10) 4 April 4, 2017 NDSU 11: Random Processes ECE 111 - JSG You need to go 1.833 deviations away from the mean to capture 90% of the area x − 1.833s < Y < x + 1.833s p = 0.9 >> x + 1.833*s 5.0790 >> x - 1.833*s 4.8140 The voltage at Y will be in the range of ( 4.8140V < Y < 5.0790V) with a probability of 0.9 >> >> >> >> >> s1 = [-3:0.01:3]'; p = exp(-s1.^2); plot(s1*s+x, p); xlabel('Voltage at Y'); ylabel('probability'); Distribution of the voltage at Y 5 April 4, 2017 NDSU 11: Random Processes ECE 111 - JSG t-tests and Weather Data: Example 2: Fargo Weather. The historical data for April in Fargo ND is Year Low (F) High (F) Mean (F) Precip (in) Snow(in) 2,015 15 82 47.1 0.98 0 2,014 9 79 40.4 3.43 2 2,013 11 73 33.9 2.11 16.7 2,012 15 77 48.1 1.1 0 2,011 26 70 42.4 2.02 4.7 2,010 24 77 51.6 1.49 0 2,009 14 82 41.9 0.81 0.2 2,008 20 68 41 2.33 16.9 2,007 10 80 42.9 3.16 7.8 2,006 25 79 50.7 1.28 0 2,005 22 87 49.1 0.87 0 2,004 17 91 44.3 0.16 0.5 2,003 20 89 45.3 1.32 3.6 2,002 7 89 40.1 1.26 6.3 2,001 21 85 44.4 2.7 8 2,000 7 74 42.3 1.33 6.2 1,999 24 77 45.1 1.04 0 1,998 26 82 49.2 0.6 0.7 1,997 7 69 37.8 2.14 0 1,996 17 66 37.7 0.21 0.2 http://weather-warehouse.com/WeatherHistory/PastWeatherData_FargoHectorIntlArpt_Fargo_ND_April.html What is the change it will break 90F in April 2016? Take column 2 (the high) and find the mean and standard deviation: >> x = mean(F) 78.8000 >> s = std(X) 7.3097 90F is 1.5322 deviations to the right of the mean: >> (90 - x) / s 1.5322 From a t-table with 19 degrees of freedom (sample size 20) 6 April 4, 2017 NDSU 11: Random Processes ECE 111 - JSG Based upon this data, there's a 7.1% chance that it will break 90F this coming April. t-test and Global Temperatures: The deviation in global temperatures is shown below: https://www.ncdc.noaa.gov/cag/time-series/global/globe/land_ocean/p12/12/1880-2016.csv What is the 90% confidence interval for the temperature deviation from the mean? Solution: Find the mean and the standard deviation for the data (column #2 of the above link) >> C = DATA(:,2); >> x = mean(C) 0.0471 >> s = std(C) 0.3275 For 5% tails, you need to go 1.645 deviatios left and right of the mean. 7 April 4, 2017 NDSU 11: Random Processes ECE 111 - JSG Any given year will be in the range of (-0.4916C < T < 0.5858C) with a probabilty of 0.9 What is the probability that a given month will be 1 degree celcius above average (like Jan - May, 2016)? Take the distance of 1C from the mean in terms of standard edviations: (x - 1) / s ans = -2.9096 There is a 0.18% chance that any given month will be 1C above average What is the chance that you'll be 1C above average 4 months in a row? Assuming these are uncorrelated, it is p = 0.0018 4 p = 0.0000000000104 8 April 4, 2017 NDSU 11: Random Processes ECE 111 - JSG Chi-Squared Distribution A t-test tests the mean. A chi-squared test tests the shape of the distribution. Example: The following code in Matlab generates a 6-sided die d6 = ceil(6*rand); Is this a fair die? To do this you need to use a chi-squared test. The way a chi-squared test works is You collect a bunch of data Separate the data in to N bins (the six numbers in this case). Count the number of times the data wound up in each bin Compare it to the expected frequency using the metric (np i −N i ) 2 ⎞ χ 2 = Σ ⎛⎝ np i ⎠ Use a chi-squared table to convert this to a probability. A large number means that the data is inconsistent with the assumed distribution. df is the degrees of freedom (number of bins minus 1) % is the probability level The number in the table is the chi-square value Chi-Squared Table Probability of rejecting the null hypothesis http://people.richland.edu/james/lecture/m170/tbl-chi.html df 99.5% 99% 97.5% 95% 90% 10% 5% 2.5% 1% 0.5% 1 7.88 6.64 5.02 3.84 2.71 0.02 0 0 0 0 2 10.6 9.21 7.38 5.99 4.61 0.21 0.1 0.05 0.02 0.01 3 12.84 11.35 9.35 7.82 6.25 0.58 0.35 0.22 0.12 0.07 4 14.86 13.28 11.14 9.49 7.78 1.06 0.71 0.48 0.3 0.21 5 16.75 15.09 12.83 11.07 9.24 1.61 1.15 0.83 0.55 0.41 6 18.55 16.81 14.45 12.59 10.65 2.2 1.64 1.24 0.87 0.68 7 20.28 18.48 16.01 14.07 12.02 2.83 2.17 1.69 1.24 0.99 8 21.96 20.09 17.54 15.51 13.36 3.49 2.73 2.18 1.65 1.34 9 23.59 21.67 19.02 16.92 14.68 4.17 3.33 2.7 2.09 1.74 10 25.19 23.21 20.48 18.31 15.99 4.87 3.94 3.25 2.56 2.16 9 April 4, 2017 NDSU 11: Random Processes ECE 111 - JSG Example: Fair Die: Roll a 6-sided die 1200 times result = zeros(6,1); for i=1:1200 D6 = ceil( 6 * rand ); result(D6) = result(D6) + 1; end result chi = sum( (result - 200).^2 / 200 ) Set up a table: Expected Actual Frequency (np−N) 2 np Frequency (np) (N) Number probabilty (p) 1 1/6 200 193 0.2450 2 1/6 200 184 1.2800 3 1/6 200 203 0.0450 4 1/6 200 204 0.0800 5 1/6 200 206 0.1800 6 1/6 200 210 0.5000 Sum 2.33 From a chi-squared table with 5 degrees of freedom (6 bins), 2.33 is more than 10% and less than 90% More than 10% means the data probably wasn't fudged. It the data is too perfect, be suspicious Less than 90% means there is no reason to claim that Matlab's rand function is biased. You can also use StatTrek.com Chi-Sqared Result from StatTrek.com. p = 0.2 means the data wasn't fudged ( very small p means be suspicious of the data ) the data is consistent with the assumed distrubution ( only a 20% chance the distribution is not uniform) 10 April 4, 2017 NDSU 11: Random Processes ECE 111 - JSG Example: Loaded Die: Suppose 5% of the time you cheat: the die is forced to be a one. Can you detect this with a chi-squared test? result = zeros(6,1); for i=1:1200 D6 = ceil( 6 * rand ); if (rand < 0.05) D6 = 1; end result(D6) = result(D6) + 1; end result chi = sum( (result - 200).^2 / 200 ) Again, set up a chi-squared table: Expected Actual Frequency (np−N) 2 np Frequency (np) (N) Number probabilty (p) 1 1/6 200 251 13.0050 2 1/6 200 165 6.1250 3 1/6 200 185 1.1250 4 1/6 200 200 5 1/6 200 201 0.0050 6 1/6 200 198 0.0200 Sum 0 20.28 From StatTrek.com, the chi-squared result is 0.999 I am 99.9% certain that this is not a fair die 11 April 4, 2017 NDSU 11: Random Processes ECE 111 - JSG Example: Fudging the Data: Instead of rolling the dice 12,000 times, just roll the dice 1200 times and add 1800 to each result making it look like you rolled the dice 12,000 times. Can you detect the fudged data with a chi-squared test? Use the results for the fair die rolled 1200 times and add 1800 to each result: Expected Actual Frequency (np−N) 2 np Frequency (np) (N) Number probabilty (p) 1 1/6 2,000 1,993 0.02 2 1/6 2,000 1,984 0.13 3 1/6 2,000 2,003 0 4 1/6 2,000 2,004 0.01 5 1/6 2,000 2,006 0.02 6 1/6 2,000 2,010 0.05 Sum 0.23 From StatTrek, a chi-squared distribuition with 5 degrees of freedom (6 bins) and A chi-squared value of 0.23 fits the expected distribution extremely well. In fact, it fits so well that there's only a 0.001 chance of generating data this good by chance. The data was most likely fudged. 12 April 4, 2017