Download Random Processes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inductive probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Law of large numbers wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
NDSU
11: Random Processes
ECE 111 - JSG
Random Processes
Probability and hypothesis testing
Objective
Determine the confidence interval for a random variable
Determine the probability of an event exceeding a threshold
Be able to use a t-table
Be able to use www.stattrek.com to determine probabilities.
Matlab Functions
mean()
std()
Central Limit Theorem
The Central Limit Theorem states that
All distributions converge to a normal distribution as the number of samples goes to infinity, and
Once you have a normal distribution, you remain with a normal distribution.
For example, take a six sided die with each number having a probability of 1/6.
Percent of the time each number comes up for rolling a six-sided die 100,000 times
If you sum 10 dice, the result in a bell curve (it approaches a Normal distribution)
Result from rolling ten 6-sided dice 100,000 times
1
April 4, 2017
NDSU
11: Random Processes
ECE 111 - JSG
Normal (Gaussian) Distributions:
The normal distribution is written as
N(x, s)
and has the probability density function of
−(x−x) 2
p(x) = α ⋅ exp ⎛⎝ s 2 ⎞⎠
where
x is the mean,
s is the standard deviation (a measure of the spread), and
α is a constant required to make the area equal to one (the probability that something happens is
one)
N(0,1) is the standard-normal distribution with
mean equal to zero, and
standard deviation equal to one
It's probability density function is:
>>
>>
>>
>>
>>
s = [-3:0.001:3]';
p = exp(-(s.^2)) / 1.7724;
plot(s,p);
xlabel('deviations');
ylabel('p()');
The area under the curve is the probability of an event happening. For example, the area within X
standard deviations of the mean is:
+/- 1 deviations
+/- 2 deviations
+/- 3 deviation
0.68
0.95
0.996
As a rough rule of thumb, 95% of the data should lie within +/- 2 standard deviations of the mean. (The
mean tells you the average of the data, the standard deviation tells you the spread.)
2
April 4, 2017
NDSU
11: Random Processes
ECE 111 - JSG
Student t-distribution
The t-distribution is like the normal distribution, but it takes the sample size into account. A t-table looks
like the following:
The left column is the degrees of freedom. This is the sample size minus one.
The top tells you the probability level (the area to the left in terms)
The table entries tell you how many standard deviations away from the mean you have to go to
capture that much area
Infinite sample size is a Normal distribution (cental limit theorem)
Student t-Table
(http://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf)
p
0.75
0.8
0.85
0.9
0.95
0.975
0.99
0.995
0.999
0.9995
1
1
1.38
1.96
3.08
6.31
12.71
31.82
63.66
318.31
636.62
2
0.82
1.06
1.39
1.89
2.92
4.3
6.97
9.93
22.33
31.6
3
0.77
0.98
1.25
1.64
2.35
3.18
4.54
5.84
10.22
12.92
4
0.74
0.94
1.19
1.53
2.13
2.78
3.75
4.6
7.17
8.61
5
0.73
0.92
1.16
1.48
2.02
2.57
3.37
4.03
5.89
6.87
10
0.7
0.88
1.09
1.37
1.81
2.23
2.76
3.17
4.14
4.59
15
0.69
0.87
1.07
1.34
1.75
2.13
2.6
2.95
3.73
4.07
20
0.69
0.86
1.06
1.33
1.73
2.09
2.53
2.85
3.55
3.85
25
0.68
0.86
1.06
1.32
1.71
2.06
2.49
2.79
3.45
3.73
30
0.68
0.85
1.06
1.31
1.7
2.042
2.46
2.750
3.39
3.646
40
0.68
0.85
1.05
1.3
1.68
2.02
2.42
2.7
3.31
3.55
60
0.68
0.848
1.05
1.3
1.67
2
2.390
2.660
3.232
3.46
infinity
0.674
0.842
1.036
1.282
1.645
1.960
2.326
2.576
3.090
3.29
This is also available at StatTrek.com. For example, a probability of 0.95 with 10 degrees of freedom
gives 1.81 - the same as the above table
StatTrek.com t-distrubution
3
April 4, 2017
NDSU
11: Random Processes
ECE 111 - JSG
t-test and Circuit Analysis:
Suppose you have 5% tolerance resistors. What is the 90% confidence interval for the voltage at Y?
R2 1k
10V
Y
R1
1k
Ideally, Y should be 5.00V. Due to variations in R1 and R2, it will be a little different.
>> R1 = 1000 * (1 + 0.05*(rand*2-1) )
1031.5
>> R2 = 1000 * (1 + 0.05*(rand*2-1) )
1040.6
>> Y = (R1 / (R1 + R2)) * 10
4.9780
To find the 90% confidence interval, we need to know the probability distribution of Y (i.e. its mean and
standard deviation). If I repeat this 10 times:
result = [];
for i=1:10
R1 = 1000 * (1 + 0.05*(rand*2-1) );
R2 = 1000 * (1 + 0.05*(rand*2-1) );
Y = (R1 / (R1 + R2)) * 10;
result = [result ; Y];
end
x = mean(result)
4.9465
s = std(result)
0.0723
For a 90% confidence interval, each tail shoulb be 5% (leaving 90% in the middle). The number of
deviations you have to go out for a 5% tail is from a t-table with 9 degrees of freedom (due to a sample
size of 10)
4
April 4, 2017
NDSU
11: Random Processes
ECE 111 - JSG
You need to go 1.833 deviations away from the mean to capture 90% of the area
x − 1.833s < Y < x + 1.833s
p = 0.9
>> x + 1.833*s
5.0790
>> x - 1.833*s
4.8140
The voltage at Y will be in the range of ( 4.8140V < Y < 5.0790V) with a probability of 0.9
>>
>>
>>
>>
>>
s1 = [-3:0.01:3]';
p = exp(-s1.^2);
plot(s1*s+x, p);
xlabel('Voltage at Y');
ylabel('probability');
Distribution of the voltage at Y
5
April 4, 2017
NDSU
11: Random Processes
ECE 111 - JSG
t-tests and Weather Data:
Example 2: Fargo Weather. The historical data for April in Fargo ND is
Year
Low (F)
High (F)
Mean (F) Precip (in)
Snow(in)
2,015
15
82
47.1
0.98
0
2,014
9
79
40.4
3.43
2
2,013
11
73
33.9
2.11
16.7
2,012
15
77
48.1
1.1
0
2,011
26
70
42.4
2.02
4.7
2,010
24
77
51.6
1.49
0
2,009
14
82
41.9
0.81
0.2
2,008
20
68
41
2.33
16.9
2,007
10
80
42.9
3.16
7.8
2,006
25
79
50.7
1.28
0
2,005
22
87
49.1
0.87
0
2,004
17
91
44.3
0.16
0.5
2,003
20
89
45.3
1.32
3.6
2,002
7
89
40.1
1.26
6.3
2,001
21
85
44.4
2.7
8
2,000
7
74
42.3
1.33
6.2
1,999
24
77
45.1
1.04
0
1,998
26
82
49.2
0.6
0.7
1,997
7
69
37.8
2.14
0
1,996
17
66
37.7
0.21
0.2
http://weather-warehouse.com/WeatherHistory/PastWeatherData_FargoHectorIntlArpt_Fargo_ND_April.html
What is the change it will break 90F in April 2016? Take column 2 (the high) and find the mean and
standard deviation:
>> x = mean(F)
78.8000
>> s = std(X)
7.3097
90F is 1.5322 deviations to the right of the mean:
>> (90 - x) / s
1.5322
From a t-table with 19 degrees of freedom (sample size 20)
6
April 4, 2017
NDSU
11: Random Processes
ECE 111 - JSG
Based upon this data, there's a 7.1% chance that it will break 90F this coming April.
t-test and Global Temperatures:
The deviation in global temperatures is shown below:
https://www.ncdc.noaa.gov/cag/time-series/global/globe/land_ocean/p12/12/1880-2016.csv
What is the 90% confidence interval for the temperature deviation from the mean?
Solution: Find the mean and the standard deviation for the data (column #2 of the above link)
>> C = DATA(:,2);
>> x = mean(C)
0.0471
>> s = std(C)
0.3275
For 5% tails, you need to go 1.645 deviatios left and right of the mean.
7
April 4, 2017
NDSU
11: Random Processes
ECE 111 - JSG
Any given year will be in the range of (-0.4916C < T < 0.5858C) with a probabilty of 0.9
What is the probability that a given month will be 1 degree celcius above average (like Jan - May, 2016)?
Take the distance of 1C from the mean in terms of standard edviations:
(x - 1) / s
ans =
-2.9096
There is a 0.18% chance that any given month will be 1C above average
What is the chance that you'll be 1C above average 4 months in a row?
Assuming these are uncorrelated, it is
p = 0.0018 4
p = 0.0000000000104
8
April 4, 2017
NDSU
11: Random Processes
ECE 111 - JSG
Chi-Squared Distribution
A t-test tests the mean. A chi-squared test tests the shape of the distribution.
Example: The following code in Matlab generates a 6-sided die
d6 = ceil(6*rand);
Is this a fair die? To do this you need to use a chi-squared test.
The way a chi-squared test works is
You collect a bunch of data
Separate the data in to N bins (the six numbers in this case).
Count the number of times the data wound up in each bin
Compare it to the expected frequency using the metric
(np i −N i ) 2 ⎞
χ 2 = Σ ⎛⎝ np
i
⎠
Use a chi-squared table to convert this to a probability. A large number means that the data is
inconsistent with the assumed distribution.
df is the degrees of freedom (number of bins minus 1)
% is the probability level
The number in the table is the chi-square value
Chi-Squared Table
Probability of rejecting the null hypothesis
http://people.richland.edu/james/lecture/m170/tbl-chi.html
df
99.5%
99%
97.5%
95%
90%
10%
5%
2.5%
1%
0.5%
1
7.88
6.64
5.02
3.84
2.71
0.02
0
0
0
0
2
10.6
9.21
7.38
5.99
4.61
0.21
0.1
0.05
0.02
0.01
3
12.84
11.35
9.35
7.82
6.25
0.58
0.35
0.22
0.12
0.07
4
14.86
13.28
11.14
9.49
7.78
1.06
0.71
0.48
0.3
0.21
5
16.75
15.09
12.83
11.07
9.24
1.61
1.15
0.83
0.55
0.41
6
18.55
16.81
14.45
12.59
10.65
2.2
1.64
1.24
0.87
0.68
7
20.28
18.48
16.01
14.07
12.02
2.83
2.17
1.69
1.24
0.99
8
21.96
20.09
17.54
15.51
13.36
3.49
2.73
2.18
1.65
1.34
9
23.59
21.67
19.02
16.92
14.68
4.17
3.33
2.7
2.09
1.74
10
25.19
23.21
20.48
18.31
15.99
4.87
3.94
3.25
2.56
2.16
9
April 4, 2017
NDSU
11: Random Processes
ECE 111 - JSG
Example: Fair Die:
Roll a 6-sided die 1200 times
result = zeros(6,1);
for i=1:1200
D6 = ceil( 6 * rand );
result(D6) = result(D6) + 1;
end
result
chi = sum(
(result - 200).^2 / 200
)
Set up a table:
Expected
Actual Frequency (np−N) 2
np
Frequency (np)
(N)
Number
probabilty (p)
1
1/6
200
193
0.2450
2
1/6
200
184
1.2800
3
1/6
200
203
0.0450
4
1/6
200
204
0.0800
5
1/6
200
206
0.1800
6
1/6
200
210
0.5000
Sum
2.33
From a chi-squared table with 5 degrees of freedom (6 bins), 2.33 is more than 10% and less than 90%
More than 10% means the data probably wasn't fudged. It the data is too perfect, be suspicious
Less than 90% means there is no reason to claim that Matlab's rand function is biased.
You can also use StatTrek.com
Chi-Sqared Result from StatTrek.com. p = 0.2 means
the data wasn't fudged ( very small p means be suspicious of the data )
the data is consistent with the assumed distrubution ( only a 20% chance the distribution is not uniform)
10
April 4, 2017
NDSU
11: Random Processes
ECE 111 - JSG
Example: Loaded Die:
Suppose 5% of the time you cheat: the die is forced to be a one. Can you detect this with a chi-squared
test?
result = zeros(6,1);
for i=1:1200
D6 = ceil( 6 * rand );
if (rand < 0.05)
D6 = 1;
end
result(D6) = result(D6) + 1;
end
result
chi = sum(
(result - 200).^2 / 200 )
Again, set up a chi-squared table:
Expected
Actual Frequency (np−N) 2
np
Frequency (np)
(N)
Number
probabilty (p)
1
1/6
200
251
13.0050
2
1/6
200
165
6.1250
3
1/6
200
185
1.1250
4
1/6
200
200
5
1/6
200
201
0.0050
6
1/6
200
198
0.0200
Sum
0
20.28
From StatTrek.com, the chi-squared result is 0.999
I am 99.9% certain that this is not a fair die
11
April 4, 2017
NDSU
11: Random Processes
ECE 111 - JSG
Example: Fudging the Data:
Instead of rolling the dice 12,000 times, just roll the dice 1200 times and add 1800 to each result making
it look like you rolled the dice 12,000 times. Can you detect the fudged data with a chi-squared test?
Use the results for the fair die rolled 1200 times and add 1800 to each result:
Expected
Actual Frequency (np−N) 2
np
Frequency (np)
(N)
Number
probabilty (p)
1
1/6
2,000
1,993
0.02
2
1/6
2,000
1,984
0.13
3
1/6
2,000
2,003
0
4
1/6
2,000
2,004
0.01
5
1/6
2,000
2,006
0.02
6
1/6
2,000
2,010
0.05
Sum
0.23
From StatTrek, a chi-squared distribuition with
5 degrees of freedom (6 bins) and
A chi-squared value of 0.23
fits the expected distribution extremely well. In fact, it fits so well that there's only a 0.001 chance of
generating data this good by chance. The data was most likely fudged.
12
April 4, 2017