Download joaquin_dana_ca08

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Fisher–Yates shuffle wikipedia , lookup

Law of large numbers wikipedia , lookup

Karhunen–Loève theorem wikipedia , lookup

Central limit theorem wikipedia , lookup

Transcript
Dana Joaquin
ECE 3522: Stochastics
Department of Electrical and Computer Engineering, Temple University, Philadelphia, PA 19121
I.
PROBLEM STATEMENT
The point of this assignment is to see how the central limit theorem works and its application to compute and display
the PDF of multiple random variables. The central limit theorem will be checked by comparing its result with a
Gaussian distribution using root-mean squared error. In addition, this assignment will also explore the concept of the
Box-Muller Transform.
II.
APPROACH AND RESULTS
The assignment required for 100 random variables all with a length of 10,000, and these random variables had to be
uniformly distributed from [-1 1]. This was done using a for loop, looping 100 times to generate our random variables,
and coding a custom MATLAB function to create a random variable that contained values within the specified range.
The custom function that was made is called “updf” and it implements the random number generator for a uniform
distribution from the mathworks website:
𝑟 = 𝑎 + (𝑏 − 𝑎).∗ 𝑟𝑎𝑛𝑑(𝑁, 1)
Equation 1. Random Number Generator within Specific Interval
The sum of each PDF for each sample size n was computed using MATLAB’s sum function. In addition, the mean,
variance and standard deviation for each n was computed as well. A histogram plot of the sums of random variables
was plotted using MATLAB’s histogram function. Next, each sample size sum’s mean and variance were stored in, a
Gaussian fit distribution for n=1, 10, and 100 were plotted on top of the actual histogram for comparison, which is
shown in figure 1. As n increases, the Gaussian distributions starts to fit more with the histogram plot, which makes
sense since more samples are being calculated and the spread of the data gets larger. Figure 1b. provides a better view
of how the sum of random variables, Sn, looks likes.
Figure 1a. Histogram plot of the sum of random variables and a Gaussian distribution for when n=1, 10, and 100.
Figure 1b. Histogram plot of just the sum, Sn’s PDF without the Gaussian distributions in Figure 1a.
Next, the root mean-square (RMS) error is plotted with respect to n, displayed in Figure 2. What this plot is saying is
that as the sample size/amount of random variables increase, the error between the PDF of Sn and a Gaussian
distribution decreases. This makes sense due to the central limit theorem; as the number of random variables increase,
the more the sum’s PDF look like a Gaussian one.
Figure 2. Root Mean-Squared Error with respect to the number of samples being computed.
For the Box-Muller technique, it takes uniformly distributed random variables and changes them to independent
random variables with a standard normal distribution. The algorithm to do this technique were taken from the BoxMuller Technique page off of Wikipedia. The equations used are:
𝑧0 = √−2𝑙𝑛𝑈1 cos⁡(2𝜋𝑈2 )
𝑧1 = √−2𝑙𝑛𝑈1 sin⁡(2𝜋𝑈2 )
Z0 and Z1 are the independent random variables that have standard normal distributions. Since they are independent
random variables, they can be multiplied together to get their joint PDF. Figure 3 displays the joint PDF of the
computed independent random variables. A custom function was coded, called “boxmuller” to do it.
Lastly, the Box-Muller technique was timed using the tic and toc functions in MATLAB and the random variables
sum technique was timed as well with the same functions. Table I displays the time results for both of the methods.
Based on the table, it shows that the Box-Muller technique is the faster algorithm by about 0.1 of a second.
Table I. Amount of Time for Each Method
Method
Time Elapse
Box-Muller Technique
0.174884 seconds
Sum of Uniformly Distributed Random Variables
0.070228 seconds
III.
MATLAB CODE
% CA 8
% Dana Joaquin
function ca08
clear; clc
close all
% parameters
n = 100;
N = 1e4;
minnie = -1;
maxie = 1;
%
%
%
%
# of RVs
samples in each RV
start of interval
end of interval
for i=1:n
ss = updf(N, minnie, maxie); % get random variable N-long vector
sssum(i) = sum(ss);
% get sum of random variable vector
smean(i) = mean(ss)*i; % estimated mean
svar(i) = var(ss)*i;
% esimated variance
sstdev(i) = sqrt(svar(i)); % estimated standard deviation
end
% plot Sn PMF
hlim = length(sssum)/2;
figure(1)
hs = histogram(sssum, 'Normalization', 'pdf')
oh_so_edgy = -hlim:1:hlim;
hs.BinEdges = oh_so_edgy;
sdata = hs.Values;
% plot n = 1, 10, 100 Gaussian Dist.
for i=1:3
a = [1 10 100];
hold on
x = linspace(-hlim, hlim, length(sdata));
norm = normpdf(x,smean(a(i)),sstdev(a(i)));
plot(x, norm)
end
legend('Histogram', 'n = 1', 'n=10', 'n=100')
title('Sum of Random Variables PDF')
xlabel('Random Variables')
ylabel('Probability')
% computing root mean-squared error
for i=1:n
x = linspace(-hlim, hlim, length(sdata));
norm = normpdf(x,smean(i),sstdev(i));
rmserr(i) = compute_rmse(sdata, norm, n);
end
% plotting RMSE with respect to # random variables
figure(2)
plot(1:n, rmserr)
title('Root Mean Square Error')
xlabel('Samples (n)')
ylabel('Error')
%Box-Muller Function
ss = 1e4;
rvs = 10;
figure(3)
boxmuller(ss, rvs);
title('Box-Muller PDF')
xlabel('Samples')
ylabel('Probability')
end
% function name: compute_rmse
% coded by Dana Joaquin
% input argument(s):
%
(1) sig_a: signal data (array)
%
(2) sig_b: signal data (array)
%
(3) length_vect: length of signal data (array)
% output argument(s):
%
(1) rmse: scalar value of root mean-squared error
% objective:
%
This function computes the root mean-squared value of two
%
data arrays brought in by sig_a and sig_b.
%
Calculates the root mean-squared value by using the formula:
%
%
n : # of samples (both sig_a and sig_b have to be same size)
%
Y-hat : predicted values (sig_a is true val. arrays)
%
Y : true values (sig_b is predicted val. arrays)
%
%
__n__
%
MSE =
1
\
_
2
%
--- /
(Y - Y)
%
n
----i
i
%
i = 1
%
%
RMSE = sqrt(MSE)
%
function rmse_val = compute_rmse(sig_a, sig_b, sig_length)
n = sig_length;
for i=1:n
data_array(i) = (sig_b(i) - sig_a(i))^2;
end
rmse = sqrt((1/n)*(sum(data_array)));
rmse_val = rmse;
end
% function name: updf
% coded by Dana Joaquin (equation taken from MathWorks website though)
% input argument(s):
%
(1) samples: amt of random # desired
%
(2) minum: min. range value
%
(3) maxum: max range value
% output argument(s):
%
(1) unipdf: a vector of uniformly distributed random variables
%
within the range of minum < x < maxum
% objective:
%
This function creates a vector of uniformly distributed random
%
variables.
%
function unipdf = updf(samples, minum, maxum)
unipdf = minum + (maxum-minum).*rand(samples,1);
end
%
%
%
%
%
%
%
%
%
%
function name: boxmuller
coded by Dana Joaquin
input argument(s):
(1) samples: length of random variable vecto
(2) rvs: number of random variables
output argument(s):
(1) histogram plot of the data
objective:
This function computes and plots the Gaussian distribution
of two independent random variables using the Box-Muller
%
distribution. The equations used are taken from
%
the wikipedia page about the Box-Muller.
%
http://en.wikipedia.org/wiki/Box%E2%80%93Muller_transfor
function boxmul = boxmuller(samples, rvs)
for i=1:rvs
% 2 uniformly distributed random variables
x = rand(samples, 1);
y = rand(samples, 1);
a(:, i) = sqrt(-2.*log(x));
b(:, i) = 2*pi.*y;
% computing 2 independent random variables
z0(:, i) = a(:, i).*cos(b(:, i));
z1(:, i) = a(:, i).*sin(b(:, i));
% computing joint pdf
zz(:,i) = rvs.*(z0(:,i).*z1(:,i));
end
% plotting pdf
histogram(zz, 'Normalization', 'pdf')
end
IV.
CONCLUSIONS
The central limit theorem basically says that the more random variables are summed up together, their sum’s
distribution will appear more Gaussian, no matter what the original probability distribution looked like. Therefore, if
100 random variables all had an exponential PDF, when added all together, the sum’s distribution will start to look
Gaussian. A good rule of thumb that the textbook says that when the number of samples/random variables exceeds
30, that is when it will start looking Gaussian.
For the first part of the assignment, a sum of 100 random variables all with uniform distributions were added together.
Based on the theorem, their sum’s distribution should now appear Gaussian. In the second part of the assignment, Sn’s
distribution is plotted and based on the figure, it is slowly getting the Gaussian shape, where the histogram is starting
to “center” itself on the apparent mean value. Plotting the Gaussian distributions for n=1, 10, 100 is a comparison to
the actual data and how close the PDF of S n is close to a Gaussian. At n = 1, since it is only 1 sample, it is nowhere
near to the actual distribution because it is only accounted for the probability of 1 sample. As the samples get larger,
it starts to get closer to the probability values of the actual PDF. Once n=100, the Gaussian distribution starts to have
the probability values of the actual PDF. As the number of sample size increases, that trend will continue to soon
enough the actual PDF and the Gaussian PDF have very little error, which is what the central limit theorem says: the
more random variables being added, the more their sum PDF takes on a Gaussian shape. The third part of the
assignment asked for the root mean-squared error between the actual fit and a Gaussian fit with respect to the number
of random variables/samples, n. Based on the results, in plot 3, as the number of random variables and samples
increased, the error dropped exponentially towards zero. This plot helps reiterate the concept of central limit theorem.
The Box-Muller Technique does the same thing as to what the central limit theorem does, except for uniformly
distributed random variables between the range of 0 and 1. This technique is good for applications that are restricted
to those limits because it computes their PDF a lot faster than taking the sum of multiple random variables, but because
it has certain requirements in order to be used, it is not as likely. Sure, the sum method takes a little bit longer than the
Box-Muller, but it can be used for any range of values.