Download CM2104: Computational Mathematics Laboratory Worksheet (Week

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
CM2104: Computational Mathematics
Laboratory Worksheet (Week 10 and 11)
Prof. D. Marshall
Aims and Objectives
After working through this worksheet you should be familiar with:
• Basic Discrete Probability Theory: Probability, Conditional Probability, Bayes’ Theorem
• Discrete Random Variables: Probability distributions, Expectation and Variance,
Uniform, Binomial, Geometric and Poisson Distributions.
• Estimators: Maximum likelihood estimation, Bias, Bayesian Inference
• Implementation of the above theory in MATLAB.
This is a combined worksheet for all the Statistics/Probability lectures, as
week 11 will be used to demonstrate the CM2104 coursework. Some of the
problems here may not be covered until a lecture after the week 10 lab class.
This lab sheet focuses more on MATLAB programming examples with a couple theoretical questions. A lot more theoretical questions are available in an additional work sheet
available online:
http://www.cs.cf.ac.uk/Dave/CM2104/Labs/Week 10/LabWeek10Additional.pdf
Solutions for both worksheets will be made available on Learning Central after the lab
class.
None of the work here is part of the assessed coursework for this module.
1
MATLAB Statistics/Discrete Probability Programming Problems
Write MATLAB code to solve the following problems:
1. Output the sample space for rolling two n sided dice.
SOLUTION:
% Throwing 2 n sided dice probability Distribution
n =8;
%n = 12;
%n = 36;
samples=1:n;
% Make Sample Space S = {(i,j)|1 <= i,j <= n}
sample_space = repmat(samples,n,1) + repmat(samples’,1,n);
% Count Unique elements
[unique_elements, unique_idx, element_idx] = unique(sample_space);
length(element_idx);
count = hist(element_idx,unique(element_idx));
px = count/length(element_idx);
% hist only deals with number so use ch
for i = 1:length(unique_elements);
fprintf(’px(%d) = %1.4f\n’,unique_elements(i),px(i));
end
%
%
%
%
%
%
This is an enumerated solution --- it DOES NOT scale well for additional
dice, for example. See Next Lab Class Exercise and also solution code,
ndicepx.m
Note that for two or more dice the number of combinations equals
the sum of combinations one column to the left,
starting from one row higher to NumFace+1 rows higher
2
Code at
http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/nsidedpx.m
2. Output the sample space for rolling m, n sided dice.
SOLUTION:
% Throwing N m sided dice probability Distribution
NumFace =6; % Number of sides on dice
NumDice = 5; % Number of Dice
% Note that for two or more dice the number of combinations equals
% the sum of combinations one column to the left,
% starting from one row higher to NumFace+1 rows higher
samples=1:NumFace;
% Make a Count Array for each dice column count for each dice, each row element is numb
% NumDice = NumFace*NumDice,
count = zeros(NumFace*NumDice,NumDice);
% initialise for single dice;
count(1:NumFace, 1) = 1;
for i = 2:NumDice
for j = i:i*NumFace
lowindex = j - NumFace;
if (lowindex < 1) lowindex =1; end% clip so we do not index beyond lowest array
count (j,i) = sum(count(lowindex:j-1,i-1));
end
end
px = count(:,NumDice)/(NumFace^NumDice);
for i = NumDice:NumFace*NumDice;
fprintf(’px(%d) = %1.4f\n’,i,px(i));
3
end
Code at
http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/ndicepx.m
3. Eight students need to complete a group assignment. There are four students needed
for programming, three students to do a requirement analysis, and one student to
write the report. In how many ways can the eight students be assigned a task
(programming / requirement analysis / report)?
SOLUTION:
my_string = ’PPPPAAAR’;
% Find unique characters
[unique_chars, unique_idx, char_idx] = unique(my_string);
% Count the occurence of each unique character
% hist only deals with number so use char_idx
count = hist(char_idx,unique(char_idx));
% number of distinguishable permutations
dist_perms = factorial(length(my_string))/prod(factorial(count))
Code at
http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/LabWk9 Ex3.m
4. Write MATLAB code to output the distinguishable permutations of a string. (Hint:
The solution should be one line of MATLAB, see doc/help unique.)
SOLUTION:
unique(perms(my_string),’rows’)
Code at
4
5. Given a dice biased such that
p(1) = p(2) = p(3) = p(4) = p(5) =
1
10
p(6) =
1
2
(a) When throwing 3 dice, What is the probability of getting 1 one, 1 three, and 1
six.
SOLUTION:
p = [1/10
1/10 1/10 1/10 1/10 1/10 1/2]
x = [1,0,1,0,0,1];
% Compute the pdf of the distribution.
Y = mnpdf(x,p)
(b) When throwing 6 dice, What is the probability of getting 3 ones, 1 three, 1 four
and 1 six.
SOLUTION:
This is a multinomial distribution problem.
p = [1/10
1/10 1/10 1/10 1/10 1/10 1/2]
x = [3,0,1,1,0,1];
% Compute the pdf of the distribution.
Y = mnpdf(x,p)
Code at
http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/LabWk9 Ex5.m
6. The probability that a person supports Cardiff City football club in Cardiff is 0.6.
Find the probability that out of 8 randomly selected people in Cardiff there are
(a) Exactly three supporters of Cardiff City
SOLUTION:
Binomial Distribution Problem:
p = 0.6;
Prob3SupportCardiff = binopdf(0,200,0.02)
5
(b) More than 5 who support Cardiff City
SOLUTION:
% We require P(X>1) = P(X=6) + P(X=7) + P(X=8)
ProbMorethan5 = binopdf(6,8,0.6) + binopdf(7,8,0.6) + binopdf(8,8,0.6)
Code at
http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/LabWk9 Ex6.m
7. A coin is biased so that the probability of obtaining a head is 0.6. If X is the random
variable for the number of tosses up to and including the first head find:
(a) P (X ≤ 4)
SOLUTION:
Geometric Distribution Problem:
We need some more theory on the Geometric Distribution
In general for a Geometric Distribution then P (X > r) = pr where p = q-1.
Now P (X = r) = pr−1 q
n
1 2
3
4
...
Probability distribution is:
2
3
P (X = n) q pq p q p q . . .
P (X ≤ r) = q + pq + p2 q + . . . + pr−1 q
= q(1 + p + . . . + pr−1 )
1 − pr
=q
1−p
1 − pr
=q
q
Since
n−1
X
pi =
i=0
1 − pn
(see formulae sheet)
1−p
Hence P (X ≤ r) = 1 − pr
Now for P (X > r)
6
P (X > r) = 1 − P (X ≤ r)
= 1 − (1 − pr )
= pr
q = 0.6
p = 1 - q;
order = 4;
Plessthanequal4 = 1- power(p,order);
(b) P (X > 5)
SOLUTION:
q = 0.6
p = 1 - q;
order = 5;
Pgreaterthan5
= power(p,order);
8. Find the probability that at least two double sixes are obtained when two dice are
thrown 90 times.
SOLUTION:
Probability of a double six is: P (double 6) =
11
66
=
1
36
1
The expected number of double sixes is given by X ∼ B(90, 36
which for Binomial
distribution:
E[X] = np = 90 ·
1
36
= 2.5
Now we can use the Poisson approximation to work out the probability:
7
P ∼ P ois(2.5), so
2.5
P (X = x) = e−2.5
, x = 0, 1, . . .
x!
N ow
P (X ≥ 2) = 1 − (P (X = 0) + P (X − 1))
= 1 − (e−2.5 + e−2.5 2.5
= 1 − 3.5e−2.5
p = 1/36;
n = 90;
Exdouble6 = binostat(n,p);
PMoreThanTwo = 1 - (poisspdf(0,Exdouble6) + poisspdf(1,Exdouble6))
Code at
http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/LabWk9 Ex8.m
9. Using the Spam Filter code from the lecture
(http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Spam Filter.zip) augment
the functionality of the Spam filter to
(a) account for stop words. Amend the sentence generating code to generate stop
words as well as spam/ham words to test you code.
SOLUTION:
There are a few possible solutions here.
• Essentially it involves the removal of a list of words from the sentence.
• One way based on the lecture example would be to find the unique() occurrences of all words and then delete the stop count items
However below is an elegant solutions in MATLAB.
stopwords_cellstring={’A’,’a’,’the’,’and’, ’in’, ’of’};
% for a source of all stopwords see:
% http://norm.al/2009/04/14/list-of-english-stop-words/
8
str1 = ’A lot of the spam in the work is in email,
of dubious origin and a lot of hassle.’
split1 = strsplit(str1,’ ’);
out_str1 = strjoin(split1(~ismember(split1,stopwords_cellstring)),’ ’)
This works as follows:
• Create a cell array list of all stopwords.
• Use strsplit() to split the sentence, as per Spam Filter lecture example.
• Use strjoin() to rebuild the sentence but only with the words that are
not members of the stopwords array, using ismember() — see doc/help
strjoin and ismember.
Code at
http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/stopwords.m
(b) accommodate n-gram sequence of words, for a variable number of sequence
lengths. Experiment with the classification success rate of different length ngrams on the test data.
SOLUTION:
%%% Example code to create a n-gram bag of words model
n= 5;
%%%% Read in training data
sentences = Read_TextFile(’spam_train.txt’);
for i = 1:length(sentences)
wordpersentence = strsplit(sentences{i},’ ’);
for j = 1:length(wordpersentence)-n+1
ngrampersentence{i,j} = {strjoin(wordpersentence(j:j+n-1))};
end
The code works as follows:
• Make sure that Read TextFile.m is in MATLAB path.
• Simply split the sentence into words, strsplit() — as seen in Spam Filter
example.
• Assemble n-gram by strjoin() words from j to n-1 looping through the
split sentence.
9
Code at
http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/ngram.m
Discrete Random Variables
1. Prove E[a] = a where a is any constant.
SOLUTION:
E[a] =
X
aP (X = x)
all x
X
=a
P (X = x)
all x
X
= a since
P (X = x) = 1
all x
2. Prove E[aX] = aE[X] where a is any constant.
SOLUTION:
E[a] =
X
axP (X = x)
all x
X
=a
xP (X = x)
all x
= aE[X]
3. Prove V ar[a] = 0 where a is any constant.
SOLUTION:
V ar[a] = E[a2 ] − E 2 [a]
= a2 − a2
=0
10
4. Prove that the variance of the geometric distribution is:
V ar[X] =
1−q
q2
SOLUTION:
Proof:
Recall (from Lecture notes)
Let p = q − 1
n
P (X = n)
The probability distribution maybe written as:
1
q
2
pq
3
p2 q
Generally,
V ar[X] = E[X 2 ] − E 2 [X]
We know E[X] = 1q , so
E 2 [X] =
1
q2
Let’s now calculate E[X 2 ]:
E[X 2 ] =
∞
X
n2 · P (X = n)
n=1
= q + 4pq + 9p2 q + 16p3 q + . . .
= q(1 + 4p + 9p2 + 16p3 + . . .)
= q(1 + 2p + 3p2 + 4p3 + . . .
2p + 6p2 + 12p3 + . . .)
From proof of E[X] we noted that via the generalised Binomial Theorem:
n
X
i=0
1
=
(1 − x)s
s+n−1 n
s+n−1 n
x ≡
x
n
s−1
11
4
p3 q
...
...
So with s = 2, x = p we see that:
(1 − p)−2 = 1 + 2p + 3p2 + 4p3
so
E[X 2 ] = q(((1 − p)−2 ) + 2p(1 + 3q + 6q 2 + . . .))
E[X 2 ] = q(((1 − p)−2 ) + 2p((1 − p)−3 )
2p
1
+ 3
=q
q2
q
1
2p
E[X 2 ] = + 2
q
q
Now
V ar[X] = E[X 2 ] − E 2 [X]
1 2p
1
= − 2 − 2
q
q
q
q + 2p − 1
since p = 1 − q
=
q2
V ar[X] =
1−q
q2
Additional Exercises
Additional Exercises are available at:
http://www.cs.cf.ac.uk/Dave/CM2104/Labs/Week 10/LabWeek10Additional.pdf
12