Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CM2104: Computational Mathematics Laboratory Worksheet (Week 10 and 11) Prof. D. Marshall Aims and Objectives After working through this worksheet you should be familiar with: • Basic Discrete Probability Theory: Probability, Conditional Probability, Bayes’ Theorem • Discrete Random Variables: Probability distributions, Expectation and Variance, Uniform, Binomial, Geometric and Poisson Distributions. • Estimators: Maximum likelihood estimation, Bias, Bayesian Inference • Implementation of the above theory in MATLAB. This is a combined worksheet for all the Statistics/Probability lectures, as week 11 will be used to demonstrate the CM2104 coursework. Some of the problems here may not be covered until a lecture after the week 10 lab class. This lab sheet focuses more on MATLAB programming examples with a couple theoretical questions. A lot more theoretical questions are available in an additional work sheet available online: http://www.cs.cf.ac.uk/Dave/CM2104/Labs/Week 10/LabWeek10Additional.pdf Solutions for both worksheets will be made available on Learning Central after the lab class. None of the work here is part of the assessed coursework for this module. 1 MATLAB Statistics/Discrete Probability Programming Problems Write MATLAB code to solve the following problems: 1. Output the sample space for rolling two n sided dice. SOLUTION: % Throwing 2 n sided dice probability Distribution n =8; %n = 12; %n = 36; samples=1:n; % Make Sample Space S = {(i,j)|1 <= i,j <= n} sample_space = repmat(samples,n,1) + repmat(samples’,1,n); % Count Unique elements [unique_elements, unique_idx, element_idx] = unique(sample_space); length(element_idx); count = hist(element_idx,unique(element_idx)); px = count/length(element_idx); % hist only deals with number so use ch for i = 1:length(unique_elements); fprintf(’px(%d) = %1.4f\n’,unique_elements(i),px(i)); end % % % % % % This is an enumerated solution --- it DOES NOT scale well for additional dice, for example. See Next Lab Class Exercise and also solution code, ndicepx.m Note that for two or more dice the number of combinations equals the sum of combinations one column to the left, starting from one row higher to NumFace+1 rows higher 2 Code at http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/nsidedpx.m 2. Output the sample space for rolling m, n sided dice. SOLUTION: % Throwing N m sided dice probability Distribution NumFace =6; % Number of sides on dice NumDice = 5; % Number of Dice % Note that for two or more dice the number of combinations equals % the sum of combinations one column to the left, % starting from one row higher to NumFace+1 rows higher samples=1:NumFace; % Make a Count Array for each dice column count for each dice, each row element is numb % NumDice = NumFace*NumDice, count = zeros(NumFace*NumDice,NumDice); % initialise for single dice; count(1:NumFace, 1) = 1; for i = 2:NumDice for j = i:i*NumFace lowindex = j - NumFace; if (lowindex < 1) lowindex =1; end% clip so we do not index beyond lowest array count (j,i) = sum(count(lowindex:j-1,i-1)); end end px = count(:,NumDice)/(NumFace^NumDice); for i = NumDice:NumFace*NumDice; fprintf(’px(%d) = %1.4f\n’,i,px(i)); 3 end Code at http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/ndicepx.m 3. Eight students need to complete a group assignment. There are four students needed for programming, three students to do a requirement analysis, and one student to write the report. In how many ways can the eight students be assigned a task (programming / requirement analysis / report)? SOLUTION: my_string = ’PPPPAAAR’; % Find unique characters [unique_chars, unique_idx, char_idx] = unique(my_string); % Count the occurence of each unique character % hist only deals with number so use char_idx count = hist(char_idx,unique(char_idx)); % number of distinguishable permutations dist_perms = factorial(length(my_string))/prod(factorial(count)) Code at http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/LabWk9 Ex3.m 4. Write MATLAB code to output the distinguishable permutations of a string. (Hint: The solution should be one line of MATLAB, see doc/help unique.) SOLUTION: unique(perms(my_string),’rows’) Code at 4 5. Given a dice biased such that p(1) = p(2) = p(3) = p(4) = p(5) = 1 10 p(6) = 1 2 (a) When throwing 3 dice, What is the probability of getting 1 one, 1 three, and 1 six. SOLUTION: p = [1/10 1/10 1/10 1/10 1/10 1/10 1/2] x = [1,0,1,0,0,1]; % Compute the pdf of the distribution. Y = mnpdf(x,p) (b) When throwing 6 dice, What is the probability of getting 3 ones, 1 three, 1 four and 1 six. SOLUTION: This is a multinomial distribution problem. p = [1/10 1/10 1/10 1/10 1/10 1/10 1/2] x = [3,0,1,1,0,1]; % Compute the pdf of the distribution. Y = mnpdf(x,p) Code at http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/LabWk9 Ex5.m 6. The probability that a person supports Cardiff City football club in Cardiff is 0.6. Find the probability that out of 8 randomly selected people in Cardiff there are (a) Exactly three supporters of Cardiff City SOLUTION: Binomial Distribution Problem: p = 0.6; Prob3SupportCardiff = binopdf(0,200,0.02) 5 (b) More than 5 who support Cardiff City SOLUTION: % We require P(X>1) = P(X=6) + P(X=7) + P(X=8) ProbMorethan5 = binopdf(6,8,0.6) + binopdf(7,8,0.6) + binopdf(8,8,0.6) Code at http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/LabWk9 Ex6.m 7. A coin is biased so that the probability of obtaining a head is 0.6. If X is the random variable for the number of tosses up to and including the first head find: (a) P (X ≤ 4) SOLUTION: Geometric Distribution Problem: We need some more theory on the Geometric Distribution In general for a Geometric Distribution then P (X > r) = pr where p = q-1. Now P (X = r) = pr−1 q n 1 2 3 4 ... Probability distribution is: 2 3 P (X = n) q pq p q p q . . . P (X ≤ r) = q + pq + p2 q + . . . + pr−1 q = q(1 + p + . . . + pr−1 ) 1 − pr =q 1−p 1 − pr =q q Since n−1 X pi = i=0 1 − pn (see formulae sheet) 1−p Hence P (X ≤ r) = 1 − pr Now for P (X > r) 6 P (X > r) = 1 − P (X ≤ r) = 1 − (1 − pr ) = pr q = 0.6 p = 1 - q; order = 4; Plessthanequal4 = 1- power(p,order); (b) P (X > 5) SOLUTION: q = 0.6 p = 1 - q; order = 5; Pgreaterthan5 = power(p,order); 8. Find the probability that at least two double sixes are obtained when two dice are thrown 90 times. SOLUTION: Probability of a double six is: P (double 6) = 11 66 = 1 36 1 The expected number of double sixes is given by X ∼ B(90, 36 which for Binomial distribution: E[X] = np = 90 · 1 36 = 2.5 Now we can use the Poisson approximation to work out the probability: 7 P ∼ P ois(2.5), so 2.5 P (X = x) = e−2.5 , x = 0, 1, . . . x! N ow P (X ≥ 2) = 1 − (P (X = 0) + P (X − 1)) = 1 − (e−2.5 + e−2.5 2.5 = 1 − 3.5e−2.5 p = 1/36; n = 90; Exdouble6 = binostat(n,p); PMoreThanTwo = 1 - (poisspdf(0,Exdouble6) + poisspdf(1,Exdouble6)) Code at http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/LabWk9 Ex8.m 9. Using the Spam Filter code from the lecture (http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Spam Filter.zip) augment the functionality of the Spam filter to (a) account for stop words. Amend the sentence generating code to generate stop words as well as spam/ham words to test you code. SOLUTION: There are a few possible solutions here. • Essentially it involves the removal of a list of words from the sentence. • One way based on the lecture example would be to find the unique() occurrences of all words and then delete the stop count items However below is an elegant solutions in MATLAB. stopwords_cellstring={’A’,’a’,’the’,’and’, ’in’, ’of’}; % for a source of all stopwords see: % http://norm.al/2009/04/14/list-of-english-stop-words/ 8 str1 = ’A lot of the spam in the work is in email, of dubious origin and a lot of hassle.’ split1 = strsplit(str1,’ ’); out_str1 = strjoin(split1(~ismember(split1,stopwords_cellstring)),’ ’) This works as follows: • Create a cell array list of all stopwords. • Use strsplit() to split the sentence, as per Spam Filter lecture example. • Use strjoin() to rebuild the sentence but only with the words that are not members of the stopwords array, using ismember() — see doc/help strjoin and ismember. Code at http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/stopwords.m (b) accommodate n-gram sequence of words, for a variable number of sequence lengths. Experiment with the classification success rate of different length ngrams on the test data. SOLUTION: %%% Example code to create a n-gram bag of words model n= 5; %%%% Read in training data sentences = Read_TextFile(’spam_train.txt’); for i = 1:length(sentences) wordpersentence = strsplit(sentences{i},’ ’); for j = 1:length(wordpersentence)-n+1 ngrampersentence{i,j} = {strjoin(wordpersentence(j:j+n-1))}; end The code works as follows: • Make sure that Read TextFile.m is in MATLAB path. • Simply split the sentence into words, strsplit() — as seen in Spam Filter example. • Assemble n-gram by strjoin() words from j to n-1 looping through the split sentence. 9 Code at http://www.cs.cf.ac.uk/Dave/CM2104/MATLAB/Statistics/Statistics Exercise Solutions/ngram.m Discrete Random Variables 1. Prove E[a] = a where a is any constant. SOLUTION: E[a] = X aP (X = x) all x X =a P (X = x) all x X = a since P (X = x) = 1 all x 2. Prove E[aX] = aE[X] where a is any constant. SOLUTION: E[a] = X axP (X = x) all x X =a xP (X = x) all x = aE[X] 3. Prove V ar[a] = 0 where a is any constant. SOLUTION: V ar[a] = E[a2 ] − E 2 [a] = a2 − a2 =0 10 4. Prove that the variance of the geometric distribution is: V ar[X] = 1−q q2 SOLUTION: Proof: Recall (from Lecture notes) Let p = q − 1 n P (X = n) The probability distribution maybe written as: 1 q 2 pq 3 p2 q Generally, V ar[X] = E[X 2 ] − E 2 [X] We know E[X] = 1q , so E 2 [X] = 1 q2 Let’s now calculate E[X 2 ]: E[X 2 ] = ∞ X n2 · P (X = n) n=1 = q + 4pq + 9p2 q + 16p3 q + . . . = q(1 + 4p + 9p2 + 16p3 + . . .) = q(1 + 2p + 3p2 + 4p3 + . . . 2p + 6p2 + 12p3 + . . .) From proof of E[X] we noted that via the generalised Binomial Theorem: n X i=0 1 = (1 − x)s s+n−1 n s+n−1 n x ≡ x n s−1 11 4 p3 q ... ... So with s = 2, x = p we see that: (1 − p)−2 = 1 + 2p + 3p2 + 4p3 so E[X 2 ] = q(((1 − p)−2 ) + 2p(1 + 3q + 6q 2 + . . .)) E[X 2 ] = q(((1 − p)−2 ) + 2p((1 − p)−3 ) 2p 1 + 3 =q q2 q 1 2p E[X 2 ] = + 2 q q Now V ar[X] = E[X 2 ] − E 2 [X] 1 2p 1 = − 2 − 2 q q q q + 2p − 1 since p = 1 − q = q2 V ar[X] = 1−q q2 Additional Exercises Additional Exercises are available at: http://www.cs.cf.ac.uk/Dave/CM2104/Labs/Week 10/LabWeek10Additional.pdf 12