Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical NLP Course for Master in Computational Linguistics 2nd Year 2016-2017 Diana Trandabăț Intro to probabilities • Probability deals with prediction: – Which word will follow in this ....? – How can parses for a sentence be ordered? – Which meaning is more likely? – Which grammar is more linguistically plausible? – See phrase “more lies ahead”. How likely is it that “lies” is noun? – See “Le chien est noir”. How likely is it that the correct translation is “The dog is black”? • Any rational decision can be described probabilistically. Notations • Experiment (or trial) – repeatable process by which observations are made – e.g. tossing 3 coins • Observe basic outcome from sample space, Ω, (set of all possible basic outcomes) • Examples of sample spaces: • one coin toss, sample space Ω = { H, T }; • three coin tosses, Ω = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} • part-of-speech of a word, Ω = {N, V, Adj, etc…} • next word in Shakespeare play, |Ω| = size of vocabulary • number of words in your Msc. Thesis Ω = { 0, 1, … ∞ } Notation • An event A, is a set of basic outcomes, i.e., a subset of the sample space, Ω. Example: Ω = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} A=Ω is the certain event P(A=Ω)=1 A=∅ is the impossible event P(A=∅) = 0 For “not A” , we write Ā Intro to probablities Intro to probabilities • “A coin is tossed 3 times. • What is the likelihood of 2 heads?” – Experiment: Toss a coin three times – Sample space Ω = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} – Event: basic outcomes that have exactly 2 H’s A = {THH, HTH, HHT} – the likelihood of 2 heads is 3 out of 8 possible outcomes P(A) = 3/8 Probability distribution • A probability distribution is an assignment of probabilities from a set of outcomes. – A uniform distribution assigns the same probability to all outcomes (eg a fair coin). – A gaussian distribution assigns a bell-curve over outcomes. – Many others. – Uniform and gaussians popular in SNLP. Joint probabilities Probabilities as sets A A∩B P(A|B) = P(A∩B) / P(B) P(A∩B)= P(A|B) * P(B) P(B|A) = P(A∩B) / P(B) P(B∩A)= P(A∩B) = P(B|A) * P(A) = P(A|B) * P(B) B Probabilities as sets A A∩B P(A|B) = P(A∩B) / P(B) P(A∩B)= P(A|B) * P(B) P(B|A) = ? P(B∩A)= P(A∩B) = P(B|A) * P(A) = P(A|B) * P(B) B Probabilities as sets A A∩B P(A|B) = P(A∩B) / P(B) P(A∩B)= P(A|B) * P(B) P(B|A) = P(B∩A) / P(A) P(B∩A)= P(A∩B) = P(B|A) * P(A) = P(A|B) * P(B) B Multiplication rule Probabilities as sets A A∩B B P(A) = P(A∩B) + P(A∩B) P(A) = P(A|B) * P(B) + P(A|B) * P(B) Additivity rule Bayes’ Theorem • Bayes’ Theorem lets us swap the order of dependence between events • We saw that P(A | B) P(A, B) P(B) • Bayes’ Theorem: P(B | A)P(A) P(A| B) P(B) Independent events • Two events are independent if: P(A,B)=P(A)*P(B) • Consider a fair dice. Intuitively, each side (1, 2, 3, 4, 5, 6) has an appearance chance of 1/6. • Consider the eveniment X “the number on the dice will be devided by 2” and Y “the number s divided by 3”. Independent events • Two events are independent if: P(A,B)=P(A)*P(B) • Consider a fair dice. Intuitively, each side (1, 2, 3, 4, 5, 6) has an appearance chance of 1/6. • Consider the eveniment X “the number on the dice will be devided by 2” and Y “the number s divided by 3”. • X={2, 4, 6}, Y={3, 6} Independent events • Two events are independent if: P(A,B)=P(A)*P(B) • Consider a fair dice. Intuitively, each side (1, 2, 3, 4, 5, 6) has an appearance chance of 1/6. • Consider the eveniment X “the number on the dice will be devided by 2” and Y “the number s divided by 3”. • X={2, 4, 6}, Y={3, 6} • p(X)=p(2)+p(4)+p(6)=1/6+1/6+1/6=3/6=1/2 • p(Y)=p(3)+p(6)=1/3 Independent events • Two events are independent if: P(A,B)=P(A)*P(B) • Consider a fair dice. Intuitively, each side (1, 2, 3, 4, 5, 6) has an appearance chance of 1/6. • Consider the eveniment X “the number on the dice will be devided by 2” and Y “the number s divided by 3”. • X={2, 4, 6}, Y={3, 6} • p(X)=p(2)+p(4)+p(6)=1/6+1/6+1/6=3/6=1/2 • p(Y)=p(3)+p(6)=1/3 • p(X,Y)=p(6)=1/2*1/3=p(X)*p(Y)=1/6 • ==> X and Y are independents Independent events • Consider Z the event “the number on the dice can be divided by 4” Are X and Z independent? p(Z)=p(4)=1 /6 p(X,Z)=1/6, p(X|Z)=p(X,Z) / p(Z)=1/6 /1/6=11/2 ==> non-indep. Other useful relations: p(x)=p(x|y) *p(y) yY or p(x)=p(x,y) yY Chain rule: p(x1,x2,…xn) = p(x1) * p(x2| x1 )*p(x3| x1,x2)*... p(xn| x1,x2 ,…xn-1) The demonstration is easy, through successive reductions: Consider event y as coincident of events x1,x2 ,…xn-1 p(x1,x2,…xn)= p(y, xn)=p(y)*p(xn| y)= p(x1,x2 ,…xn-1)*p(xn | x1,x2 ,…xn-1) similar for the event z p(x1,x2,…xn-1)= p(z, xn-1)=p(z)*p(xn -1| z)= p(x1,x2 ,…xn-2)*p(xn -1| x1,x2 ,…xn-2) ... p(x1,x2,…xn)= p(x1) * p(x2| x1 )*p(x3| x1,x2)*... p(xn| x1,x2 ,…xn-1) prior bigram, trigram, n-gram Objections • People don’t compute probabilities. • Why would computers? • Or do they? • John went to … the market go red if number Objections • Statistics only count words and co-occurrences • Two different concepts: – Statistical model and statistical method • The first doesn’t need the second one. • A person which used the intuition to raison is using a statistical model without statistical methods. • Objections refer mainly to the accuracy of statistical models. Reference • Christopher D. Manning and Hinrich Schiitze, Fundations of Statistical Natural Language Processing Great! P(See you next time)…= Great! P(See you next time)=…