Download intro

Introduction LING 572 Fei Xia Week 1: 1/3/06 Outline • Course overview • Problems and methods • Mathematical foundation – Probability theory – Information theory Course overview Course objective • Focus on statistical methods that produce stateof-the-art results • Questions: for each algorithm – How the algorithm works: input, output, steps – What kind of tasks an algorithm can be applied to? – How much data is needed? • Labeled data • Unlabeled data General info • Course website: – Syllabus (incl. slides and papers): updated every week. – Message board – ESubmit • Office hour: W: 3-5pm. • Prerequisites: – Ling570 and Ling571. – Programming: C, C++, or Java, Perl is a plus. – Introduction to probability and statistics Expectations • Reading: – Papers are online: who don’t have access to printers? – Reference book: Manning & Schutze (MS) – Finish reading before class. Bring your questions to class. • Grade: – – – – Homework (3): 30% Project (6 parts): 60% Class participation: 10% No quizzes, exams Assignments Hw1: FSA and HMM Hw2: DT, DL, and TBL. Hw3: Boosting  No coding  Bring the finished assignments to class. Project P1: Method 1 (Baseline): Trigram P2: Method 2: TBL P3: Method 3: MaxEnt P4: Method 4: choose one of four tasks. P5: Presentation P6: Final report Methods 1-3 are supervised methods. Method 4: bagging, boosting, semi-supervised learning, or system combination. P1 is an individual task, P2-P6 are group tasks. A group should have no more than three people.   Use ESubmit Need to use others’ code and write your own code. Summary of Ling570 • • • • • • • • • Overview: corpora, evaluation Tokenization Morphological analysis POS tagging Shallow parsing N-grams and smoothing WSD NE tagging HMM Summary of Ling571 • • • • • • Parsing Semantics Discourse Dialogue Natural language generation (NLG) Machine translation (MT) 570/571 vs. 572 • 572 focuses more on statistical approaches. • 570/571 are organized by tasks; 572 is organized by learning methods. • I assume that you know – The basics of each task: POS tagging, parsing, … – The basic concepts: PCFG, entropy, … – Some learning methods: HMM, FSA, … An example • 570/571: – POS tagging: HMM – Parsing: PCFG – MT: Model 1-4 training • 572: – HMM: forward-backward algorithm – PCFG: inside-outside algorithm – MT: EM algorithm  All special cases of EM algorithm, one method of unsupervised learning. Course layout • Supervised methods – Decision tree – Decision list – Transformation-based learning (TBL) – Bagging – Boosting – Maximum Entropy (MaxEnt) Course layout (cont) • Semi-supervised methods – Self-training – Co-training • Unsupervised methods – EM algorithm • Forward-backward algorithm • Inside-outside algorithm • EM for PM models Outline • Course overview • Problems and methods • Mathematical foundation – Probability theory – Information theory Problems and methods Types of ML problems • • • • • Classification problem Estimation problem Clustering Discovery … A learning method can be applied to one or more types of ML problems. We will focus on the classification problem. Classification problem • Given a set of classes and data x, decide which class x belongs to. • Labeled data: – (xi, yi) is a set of labeled data. – xi is a list of attribute values. – yi is a member of a pre-defined set of classes. Examples of classification problem • Disambiguation: – Document classification – POS tagging – WSD – PP attachment given a set of other phrases • Segmentation: – Tokenization / Word segmentation – NP Chunking Learning methods • Modeling: represent the problem as a formula and decompose the formula into a function of parameters • Training stage: estimate the parameters • Test (decoding) stage: find the answer given the parameters Modeling • Joint vs. conditional models: – P(data, model) – P(model | data) – P(data | model) • Decomposition: – Which variable conditions on which variable? – What independent assumptions? An example of different modeling P( A, B, C )  P( A) * P( B | A) * P(C | A, B)  P( A) * P( B | A) * P(C | B) P( A, B, C )  P( B) * P(C | B) * P( A | B, C )  P( B) * P(C | B) * P( A | B) Training • Objective functions: – Maximize likelihood: ˆML  arg max P(data |  )  – Minimize error rate – Maximum entropy – …. • Supervised, semi-supervised, unsupervised: – Ex: Maximize likelihood • Supervised: simple counting • Unsupervised: EM Decoding • DP algorithm – CYK for PCFG – Viterbi for HMM –… • Pruning: – TopN: keep topN hyps at each node. – Beam: keep hyps whose weights >= beam * max_weight – Threshold: keep hyps whose weights >= threshold –… Outline • Course overview • Problems and methods • Mathematical foundation – Probability theory – Information theory Probability Theory Probability theory • Sample space, event, event space • Random variable and random vector • Conditional probability, joint probability, marginal probability (prior) Sample space, event, event space • Sample space (Ω): a collection of basic outcomes. – Ex: toss a coin twice: {HH, HT, TH, TT} • Event: an event is a subset of Ω. – Ex: {HT, TH} • Event space (2Ω): the set of all possible events. Random variable • The outcome of an experiment need not be a number. • We often want to represent outcomes as numbers. • A random variable is a function that associates a unique numerical value with every outcome of an experiment. • Random variable is a function X: ΩR. • Ex: toss a coin once: X(H)=1, X(T)=0 Two types of random variable • Discrete random variable: X takes on only a countable number of distinct values. – Ex: Toss a coin 10 times. X is the number of tails that are noted. • Continuous random variable: X takes on uncountable number of possible values. – Ex: X is the lifetime (in hours) of a light bulb. Probability function • The probability function of a discrete variable X is a function which gives the probability p(xi) that the random variable equals xi: a.k.a. p(xi) = p(X=xi). 0  p ( xi )  1  p( x )  1 i xi Random vector • Random vector is a finite-dimensional vector of random variables: X=[X1,…,Xk]. • P(x) = P(x1,x2,…,xn)=P(X1=x1,…., Xn=xn) • Ex: P(w1, …, wn, t1, …, tn) Three types of probability • Joint prob: P(x,y)= prob of x and y happening together • Conditional prob: P(x|y) = prob of x given a specific value of y • Marginal prob: P(x) = prob of x for all possible values of y Common equations P( A)   P( A, B) B P( A, B)  P( A) * P( B | A)  P( B) * P( A | B) P( A, B) P( B | A)  P( A) More general cases P( A1 )   P( A ,..., A ) 1 n A2 ,..., An P( A1 ,..., An )   P( Ai | A1 ,... Ai 1 ) i 1 Information Theory Information theory • It is the use of probability theory to quantify and measure “information”. • Basic concepts: – Entropy – Joint entropy and conditional entropy – Cross entropy and relative entropy – Mutual information and perplexity Entropy • Entropy is a measure of the uncertainty associated with a distribution. H ( X )   p( x) log p( x) x • The lower bound on the number of bits it takes to transmit messages. • An example: – Display the results of horse races. – Goal: minimize the number of bits to encode the results. An example • Uniform distribution: pi=1/8. 1 1 H ( X )  8 * ( log 2 )  3 bits 8 8 • Non-uniform distribution: (1/2,1/4,1/8, 1/16, 1/64, 1/64, 1/64, 1/64) 1 1 1 1 1 1 1 1 1 1 H ( X )  ( log  log  log  log  4 * log )  2 bits 2 2 4 4 8 8 16 16 64 64 (0, 10, 110, 1110, 111100, 111101, 111110, 111111) Entropy of a language • The entropy of a language L:  p( x 1n H ( L)   lim n  ) log p ( x1n ) x1 n n • If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as: log p( x1n ) log p( x1n ) H ( L)   lim  n n n Joint and conditional entropy • Joint entropy: H ( X , Y )   p( x, y ) log p( x, y ) x y • Conditional entropy: H (Y | X )   p( x, y ) log p( y | x) x y  H ( X ,Y )  H ( X ) Cross Entropy • Entropy: H ( X )   p( x) log p( x) x • Cross Entropy: H c ( X )   p( x) log q( x) x • Cross entropy is a distance measure between p(x) and q(x): p(x) is the true probability; q(x) is our estimate of p(x). Hc (X )  H(X ) Cross entropy of a language • The cross entropy of a language L:  p( x 1n H ( L, q )   lim n  ) log q ( x1n ) x1 n n • If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as: log q( x1n ) log q( x1n ) H ( L, q)   lim  n n n Relative Entropy • Also called Kullback-Leibler distance: p ( x) KL( p || q)   p( x) log 2  Hc (X )  H (X ) q ( x) • Another distance measure between prob functions p and q. • KL distance is asymmetric (not a true distance): KL( p, q )  KL(q, p ) Relative entropy is non-negative z  0 log z  z  1 KL( p || q) p ( x) q( x)   p( x) log q ( x) p( x) x x q( x)  ( p( x)(  1))  ( (q( x)  p( x))) p( x) x x   p( x) log  (  p ( x ))  (  q ( x ))  0 x x Mutual information • It measures how much is in common between X and Y: p ( x, y ) I ( X ; Y )   p( x, y ) log p ( x) p( y ) x y  H ( X )  H (Y )  H ( X , Y )  I (Y ; X ) • I(X;Y)=KL(p(x,y)||p(x)p(y)) Perplexity • Perplexity is 2H. • Perplexity is the weighted average number of choices a random variable has to make. Summary • Course overview • Problems and methods • Mathematical foundation – Probability theory – Information theory  M&S Ch2 Next time • FSA • HMM: M&S Ch 9.1 and 9.2

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download intro