Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction LING 575 Week 1: 1/08/08 1 Plan for today • General information • Course plan • HMM and n-gram tagger (recap) • EM and forward-backward algorithm 2 Before next time • Select papers that you’d like to present – Reply to the 1st message at GoPost by noon Saturday • Read M&S 9.3.3 – Remember to hand in your questions next time. 3 General information 4 General info • Course url: http://courses.washington.edu/ling575x – Syllabus (incl. slides, assignments, and papers): updated every week. – GoPost: – Collect it: • Please check your emails at least once per day. 5 Office hour • Email: – Email address: [email protected] – Subject line should include “ling575” – The 48-hour rule: it works both ways • Office hour: – Time: Fr: 10:30-11:30am – Location: Padelford A-210G 6 Slides • The slides will be online before class if possible. • The final version will be uploaded a few hours after class. 7 Prerequisites • CS 326 (Data Structures) or equivalent: • Stat 391 (Prob. and Stats for CS) or equivalent: Basic concepts in probability and statistics • Programming in Perl, C, C++, Java, or Python • LING570 • LING572 • Being comfortable with formulas 8 Grades for LING575 No midterm or final exams. Graded: • Assignments (5): • Presentation: 45-60% 15-25% Not graded: • Reading: 5-10% • Class participation: 10-20% 9 Assignments • Assignments: – Due at 2:30pm on Tuesdays – 1% penalty for each hour after the due date. Nothing accepted after 4 days. – Submit via CollectIt • Reading: – Papers should be read before class. – Bring at least two questions to class. – Your answers will be checked but not graded. 10 Presentation • Select your week by noon this Saturday (1/12) by replying to the GoPost message: – first come, first service • If later for whatever reason, the week you selected no long works for you, it is your responsibility to find someone to switch. • For your week, email Fei the slides by noon the Monday (i.e., the day before your presentation). – 1% penalty for each hour after the due date. 11 Patas • If you need to have a patas account, you need to email [email protected] right away to get an account. • The directory for LING575: ~/dropbox/07-08/575x/ – hw1/, hw2/, ….: Assignments and solution – hmm/: A pre-existing HMM package – misc_slides/: Solution to exams and misc slides that are not on the course url. 12 Course plan 13 ML learning • Supervised learning: LING572 • Semi-supervised learning: – Some annotated data, plus a large amount of annotated data – Ex: self-training, co-training, transductive SVM • Unsupervised learning: – There are no annotated data – Ex: EM 14 Unsupervised learning • No annotated data • But the knowledge has to come from somewhere. – Dictionary / lexicon – Seed examples –… We choose unsupervised POS tagging as a case to study. 15 Supervised POS tagging • It is a sequence labeling problem. • Statistical approach: – Sequence labeling algorithms: HMM, MEMM, CRF, … – Classification algorithms: decision tree, naïve Bayes, MaxEnt, SVM, Boosting, …. • Most unsupervised POS tagging algorithms use EM to estimate HMM parameters. 16 Major approaches to unsupervised tagging • All assume a large amount of unannotated data • Approach #1: use EM to estimate HMM – No lexicon – With full lexicon – With filtered lexicon 17 Major approaches (cont) • Approach #2: clustering the words based on – distributional cues – morphological cues • Approach #3: cross-lingual approach: – It requires parallel data – Seeds are created by projecting POS info from one language to the other. 18 Major approaches (cont) • Approach #4: Prototype learning: – It requires a small number of prototypes: e.g., “book” is a noun, “the” is a determiner. – Prototypes would help to label other words. 19 In this course • We will – discuss the papers in each category – explore various methods aiming at improving the start of the art. • Compared to last year’s ling573, this course focuses – more on machine learning – less on search and rule writing 20