Download Lecture slides - Maastricht University

DataMiningWorkshop Aachen,23-25November2016 Gerasimos(Jerry)Spanakis,PhD h:p://dke.maastrichtuniversity.nl/jerry.spanakis @gerasimoss WhoamI? 2001-2006 2007-2012 2007-2012 2012-2013 2013-2014 2016 2014-2016 2016- Diploma(MSc)inElectrical&ComputerEngineering (NeuralNetworksandReinforcementLearning) Diploma(MSc)inCivil(Transport)Engineering (MachinelearningintransportopXmizaXonproblems) PhDinComputaXonalIntelligence (Intelligenttechniquesintextanalysis) Armyservice:maintenance,update,developmentforthe armypersonnelso]waresystem LectureratTechnicalUniversityofCrete Scien>ficAssociateatTechnologicalInsXtuteofCrete Visi>ngresearcheratUniversityofAlberta (RecommenderSystemsforHigherEducaXon) PostdoctoralResearcheratMaastrichtUniversity DepartmentofDataScience& KnowledgeEngineering AssistantProfessoratMaastrichtUniversity DepartmentofDataScience& KnowledgeEngineering 2 WhoamI(really)? •  Machinelearner &dataminer •  Coffeeaddict •  TVseries binge-watcher •  AviaXonenthusiast •  Gin&Toniclover •  Proudgeek 3 Outline •  Datamining:Whyworldhasgonecrazy? -  DatapreparaXon •  Differentformsofdata&learning -  ClassificaXon -  Clustering,DimensionalityreducXon -  RecommenderSystems Someslidecreditsto: -  Topicmodelingfortexts •  Deeplearning”hype” -  Howitworks? Scheduling •  Mornings:“Lectures”,A]ernoons:PracXcals -  Canbeadjusted… •  Day1:IntroducXon,DataPre-processing, Supervisedlearning:ClassificaXon •  Day2:Unsupervisedlearning:Clustering, DimensionalityReducXon,Recommender Systems,Topicmodelingfortexts •  Day3:Deeplearningfortext&images https://dke.maastrichtuniversity.nl/jerry.spanakis/aachen16/ 5 Let’sstart… •  Howmanyofyouhavealreadyapplied datamining? •  Howmanyofyouhaveheardwordslike classificaXon,clustering,matrix factorizaXon,SVM,accuracy,LDA,…? •  HowmanyofyouhaveworkedwithR? 6 Whatisdatascienceordatamining? •  Learning= RepresentaXon+ EvaluaXon+ OpXmizaXon [JoelGrus,2016] Representa>on:Choiceofmodel&hypothesisspace Evalua>on:ChoiceofobjecXvefuncXon Op>miza>on:Thealgorithmtocomputethebestmodel 7 Weminedataeveryday •  Yes,itworksJ •  SpamdetecXon •  CreditcardfrauddetecXon •  DigitrecogniXon •  VoicerecogniXon •  FacedetecXon •  Stocktrading •  Games 8 Whyitworks? •  Supervisedlearningtechniques Cars Whatis this? Motorcycles 9 Whyitworks? •  Lotsof(labeled)data 10 Whyitworks? •  ComputaXonalpower “Thenumberofpeople predicXngthedeathof Moore’slawdoubles everytwoyears” --PeterLee, VPMSResearch 11 Howiscomputerpercep>ondone? 12 CaseStudy:Bank •  Businessgoal:Sellmorehomeequityloans •  Currentmodels: -  Customerswithcollege-agechildrenusehomeequityloanstopayfor tuiXon -  Customerswithvariableincomeusehomeequityloanstoevenout streamofincome •  Data: -  Largedatawarehouse -  Consolidatesdatafrom42operaXonaldatasources 13 CaseStudy:Bank(Contd.) 1.  Selectsubsetofcustomerrecordswhohavereceivedhome equityloanoffer -  -  Income $40,000 $75,000 $50,000 … Customerswhodeclined Customerswhosignedup Number of Children 2 0 1 … Average Checking Account Balance $1500 $5000 $3000 … … Reponse … Yes No No … 14 CaseStudy:Bank(Contd.) 2.  Findrulestopredictwhetheracustomerwouldrespondto homeequityloanoffer IF (Salary<40k)and (numChildren>0)and (ageChild1>18andageChild1<22) THENYES … 15 CaseStudy:Bank(Contd.) 3.  Groupcustomersintoclustersand invesXgateclusters Group 2 Group 3 Group 1 Group 4 16 CaseStudy:Bank(Contd.) 4.  Evaluateresults: -  Many“uninteresXng”clusters -  OneinteresXngcluster!Customerswith bothbusinessandpersonalaccounts; unusuallyhighpercentageoflikely respondents 17 WhatisaDataMiningModel? AdataminingmodelisadescripXonofacertainaspectofa dataset.Itproducesoutputvaluesforanassignedsetof inputs. Examples: •  Clustering •  Linearregressionmodel •  ClassificaXonmodel •  FrequentitemsetsandassociaXonrules •  SupportVectorMachines •  … 18 Learningfromdataisdifficult Supervisedlearning: •  Ithasnothingtodolikehumanlearning •  TransiXontoopenbutbigdata Unsupervisedlearning: •  Exploratorydataanalysis:Goalisnotclear •  Difficulttoassessperformance •  High-dimensionaldata 19 Ourgoal:Vectorrepresenta>ons Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M 1 0 1 0 0 0 Mary F 1 0 1 0 1 0 Jim M 1 1 0 0 0 0 •  Discretevs.ConXnuousdata •  Numeric,Binary,Ordinalfeatures •  Differentdistancemeasures x = (x1 x2 ⋅ ⋅ ⋅ xn ) and y = ( y1 y2 ⋅ ⋅ ⋅ yn ) ( d (x, y ) = | x1 − y1 | p + | x2 − y2 | p ⋅ ⋅ ⋅+ | xn − yn | 1 p p ) , p>0 20 WhyDataPreprocessing? •  Dataintherealworldisdirty -  incomplete:lackinga:ributevalues,lacking certaina:ributesofinterest,orcontaining onlyaggregatedata -  e.g.,occupaXon=“” -  noisy:containingerrorsoroutliers -  e.g.,Salary=“-10” -  inconsistent:containingdiscrepanciesin codesornames -  e.g.,Age=“42”Birthday=“03/07/1997” -  e.g.,WasraXng“1,2,3”,nowraXng“A,B,C” -  e.g.,discrepancybetweenduplicaterecords 21 WhyIsDataDirty? •  Incompletedatamaycomefrom -  “Notapplicable”datavaluewhencollected -  DifferentconsideraXonsbetweentheXmewhenthedatawascollected andwhenitisanalyzed. -  Human/hardware/so]wareproblems •  Noisydata(incorrectvalues)maycomefrom -  FaultydatacollecXoninstruments -  Humanorcomputererroratdataentry -  Errorsindatatransmission •  Inconsistentdatamaycomefrom -  Differentdatasources -  FuncXonaldependencyviolaXon(e.g.,modifysomelinkeddata) •  Duplicaterecordsalsoneeddatacleaning 22 WhyIsDataPreprocessingImportant? •  Noqualitydata,noqualityminingresults! •  ”Garbagein–Garbageout” -  Qualitydecisionsmustbebasedonqualitydata -  e.g.,duplicateormissingdatamaycauseincorrectorevenmisleading staXsXcs. -  DatawarehouseneedsconsistentintegraXonofqualitydata •  DataextracXon,cleaning,andtransformaXoncomprisesthe majorityoftheworkofbuildingadatawarehouse •  Youmayassumethatinadataminingproblem80%ofthework isthepreparaXonofthedata 23 MajorTasksinDataPreprocessing •  Datacleaning -  Fillinmissingvalues,smoothnoisydata,idenXfyorremoveoutliers,and resolveinconsistencies •  DataintegraXon -  IntegraXonofmulXpledatabases,datacubes,orfiles •  DatatransformaXon -  NormalizaXonandaggregaXon •  DatareducXon -  ObtainsreducedrepresentaXoninvolumebutproducesthesameor similaranalyXcalresults •  DatadiscreXzaXon -  PartofdatareducXonbutwithparXcularimportance,especiallyfor numericaldata 24 MiningDataDescrip>veCharacteris>cs •  •  •  •  MoXvaXon -  Tobe:erunderstandthedata:centraltendency,variaXonand spread DatadispersioncharacterisXcs -  median,max,min,quanXles,outliers,variance,etc. Numericaldimensionscorrespondtosortedintervals -  Datadispersion:analyzedwithmulXplegranulariXesofprecision -  BoxplotorquanXleanalysisonsortedintervals Dispersionanalysisoncomputedmeasures -  Foldingmeasuresintonumericaldimensions -  BoxplotorquanXleanalysisonthetransformedcube 25 MeasuringtheCentralTendency •  •  Mean(algebraicmeasure)(samplevs.populaXon): -  WeightedarithmeXcmean: -  Trimmedmean:choppingextremevalues µ=∑ x N n ∑w x i x= i i =1 n ∑w Median:AholisXcmeasure -  1 n x = ∑ xi n i =1 i i =1 Middlevalueifoddnumberofvalues,oraverageofthemiddletwovalues otherwise -  •  EsXmatedbyinterpolaXon(forgroupeddata): median = L1 + ( Mode -  Valuethatoccursmostfrequentlyinthedata -  Unimodal,bimodal,trimodal -  Empiricalformula: n / 2 − (∑ f )l f median mean − mode = 3 × (mean − median ) 26 )c Symmetricvs.SkewedData •  Median,meanandmodeof symmetric,posiXvelyand negaXvelyskeweddata 27 MeasuringtheDispersionofData •  QuarXles,outliersandboxplots -  QuarXles:Q1(25thpercenXle),Q3(75thpercenXle) -  Inter-quarXlerange:IQR=Q3–Q1 -  Fivenumbersummary:min,Q1,M,Q3,max -  Boxplot:endsoftheboxarethequarXles,medianismarked,whiskers,andplot outlierindividually -  •  Outlier:usually,avaluehigher/lowerthan1.5xIQR VarianceandstandarddeviaXon(sample:s,popula0on:σ) -  Variance:(algebraic,scalablecomputaXon) σ2 = -  1 N n ∑ (x − µ) i i =1 2 = 1 N n ∑x i 2 − µ2 i =1 StandarddeviaXons(orσ)isthesquarerootofvariances2(orσ2) 1 n 1 n 2 1 n 2 2 s = ( xi − x ) = [∑ xi − (∑ xi ) ] ∑ n − 1 i =1 n − 1 i =1 n i =1 2 28 Proper>esofNormalDistribu>onCurve •  Thenormal(distribuXon)curve -  Fromμ–σtoμ+σ:containsabout68%ofthemeasurements(μ: mean,σ:standarddeviaXon) -  Fromμ–2σtoμ+2σ:containsabout95%ofit -  Fromμ–3σtoμ+3σ:containsabout99.7%ofit 29 BoxplotAnalysis •  Five-numbersummaryofadistribuXon: Minimum,Q1,M,Q3,Maximum •  Boxplot -  Dataisrepresentedwithabox -  TheendsoftheboxareatthefirstandthirdquarXles, i.e.,theheightoftheboxisIRQ -  Themedianismarkedbyalinewithinthebox -  Whiskers:twolinesoutsidetheboxextendto MinimumandMaximum 30 Visualiza>onofDataDispersion:BoxplotAnalysis 31 DataCleaning •  Importance -  “Datacleaningisoneofthethreebiggestproblemsin datawarehousing”—RalphKimball -  “Datacleaningisthenumberoneproblemindata relatedtasks”–Unknownprogrammer •  Datacleaningtasks -  Fillinmissingvalues -  IdenXfyoutliersandsmoothoutnoisydata -  Correctinconsistentdata -  ResolveredundancycausedbydataintegraXon 32 MissingData •  Dataisnotalwaysavailable -  E.g.,manytupleshavenorecordedvalueforseverala:ributes,such ascustomerincomeinsalesdata •  Missingdatamaybedueto -  equipmentmalfuncXon -  inconsistentwithotherrecordeddataandthusdeleted -  datanotenteredduetomisunderstanding -  certaindatamaynotbeconsideredimportantattheXmeofentry -  notregisterhistoryorchangesofthedata •  Missingdatamayneedtobeinferred. 33 HowtoHandleMissingData? •  Ignorethetuple:usuallydonewhenclasslabelismissing(assumingthetasks inclassificaXon—noteffecXvewhenthepercentageofmissingvaluesper a:ributevariesconsiderably. •  Fillinthemissingvaluemanually:tedious+infeasible? •  FillinitautomaXcallywith -  aglobalconstant:e.g.,“unknown”,anewclass?! -  thea:ributemean -  thea:ributemeanforallsamplesbelongingtothesameclass:smarter -  themostprobablevalue:inference-basedsuchasBayesianformulaor decisiontree 34 NoisyData •  Noise:randomerrororvarianceinameasuredvariable •  Incorrecta:ributevaluesmaydueto -  faultydatacollecXoninstruments -  dataentryproblems -  datatransmissionproblems -  technologylimitaXon -  inconsistencyinnamingconvenXon •  Otherdataproblemswhichrequiresdatacleaning -  duplicaterecords -  incompletedata -  inconsistentdata 35 HowtoHandleNoisyData? •  Binning -  firstsortdataandparXXoninto(equal-frequency)bins -  thenonecansmoothbybinmeans,smoothbybinmedian, smoothbybinboundaries,etc. •  Regression -  smoothbyfi•ngthedataintoregressionfuncXons •  Clustering -  detectandremoveoutliers •  CombinedcomputerandhumaninspecXon -  detectsuspiciousvaluesandcheckbyhuman(e.g.,dealwith possibleoutliers) 36 DataTransforma>on Smoothing:removenoisefromdata AggregaXon:summarizaXon,datacubeconstrucXon GeneralizaXon:concepthierarchyclimbing NormalizaXon:scaledtofallwithinasmall,specifiedrange -  min-maxnormalizaXon -  z-scorenormalizaXon -  normalizaXonbydecimalscaling •  A:ribute/featureconstrucXon -  Newa:ributesconstructedfromthegivenones •  •  •  •  37 DataTransforma>on:Normaliza>on •  Min-maxnormalizaXon:to[new_minA,new_maxA] v' = v − minA (new _ maxA − new _ minA) + new _ minA maxA − minA -  Ex.Letincomerange$12,000to$98,000normalizedto[0.0,1.0].Then 73,600 − 12,000 $73,000ismappedto 98,000 − 12,000 (1.0 − 0) + 0 = 0.716 •  Z-scorenormalizaXon(μ:mean,σ:standarddeviaXon): v' = v − µA σ A -  Ex.Letμ=54,000,σ=16,000.Then •  NormalizaXonbydecimalscaling v v'= j 10 73,600 − 54,000 = 1.225 16,000 WherejisthesmallestintegersuchthatMax(|ν’|)<1 38 Howisthisappliedtodifferentdata? •  Datacanbe: -  Numeric -  Categorical -  Text -  Images •  TheXmedimensioncreatesmorecomplex structures -  Timeseries(sequencesofnumericaldata) -  Videos(sequencesofimages) 39 What’stheproblemwithtextualdata Unstructured(text)vs.structured (database)datainthemid-nine>es 40 What’stheproblemwithtextualdata Unstructured(text)vs.structured (database)datatoday 41 Allstartedfrominforma>onretrieval (orwebsearch) •  CollecXon:Asetofdocuments -  AssumeitisastaXccollecXonforthe moment •  Goal:Retrievedocumentswith informaXonthatisrelevanttotheuser’s informaXonneedandhelpstheuser completeatask 42 Sec. 1.1 Unstructureddatain1620 •  WhichplaysofShakespearecontainthewords BrutusANDCaesarbutNOTCalpurnia? •  OnecouldgrepallofShakespeare’splaysfor BrutusandCaesar,thenstripoutlines containingCalpurnia? •  Whyisthatnotafeasibleanswer? –  Slow(forlargecorpora) –  NOTCalpurniaisnon-trivial –  OtheroperaXons(e.g.,findthewordRomans nearcountrymen)notfeasible –  Rankedretrieval(bestdocumentstoreturn) •  Anditstartsge•ngverycomplex… 43 Sec. 1.1 Term-documentincidencematrices Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 Brutus AND Caesar BUT NOT Calpurnia 1 if play contains word, 0 otherwise 44 Sec. 1.1 Incidencevectors •  Sowehavea0/1vectorforeachterm. •  Toanswerquery:takethevectorsforBrutus, CaesarandCalpurnia(complemented)è bitwiseAND. Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 45 Sec. 1.1 Answerstoquery •  Antony and Cleopatra,Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. •  Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i’ the Capitol; Brutus killed me. 46 Sec. 1.1 Biggercollec>ons •  ConsiderN=1milliondocuments,eachwithabout 1000words. •  Avg6bytes/wordincludingspaces/punctuaXon -  6GBofdatainthedocuments. •  SaythereareM=500Kdis0ncttermsamongthese. •  500Kx1Mmatrixhashalf-a-trillion0’sand1’s. •  Butithasnomorethanonebillion1’s. Why? -  matrixisextremelysparse. •  Programmer’snote:What’sabe:errepresentaXon? -  Weonlyrecordthe1posiXons. 47 ProblemwithBooleansearch: feastorfamine Ch. 6 •  Booleanquerieso]enresultineithertoofew (=0)ortoomany(1000s)results. •  Query1:“standarduserdlink650”→200,000 hits •  Query2:“standarduserdlink650nocard found”:0hits •  Ittakesalotofskilltocomeupwithaquery thatproducesamanageablenumberofhits. -  ANDgivestoofew;ORgivestoomany 48 Sec. 6.2 Term-documentcountmatrices •  Considerthenumberofoccurrencesofa terminadocument: -  Eachdocumentisacountvectorinℕv:a columnbelow Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 157 73 0 0 0 0 Brutus 4 157 0 1 0 0 Caesar 232 227 0 2 1 1 Calpurnia 0 10 0 0 0 0 Cleopatra 57 0 0 0 0 0 mercy 2 0 3 5 5 1 worser 2 0 1 1 1 0 49 Bagofwordsmodel •  VectorrepresentaXondoesn’tconsiderthe orderingofwordsinadocument •  JohnisquickerthanMaryandMaryisquicker thanJohnhavethesamevectors •  Thisiscalledthebagofwordsmodel. •  Otherwell-knownproblems: -  BreaksmulX-words(e.g.datamining) -  Doesnotconsidersynonymy,hyponymy,etc. -  Mainproblemisthatit’ssparse! 50 Termfrequency^ •  Thetermfrequency†t,doftermtindocumentdis definedasthenumberofXmesthattoccursind. •  Wewanttouse†whencompuXngquerydocumentmatchscores.Buthow? •  Rawtermfrequencyisnotwhatwewant: -  Adocumentwith10occurrencesofthetermis morerelevantthanadocumentwith1 occurrenceoftheterm. -  Butnot10Xmesmorerelevant. •  RelevancedoesnotincreaseproporXonallywith termfrequency. NB:frequency=countinIR 51 Sec. 6.2 Log-frequencyweigh>ng •  Thelogfrequencyweightoftermtindis wt,d ⎧1 + log10 tft,d , =⎨ 0, ⎩ if tft,d > 0 otherwise •  0→0,1→1,2→1.3,10→2,1000→4,etc. •  Scoreforadocument-querypair:sumover termstinbothqandd: •  score = ∑t∈q∩d (1 + log tft ,d ) •  Thescoreis0ifnoneofthequerytermsis presentinthedocument. 52 Sec. 6.2.1 Documentfrequency •  RaretermsaremoreinformaXvethanfrequent terms -  Recallstopwords •  Consideraterminthequerythatisrareinthe collecXon(e.g.,arachnophobia) •  Adocumentcontainingthistermisverylikelyto berelevanttothequeryarachnophobia •  →Wewantahighweightforraretermslike arachnophobia. 53 Sec. 6.2.1 Documentfrequency,con>nued •  FrequenttermsarelessinformaXvethanrareterms •  Consideraquerytermthatisfrequentinthe collecXon(e.g.,high,increase,line) •  Adocumentcontainingsuchatermismorelikelyto berelevantthanadocumentthatdoesn’t •  Butit’snotasureindicatorofrelevance. •  →Forfrequentterms,wewanthighposiXve weightsforwordslikehigh,increase,andline •  Butlowerweightsthanforrareterms. •  Wewillusedocumentfrequency(df)tocapturethis. 54 Sec. 6.2.1 idfweight •  dftisthedocumentfrequencyoft:the numberofdocumentsthatcontaint -  dftisaninversemeasureofthe informaXvenessoft -  dft≤N •  Wedefinetheidf(inversedocument frequency)oftby idft = log10 ( N/dft ) -  Weuselog(N/dft)insteadofN/dftto “dampen”theeffectofidf. Will turn out the base of the log is immaterial. 55 Sec. 6.2.1 idfexample,supposeN=1million term dft idft calpurnia 1 animal 100 sunday 1,000 fly 10,000 under the 100,000 1,000,000 idft = log10 ( N/dft ) There is one idf value for each term t in a collection. 56 Effectofidfonranking •  Doesidfhaveaneffectonrankingforone-term queries,like -  iPhone •  idfhasnoeffectonrankingonetermqueries -  idfaffectstherankingofdocumentsfor querieswithatleasttwoterms -  Forthequerycapriciousperson,idfweighXng makesoccurrencesofcapriciouscountfor muchmoreinthefinaldocumentranking thanoccurrencesofperson. 57 Sec. 6.2.1 Collec>onvs.Documentfrequency •  ThecollecXonfrequencyoftisthenumber ofoccurrencesoftinthecollecXon, counXngmulXpleoccurrences. Collection Document frequency •  Example: Word frequency insurance 10440 3997 try 10422 8760 •  Whichwordisabe:ersearchterm(and shouldgetahigherweight)? 58 Sec. 6.2.2 ^-idfweigh>ng •  The†-idfweightofatermistheproductofits† weightanditsidfweight. w = log(1 + tf ) × log ( N / df ) t ,d t ,d 10 t •  BestknownweighXngschemeininformaXon retrieval •  Increaseswiththenumberofoccurrenceswithina document •  Increaseswiththerarityoftheterminthe collecXon 59 Sec. 6.2.2 Scoreforadocumentgivenaquery •  Therearemanyvariants -  How“†”iscomputed(with/withoutlogs) -  Whetherthetermsinthequeryarealsoweighted -  … 60 Sec. 6.3 Binary→count→weightmatrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 5.25 3.18 0 0 0 0.35 Brutus 1.21 6.1 0 1 0 0 Caesar 8.59 2.54 0 1.51 0.25 0 Calpurnia 0 1.54 0 0 0 0 Cleopatra 2.85 0 0 0 0 0 mercy 1.51 0 1.9 0.12 5.25 0.88 worser 1.37 0 0.11 4.15 0.25 1.95 Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V| 61 Sec. 6.3 Documentsasvectors Sowehavea|V|-dimensionalvectorspace Termsareaxesofthespace Documentsarepointsorvectorsinthisspace Veryhigh-dimensional:tensofmillionsof dimensionswhenyouapplythistoawebsearch engine •  Theseareverysparsevectors-mostentriesare zero. •  •  •  •  62 Sec. 6.3 Queriesasvectors •  Keyidea1:Dothesameforqueries:represent themasvectorsinthespace •  Keyidea2:Rankdocumentsaccordingtotheir proximitytothequeryinthisspace •  proximity=similarityofvectors •  proximity≈inverseofdistance •  Recall:Wedothisbecausewewanttogetaway fromtheyou’re-either-in-or-outBoolean model. •  Instead:rankmorerelevantdocumentshigher thanlessrelevantdocuments 63 Sec. 6.3 Formalizingvectorspaceproximity •  Firstcut:distancebetweentwopoints -  (=distancebetweentheendpointsofthe twovectors) •  Euclideandistance? •  Euclideandistanceisabadidea... •  ...becauseEuclideandistanceislargefor vectorsofdifferentlengths. 64 Whydistanceisabadidea Sec. 6.3 TheEuclideandistance betweenq andd2islargeeven thoughthe distribuXonoftermsin thequeryqandthe distribuXonof termsinthedocument d2are verysimilar. 65 Sec. 6.3 Useangleinsteadofdistance •  Thoughtexperiment:takeadocumentdand appendittoitself.Callthisdocumentdʹ. •  “SemanXcally”danddʹhavethesamecontent •  TheEuclideandistancebetweenthetwo documentscanbequitelarge •  Theanglebetweenthetwodocumentsis0, correspondingtomaximalsimilarity. •  Keyidea:Rankdocumentsaccordingtoangle withquery. 66 Sec. 6.3 Fromanglestocosines •  ThefollowingtwonoXonsareequivalent. -  Rankdocumentsindecreasingorderofthe anglebetweenqueryanddocument -  Rankdocumentsinincreasingorderof cosine(query,document) •  Cosineisamonotonicallydecreasing funcXonfortheinterval[0o,180o] 67 Sec. 6.3 Fromanglestocosines •  Buthow–andwhy–shouldwebecompuXngcosines? 68 Sec. 6.3 Lengthnormaliza>on •  Avectorcanbe(length-)normalizedbydividing eachofitscomponentsbyitslength–forthis weusetheL2norm: x! = ∑ xi2 2 i •  DividingavectorbyitsL2normmakesitaunit (length)vector(onsurfaceofunithypersphere) •  Effectonthetwodocumentsdanddʹ(d appendedtoitself)fromearlierslide:theyhave idenXcalvectorsa]erlength-normalizaXon. -  Longandshortdocumentsnowhave comparableweights 69 Sec. 6.3 cosine(query,document) Dot product Unit vectors ! ! ! ! ! ! q•d q d cos( q, d ) = ! ! = ! • ! = q d qd ∑ V qd i =1 i i V 2 i =1 i ∑ q ∑ V i =1 d 2 i qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d. 70 Cosineforlength-normalizedvectors •  Forlength-normalizedvectors,cosine similarityissimplythedotproduct(or scalarproduct): forq,dlength-normalized. 71 Cosinesimilarityillustrated 72 Sec. 6.3 Cosinesimilarityamongst3documents Howsimilarare thenovels SaS:Senseand Sensibility PaP:Prideand Prejudice,and WH:Wuthering Heights? term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 Term frequencies (counts) Note: To simplify this example, we don’t do idf weighting. 73 Sec. 6.3 3documentsexamplecontd. Logfrequencyweigh>ng term SaS PaP Aeerlengthnormaliza>on WH term SaS PaP WH affection 3.06 2.76 2.30 affection 0.789 0.832 0.524 jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465 gossip 1.30 0 1.78 gossip 0.335 0 0.405 0 0 2.58 wuthering 0 0 0.588 wuthering cos(SaS,PaP) ≈ 0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94 cos(SaS,WH) ≈ 0.79 Why do we have cos(PaP,WH) ≈ 0.69 cos(SaS,PaP) > cos(SaS,WH)? 74 TheDreamforNaturalLanguageProcessing •  It’dbegreatifmachinescould -  Processouremail(usefully) -  Translatelanguagesaccurately -  Helpusmanage,summarize,and aggregateinformaXon -  UsespeechasaUI(whenneeded) -  Talktous/listentous •  Buttheycan’t: -  Languageiscomplex,ambiguous, flexible,andsubtle -  GoodsoluXonsneedlinguisXcsand machinelearningknowledge •  So: 75 Themystery •  What’snowimpossibleforcomputers(and anyotherspecies)todoiseffortlessfor humans ✕ ✕ ✓ 76 WhatisNLP? •  Fundamentalgoal:deepunderstandofbroadlanguage -  Notjuststringprocessingorkeywordmatching! 77 WhatisNLP? •  Computersuse(analyze,understand,generate)naturallanguage •  TextProcessing -  Lexical:tokenizaXon,partofspeech,head,lemmas -  Parsingandchunking -  SemanXctagging:semanXcrole,wordsense -  Certainexpressions:namedenXXes -  Discourse:coreference,discoursesegments •  SpeechProcessing -  PhoneXctranscripXon -  SegmentaXon(punctuaXons) -  Prosody 78 Whyshouldyoucare? •  TremendousprogressinNLP -  Anenormousamountof knowledgeisnowavailablein machinereadableformas naturallanguagetext -  ConversaXonalagentsare becominganimportantformof human-computer communicaXon -  Muchofhuman-human communicaXonisnowmediated bycomputers Commercialworldisblooming Sponsors@EMNLP2016 79 What’stheproblemwithimages? •  Imagescanberepresentedasvectorsof pixels(RGBvalues)andthencanbe (simply?)fedtoaclassificaXonalgorithm •  Butforasimpleimage(256x256)thereare 2524,888possibleimages -  Thereareabout1024startsintheuniverse •  ThinkalsoofvariaXonsperclass -  Fruit? -  CatsindifferentposiXons? 80 Whatiscomputervision? Terminator 2 81 Everypicturetellsastory •  Goalofcomputervisionisto writecomputerprogramsthat caninterpretimages 82 Cancomputersmatch(orbeat)humanvision? •  Yesandno(butmostlyno!) -  humansaremuchbe:erat“hard”things -  computerscanbebe:erat“easy”things 83 Howdoesitwork? 84 Supervisedlearning 85 Classifica>on •  InclassificaXonproblems,eachenXtyinsomedomaincanbeplaced inoneofadiscretesetofcategories:yes/no,friend/foe,good/bad/ indifferent,blue/red/green,etc. •  GivenatrainingsetoflabeledenXXes,developaruleforassigning labelstoenXXesinatestset •  ManyvariaXonsonthistheme: -  -  -  -  binaryclassificaXon mulX-categoryclassificaXon non-exclusivecategories ranking •  ManycriteriatoassessrulesandtheirpredicXons -  overallerrors -  costsassociatedwithdifferentkindsoferrors -  operaXngpoints 86 Representa>onofObjects •  Eachobjecttobeclassifiedis representedasapair(x, y): -  wherex isadescripXonoftheobject(seeexamplesofdata typesinthefollowingslides) -  where y isalabel(assumedbinaryfornow) •  Successorfailureofamachinelearning classifiero]endependsonchoosing gooddescripXonsofobjects -  thechoiceofdescripXoncanalsobeviewedasalearning problem,andindeedwe’lldiscussautomatedprocedures forchoosingdescripXonsinalaterlecture -  butgoodhumanintuiXonsareo]enneededhere 87 DataTypes •  Vectorialdata: -  -  -  -  -  physicala:ributes behaviorala:ributes context history etc •  We’llassumefornowthatsuchvectorsareexplicitly representedinatable,butpracXcallycantakedifferent forms(e.g.kernels) 88 DataTypes -  textandhypertext <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>Welcome to FairmontNET</title> </head> <STYLE type="text/css"> .stdtext {font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 11px; color: #1F3D4E;} .stdtext_wh {font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 11px; color: WHITE;} </STYLE> <body leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" bgcolor="BLACK"> <TABLE cellpadding="0" cellspacing="0" width="100%" border="0"> <TR> <TD width=50% background="/TFN/en/CDA/Images/common/labels/decorative_2px_blk.gif"> </TD> <TD><img src="/TFN/en/CDA/Images/common/labels/decorative.gif"></td> <TD width=50% background="/TFN/en/CDA/Images/common/labels/decorative_2px_blk.gif"> </TD> </TR> </TABLE> <tr> <td align="right" valign="middle"><IMG src="/TFN/en/CDA/Images/common/labels/centrino_logo_blk.gif"></td> </tr> </body> </html> 89 DataTypes -  email Return-path <[email protected]>Received from relay2.EECS.Berkeley.EDU (relay2.EECS.Berkeley.EDU [169.229.60.28]) by imap4.CS.Berkeley.EDU (iPlanet Messaging Server 5.2 HotFix 1.16 (built May 14 2003)) with ESMTP id <[email protected]>; Tue, 08 Jun 2004 11:40:43 -0700 (PDT)Received from relay3.EECS.Berkeley.EDU (localhost [127.0.0.1]) by relay2.EECS.Berkeley.EDU (8.12.10/8.9.3) with ESMTP id i58Ieg3N000927; Tue, 08 Jun 2004 11:40:43 -0700 (PDT)Received from redbirds (dhcp-168-35.EECS.Berkeley.EDU [128.32.168.35]) by relay3.EECS.Berkeley.EDU (8.12.10/8.9.3) with ESMTP id i58IegFp007613; Tue, 08 Jun 2004 11:40:42 -0700 (PDT)Date Tue, 08 Jun 2004 11:40:42 -0700From Robert Miller <[email protected]>Subject RE: SLT headcount = 25In-replyto <[email protected]>To 'Randy Katz' <[email protected]>Cc "'Glenda J. Smith'" <[email protected]>, 'Gert Lanckriet' <[email protected]>Messageid <[email protected]>MIME-version 1.0XMIMEOLE Produced By Microsoft MimeOLE V6.00.2800.1409X-Mailer Microsoft Office Outlook, Build 11.0.5510Content-type multipart/alternative; boundary="----=_NextPart_000_0033_01C44D4D. 6DD93AF0"Thread-index AcRMtQRp+R26lVFaRiuz4BfImikTRAA0wf3Qthe headcount is now 32. ---------------------------------------- Robert Miller, Administrative Specialist University of California, Berkeley Electronics Research Lab 634 Soda Hall #1776 Berkeley, CA 94720-1776 Phone: 510-642-6037 fax: 510-643-1289 90 DataTypes -  proteinsequences 91 DataTypes -  sequencesofUnixsystemcalls 92 DataTypes -  networklayout:graph 93 DataTypes -  images 94 Example:SpamFilter •  Input:email •  Output:spam/ham •  Setup: -  GetalargecollecXonofexample emails,eachlabeled“spam”or “ham” -  Note:someonehastohandlabel allthisdata -  Wanttolearntopredictlabelsof new,futureemails •  Features:Thea:ributesusedtomake theham/spamdecision -  -  -  -  Words:FREE! TextPa:erns:$dd,CAPS Non-text:SenderInContacts … Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. … TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT. 99 MILLION EMAIL ADDRESSES FOR ONLY $99 Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing 95 happened. •  •  •  •  •  Example:DigitRecogni>on Input:images/pixelgrids Output:adigit0-9 Setup: -  GetalargecollecXonofexampleimages,eachlabeled withadigit -  Note:someonehastohandlabelallthisdata -  Wanttolearntopredictlabelsofnew,futuredigit images Features:Thea:ributesusedtomakethedigitdecision -  Pixels:(6,8)=ON -  ShapePa:erns:NumComponents,AspectRaXo, NumLoops -  … Currentstate-of-the-art:Human-levelperformance 0 1 2 1 ?? 96 OtherExamplesofReal-WorldClassifica>on Tasks •  FrauddetecXon(input:accountacXvity,classes:fraud/nofraud) •  WebpagespamdetecXon(input:HTML/renderedpage,classes: spam/ham) •  SpeechrecogniXonandspeakerrecogniXon(input:waveform, classes:phonemesorwords) •  Medicaldiagnosis(input:symptoms,classes:diseases) •  AutomaXcessaygrader(input:document,classes:grades) •  CustomerserviceemailrouXngandfoldering •  LinkpredicXoninsocialnetworks •  CatalyXcacXvityindrugdesign •  …manymanymore •  ClassificaXonisanimportantcommercialtechnology 97 •  TrainingandValida>on Data:labeledinstances,e.g.emailsmarkedspam/ham -  Trainingset -  ValidaXonset -  Testset •  Training •  EvaluaXon •  StaXsXcalissues -  -  -  -  EsXmateparametersontrainingset TunehyperparametersonvalidaXonset Reportresultsontestset Anythingshortofthisyieldsover-opXmisXcclaims Training Data -  Manydifferentmetrics -  Ideally,thecriteriausedtotraintheclassifiershouldbecloselyrelated tothoseusedtoevaluatetheclassifier -  Wantaclassifierwhichdoeswellontestdata -  Overfi•ng:fi•ngthetrainingdataveryclosely,butnotgeneralizing well -  Errorbars:wantrealisXc(conservaXve)esXmatesofaccuracy Validation Data Test Data 98 Intui>vePictureoftheProblem Class1 Class2 99 LinearlySeparableData Class1 Linear Decision boundary Class2 100 NonlinearlySeparableData Non Linear Classifier Class1 Class2 101 SomeIssues •  Theremaybeasimpleseparator(e.g.,astraightlinein2Dora hyperplaneingeneral)ortheremaynot •  Theremaybe“noise”ofvariouskinds •  Theremaybe“overlap” •  Oneshouldnotbedeceivedbyone’slow-dimensionalgeometrical intuiXon •  Someclassifiersexplicitlyrepresentseparators(e.g.,straightlines), whileforotherclassifierstheseparaXonisdoneimplicitly •  Someclassifiersjustmakeadecisionastowhichclassanobjectisin; othersesXmateclassprobabiliXes 102 Methods I)Instance-basedmethods: 1)Nearestneighbor II)ProbabilisXcmodels: 1)NaïveBayes 2)LogisXcRegression III)LinearModels: 1)Perceptron 2)SupportVectorMachine IV)DecisionModels: 1)DecisionTrees 2)BoostedDecisionTrees 3)RandomForest 103 NearestNeighbourRule Non-parametricpa:ern classificaXon. Consideratwoclassproblem whereeachsampleconsistsoftwo measurements(x,y). Foragivenquerypointq, assigntheclassofthe nearestneighbour. k=1 Computetheknearest neighboursandassignthe classbymajorityvote. k=3 104 Ques>ons •  Whatdistancemeasuretouse? -  O]enEuclideandistanceisused -  LocallyadapXvemetrics -  Morecomplicatedwithnon-numericdata,orwhen differentdimensionshavedifferentscales •  Choiceofk? -  Cross-validaXon -  1-NNo]enperformswellinpracXce -  k-NNneededforoverlappingclasses -  Re-labelalldataaccordingtok-NN,thenclassifywith1NN -  Reducek-NNproblemto1-NNthroughdatasetediXng 105 DecisionRegions Eachcellcontainsone sample,andeverylocaXon withinthecelliscloserto thatsamplethantoany othersample. AVoronoidiagramdivides thespaceintosuchcells. EveryquerypointwillbeassignedtheclassificaXonofthesamplewithinthatcell.The decisionboundaryseparatestheclassregionsbasedonthe1-NNdecisionrule. Knowledgeofthisboundaryissufficienttoclassifynewpoints. Theboundaryitselfisrarelycomputed;manyalgorithmsseektoretainonlythose pointsnecessarytogenerateanidenXcalboundary. 106 NearestNeighborClassifica>on… •  Scalingissues -  A:ributesmayhavetobescaledtoprevent distancemeasuresfrombeingdominatedby oneofthea:ributes -  Example: -  heightofapersonmayvaryfrom1.5mto1.8m -  weightofapersonmayvaryfrom90lbto300lb -  incomeofapersonmayvaryfrom$10Kto$1M 107 NearestNeighborClassifica>on… •  ProblemwithEuclideanmeasure: -  Highdimensionaldata -  curseofdimensionality -  Canproducecounter-intuiXveresults 111111111110 011111111111 d=1.4142 vs 100000000000 000000000001 d=1.4142 u SoluXon:Normalizethevectorstounitlength 108 k-NNsummary •  Pros -  Canexpresscomplexboundary(non-parametric) -  Veryfasttraining -  Simple,butsXllgoodinpracXce(e.g. applicaXonsincomputervision) -  Somewhatinterpretablebylookingatclosest points •  Cons -  LargememoryrequirementsforpredicXon -  Notbestaccuracyamongdifferentclassifiers 109 Methods I)Instance-basedmethods: 1)Nearestneighbor II)ProbabilisXcmodels: 1)NaïveBayes 2)LogisXcRegression III)LinearModels: 1)Perceptron 2)SupportVectorMachine IV)DecisionModels: 1)DecisionTrees 2)BoostedDecisionTrees 3)RandomForest 110 BayesClassifier •  AprobabilisXcframeworkforsolving classificaXonproblems •  CondiXonalProbability: P( A, C ) P(C | A) = P( A) P( A, C ) P( A | C ) = P(C ) •  Bayestheorem: P( A | C ) P(C ) P(C | A) = P( A) 111 ExampleofBayesTheorem •  Given: -  AdoctorknowsthatmeningiXscausessXffneck50%oftheXme -  PriorprobabilityofanypaXenthavingmeningiXsis1/50,000 -  PriorprobabilityofanypaXenthavingsXffneckis1/20 •  IfapaXenthassXffneck,what’sthe probabilityhe/shehasmeningiXs? P( S | M ) P( M ) 0.5 ×1 / 50000 P( M | S ) = = = 0.0002 P( S ) 1 / 20 112 BayesianClassifiers •  Considereacha:ributeandclasslabelas randomvariables •  Givenarecordwitha:ributes(A1,A2,…,An) -  GoalistopredictclassC -  Specifically,wewanttofindthevalueofCthat maximizesP(C|A1,A2,…,An) •  CanweesXmateP(C|A1,A2,…,An)directly fromdata? 113 BayesianClassifiers •  Approach: -  computetheposteriorprobabilityP(C|A1,A2,…,An)forall valuesofCusingtheBayestheorem P( A A … A | C ) P(C ) P( A A … A ) -  ChoosevalueofCthatmaximizes P(C|A1,A2,…,An) -  EquivalenttochoosingvalueofCthatmaximizes P(A1,A2,…,An|C)P(C) •  HowtoesXmateP(A1,A2,…,An|C)? P(C | A A … A ) = 1 2 1 2 n n 1 2 n 114 NaïveBayesClassifier •  Assumeindependenceamonga:ributesAiwhenclassisgiven: -  P(A1,A2,…,An|C)=P(A1|Cj)P(A2|Cj)…P(An|Cj) -  CanesXmateP(Ai|Cj)forallAiandCj. -  NewpointisclassifiedtoCjifP(Cj)ΠP(Ai|Cj)ismaximal. 115 HowtoEs>mateProbabili>esfromData? l l c 10 at o eg a c i r c at o eg a c i r c on u it n s u o s s a cl Tid Refund Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes •  Class:P(C)=Nc/N -  e.g.,P(No)=7/10, P(Yes)=3/10 •  Fordiscrete a:ributes: k P(Ai|Ck)=|Aik|/Nc -  where|Aik|isnumberof instanceshavinga:ributeAi andbelongstoclassCk -  Examples: P(Status=Married|No)=4/7 P(Refund=Yes|Yes)=0 116 HowtoEs>mateProbabili>esfromData? •  ForconXnuousa:ributes: -  DiscreXzetherangeintobins -  oneordinala:ributeperbin -  violatesindependenceassumpXon -  Two-waysplit:(A<v)or(A>v) k -  chooseonlyoneofthetwosplitsasnew a:ribute -  ProbabilitydensityesXmaXon: -  Assumea:ributefollowsanormaldistribuXon -  UsedatatoesXmateparametersofdistribuXon (e.g.,meanandstandarddeviaXon) -  OnceprobabilitydistribuXonisknown,canuseit toesXmatethecondiXonalprobabilityP(Ai|c) 117 l l us HowtoEs>mateProbabili>esfromData? o u o o c Tid e at Refund g a c i r c e at Marital Status g a c i r c t n o Taxable Income in s s a cl Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes •  NormaldistribuXon: 1 P( A | c ) = e 2πσ i j − ( Ai − µ ij ) 2 2 σ ij2 2 ij -  Oneforeach(Ai,ci)pair •  For(Income,Class=No): -  IfClass=No -  samplemean=110 -  samplevariance=2975 10 1 P( Income = 120 | No) = e 2π (54.54) − ( 120 −110 ) 2 2 ( 2975 ) = 0.0072 118 ExampleofNaïveBayesClassifier GivenaTestRecord: X = (Refund = No, Married, Income = 120K) naive Bayes Classifier: P(Refund=Yes|No) = 3/7 P(Refund=No|No) = 4/7 P(Refund=Yes|Yes) = 0 P(Refund=No|Yes) = 1 P(Marital Status=Single|No) = 2/7 P(Marital Status=Divorced|No)=1/7 P(Marital Status=Married|No) = 4/7 P(Marital Status=Single|Yes) = 2/7 P(Marital Status=Divorced|Yes)=1/7 P(Marital Status=Married|Yes) = 0 For taxable income: If class=No: sample mean=110 sample variance=2975 If class=Yes: sample mean=90 sample variance=25 ●  P(X|Class=No)=P(Refund=No|Class=No) ×P(Married|Class=No) ×P(Income=120K|Class=No) =4/7×4/7×0.0072=0.0024 ●  P(X|Class=Yes)=P(Refund=No|Class=Yes) ×P(Married|Class=Yes) ×P(Income=120K|Class=Yes) =1×0×1.2×10-9=0 SinceP(X|No)P(No)>P(X|Yes)P(Yes) ThereforeP(No|X)>P(Yes|X) =>Class=No 119 NaïveBayesClassifier •  IfoneofthecondiXonalprobabilityiszero, thentheenXreexpressionbecomeszero •  ProbabilityesXmaXon: Original : P( Ai | C ) = N ic Nc N ic + 1 Laplace : P ( Ai | C ) = Nc + c N ic + mp m - estimate : P ( Ai | C ) = Nc + m c: number of classes p: prior probability m: parameter 120 ExampleofNaïveBayesClassifier Name human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle Give Birth yes Give Birth yes no no yes no no yes no yes yes no no yes no no no no no yes no Can Fly no no no no no no yes yes no no no no no no no no no yes no yes Can Fly no Live in Water Have Legs no no yes yes sometimes no no no no yes sometimes sometimes no yes sometimes no no no yes no Class yes no no no yes yes yes yes yes no yes yes yes no yes yes yes yes no yes mammals non-mammals non-mammals mammals non-mammals non-mammals mammals non-mammals mammals non-mammals non-mammals non-mammals mammals non-mammals non-mammals non-mammals mammals non-mammals mammals non-mammals Live in Water Have Legs yes no Class ? A:a:ributes M:mammals N:non-mammals 6 6 2 2 P ( A | M ) = × × × = 0.06 7 7 7 7 1 10 3 4 P ( A | N ) = × × × = 0.0042 13 13 13 13 7 P ( A | M ) P ( M ) = 0.06 × = 0.021 20 13 P ( A | N ) P ( N ) = 0.004 × = 0.0027 20 P(A|M)P(M)>P(A|N)P(N) =>Mammals 121 NaïveBayesSummary •  Robusttoisolatednoisepoints •  Handlemissingvaluesbyignoringtheinstanceduring probabilityesXmatecalculaXons •  Robusttoirrelevanta:ributes •  IndependenceassumpXonmaynotholdforsomea:ributes -  UseothertechniquessuchasBayesianBeliefNetworks(BBN) •  NaïveBayescanproduceaprobabilityesXmate,butitis usuallyaverybiasedone -  LogisXcRegressionisbe:erforobtainingprobabiliXes. 122 Methods I)Instance-basedmethods: 1)Nearestneighbor II)ProbabilisXcmodels: 1)NaïveBayes 2)LogisXcRegression III)LinearModels: 1)Perceptron 2)SupportVectorMachine IV)DecisionModels: 1)DecisionTrees 2)BoostedDecisionTrees 3)RandomForest 123 SupportVectorMachines •  Findalinearhyperplane (decisionboundary)that willseparatethedata 124 SupportVectorMachines B1 •  OnePossibleSoluXon 125 SupportVectorMachines B2 •  AnotherpossiblesoluXon 126 SupportVectorMachines B2 •  OtherpossiblesoluXons 127 SupportVectorMachines B1 B2 •  Whichoneisbe:er?B1orB2? •  Howdoyoudefinebe:er? 128 SupportVectorMachines B1 •  Findhyperplane maximizesthemargin=> B1isbe:erthanB2 B2 b21 b22 margin b11 b12 129 SupportVectorMachines B1 ! ! w• x + b = 0 ! ! w • x + b = +1 ! ! w • x + b = −1 b11 ! ! if w • x + b ≥ 1 ⎧1 ! f ( x) = ⎨ ! ! − 1 if w • x + b ≤ −1 ⎩ b12 2 Margin = ! 2 || w || 130 SupportVectorMachines •  Wewanttomaximize: Margin = 2 ! 2 || w || -  Whichisequivalenttominimizing: ! 2 || w || L( w) = 2 -  Butsubjectedtothefollowingconstraints: ! ! if w • x i + b ≥ 1 ⎧1 ! f ( xi ) = ⎨ ! ! ⎩− 1 if w • x i + b ≤ −1 -  ThisisaconstrainedopXmizaXonproblem -  Numericalapproachestosolveit(e.g.,quadraXc programming) 131 SupportVectorMachines •  Whatiftheproblemisnotlinearly separable? 132 SupportVectorMachines •  Whatiftheproblemisnotlinearly separable? ! 2 || w || ⎛ N k⎞ + C ⎜ ∑ ξi ⎟ -  Introduceslackvariables L( w) = 2 ⎝ i =1 ⎠ -  Needtominimize: -  Subjectto: ! ! if w • x i + b ≥ 1 - ξi ⎧1 ! f ( xi ) = ⎨ ! ! ⎩− 1 if w • x i + b ≤ −1 + ξi 133 NonlinearlySeparableData Var1 Introduceslack variables Allowsomeinstances tofallwithinthe margin,butpenalize them Var2 134 RobustnessofSoevsHardMarginSVMs Var1 Var1 ξi Var2 Var2 So]MarginSVN HardMarginSVN 135 NonlinearSupportVectorMachines •  Whatifthingsarenotgood? 136 NonlinearSupportVectorMachines •  Transformdataintohigherdimensional space 137 DisadvantagesofLinearDecisionSurfaces Var1 Var2 138 AdvantagesofNonlinearSurfaces Var1 Var2 139 LinearClassifiersinHigh-DimensionalSpaces Var1 Constructed Feature 2 Var2 Constructed Feature 1 FindfuncXonΦ(x)tomaptoa differentspace 140 MappingDatatoaHigh-DimensionalSpace •  FindfuncXonΦ(x)tomaptoadifferentspace,thenSVMformulaXon becomes: •  DataappearasΦ(x),weightswarenowweightsinthenewspace •  ExplicitmappingexpensiveifΦ(x)isveryhighdimensional •  Solvingtheproblemwithoutexplicitlymappingthedataisdesirable 141 TheKernelTrick •  Φ(xi)⋅Φ(xj):means,mapdataintonewspace,thentaketheinner productofthenewvectors •  WecanfindafuncXonsuchthat:K(xi⋅xj)=Φ(xi)⋅Φ(xj),i.e.,theimageof theinnerproductofthedataistheinnerproductoftheimagesofthe data •  Then,wedonotneedtoexplicitlymapthedataintothehighdimensionalspacetosolvetheopXmizaXonproblem 142 Example wTΦ(x)+b=0 X=[x z] Φ(X)=[x2 z2 xz] f(x) = sign(w1x2+w2z2+w3xz + b) 143 Example X1=[x1 z1] X2=[x2 z2] Φ(X1)=[x12 z12 21/2x1z1] Φ(X2)=[x22 z22 21/2x2z2] Φ(X1)TΦ(X2)= [x12 z12 21/2x1z1] [x22 z22 21/2x2z2]T = x12z12 + x22 z22 + 2 x1 z1 x2 z2 = (x1z1 + x2z2)2 = (X1T X2)2 Expensive! O(d2) Efficient! O(d) 144 KernelTrick •  Kernelfunc>on:asymmetricfuncXon k:RdxRd->R •  Innerproductkernels:addiXonally, k(x,z)=Φ(x)TΦ(z) •  Example: O(d ) O(d) 2 145 KernelTrick •  Implementaninfinite-dimensionalmappingimplicitly •  Onlyinnerproductsexplicitlyneededfortrainingand evaluaXon •  Innerproductscomputedefficiently,infinitedimensions •  TheunderlyingmathemaXcaltheoryisthatofreproducing kernelHilbertspacefromfuncXonalanalysis 146 KernelMethods •  Ifalinearalgorithmcanbeexpressedonlyinterms ofinnerproducts -  itcanbe“kernelized” -  findlinearpa:erninhigh-dimensionalspace -  nonlinearrelaXoninoriginalspace •  SpecifickernelfuncXondeterminesnonlinearity 147 Kernels •  Somesimplekernels -  Linearkernel:k(x,z)=xTz àequivalenttolinearalgorithm -  Polynomialkernel:k(x,z)=(1+xTz)d àpolynomialdecisionrules -  RBFkernel:k(x,z)=exp(-||x-z||2/2σ) àhighlynonlineardecisions 148 GaussianKernel:Example A hyperplane in some space 149 KernelMatrix k(x,y) •  KernelmatrixKdefinesallpairwiseinner products •  Mercertheorem:KposiXvesemidefinite •  AnysymmetricposiXvesemidefinite matrixcanberegardedasaninner productmatrixinsomespace i K j Kij=k(xi,xj) 150 Kernel-BasedLearning Data Embedding Linear algorithm {(xi,yi)} •  k(x,y) or •  y K 151 Kernel-BasedLearning Data Embedding Linear algorithm K Kernel design Kernel algorithm 152 Methods I)Instance-basedmethods: 1)Nearestneighbor II)ProbabilisXcmodels: 1)NaïveBayes 2)LogisXcRegression III)LinearModels: 1)Perceptron 2)SupportVectorMachine IV)DecisionModels: 1)DecisionTrees 2)BoostedDecisionTrees 3)RandomForest 153 ExampleofaDecisionTree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Splitting Attributes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 10 Training Data Model: Decision Tree 154 HowtoSpecifyTestCondi>on? •  Dependsona:ributetypes -  Nominal -  Ordinal -  ConXnuous •  Dependsonnumberofwaystosplit -  2-waysplit -  MulX-waysplit 155 SplinngBasedonNominalAoributes •  MulX-waysplit:UseasmanyparXXonsas disXnctvalues. Family CarType Luxury Sports •  Binarysplit:Dividesvaluesintotwo subsets. NeedtofindopXmalparXXoning. {Sports, Luxury} CarType {Family} OR {Family, Luxury} CarType {Sports} 156 SplinngBasedonOrdinalAoributes •  MulX-waysplit:UseasmanyparXXonsas Size disXnctvalues. Small Large Medium •  Binarysplit:Dividesvaluesintotwo subsets. NeedtofindopXmalparXXoning. {Small, Medium} Size {Large} OR •  Whataboutthissplit? {Medium, Large} {Small, Large} Size {Small} Size {Medium} 157 SplinngBasedonCon>nuousAoributes •  Differentwaysofhandling -  DiscreXzaXontoformanordinalcategorical a:ribute -  StaXc–discreXzeonceatthebeginning -  Dynamic–rangescanbefoundbyequalinterval buckeXng,equalfrequencybuckeXng (percenXles),orclustering. -  BinaryDecision:(A<v)or(A≥v) -  considerallpossiblesplitsandfindsthebestcut -  canbemorecomputeintensive 158 SplinngBasedonCon>nuousAoributes Taxable Income > 80K? Taxable Income? < 10K Yes > 80K No [10K,25K) (i) Binary split [25K,50K) [50K,80K) (ii) Multi-way split 159 TreeInduc>on •  Greedystrategy. -  Splittherecordsbasedonana:ributetest thatopXmizescertaincriterion. •  Issues -  Determinehowtosplittherecords -  Howtospecifythea:ributetestcondiXon? -  Howtodeterminethebestsplit? -  Determinewhentostopspli•ng 160 HowtodeterminetheBestSplit BeforeSpli•ng:10recordsofclass0, 10recordsofclass1 Own Car? Yes Car Type? No Family Student ID? Luxury c1 Sports C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0 ... c10 C0: 1 C1: 0 c11 C0: 0 C1: 1 c20 ... C0: 0 C1: 1 WhichtestcondiXonisthebest? 161 HowtodeterminetheBestSplit •  Greedyapproach: -  NodeswithhomogeneousclassdistribuXon arepreferred •  Needameasureofnodeimpurity: C0: 5 C1: 5 C0: 9 C1: 1 Non-homogeneous, Homogeneous, Highdegreeofimpurity Lowdegreeofimpurity 162 EnsembleMethods:IncreasingtheAccuracy •  Ensemblemethods -  UseacombinaXonofmodelstoincreaseaccuracy -  Combineaseriesofklearnedmodels,M1,M2,…,Mk,withthe aimofcreaXnganimprovedmodelM* •  Popularensemblemethods -  Bagging:averagingthepredicXonoveracollecXonof classifiers -  BoosXng:weightedvotewithacollecXonofclassifiers -  Ensemble:combiningasetofheterogeneousclassifiers 163 163 BaggingandRandomisedTrees otherclassifiercombinaXons: Bagging: combinetreesgrownfrom“bootstrap”samples (i.ere-sampletrainingdatawithreplacement) RandomisedTrees:(RandomForest:trademarkL.Breiman,A.Cutler) combinetreesgrownwith: randombootstrap(orsubsets)ofthetrainingdataonly considerateachnodeonlyarandomsubsetsofvariablesforthesplit NOPruning! Thesecombinedclassifiersworksurprisinglywell,areverystableandalmost perfect“outofthebox”classifiers 164 RandomForest Randomlysample2/3 ofthedata Data Eachnode:pickrandomlya smallmnumberofinput variablestospliton VOTE ! Use Out-Of-Bag samples to: -  Estimate error -  Choosing m 165 -  Estimate variable importance Boos>ng Training Sample classifier C(0)(x) re-weight Weighted Sample re-weight classifier C(1)(x) Weighted Sample re-weight classifier C(2)(x) Weighted Sample classifier C(3)(x) y(x) = NClassifier ∑ w iC(i) (x) i re-weight Weighted Sample classifier C(m)(x) 166 Adap>veBoos>ng(AdaBoost) Training Sample classifier C(0)(x) re-weight Weighted Sample re-weight classifier C(1)(x) Weighted Sample re-weight classifier C(2)(x) Weighted Sample classifier C(3)(x) re-weight AdaBoostre-weightseventsmisclassifiedby previousclassifierby: 1 − ferr with : ferr ferr misclassified events = all events AdaBoostweightstheclassifiersalsousing theerrorrateoftheindividualclassifier accordingto: y(x) = Weighted Sample classifier C(m)(x) NClassifier ∑ i (i) ⎛ 1 − ferr log ⎜ (i) ⎝ ferr ⎞ (i) ⎟C (x) ⎠ 167 TricksandEvalua>on 168 Let’s>eclassifica>ontotext •  RepresentaXonsoftextareusuallyvery highdimensional -  “Thecurseofdimensionality” •  High-biasalgorithmsshouldgenerally workbestinhigh-dimensionalspace -  Theypreventoverfi•ng -  Theygeneralizemore •  FormosttextcategorizaXontasks,there aremanyrelevantfeaturesandmany irrelevantones 169 WhichclassifierdoIuseforagiven(text) classifica>onproblem? •  IstherealearningmethodthatisopXmalforall(text) classificaXonproblems? -  No,becausethereisatradeoffbetweenbiasandvariance. •  Factorstotakeintoaccount: -  Howmuchtrainingdataisavailable? -  Howsimple/complexistheproblem?(linearvs. nonlineardecisionboundary) -  Hownoisyisthedata? -  HowstableistheproblemoverXme? -  Foranunstableproblem,it’sbe:ertousea simpleandrobustclassifier. 170 Sec. 15.3.1 Manuallywrioenrules •  Notrainingdata,adequateeditorialstaff? •  Neverforgetthehand-wri:enrulessoluXon! -  If(wheatorgrain)andnot(wholeorbread)then -  Categorizeasgrain •  InpracXce,rulesgetalotbiggerthanthis -  Canalsobephrasedusing†or†.idfweights •  Withcarefulcra]ing(humantuningondevelopment data)performanceishigh: -  Construe:94%recall,84%precisionover675 categories(HayesandWeinstein1990) •  Amountofworkrequiredishuge -  EsXmate2daysperclass…plusmaintenance 171 Sec. 15.3.1 Verylioledata? •  Ifyou’rejustdoingsupervisedclassificaXon,you shouldsXcktosomethinghighbias -  TherearetheoreXcalresultsthatNaïveBayes shoulddowellinsuchcircumstances(NgandJordan 2002NIPS) •  TheinteresXngtheoreXcalansweristoexploresemisupervisedtrainingmethods: -  Bootstrapping,EMoverunlabeleddocuments,… •  ThepracXcalansweristogetmorelabeleddataas soonasyoucan -  Howcanyouinsertyourselfintoaprocesswhere humanswillbewillingtolabeldataforyou?? 172 Sec. 15.3.1 Areasonableamountofdata? •  Wecanuseallourcleverclassifiers •  “RollouttheSVM!” •  ButifyouareusinganSVM/NBetc.,youshould probablybepreparedwiththe“hybrid”soluXon wherethereisaBooleanoverlay -  Orelsetouseuser-interpretableBoolean-like modelslikedecisiontrees -  Usersliketohack,andmanagementlikesto beabletoimplementquickfixesimmediately 173 Sec. 15.3.1 Ahugeamountofdata? •  Thisisgreatintheoryfordoingaccurate classificaXon… •  Butitcouldeasilymeanthatexpensivemethods likeSVMs(trainXme)orkNN(testXme)arequite impracXcal •  NaïveBayescancomebackintoitsownagain! -  Orotheradvancedmethodswithlinear training/testcomplexitylikeregularizedlogisXc regression(thoughmuchmoreexpensiveto train) 174 Sec. 15.3.2 Howmanycategories? •  Afew(wellseparatedones)? -  Easy! •  Azillioncloselyrelatedones? -  Think:Yahoo!Directory,LibraryofCongressclassificaXon, legalapplicaXons -  Quicklygetsdifficult! -  ClassifiercombinaXonisalwaysausefultechnique -  VoXng,bagging,orboosXngmulXpleclassifiers -  MuchliteratureonhierarchicalclassificaXon -  Mileagefairlyunclear,buthelpsabit(Tie-YanLiuetal. 2005) -  Definitelyhelpsforscalability,evenifnotinaccuracy -  MayneedahybridautomaXc/manualsoluXon 175 ModelEvalua>onandSelec>on •  EvaluaXonmetrics:Howcanwemeasureaccuracy?Other metricstoconsider? •  Usevalida>ontestsetofclass-labeledtuplesinsteadoftraining setwhenassessingaccuracy •  MethodsforesXmaXngaclassifier’saccuracy: -  Holdoutmethod,randomsubsampling -  Cross-validaXon -  Bootstrap •  Comparingclassifiers: -  Confidenceintervals -  Cost-benefitanalysisandROCCurves 176 ClassifierEvalua>onMetrics: ConfusionMatrixfor2classes ConfusionMatrix: Actualclass\Predictedclass C1 ¬C1 C1 TruePosi>ves(TP) FalseNega>ves(FN) ¬C1 FalsePosi>ves(FP) TrueNega>ves(TN) Example of Confusion Matrix: Actualclass\Predicted buy_computer buy_computer class =yes =no Total buy_computer=yes 6954 46 7000 buy_computer=no 412 2588 3000 Total 7366 2634 10000 177 Sec. 15.2.4 ClassifierEvalua>onMetrics: ConfusionMatrixformoreclasses This (i, j) entry means 53 of the docs actually in class i were put in class j by the classifier. Actual Class Class assigned by classifier 53 •  InaperfectclassificaXon,onlythediagonalhasnon-zeroentries •  Lookatcommonconfusionsandhowtheymightbeaddressed 178 ClassifierEvalua>onMetrics: Accuracy,ErrorRate,Sensi>vityandSpecificity A\P C ¬C ClassImbalanceProblem: C TP FN P n  Oneclassmayberare,e.g. ¬C FP TN N fraud,orHIV-posiXve P’ N’ All n  Significantmajorityofthe •  ClassifierAccuracy,orrecogniXon nega0veclassandminorityof rate:percentageoftestsettuples theposiXveclass thatarecorrectlyclassified n  Sensi>vity:TruePosiXve Accuracy=(TP+TN)/All recogniXonrate •  Errorrate:1–accuracy,or n  Sensi>vity=TP/P Errorrate=(FP+FN)/All n  Specificity:TrueNegaXve recogniXonrate n  Specificity=TN/N n  179 ClassifierEvalua>onMetrics: PrecisionandRecall,andF-measures •  Precision:exactness–what%oftuplesthattheclassifierlabeledas posiXveareactuallyposiXve •  Recall:completeness–what%ofposiXvetuplesdidtheclassifier labelasposiXve? •  Perfectscoreis1.0 •  InverserelaXonshipbetweenprecision&recall •  Fmeasure(F1orF-score):harmonicmeanofprecisionandrecall, •  Fß:weightedmeasureofprecisionandrecall -  assignsßXmesasmuchweighttorecallastoprecision 180 ClassifierEvalua>onMetrics:Example -  ActualClass\Predictedclass cancer=yes cancer=no Total RecogniXon(%) cancer=yes 90 210 300 30.00(sensi0vity cancer=no 140 9560 9700 98.56(specificity) Total 230 9770 10000 96.40(accuracy) Precision=90/230=39.13%Recall=90/300=30.00% 181 Sec. 15.2.4 Micro-vs.Macro-Averaging •  Ifwehavemorethanoneclass,howdo wecombinemulXpleperformance measuresintoonequanXty? •  Macroaveraging:Computeperformance foreachclass,thenaverage. •  Microaveraging:Collectdecisionsforall classes,computeconXngencytable, evaluate. 182 Micro-vs.Macro-Averaging:Example Class 1 Truth: yes Class 2 Truth: no Sec. 15.2.4 Micro Ave. Table Truth: yes Truth: no Truth: yes Truth: no Classifi 10 er: yes 10 Classifi er: yes 90 10 Classifier: yes 100 20 Classifi 10 er: no 970 Classifi er: no 10 890 Classifier: no 20 1860 n  n  n  Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Microaveraged score is dominated by score on common classes 183 Evalua>ngClassifierAccuracy: Holdout&Cross-Valida>onMethods •  Holdoutmethod -  GivendataisrandomlyparXXonedintotwoindependentsets -  Trainingset(e.g.,2/3)formodelconstrucXon -  Testset(e.g.,1/3)foraccuracyesXmaXon -  Randomsampling:avariaXonofholdout -  RepeatholdoutkXmes,accuracy=avg.ofthe accuraciesobtained •  Cross-valida>on(k-fold,wherek=10ismostpopular) -  RandomlyparXXonthedataintokmutuallyexclusivesubsets, eachapproximatelyequalsize -  Ati-thiteraXon,useDiastestsetandothersastrainingset -  Leave-one-out:kfoldswherek=#oftuples,forsmallsized data -  *Stra>fiedcross-valida>on*:foldsarestraXfiedsothatclass dist.ineachfoldisapprox.thesameasthatintheiniXaldata 184 Es>ma>ngConfidenceIntervals: ClassifierModelsM1vs.M2 •  Supposewehave2classifiers,M1andM2,whichoneisbe:er? •  Use10-foldcross-validaXontoobtainand •  Thesemeanerrorratesarejustes0matesoferroronthetrue populaXonoffuturedatacases •  Whatifthedifferencebetweenthe2errorratesisjust a:ributedtochance? -  Useatestofsta>s>calsignificance -  ObtainconfidencelimitsforourerroresXmates 185 Es>ma>ngConfidenceIntervals:NullHypothesis •  Perform10-foldcross-validaXon •  Assumesamplesfollowatdistribu>onwithk–1degreesof freedom(here,k=10) •  Uset-test(orStudent’st-test) •  NullHypothesis:M1&M2arethesame •  Ifwecanrejectnullhypothesis,then -  weconcludethatthedifferencebetweenM1&M2is sta>s>callysignificant -  Chosemodelwithlowererrorrate 186 Es>ma>ngConfidenceIntervals:t-test •  Ifonly1testsetavailable:pairwisecomparison -  Forithroundof10-foldcross-validaXon,thesamecross parXXoningisusedtoobtainerr(M1)ianderr(M2)i -  Averageover10roundstoget -  t-testcomputest-sta>s>cwithk-1degreesoffreedom: •  Iftwotestsetsavailable:usenon-pairedt-test wherek1&k2are#ofcross-validaXonsamplesusedforM1&M2,resp. 187 Es>ma>ngConfidenceIntervals:Tablefort-distribu>on •  Symmetric •  Significancelevel, e.g.,sig=0.05or5% meansM1&M2are significantlydifferent for95%ofpopulaXon •  Confidencelimit,z= sig/2 188 Es>ma>ngConfidenceIntervals:Sta>s>calSignificance •  AreM1&M2significantlydifferent? -  Computet.Selectsignificancelevel(e.g.sig=5%) -  Consulttablefort-distribuXon:Findtvaluecorrespondingto k-1degreesoffreedom(here,9) -  t-distribuXonissymmetric:typicallyupper%pointsof distribuXonshown→lookupvalueforconfidencelimit z=sig/2(here,0.025) -  Ift>zort<-z,thentvalueliesinrejecXonregion: -  Rejectnullhypothesisthatmeanerrorratesof M1&M2aresame -  Conclude:staXsXcallysignificantdifference betweenM1&M2 -  Otherwise,concludethatanydifferenceischance 189 ModelSelec>on:ROCCurves •  •  •  •  •  •  ROC(ReceiverOperaXngCharacterisXcs) curves:forvisualcomparisonof classificaXonmodels OriginatedfromsignaldetecXontheory Showsthetrade-offbetweenthetrue posiXverateandthefalseposiXverate TheareaundertheROCcurveisa measureoftheaccuracyofthemodel Rankthetesttuplesindecreasingorder: theonethatismostlikelytobelongto theposiXveclassappearsatthetopof thelist Theclosertothediagonalline(i.e.,the closertheareaisto0.5),theless accurateisthemodel n  n  n  n  VerXcalaxis representsthetrue posiXverate Horizontalaxisrep. thefalseposiXverate Theplotalsoshowsa diagonalline Amodelwithperfect accuracywillhavean 190 areaof1.0

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture slides - Maastricht University