Download Lecture slides - Maastricht University

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
DataMiningWorkshop
Aachen,23-25November2016
Gerasimos(Jerry)Spanakis,PhD
h:p://dke.maastrichtuniversity.nl/jerry.spanakis
@gerasimoss
WhoamI?
2001-2006
2007-2012
2007-2012
2012-2013
2013-2014
2016
2014-2016
2016- Diploma(MSc)inElectrical&ComputerEngineering
(NeuralNetworksandReinforcementLearning)
Diploma(MSc)inCivil(Transport)Engineering
(MachinelearningintransportopXmizaXonproblems)
PhDinComputaXonalIntelligence
(Intelligenttechniquesintextanalysis)
Armyservice:maintenance,update,developmentforthe
armypersonnelso]waresystem
LectureratTechnicalUniversityofCrete
Scien>ficAssociateatTechnologicalInsXtuteofCrete
Visi>ngresearcheratUniversityofAlberta
(RecommenderSystemsforHigherEducaXon)
PostdoctoralResearcheratMaastrichtUniversity
DepartmentofDataScience& KnowledgeEngineering
AssistantProfessoratMaastrichtUniversity
DepartmentofDataScience& KnowledgeEngineering
2
WhoamI(really)?
•  Machinelearner
&dataminer
•  Coffeeaddict
•  TVseries
binge-watcher
•  AviaXonenthusiast
•  Gin&Toniclover
•  Proudgeek
3
Outline
•  Datamining:Whyworldhasgonecrazy?
-  DatapreparaXon
•  Differentformsofdata&learning
-  ClassificaXon
-  Clustering,DimensionalityreducXon
-  RecommenderSystems
Someslidecreditsto:
-  Topicmodelingfortexts
•  Deeplearning”hype”
-  Howitworks?
Scheduling
•  Mornings:“Lectures”,A]ernoons:PracXcals
-  Canbeadjusted…
•  Day1:IntroducXon,DataPre-processing,
Supervisedlearning:ClassificaXon
•  Day2:Unsupervisedlearning:Clustering,
DimensionalityReducXon,Recommender
Systems,Topicmodelingfortexts
•  Day3:Deeplearningfortext&images
https://dke.maastrichtuniversity.nl/jerry.spanakis/aachen16/
5
Let’sstart…
•  Howmanyofyouhavealreadyapplied
datamining?
•  Howmanyofyouhaveheardwordslike
classificaXon,clustering,matrix
factorizaXon,SVM,accuracy,LDA,…?
•  HowmanyofyouhaveworkedwithR?
6
Whatisdatascienceordatamining?
•  Learning=
RepresentaXon+
EvaluaXon+
OpXmizaXon
[JoelGrus,2016]
Representa>on:Choiceofmodel&hypothesisspace
Evalua>on:ChoiceofobjecXvefuncXon
Op>miza>on:Thealgorithmtocomputethebestmodel
7
Weminedataeveryday
•  Yes,itworksJ
•  SpamdetecXon
•  CreditcardfrauddetecXon
•  DigitrecogniXon
•  VoicerecogniXon
•  FacedetecXon
•  Stocktrading
•  Games
8
Whyitworks?
•  Supervisedlearningtechniques
Cars
Whatis
this?
Motorcycles
9
Whyitworks?
•  Lotsof(labeled)data
10
Whyitworks?
•  ComputaXonalpower
“Thenumberofpeople
predicXngthedeathof
Moore’slawdoubles
everytwoyears”
--PeterLee,
VPMSResearch
11
Howiscomputerpercep>ondone?
12
CaseStudy:Bank
•  Businessgoal:Sellmorehomeequityloans
•  Currentmodels:
-  Customerswithcollege-agechildrenusehomeequityloanstopayfor
tuiXon
-  Customerswithvariableincomeusehomeequityloanstoevenout
streamofincome
•  Data:
-  Largedatawarehouse
-  Consolidatesdatafrom42operaXonaldatasources
13
CaseStudy:Bank(Contd.)
1. 
Selectsubsetofcustomerrecordswhohavereceivedhome
equityloanoffer
- 
- 
Income
$40,000
$75,000
$50,000
…
Customerswhodeclined
Customerswhosignedup
Number of
Children
2
0
1
…
Average Checking
Account Balance
$1500
$5000
$3000
…
…
Reponse
…
Yes
No
No
…
14
CaseStudy:Bank(Contd.)
2. 
Findrulestopredictwhetheracustomerwouldrespondto
homeequityloanoffer
IF
(Salary<40k)and
(numChildren>0)and
(ageChild1>18andageChild1<22)
THENYES
…
15
CaseStudy:Bank(Contd.)
3.  Groupcustomersintoclustersand
invesXgateclusters
Group 2
Group 3
Group 1
Group 4
16
CaseStudy:Bank(Contd.)
4.  Evaluateresults:
-  Many“uninteresXng”clusters
-  OneinteresXngcluster!Customerswith
bothbusinessandpersonalaccounts;
unusuallyhighpercentageoflikely
respondents
17
WhatisaDataMiningModel?
AdataminingmodelisadescripXonofacertainaspectofa
dataset.Itproducesoutputvaluesforanassignedsetof
inputs.
Examples:
•  Clustering
•  Linearregressionmodel
•  ClassificaXonmodel
•  FrequentitemsetsandassociaXonrules
•  SupportVectorMachines
•  …
18
Learningfromdataisdifficult
Supervisedlearning:
•  Ithasnothingtodolikehumanlearning
•  TransiXontoopenbutbigdata
Unsupervisedlearning:
•  Exploratorydataanalysis:Goalisnotclear
•  Difficulttoassessperformance
•  High-dimensionaldata
19
Ourgoal:Vectorrepresenta>ons
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack
M
1
0
1
0
0
0
Mary
F
1
0
1
0
1
0
Jim
M
1
1
0
0
0
0
•  Discretevs.ConXnuousdata
•  Numeric,Binary,Ordinalfeatures
•  Differentdistancemeasures
x = (x1 x2 ⋅ ⋅ ⋅ xn ) and y = ( y1 y2 ⋅ ⋅ ⋅ yn )
(
d (x, y ) = | x1 − y1 | p + | x2 − y2 | p ⋅ ⋅ ⋅+ | xn − yn |
1
p p
)
, p>0
20
WhyDataPreprocessing?
•  Dataintherealworldisdirty
-  incomplete:lackinga:ributevalues,lacking
certaina:ributesofinterest,orcontaining
onlyaggregatedata
-  e.g.,occupaXon=“”
-  noisy:containingerrorsoroutliers
-  e.g.,Salary=“-10”
-  inconsistent:containingdiscrepanciesin
codesornames
-  e.g.,Age=“42”Birthday=“03/07/1997”
-  e.g.,WasraXng“1,2,3”,nowraXng“A,B,C”
-  e.g.,discrepancybetweenduplicaterecords
21
WhyIsDataDirty?
•  Incompletedatamaycomefrom
-  “Notapplicable”datavaluewhencollected
-  DifferentconsideraXonsbetweentheXmewhenthedatawascollected
andwhenitisanalyzed.
-  Human/hardware/so]wareproblems
•  Noisydata(incorrectvalues)maycomefrom
-  FaultydatacollecXoninstruments
-  Humanorcomputererroratdataentry
-  Errorsindatatransmission
•  Inconsistentdatamaycomefrom
-  Differentdatasources
-  FuncXonaldependencyviolaXon(e.g.,modifysomelinkeddata)
•  Duplicaterecordsalsoneeddatacleaning
22
WhyIsDataPreprocessingImportant?
•  Noqualitydata,noqualityminingresults!
•  ”Garbagein–Garbageout”
-  Qualitydecisionsmustbebasedonqualitydata
-  e.g.,duplicateormissingdatamaycauseincorrectorevenmisleading
staXsXcs.
-  DatawarehouseneedsconsistentintegraXonofqualitydata
•  DataextracXon,cleaning,andtransformaXoncomprisesthe
majorityoftheworkofbuildingadatawarehouse
•  Youmayassumethatinadataminingproblem80%ofthework
isthepreparaXonofthedata
23
MajorTasksinDataPreprocessing
•  Datacleaning
-  Fillinmissingvalues,smoothnoisydata,idenXfyorremoveoutliers,and
resolveinconsistencies
•  DataintegraXon
-  IntegraXonofmulXpledatabases,datacubes,orfiles
•  DatatransformaXon
-  NormalizaXonandaggregaXon
•  DatareducXon
-  ObtainsreducedrepresentaXoninvolumebutproducesthesameor
similaranalyXcalresults
•  DatadiscreXzaXon
-  PartofdatareducXonbutwithparXcularimportance,especiallyfor
numericaldata
24
MiningDataDescrip>veCharacteris>cs
• 
• 
• 
• 
MoXvaXon
-  Tobe:erunderstandthedata:centraltendency,variaXonand
spread
DatadispersioncharacterisXcs
-  median,max,min,quanXles,outliers,variance,etc.
Numericaldimensionscorrespondtosortedintervals
-  Datadispersion:analyzedwithmulXplegranulariXesofprecision
-  BoxplotorquanXleanalysisonsortedintervals
Dispersionanalysisoncomputedmeasures
-  Foldingmeasuresintonumericaldimensions
-  BoxplotorquanXleanalysisonthetransformedcube
25
MeasuringtheCentralTendency
• 
• 
Mean(algebraicmeasure)(samplevs.populaXon):
- 
WeightedarithmeXcmean:
- 
Trimmedmean:choppingextremevalues
µ=∑
x
N
n
∑w x
i
x=
i
i =1
n
∑w
Median:AholisXcmeasure
- 
1 n
x = ∑ xi
n i =1
i
i =1
Middlevalueifoddnumberofvalues,oraverageofthemiddletwovalues
otherwise
- 
• 
EsXmatedbyinterpolaXon(forgroupeddata):
median = L1 + (
Mode
- 
Valuethatoccursmostfrequentlyinthedata
- 
Unimodal,bimodal,trimodal
- 
Empiricalformula:
n / 2 − (∑ f )l
f median
mean − mode = 3 × (mean − median )
26
)c
Symmetricvs.SkewedData
•  Median,meanandmodeof
symmetric,posiXvelyand
negaXvelyskeweddata
27
MeasuringtheDispersionofData
• 
QuarXles,outliersandboxplots
- 
QuarXles:Q1(25thpercenXle),Q3(75thpercenXle)
- 
Inter-quarXlerange:IQR=Q3–Q1
- 
Fivenumbersummary:min,Q1,M,Q3,max
- 
Boxplot:endsoftheboxarethequarXles,medianismarked,whiskers,andplot
outlierindividually
- 
• 
Outlier:usually,avaluehigher/lowerthan1.5xIQR
VarianceandstandarddeviaXon(sample:s,popula0on:σ)
- 
Variance:(algebraic,scalablecomputaXon)
σ2 =
- 
1
N
n
∑ (x − µ)
i
i =1
2
=
1
N
n
∑x
i
2
− µ2
i =1
StandarddeviaXons(orσ)isthesquarerootofvariances2(orσ2)
1 n
1 n 2 1 n 2
2
s =
( xi − x ) =
[∑ xi − (∑ xi ) ]
∑
n − 1 i =1
n − 1 i =1
n i =1
2
28
Proper>esofNormalDistribu>onCurve
•  Thenormal(distribuXon)curve
-  Fromμ–σtoμ+σ:containsabout68%ofthemeasurements(μ:
mean,σ:standarddeviaXon)
-  Fromμ–2σtoμ+2σ:containsabout95%ofit
-  Fromμ–3σtoμ+3σ:containsabout99.7%ofit
29
BoxplotAnalysis
•  Five-numbersummaryofadistribuXon:
Minimum,Q1,M,Q3,Maximum
•  Boxplot
-  Dataisrepresentedwithabox
-  TheendsoftheboxareatthefirstandthirdquarXles,
i.e.,theheightoftheboxisIRQ
-  Themedianismarkedbyalinewithinthebox
-  Whiskers:twolinesoutsidetheboxextendto
MinimumandMaximum
30
Visualiza>onofDataDispersion:BoxplotAnalysis
31
DataCleaning
•  Importance
-  “Datacleaningisoneofthethreebiggestproblemsin
datawarehousing”—RalphKimball
-  “Datacleaningisthenumberoneproblemindata
relatedtasks”–Unknownprogrammer
•  Datacleaningtasks
-  Fillinmissingvalues
-  IdenXfyoutliersandsmoothoutnoisydata
-  Correctinconsistentdata
-  ResolveredundancycausedbydataintegraXon
32
MissingData
•  Dataisnotalwaysavailable
-  E.g.,manytupleshavenorecordedvalueforseverala:ributes,such
ascustomerincomeinsalesdata
•  Missingdatamaybedueto
-  equipmentmalfuncXon
-  inconsistentwithotherrecordeddataandthusdeleted
-  datanotenteredduetomisunderstanding
-  certaindatamaynotbeconsideredimportantattheXmeofentry
-  notregisterhistoryorchangesofthedata
•  Missingdatamayneedtobeinferred.
33
HowtoHandleMissingData?
•  Ignorethetuple:usuallydonewhenclasslabelismissing(assumingthetasks
inclassificaXon—noteffecXvewhenthepercentageofmissingvaluesper
a:ributevariesconsiderably.
•  Fillinthemissingvaluemanually:tedious+infeasible?
•  FillinitautomaXcallywith
-  aglobalconstant:e.g.,“unknown”,anewclass?!
-  thea:ributemean
-  thea:ributemeanforallsamplesbelongingtothesameclass:smarter
-  themostprobablevalue:inference-basedsuchasBayesianformulaor
decisiontree
34
NoisyData
•  Noise:randomerrororvarianceinameasuredvariable
•  Incorrecta:ributevaluesmaydueto
-  faultydatacollecXoninstruments
-  dataentryproblems
-  datatransmissionproblems
-  technologylimitaXon
-  inconsistencyinnamingconvenXon
•  Otherdataproblemswhichrequiresdatacleaning
-  duplicaterecords
-  incompletedata
-  inconsistentdata
35
HowtoHandleNoisyData?
•  Binning
-  firstsortdataandparXXoninto(equal-frequency)bins
-  thenonecansmoothbybinmeans,smoothbybinmedian,
smoothbybinboundaries,etc.
•  Regression
-  smoothbyfi•ngthedataintoregressionfuncXons
•  Clustering
-  detectandremoveoutliers
•  CombinedcomputerandhumaninspecXon
-  detectsuspiciousvaluesandcheckbyhuman(e.g.,dealwith
possibleoutliers)
36
DataTransforma>on
Smoothing:removenoisefromdata
AggregaXon:summarizaXon,datacubeconstrucXon
GeneralizaXon:concepthierarchyclimbing
NormalizaXon:scaledtofallwithinasmall,specifiedrange
-  min-maxnormalizaXon
-  z-scorenormalizaXon
-  normalizaXonbydecimalscaling
•  A:ribute/featureconstrucXon
-  Newa:ributesconstructedfromthegivenones
• 
• 
• 
• 
37
DataTransforma>on:Normaliza>on
•  Min-maxnormalizaXon:to[new_minA,new_maxA]
v' =
v − minA
(new _ maxA − new _ minA) + new _ minA
maxA − minA
-  Ex.Letincomerange$12,000to$98,000normalizedto[0.0,1.0].Then
73,600 − 12,000
$73,000ismappedto 98,000 − 12,000 (1.0 − 0) + 0 = 0.716
•  Z-scorenormalizaXon(μ:mean,σ:standarddeviaXon):
v' =
v − µA
σ
A
-  Ex.Letμ=54,000,σ=16,000.Then
•  NormalizaXonbydecimalscaling
v
v'= j
10
73,600 − 54,000
= 1.225
16,000
WherejisthesmallestintegersuchthatMax(|ν’|)<1
38
Howisthisappliedtodifferentdata?
•  Datacanbe:
-  Numeric
-  Categorical
-  Text
-  Images
•  TheXmedimensioncreatesmorecomplex
structures
-  Timeseries(sequencesofnumericaldata)
-  Videos(sequencesofimages)
39
What’stheproblemwithtextualdata
Unstructured(text)vs.structured
(database)datainthemid-nine>es
40
What’stheproblemwithtextualdata
Unstructured(text)vs.structured
(database)datatoday
41
Allstartedfrominforma>onretrieval
(orwebsearch)
•  CollecXon:Asetofdocuments
-  AssumeitisastaXccollecXonforthe
moment
•  Goal:Retrievedocumentswith
informaXonthatisrelevanttotheuser’s
informaXonneedandhelpstheuser
completeatask
42
Sec. 1.1
Unstructureddatain1620
•  WhichplaysofShakespearecontainthewords
BrutusANDCaesarbutNOTCalpurnia?
•  OnecouldgrepallofShakespeare’splaysfor
BrutusandCaesar,thenstripoutlines
containingCalpurnia?
•  Whyisthatnotafeasibleanswer?
–  Slow(forlargecorpora)
–  NOTCalpurniaisnon-trivial
–  OtheroperaXons(e.g.,findthewordRomans
nearcountrymen)notfeasible
–  Rankedretrieval(bestdocumentstoreturn)
•  Anditstartsge•ngverycomplex…
43
Sec. 1.1
Term-documentincidencematrices
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Brutus AND Caesar BUT NOT
Calpurnia
1 if play contains
word, 0 otherwise
44
Sec. 1.1
Incidencevectors
•  Sowehavea0/1vectorforeachterm.
•  Toanswerquery:takethevectorsforBrutus,
CaesarandCalpurnia(complemented)è
bitwiseAND.
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
45
Sec. 1.1
Answerstoquery
•  Antony and Cleopatra,Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
•  Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.
46
Sec. 1.1
Biggercollec>ons
•  ConsiderN=1milliondocuments,eachwithabout
1000words.
•  Avg6bytes/wordincludingspaces/punctuaXon
-  6GBofdatainthedocuments.
•  SaythereareM=500Kdis0ncttermsamongthese.
•  500Kx1Mmatrixhashalf-a-trillion0’sand1’s.
•  Butithasnomorethanonebillion1’s.
Why?
-  matrixisextremelysparse.
•  Programmer’snote:What’sabe:errepresentaXon?
-  Weonlyrecordthe1posiXons.
47
ProblemwithBooleansearch:
feastorfamine
Ch. 6
•  Booleanquerieso]enresultineithertoofew
(=0)ortoomany(1000s)results.
•  Query1:“standarduserdlink650”→200,000
hits
•  Query2:“standarduserdlink650nocard
found”:0hits
•  Ittakesalotofskilltocomeupwithaquery
thatproducesamanageablenumberofhits.
-  ANDgivestoofew;ORgivestoomany
48
Sec. 6.2
Term-documentcountmatrices
•  Considerthenumberofoccurrencesofa
terminadocument:
-  Eachdocumentisacountvectorinℕv:a
columnbelow
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
157
73
0
0
0
0
Brutus
4
157
0
1
0
0
Caesar
232
227
0
2
1
1
Calpurnia
0
10
0
0
0
0
Cleopatra
57
0
0
0
0
0
mercy
2
0
3
5
5
1
worser
2
0
1
1
1
0
49
Bagofwordsmodel
•  VectorrepresentaXondoesn’tconsiderthe
orderingofwordsinadocument
•  JohnisquickerthanMaryandMaryisquicker
thanJohnhavethesamevectors
•  Thisiscalledthebagofwordsmodel.
•  Otherwell-knownproblems:
-  BreaksmulX-words(e.g.datamining)
-  Doesnotconsidersynonymy,hyponymy,etc.
-  Mainproblemisthatit’ssparse!
50
Termfrequency^
•  Thetermfrequency†t,doftermtindocumentdis
definedasthenumberofXmesthattoccursind.
•  Wewanttouse†whencompuXngquerydocumentmatchscores.Buthow?
•  Rawtermfrequencyisnotwhatwewant:
-  Adocumentwith10occurrencesofthetermis
morerelevantthanadocumentwith1
occurrenceoftheterm.
-  Butnot10Xmesmorerelevant.
•  RelevancedoesnotincreaseproporXonallywith
termfrequency.
NB:frequency=countinIR
51
Sec. 6.2
Log-frequencyweigh>ng
•  Thelogfrequencyweightoftermtindis
wt,d
⎧1 + log10 tft,d ,
=⎨
0,
⎩
if tft,d > 0
otherwise
•  0→0,1→1,2→1.3,10→2,1000→4,etc.
•  Scoreforadocument-querypair:sumover
termstinbothqandd:
•  score = ∑t∈q∩d (1 + log tft ,d )
•  Thescoreis0ifnoneofthequerytermsis
presentinthedocument.
52
Sec. 6.2.1
Documentfrequency
•  RaretermsaremoreinformaXvethanfrequent
terms
-  Recallstopwords
•  Consideraterminthequerythatisrareinthe
collecXon(e.g.,arachnophobia)
•  Adocumentcontainingthistermisverylikelyto
berelevanttothequeryarachnophobia
•  →Wewantahighweightforraretermslike
arachnophobia.
53
Sec. 6.2.1
Documentfrequency,con>nued
•  FrequenttermsarelessinformaXvethanrareterms
•  Consideraquerytermthatisfrequentinthe
collecXon(e.g.,high,increase,line)
•  Adocumentcontainingsuchatermismorelikelyto
berelevantthanadocumentthatdoesn’t
•  Butit’snotasureindicatorofrelevance.
•  →Forfrequentterms,wewanthighposiXve
weightsforwordslikehigh,increase,andline
•  Butlowerweightsthanforrareterms.
•  Wewillusedocumentfrequency(df)tocapturethis.
54
Sec. 6.2.1
idfweight
•  dftisthedocumentfrequencyoft:the
numberofdocumentsthatcontaint
-  dftisaninversemeasureofthe
informaXvenessoft
-  dft≤N
•  Wedefinetheidf(inversedocument
frequency)oftby idft = log10 ( N/dft )
-  Weuselog(N/dft)insteadofN/dftto
“dampen”theeffectofidf.
Will turn out the base of the log is immaterial.
55
Sec. 6.2.1
idfexample,supposeN=1million
term
dft
idft
calpurnia
1
animal
100
sunday
1,000
fly
10,000
under
the
100,000
1,000,000
idft = log10 ( N/dft )
There is one idf value for each term t in a collection.
56
Effectofidfonranking
•  Doesidfhaveaneffectonrankingforone-term
queries,like
-  iPhone
•  idfhasnoeffectonrankingonetermqueries
-  idfaffectstherankingofdocumentsfor
querieswithatleasttwoterms
-  Forthequerycapriciousperson,idfweighXng
makesoccurrencesofcapriciouscountfor
muchmoreinthefinaldocumentranking
thanoccurrencesofperson.
57
Sec. 6.2.1
Collec>onvs.Documentfrequency
•  ThecollecXonfrequencyoftisthenumber
ofoccurrencesoftinthecollecXon,
counXngmulXpleoccurrences.
Collection
Document frequency
•  Example: Word
frequency
insurance
10440
3997
try
10422
8760
•  Whichwordisabe:ersearchterm(and
shouldgetahigherweight)?
58
Sec. 6.2.2
^-idfweigh>ng
•  The†-idfweightofatermistheproductofits†
weightanditsidfweight.
w = log(1 + tf ) × log ( N / df )
t ,d
t ,d
10
t
•  BestknownweighXngschemeininformaXon
retrieval
•  Increaseswiththenumberofoccurrenceswithina
document
•  Increaseswiththerarityoftheterminthe
collecXon
59
Sec. 6.2.2
Scoreforadocumentgivenaquery
•  Therearemanyvariants
-  How“†”iscomputed(with/withoutlogs)
-  Whetherthetermsinthequeryarealsoweighted
-  …
60
Sec. 6.3
Binary→count→weightmatrix
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
5.25
3.18
0
0
0
0.35
Brutus
1.21
6.1
0
1
0
0
Caesar
8.59
2.54
0
1.51
0.25
0
Calpurnia
0
1.54
0
0
0
0
Cleopatra
2.85
0
0
0
0
0
mercy
1.51
0
1.9
0.12
5.25
0.88
worser
1.37
0
0.11
4.15
0.25
1.95
Each document is now represented by a real-valued
vector of tf-idf weights ∈ R|V|
61
Sec. 6.3
Documentsasvectors
Sowehavea|V|-dimensionalvectorspace
Termsareaxesofthespace
Documentsarepointsorvectorsinthisspace
Veryhigh-dimensional:tensofmillionsof
dimensionswhenyouapplythistoawebsearch
engine
•  Theseareverysparsevectors-mostentriesare
zero.
• 
• 
• 
• 
62
Sec. 6.3
Queriesasvectors
•  Keyidea1:Dothesameforqueries:represent
themasvectorsinthespace
•  Keyidea2:Rankdocumentsaccordingtotheir
proximitytothequeryinthisspace
•  proximity=similarityofvectors
•  proximity≈inverseofdistance
•  Recall:Wedothisbecausewewanttogetaway
fromtheyou’re-either-in-or-outBoolean
model.
•  Instead:rankmorerelevantdocumentshigher
thanlessrelevantdocuments
63
Sec. 6.3
Formalizingvectorspaceproximity
•  Firstcut:distancebetweentwopoints
-  (=distancebetweentheendpointsofthe
twovectors)
•  Euclideandistance?
•  Euclideandistanceisabadidea...
•  ...becauseEuclideandistanceislargefor
vectorsofdifferentlengths.
64
Whydistanceisabadidea
Sec. 6.3
TheEuclideandistance
betweenq
andd2islargeeven
thoughthe
distribuXonoftermsin
thequeryqandthe
distribuXonof
termsinthedocument
d2are
verysimilar.
65
Sec. 6.3
Useangleinsteadofdistance
•  Thoughtexperiment:takeadocumentdand
appendittoitself.Callthisdocumentdʹ.
•  “SemanXcally”danddʹhavethesamecontent
•  TheEuclideandistancebetweenthetwo
documentscanbequitelarge
•  Theanglebetweenthetwodocumentsis0,
correspondingtomaximalsimilarity.
•  Keyidea:Rankdocumentsaccordingtoangle
withquery.
66
Sec. 6.3
Fromanglestocosines
•  ThefollowingtwonoXonsareequivalent.
-  Rankdocumentsindecreasingorderofthe
anglebetweenqueryanddocument
-  Rankdocumentsinincreasingorderof
cosine(query,document)
•  Cosineisamonotonicallydecreasing
funcXonfortheinterval[0o,180o]
67
Sec. 6.3
Fromanglestocosines
•  Buthow–andwhy–shouldwebecompuXngcosines?
68
Sec. 6.3
Lengthnormaliza>on
•  Avectorcanbe(length-)normalizedbydividing
eachofitscomponentsbyitslength–forthis
weusetheL2norm: x! = ∑ xi2
2
i
•  DividingavectorbyitsL2normmakesitaunit
(length)vector(onsurfaceofunithypersphere)
•  Effectonthetwodocumentsdanddʹ(d
appendedtoitself)fromearlierslide:theyhave
idenXcalvectorsa]erlength-normalizaXon.
-  Longandshortdocumentsnowhave
comparableweights
69
Sec. 6.3
cosine(query,document)
Dot product
Unit vectors
! ! !
!
! ! q•d q d
cos( q, d ) = ! ! = ! • ! =
q d
qd
∑
V
qd
i =1 i i
V
2
i =1 i
∑
q
∑
V
i =1
d
2
i
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or,
equivalently, the cosine of the angle between q and d.
70
Cosineforlength-normalizedvectors
•  Forlength-normalizedvectors,cosine
similarityissimplythedotproduct(or
scalarproduct):
forq,dlength-normalized.
71
Cosinesimilarityillustrated
72
Sec. 6.3
Cosinesimilarityamongst3documents
Howsimilarare
thenovels
SaS:Senseand
Sensibility
PaP:Prideand
Prejudice,and
WH:Wuthering
Heights?
term
SaS
PaP
WH
affection
115
58
20
jealous
10
7
11
gossip
2
0
6
wuthering
0
0
38
Term frequencies (counts)
Note: To simplify this example, we don’t do idf weighting.
73
Sec. 6.3
3documentsexamplecontd.
Logfrequencyweigh>ng
term
SaS
PaP
Aeerlengthnormaliza>on
WH
term
SaS
PaP
WH
affection
3.06
2.76
2.30
affection
0.789
0.832
0.524
jealous
2.00
1.85
2.04
jealous
0.515
0.555
0.465
gossip
1.30
0
1.78
gossip
0.335
0
0.405
0
0
2.58
wuthering
0
0
0.588
wuthering
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
Why do we have
cos(PaP,WH) ≈ 0.69
cos(SaS,PaP) > cos(SaS,WH)? 74
TheDreamforNaturalLanguageProcessing
•  It’dbegreatifmachinescould
-  Processouremail(usefully)
-  Translatelanguagesaccurately
-  Helpusmanage,summarize,and
aggregateinformaXon
-  UsespeechasaUI(whenneeded)
-  Talktous/listentous
•  Buttheycan’t:
-  Languageiscomplex,ambiguous,
flexible,andsubtle
-  GoodsoluXonsneedlinguisXcsand
machinelearningknowledge
•  So:
75
Themystery
•  What’snowimpossibleforcomputers(and
anyotherspecies)todoiseffortlessfor
humans
✕
✕
✓
76
WhatisNLP?
•  Fundamentalgoal:deepunderstandofbroadlanguage
-  Notjuststringprocessingorkeywordmatching!
77
WhatisNLP?
•  Computersuse(analyze,understand,generate)naturallanguage
•  TextProcessing
-  Lexical:tokenizaXon,partofspeech,head,lemmas
-  Parsingandchunking
-  SemanXctagging:semanXcrole,wordsense
-  Certainexpressions:namedenXXes
-  Discourse:coreference,discoursesegments
•  SpeechProcessing
-  PhoneXctranscripXon
-  SegmentaXon(punctuaXons)
-  Prosody
78
Whyshouldyoucare?
•  TremendousprogressinNLP
-  Anenormousamountof
knowledgeisnowavailablein
machinereadableformas
naturallanguagetext
-  ConversaXonalagentsare
becominganimportantformof
human-computer
communicaXon
-  Muchofhuman-human
communicaXonisnowmediated
bycomputers
Commercialworldisblooming
Sponsors@EMNLP2016
79
What’stheproblemwithimages?
•  Imagescanberepresentedasvectorsof
pixels(RGBvalues)andthencanbe
(simply?)fedtoaclassificaXonalgorithm
•  Butforasimpleimage(256x256)thereare
2524,888possibleimages
-  Thereareabout1024startsintheuniverse
•  ThinkalsoofvariaXonsperclass
-  Fruit?
-  CatsindifferentposiXons?
80
Whatiscomputervision?
Terminator 2
81
Everypicturetellsastory
•  Goalofcomputervisionisto
writecomputerprogramsthat
caninterpretimages
82
Cancomputersmatch(orbeat)humanvision?
•  Yesandno(butmostlyno!)
-  humansaremuchbe:erat“hard”things
-  computerscanbebe:erat“easy”things
83
Howdoesitwork?
84
Supervisedlearning
85
Classifica>on
•  InclassificaXonproblems,eachenXtyinsomedomaincanbeplaced
inoneofadiscretesetofcategories:yes/no,friend/foe,good/bad/
indifferent,blue/red/green,etc.
•  GivenatrainingsetoflabeledenXXes,developaruleforassigning
labelstoenXXesinatestset
•  ManyvariaXonsonthistheme:
- 
- 
- 
- 
binaryclassificaXon
mulX-categoryclassificaXon
non-exclusivecategories
ranking
•  ManycriteriatoassessrulesandtheirpredicXons
-  overallerrors
-  costsassociatedwithdifferentkindsoferrors
-  operaXngpoints
86
Representa>onofObjects
•  Eachobjecttobeclassifiedis
representedasapair(x, y):
-  wherex isadescripXonoftheobject(seeexamplesofdata
typesinthefollowingslides)
-  where y isalabel(assumedbinaryfornow)
•  Successorfailureofamachinelearning
classifiero]endependsonchoosing
gooddescripXonsofobjects
-  thechoiceofdescripXoncanalsobeviewedasalearning
problem,andindeedwe’lldiscussautomatedprocedures
forchoosingdescripXonsinalaterlecture
-  butgoodhumanintuiXonsareo]enneededhere
87
DataTypes
•  Vectorialdata:
- 
- 
- 
- 
- 
physicala:ributes
behaviorala:ributes
context
history
etc
•  We’llassumefornowthatsuchvectorsareexplicitly
representedinatable,butpracXcallycantakedifferent
forms(e.g.kernels)
88
DataTypes
-  textandhypertext
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Welcome to FairmontNET</title>
</head>
<STYLE type="text/css">
.stdtext {font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 11px; color: #1F3D4E;}
.stdtext_wh {font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 11px; color: WHITE;}
</STYLE>
<body leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" bgcolor="BLACK">
<TABLE cellpadding="0" cellspacing="0" width="100%" border="0">
<TR>
<TD width=50% background="/TFN/en/CDA/Images/common/labels/decorative_2px_blk.gif">&nbsp;</TD>
<TD><img src="/TFN/en/CDA/Images/common/labels/decorative.gif"></td>
<TD width=50% background="/TFN/en/CDA/Images/common/labels/decorative_2px_blk.gif">&nbsp;</TD>
</TR>
</TABLE>
<tr>
<td align="right" valign="middle"><IMG src="/TFN/en/CDA/Images/common/labels/centrino_logo_blk.gif"></td>
</tr>
</body>
</html>
89
DataTypes
-  email
Return-path <[email protected]>Received from relay2.EECS.Berkeley.EDU
(relay2.EECS.Berkeley.EDU [169.229.60.28]) by imap4.CS.Berkeley.EDU (iPlanet Messaging Server
5.2 HotFix 1.16 (built May 14 2003)) with ESMTP id <[email protected]>;
Tue, 08 Jun 2004 11:40:43 -0700 (PDT)Received from relay3.EECS.Berkeley.EDU (localhost
[127.0.0.1]) by relay2.EECS.Berkeley.EDU (8.12.10/8.9.3) with ESMTP id i58Ieg3N000927; Tue, 08
Jun 2004 11:40:43 -0700 (PDT)Received from redbirds (dhcp-168-35.EECS.Berkeley.EDU
[128.32.168.35]) by relay3.EECS.Berkeley.EDU (8.12.10/8.9.3) with ESMTP id i58IegFp007613;
Tue, 08 Jun 2004 11:40:42 -0700 (PDT)Date Tue, 08 Jun 2004 11:40:42 -0700From Robert Miller
<[email protected]>Subject RE: SLT headcount = 25In-replyto <[email protected]>To 'Randy Katz'
<[email protected]>Cc "'Glenda J. Smith'" <[email protected]>, 'Gert Lanckriet'
<[email protected]>Messageid <[email protected]>MIME-version 1.0XMIMEOLE Produced By Microsoft MimeOLE V6.00.2800.1409X-Mailer Microsoft Office Outlook, Build
11.0.5510Content-type multipart/alternative; boundary="----=_NextPart_000_0033_01C44D4D.
6DD93AF0"Thread-index AcRMtQRp+R26lVFaRiuz4BfImikTRAA0wf3Qthe headcount is now 32.
---------------------------------------- Robert Miller, Administrative Specialist University of California,
Berkeley Electronics Research Lab 634 Soda Hall #1776 Berkeley, CA 94720-1776 Phone:
510-642-6037 fax: 510-643-1289
90
DataTypes
-  proteinsequences
91
DataTypes
-  sequencesofUnixsystemcalls
92
DataTypes
-  networklayout:graph
93
DataTypes
-  images
94
Example:SpamFilter
•  Input:email
•  Output:spam/ham
•  Setup:
-  GetalargecollecXonofexample
emails,eachlabeled“spam”or
“ham”
-  Note:someonehastohandlabel
allthisdata
-  Wanttolearntopredictlabelsof
new,futureemails
•  Features:Thea:ributesusedtomake
theham/spamdecision
- 
- 
- 
- 
Words:FREE!
TextPa:erns:$dd,CAPS
Non-text:SenderInContacts
…
Dear Sir.
First, I must solicit your confidence in this
transaction, this is by virture of its nature
as being utterly confidencial and top
secret. …
TO BE REMOVED FROM FUTURE
MAILINGS, SIMPLY REPLY TO THIS
MESSAGE AND PUT "REMOVE" IN THE
SUBJECT.
99 MILLION EMAIL ADDRESSES
FOR ONLY $99
Ok, Iknow this is blatantly OT but I'm
beginning to go insane. Had an old Dell
Dimension XPS sitting in the corner and
decided to put it to use, I know it was
working pre being stuck in the corner, but
when I plugged it in, hit the power nothing
95
happened.
• 
• 
• 
• 
• 
Example:DigitRecogni>on
Input:images/pixelgrids
Output:adigit0-9
Setup:
-  GetalargecollecXonofexampleimages,eachlabeled
withadigit
-  Note:someonehastohandlabelallthisdata
-  Wanttolearntopredictlabelsofnew,futuredigit
images
Features:Thea:ributesusedtomakethedigitdecision
-  Pixels:(6,8)=ON
-  ShapePa:erns:NumComponents,AspectRaXo,
NumLoops
-  …
Currentstate-of-the-art:Human-levelperformance
0
1
2
1
??
96
OtherExamplesofReal-WorldClassifica>on
Tasks
•  FrauddetecXon(input:accountacXvity,classes:fraud/nofraud)
•  WebpagespamdetecXon(input:HTML/renderedpage,classes:
spam/ham)
•  SpeechrecogniXonandspeakerrecogniXon(input:waveform,
classes:phonemesorwords)
•  Medicaldiagnosis(input:symptoms,classes:diseases)
•  AutomaXcessaygrader(input:document,classes:grades)
•  CustomerserviceemailrouXngandfoldering
•  LinkpredicXoninsocialnetworks
•  CatalyXcacXvityindrugdesign
•  …manymanymore
•  ClassificaXonisanimportantcommercialtechnology
97
• 
TrainingandValida>on
Data:labeledinstances,e.g.emailsmarkedspam/ham
-  Trainingset
-  ValidaXonset
-  Testset
• 
Training
• 
EvaluaXon
• 
StaXsXcalissues
- 
- 
- 
- 
EsXmateparametersontrainingset
TunehyperparametersonvalidaXonset
Reportresultsontestset
Anythingshortofthisyieldsover-opXmisXcclaims
Training
Data
-  Manydifferentmetrics
-  Ideally,thecriteriausedtotraintheclassifiershouldbecloselyrelated
tothoseusedtoevaluatetheclassifier
-  Wantaclassifierwhichdoeswellontestdata
-  Overfi•ng:fi•ngthetrainingdataveryclosely,butnotgeneralizing
well
-  Errorbars:wantrealisXc(conservaXve)esXmatesofaccuracy
Validation
Data
Test
Data
98
Intui>vePictureoftheProblem
Class1
Class2
99
LinearlySeparableData
Class1
Linear Decision boundary
Class2
100
NonlinearlySeparableData
Non Linear Classifier
Class1
Class2
101
SomeIssues
•  Theremaybeasimpleseparator(e.g.,astraightlinein2Dora
hyperplaneingeneral)ortheremaynot
•  Theremaybe“noise”ofvariouskinds
•  Theremaybe“overlap”
•  Oneshouldnotbedeceivedbyone’slow-dimensionalgeometrical
intuiXon
•  Someclassifiersexplicitlyrepresentseparators(e.g.,straightlines),
whileforotherclassifierstheseparaXonisdoneimplicitly
•  Someclassifiersjustmakeadecisionastowhichclassanobjectisin;
othersesXmateclassprobabiliXes
102
Methods
I)Instance-basedmethods:
1)Nearestneighbor
II)ProbabilisXcmodels:
1)NaïveBayes
2)LogisXcRegression
III)LinearModels:
1)Perceptron
2)SupportVectorMachine
IV)DecisionModels:
1)DecisionTrees
2)BoostedDecisionTrees
3)RandomForest
103
NearestNeighbourRule
Non-parametricpa:ern
classificaXon.
Consideratwoclassproblem
whereeachsampleconsistsoftwo
measurements(x,y).
Foragivenquerypointq,
assigntheclassofthe
nearestneighbour.
k=1
Computetheknearest
neighboursandassignthe
classbymajorityvote.
k=3
104
Ques>ons
•  Whatdistancemeasuretouse?
-  O]enEuclideandistanceisused
-  LocallyadapXvemetrics
-  Morecomplicatedwithnon-numericdata,orwhen
differentdimensionshavedifferentscales
•  Choiceofk?
-  Cross-validaXon
-  1-NNo]enperformswellinpracXce
-  k-NNneededforoverlappingclasses
-  Re-labelalldataaccordingtok-NN,thenclassifywith1NN
-  Reducek-NNproblemto1-NNthroughdatasetediXng
105
DecisionRegions
Eachcellcontainsone
sample,andeverylocaXon
withinthecelliscloserto
thatsamplethantoany
othersample.
AVoronoidiagramdivides
thespaceintosuchcells.
EveryquerypointwillbeassignedtheclassificaXonofthesamplewithinthatcell.The
decisionboundaryseparatestheclassregionsbasedonthe1-NNdecisionrule.
Knowledgeofthisboundaryissufficienttoclassifynewpoints.
Theboundaryitselfisrarelycomputed;manyalgorithmsseektoretainonlythose
pointsnecessarytogenerateanidenXcalboundary.
106
NearestNeighborClassifica>on…
•  Scalingissues
-  A:ributesmayhavetobescaledtoprevent
distancemeasuresfrombeingdominatedby
oneofthea:ributes
-  Example:
-  heightofapersonmayvaryfrom1.5mto1.8m
-  weightofapersonmayvaryfrom90lbto300lb
-  incomeofapersonmayvaryfrom$10Kto$1M
107
NearestNeighborClassifica>on…
•  ProblemwithEuclideanmeasure:
-  Highdimensionaldata
-  curseofdimensionality
-  Canproducecounter-intuiXveresults
111111111110
011111111111
d=1.4142
vs
100000000000
000000000001
d=1.4142
u SoluXon:Normalizethevectorstounitlength
108
k-NNsummary
•  Pros
-  Canexpresscomplexboundary(non-parametric)
-  Veryfasttraining
-  Simple,butsXllgoodinpracXce(e.g.
applicaXonsincomputervision)
-  Somewhatinterpretablebylookingatclosest
points
•  Cons
-  LargememoryrequirementsforpredicXon
-  Notbestaccuracyamongdifferentclassifiers
109
Methods
I)Instance-basedmethods:
1)Nearestneighbor
II)ProbabilisXcmodels:
1)NaïveBayes
2)LogisXcRegression
III)LinearModels:
1)Perceptron
2)SupportVectorMachine
IV)DecisionModels:
1)DecisionTrees
2)BoostedDecisionTrees
3)RandomForest
110
BayesClassifier
•  AprobabilisXcframeworkforsolving
classificaXonproblems
•  CondiXonalProbability:
P( A, C )
P(C | A) =
P( A)
P( A, C )
P( A | C ) =
P(C )
•  Bayestheorem:
P( A | C ) P(C )
P(C | A) =
P( A)
111
ExampleofBayesTheorem
•  Given:
-  AdoctorknowsthatmeningiXscausessXffneck50%oftheXme
-  PriorprobabilityofanypaXenthavingmeningiXsis1/50,000
-  PriorprobabilityofanypaXenthavingsXffneckis1/20
•  IfapaXenthassXffneck,what’sthe
probabilityhe/shehasmeningiXs?
P( S | M ) P( M ) 0.5 ×1 / 50000
P( M | S ) =
=
= 0.0002
P( S )
1 / 20
112
BayesianClassifiers
•  Considereacha:ributeandclasslabelas
randomvariables
•  Givenarecordwitha:ributes(A1,A2,…,An)
-  GoalistopredictclassC
-  Specifically,wewanttofindthevalueofCthat
maximizesP(C|A1,A2,…,An)
•  CanweesXmateP(C|A1,A2,…,An)directly
fromdata?
113
BayesianClassifiers
•  Approach:
-  computetheposteriorprobabilityP(C|A1,A2,…,An)forall
valuesofCusingtheBayestheorem
P( A A … A | C ) P(C )
P( A A … A )
-  ChoosevalueofCthatmaximizes
P(C|A1,A2,…,An)
-  EquivalenttochoosingvalueofCthatmaximizes
P(A1,A2,…,An|C)P(C)
•  HowtoesXmateP(A1,A2,…,An|C)?
P(C | A A … A ) =
1
2
1
2
n
n
1
2
n
114
NaïveBayesClassifier
•  Assumeindependenceamonga:ributesAiwhenclassisgiven:
-  P(A1,A2,…,An|C)=P(A1|Cj)P(A2|Cj)…P(An|Cj)
-  CanesXmateP(Ai|Cj)forallAiandCj.
-  NewpointisclassifiedtoCjifP(Cj)ΠP(Ai|Cj)ismaximal.
115
HowtoEs>mateProbabili>esfromData?
l
l
c
10
at
o
eg
a
c
i
r
c
at
o
eg
a
c
i
r
c
on
u
it n
s
u
o
s
s
a
cl
Tid
Refund
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
•  Class:P(C)=Nc/N
-  e.g.,P(No)=7/10,
P(Yes)=3/10
•  Fordiscrete
a:ributes:
k
P(Ai|Ck)=|Aik|/Nc
-  where|Aik|isnumberof
instanceshavinga:ributeAi
andbelongstoclassCk
-  Examples:
P(Status=Married|No)=4/7
P(Refund=Yes|Yes)=0
116
HowtoEs>mateProbabili>esfromData?
•  ForconXnuousa:ributes:
-  DiscreXzetherangeintobins
-  oneordinala:ributeperbin
-  violatesindependenceassumpXon
-  Two-waysplit:(A<v)or(A>v)
k
-  chooseonlyoneofthetwosplitsasnew
a:ribute
-  ProbabilitydensityesXmaXon:
-  Assumea:ributefollowsanormaldistribuXon
-  UsedatatoesXmateparametersofdistribuXon
(e.g.,meanandstandarddeviaXon)
-  OnceprobabilitydistribuXonisknown,canuseit
toesXmatethecondiXonalprobabilityP(Ai|c)
117
l
l
us
HowtoEs>mateProbabili>esfromData?
o
u
o
o
c
Tid
e
at
Refund
g
a
c
i
r
c
e
at
Marital
Status
g
a
c
i
r
c
t
n
o
Taxable
Income
in
s
s
a
cl
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
•  NormaldistribuXon:
1
P( A | c ) =
e
2πσ
i
j
−
( Ai − µ ij ) 2
2 σ ij2
2
ij
-  Oneforeach(Ai,ci)pair
•  For(Income,Class=No):
-  IfClass=No
-  samplemean=110
-  samplevariance=2975
10
1
P( Income = 120 | No) =
e
2π (54.54)
−
( 120 −110 ) 2
2 ( 2975 )
= 0.0072
118
ExampleofNaïveBayesClassifier
GivenaTestRecord:
X = (Refund = No, Married, Income = 120K)
naive Bayes Classifier:
P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
P(Marital Status=Single|Yes) = 2/7
P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
For taxable income:
If class=No:
sample mean=110
sample variance=2975
If class=Yes: sample mean=90
sample variance=25
● 
P(X|Class=No)=P(Refund=No|Class=No)
×P(Married|Class=No)
×P(Income=120K|Class=No)
=4/7×4/7×0.0072=0.0024
● 
P(X|Class=Yes)=P(Refund=No|Class=Yes)
×P(Married|Class=Yes)
×P(Income=120K|Class=Yes)
=1×0×1.2×10-9=0
SinceP(X|No)P(No)>P(X|Yes)P(Yes)
ThereforeP(No|X)>P(Yes|X)
=>Class=No
119
NaïveBayesClassifier
•  IfoneofthecondiXonalprobabilityiszero,
thentheenXreexpressionbecomeszero
•  ProbabilityesXmaXon:
Original : P( Ai | C ) = N ic
Nc
N ic + 1
Laplace : P ( Ai | C ) =
Nc + c
N ic + mp
m - estimate : P ( Ai | C ) =
Nc + m
c: number of classes
p: prior probability
m: parameter
120
ExampleofNaïveBayesClassifier
Name
human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
gila monster
platypus
owl
dolphin
eagle
Give Birth
yes
Give Birth
yes
no
no
yes
no
no
yes
no
yes
yes
no
no
yes
no
no
no
no
no
yes
no
Can Fly
no
no
no
no
no
no
yes
yes
no
no
no
no
no
no
no
no
no
yes
no
yes
Can Fly
no
Live in Water Have Legs
no
no
yes
yes
sometimes
no
no
no
no
yes
sometimes
sometimes
no
yes
sometimes
no
no
no
yes
no
Class
yes
no
no
no
yes
yes
yes
yes
yes
no
yes
yes
yes
no
yes
yes
yes
yes
no
yes
mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
Live in Water Have Legs
yes
no
Class
?
A:a:ributes
M:mammals
N:non-mammals
6 6 2 2
P ( A | M ) = × × × = 0.06
7 7 7 7
1 10 3 4
P ( A | N ) = × × × = 0.0042
13 13 13 13
7
P ( A | M ) P ( M ) = 0.06 × = 0.021
20
13
P ( A | N ) P ( N ) = 0.004 × = 0.0027
20
P(A|M)P(M)>P(A|N)P(N)
=>Mammals
121
NaïveBayesSummary
•  Robusttoisolatednoisepoints
•  Handlemissingvaluesbyignoringtheinstanceduring
probabilityesXmatecalculaXons
•  Robusttoirrelevanta:ributes
•  IndependenceassumpXonmaynotholdforsomea:ributes
-  UseothertechniquessuchasBayesianBeliefNetworks(BBN)
•  NaïveBayescanproduceaprobabilityesXmate,butitis
usuallyaverybiasedone
-  LogisXcRegressionisbe:erforobtainingprobabiliXes.
122
Methods
I)Instance-basedmethods:
1)Nearestneighbor
II)ProbabilisXcmodels:
1)NaïveBayes
2)LogisXcRegression
III)LinearModels:
1)Perceptron
2)SupportVectorMachine
IV)DecisionModels:
1)DecisionTrees
2)BoostedDecisionTrees
3)RandomForest
123
SupportVectorMachines
•  Findalinearhyperplane
(decisionboundary)that
willseparatethedata
124
SupportVectorMachines
B1
•  OnePossibleSoluXon
125
SupportVectorMachines
B2
•  AnotherpossiblesoluXon
126
SupportVectorMachines
B2
•  OtherpossiblesoluXons
127
SupportVectorMachines
B1
B2
•  Whichoneisbe:er?B1orB2?
•  Howdoyoudefinebe:er?
128
SupportVectorMachines
B1
•  Findhyperplane
maximizesthemargin=>
B1isbe:erthanB2
B2
b21
b22
margin
b11
b12
129
SupportVectorMachines
B1
! !
w• x + b = 0
! !
w • x + b = +1
! !
w • x + b = −1
b11
! !
if w • x + b ≥ 1
⎧1
!
f ( x) = ⎨
! !
−
1
if
w
• x + b ≤ −1
⎩
b12
2
Margin = ! 2
|| w ||
130
SupportVectorMachines
•  Wewanttomaximize: Margin =
2
! 2
|| w ||
-  Whichisequivalenttominimizing:
! 2
|| w ||
L( w) =
2
-  Butsubjectedtothefollowingconstraints:
! !
if w • x i + b ≥ 1
⎧1
!
f ( xi ) = ⎨
! !
⎩− 1 if w • x i + b ≤ −1
-  ThisisaconstrainedopXmizaXonproblem
-  Numericalapproachestosolveit(e.g.,quadraXc
programming)
131
SupportVectorMachines
•  Whatiftheproblemisnotlinearly
separable?
132
SupportVectorMachines
•  Whatiftheproblemisnotlinearly
separable?
! 2
|| w ||
⎛ N k⎞
+ C ⎜ ∑ ξi ⎟
-  Introduceslackvariables L( w) =
2
⎝ i =1 ⎠
-  Needtominimize:
-  Subjectto:
! !
if w • x i + b ≥ 1 - ξi
⎧1
!
f ( xi ) = ⎨
! !
⎩− 1 if w • x i + b ≤ −1 + ξi
133
NonlinearlySeparableData
Var1
Introduceslack
variables
Allowsomeinstances
tofallwithinthe
margin,butpenalize
them
Var2
134
RobustnessofSoevsHardMarginSVMs
Var1
Var1
ξi
Var2
Var2
So]MarginSVN
HardMarginSVN
135
NonlinearSupportVectorMachines
•  Whatifthingsarenotgood?
136
NonlinearSupportVectorMachines
•  Transformdataintohigherdimensional
space
137
DisadvantagesofLinearDecisionSurfaces
Var1
Var2 138
AdvantagesofNonlinearSurfaces
Var1
Var2 139
LinearClassifiersinHigh-DimensionalSpaces
Var1
Constructed
Feature 2
Var2
Constructed
Feature 1
FindfuncXonΦ(x)tomaptoa
differentspace
140
MappingDatatoaHigh-DimensionalSpace
•  FindfuncXonΦ(x)tomaptoadifferentspace,thenSVMformulaXon
becomes:
•  DataappearasΦ(x),weightswarenowweightsinthenewspace
•  ExplicitmappingexpensiveifΦ(x)isveryhighdimensional
•  Solvingtheproblemwithoutexplicitlymappingthedataisdesirable
141
TheKernelTrick
•  Φ(xi)⋅Φ(xj):means,mapdataintonewspace,thentaketheinner
productofthenewvectors
•  WecanfindafuncXonsuchthat:K(xi⋅xj)=Φ(xi)⋅Φ(xj),i.e.,theimageof
theinnerproductofthedataistheinnerproductoftheimagesofthe
data
•  Then,wedonotneedtoexplicitlymapthedataintothehighdimensionalspacetosolvetheopXmizaXonproblem
142
Example
wTΦ(x)+b=0
X=[x z]
Φ(X)=[x2 z2 xz]
f(x) = sign(w1x2+w2z2+w3xz + b)
143
Example
X1=[x1 z1]
X2=[x2 z2]
Φ(X1)=[x12 z12 21/2x1z1]
Φ(X2)=[x22 z22 21/2x2z2]
Φ(X1)TΦ(X2)= [x12 z12 21/2x1z1] [x22 z22 21/2x2z2]T
= x12z12 + x22 z22 + 2 x1 z1 x2 z2
= (x1z1 + x2z2)2
= (X1T X2)2
Expensive!
O(d2)
Efficient!
O(d)
144
KernelTrick
•  Kernelfunc>on:asymmetricfuncXon
k:RdxRd->R
•  Innerproductkernels:addiXonally,
k(x,z)=Φ(x)TΦ(z)
•  Example:
O(d )
O(d)
2
145
KernelTrick
•  Implementaninfinite-dimensionalmappingimplicitly
•  Onlyinnerproductsexplicitlyneededfortrainingand
evaluaXon
•  Innerproductscomputedefficiently,infinitedimensions
•  TheunderlyingmathemaXcaltheoryisthatofreproducing
kernelHilbertspacefromfuncXonalanalysis
146
KernelMethods
•  Ifalinearalgorithmcanbeexpressedonlyinterms
ofinnerproducts
-  itcanbe“kernelized”
-  findlinearpa:erninhigh-dimensionalspace
-  nonlinearrelaXoninoriginalspace
•  SpecifickernelfuncXondeterminesnonlinearity
147
Kernels
•  Somesimplekernels
-  Linearkernel:k(x,z)=xTz
àequivalenttolinearalgorithm
-  Polynomialkernel:k(x,z)=(1+xTz)d
àpolynomialdecisionrules
-  RBFkernel:k(x,z)=exp(-||x-z||2/2σ)
àhighlynonlineardecisions
148
GaussianKernel:Example
A hyperplane
in some space
149
KernelMatrix
k(x,y)
•  KernelmatrixKdefinesallpairwiseinner
products
•  Mercertheorem:KposiXvesemidefinite
•  AnysymmetricposiXvesemidefinite
matrixcanberegardedasaninner
productmatrixinsomespace
i
K
j
Kij=k(xi,xj)
150
Kernel-BasedLearning
Data
Embedding
Linear algorithm
{(xi,yi)}
• 
k(x,y) or
• 
y
K
151
Kernel-BasedLearning
Data
Embedding
Linear algorithm
K
Kernel design
Kernel algorithm
152
Methods
I)Instance-basedmethods:
1)Nearestneighbor
II)ProbabilisXcmodels:
1)NaïveBayes
2)LogisXcRegression
III)LinearModels:
1)Perceptron
2)SupportVectorMachine
IV)DecisionModels:
1)DecisionTrees
2)BoostedDecisionTrees
3)RandomForest
153
ExampleofaDecisionTree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Training Data
Model: Decision Tree
154
HowtoSpecifyTestCondi>on?
•  Dependsona:ributetypes
-  Nominal
-  Ordinal
-  ConXnuous
•  Dependsonnumberofwaystosplit
-  2-waysplit
-  MulX-waysplit
155
SplinngBasedonNominalAoributes
•  MulX-waysplit:UseasmanyparXXonsas
disXnctvalues.
Family
CarType
Luxury
Sports
•  Binarysplit:Dividesvaluesintotwo
subsets.
NeedtofindopXmalparXXoning.
{Sports,
Luxury}
CarType
{Family}
OR
{Family,
Luxury}
CarType
{Sports}
156
SplinngBasedonOrdinalAoributes
•  MulX-waysplit:UseasmanyparXXonsas
Size
disXnctvalues.
Small
Large
Medium
•  Binarysplit:Dividesvaluesintotwo
subsets.
NeedtofindopXmalparXXoning.
{Small,
Medium}
Size
{Large}
OR
•  Whataboutthissplit?
{Medium,
Large}
{Small,
Large}
Size
{Small}
Size
{Medium}
157
SplinngBasedonCon>nuousAoributes
•  Differentwaysofhandling
-  DiscreXzaXontoformanordinalcategorical
a:ribute
-  StaXc–discreXzeonceatthebeginning
-  Dynamic–rangescanbefoundbyequalinterval
buckeXng,equalfrequencybuckeXng
(percenXles),orclustering.
-  BinaryDecision:(A<v)or(A≥v)
-  considerallpossiblesplitsandfindsthebestcut
-  canbemorecomputeintensive
158
SplinngBasedonCon>nuousAoributes
Taxable
Income
> 80K?
Taxable
Income?
< 10K
Yes
> 80K
No
[10K,25K)
(i) Binary split
[25K,50K)
[50K,80K)
(ii) Multi-way split
159
TreeInduc>on
•  Greedystrategy.
-  Splittherecordsbasedonana:ributetest
thatopXmizescertaincriterion.
•  Issues
-  Determinehowtosplittherecords
-  Howtospecifythea:ributetestcondiXon?
-  Howtodeterminethebestsplit?
-  Determinewhentostopspli•ng
160
HowtodeterminetheBestSplit
BeforeSpli•ng:10recordsofclass0,
10recordsofclass1
Own
Car?
Yes
Car
Type?
No
Family
Student
ID?
Luxury
c1
Sports
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
C0: 1
C1: 0
...
c10
C0: 1
C1: 0
c11
C0: 0
C1: 1
c20
...
C0: 0
C1: 1
WhichtestcondiXonisthebest?
161
HowtodeterminetheBestSplit
•  Greedyapproach:
-  NodeswithhomogeneousclassdistribuXon
arepreferred
•  Needameasureofnodeimpurity:
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
Homogeneous,
Highdegreeofimpurity
Lowdegreeofimpurity
162
EnsembleMethods:IncreasingtheAccuracy
•  Ensemblemethods
-  UseacombinaXonofmodelstoincreaseaccuracy
-  Combineaseriesofklearnedmodels,M1,M2,…,Mk,withthe
aimofcreaXnganimprovedmodelM*
•  Popularensemblemethods
-  Bagging:averagingthepredicXonoveracollecXonof
classifiers
-  BoosXng:weightedvotewithacollecXonofclassifiers
-  Ensemble:combiningasetofheterogeneousclassifiers
163
163
BaggingandRandomisedTrees
otherclassifiercombinaXons:
Bagging:
combinetreesgrownfrom“bootstrap”samples
(i.ere-sampletrainingdatawithreplacement)
RandomisedTrees:(RandomForest:trademarkL.Breiman,A.Cutler)
combinetreesgrownwith:
randombootstrap(orsubsets)ofthetrainingdataonly
considerateachnodeonlyarandomsubsetsofvariablesforthesplit
NOPruning!
Thesecombinedclassifiersworksurprisinglywell,areverystableandalmost
perfect“outofthebox”classifiers
164
RandomForest
Randomlysample2/3
ofthedata
Data
Eachnode:pickrandomlya
smallmnumberofinput
variablestospliton
VOTE !
Use Out-Of-Bag samples to:
-  Estimate error
-  Choosing m
165
-  Estimate variable importance
Boos>ng
Training Sample
classifier
C(0)(x)
re-weight
Weighted
Sample
re-weight
classifier
C(1)(x)
Weighted
Sample
re-weight
classifier
C(2)(x)
Weighted
Sample
classifier
C(3)(x)
y(x) =
NClassifier
∑
w iC(i) (x)
i
re-weight
Weighted
Sample
classifier
C(m)(x)
166
Adap>veBoos>ng(AdaBoost)
Training Sample
classifier
C(0)(x)
re-weight
Weighted
Sample
re-weight
classifier
C(1)(x)
Weighted
Sample
re-weight
classifier
C(2)(x)
Weighted
Sample
classifier
C(3)(x)
re-weight
AdaBoostre-weightseventsmisclassifiedby
previousclassifierby:
1 − ferr
with :
ferr
ferr
misclassified events
=
all events
AdaBoostweightstheclassifiersalsousing
theerrorrateoftheindividualclassifier
accordingto:
y(x) =
Weighted
Sample
classifier
C(m)(x)
NClassifier
∑
i
(i)
⎛ 1 − ferr
log ⎜ (i)
⎝ ferr
⎞ (i)
⎟C (x)
⎠
167
TricksandEvalua>on
168
Let’s>eclassifica>ontotext
•  RepresentaXonsoftextareusuallyvery
highdimensional
-  “Thecurseofdimensionality”
•  High-biasalgorithmsshouldgenerally
workbestinhigh-dimensionalspace
-  Theypreventoverfi•ng
-  Theygeneralizemore
•  FormosttextcategorizaXontasks,there
aremanyrelevantfeaturesandmany
irrelevantones
169
WhichclassifierdoIuseforagiven(text)
classifica>onproblem?
•  IstherealearningmethodthatisopXmalforall(text)
classificaXonproblems?
-  No,becausethereisatradeoffbetweenbiasandvariance.
•  Factorstotakeintoaccount:
-  Howmuchtrainingdataisavailable?
-  Howsimple/complexistheproblem?(linearvs.
nonlineardecisionboundary)
-  Hownoisyisthedata?
-  HowstableistheproblemoverXme?
-  Foranunstableproblem,it’sbe:ertousea
simpleandrobustclassifier.
170
Sec. 15.3.1
Manuallywrioenrules
•  Notrainingdata,adequateeditorialstaff?
•  Neverforgetthehand-wri:enrulessoluXon!
-  If(wheatorgrain)andnot(wholeorbread)then
-  Categorizeasgrain
•  InpracXce,rulesgetalotbiggerthanthis
-  Canalsobephrasedusing†or†.idfweights
•  Withcarefulcra]ing(humantuningondevelopment
data)performanceishigh:
-  Construe:94%recall,84%precisionover675
categories(HayesandWeinstein1990)
•  Amountofworkrequiredishuge
-  EsXmate2daysperclass…plusmaintenance
171
Sec. 15.3.1
Verylioledata?
•  Ifyou’rejustdoingsupervisedclassificaXon,you
shouldsXcktosomethinghighbias
-  TherearetheoreXcalresultsthatNaïveBayes
shoulddowellinsuchcircumstances(NgandJordan
2002NIPS)
•  TheinteresXngtheoreXcalansweristoexploresemisupervisedtrainingmethods:
-  Bootstrapping,EMoverunlabeleddocuments,…
•  ThepracXcalansweristogetmorelabeleddataas
soonasyoucan
-  Howcanyouinsertyourselfintoaprocesswhere
humanswillbewillingtolabeldataforyou??
172
Sec. 15.3.1
Areasonableamountofdata?
•  Wecanuseallourcleverclassifiers
•  “RollouttheSVM!”
•  ButifyouareusinganSVM/NBetc.,youshould
probablybepreparedwiththe“hybrid”soluXon
wherethereisaBooleanoverlay
-  Orelsetouseuser-interpretableBoolean-like
modelslikedecisiontrees
-  Usersliketohack,andmanagementlikesto
beabletoimplementquickfixesimmediately
173
Sec. 15.3.1
Ahugeamountofdata?
•  Thisisgreatintheoryfordoingaccurate
classificaXon…
•  Butitcouldeasilymeanthatexpensivemethods
likeSVMs(trainXme)orkNN(testXme)arequite
impracXcal
•  NaïveBayescancomebackintoitsownagain!
-  Orotheradvancedmethodswithlinear
training/testcomplexitylikeregularizedlogisXc
regression(thoughmuchmoreexpensiveto
train)
174
Sec. 15.3.2
Howmanycategories?
•  Afew(wellseparatedones)?
-  Easy!
•  Azillioncloselyrelatedones?
-  Think:Yahoo!Directory,LibraryofCongressclassificaXon,
legalapplicaXons
-  Quicklygetsdifficult!
-  ClassifiercombinaXonisalwaysausefultechnique
-  VoXng,bagging,orboosXngmulXpleclassifiers
-  MuchliteratureonhierarchicalclassificaXon
-  Mileagefairlyunclear,buthelpsabit(Tie-YanLiuetal.
2005)
-  Definitelyhelpsforscalability,evenifnotinaccuracy
-  MayneedahybridautomaXc/manualsoluXon
175
ModelEvalua>onandSelec>on
•  EvaluaXonmetrics:Howcanwemeasureaccuracy?Other
metricstoconsider?
•  Usevalida>ontestsetofclass-labeledtuplesinsteadoftraining
setwhenassessingaccuracy
•  MethodsforesXmaXngaclassifier’saccuracy:
-  Holdoutmethod,randomsubsampling
-  Cross-validaXon
-  Bootstrap
•  Comparingclassifiers:
-  Confidenceintervals
-  Cost-benefitanalysisandROCCurves
176
ClassifierEvalua>onMetrics:
ConfusionMatrixfor2classes
ConfusionMatrix:
Actualclass\Predictedclass
C1
¬C1
C1
TruePosi>ves(TP)
FalseNega>ves(FN)
¬C1
FalsePosi>ves(FP)
TrueNega>ves(TN)
Example of Confusion Matrix:
Actualclass\Predicted buy_computer buy_computer
class
=yes
=no
Total
buy_computer=yes
6954
46
7000
buy_computer=no
412
2588
3000
Total
7366
2634
10000
177
Sec. 15.2.4
ClassifierEvalua>onMetrics:
ConfusionMatrixformoreclasses
This (i, j) entry means 53 of the docs actually in
class i were put in class j by the classifier.
Actual Class
Class assigned by classifier
53
•  InaperfectclassificaXon,onlythediagonalhasnon-zeroentries
•  Lookatcommonconfusionsandhowtheymightbeaddressed
178
ClassifierEvalua>onMetrics:
Accuracy,ErrorRate,Sensi>vityandSpecificity
A\P
C
¬C
ClassImbalanceProblem:
C TP FN P
n  Oneclassmayberare,e.g.
¬C FP TN N
fraud,orHIV-posiXve
P’ N’ All
n  Significantmajorityofthe
•  ClassifierAccuracy,orrecogniXon nega0veclassandminorityof
rate:percentageoftestsettuples theposiXveclass
thatarecorrectlyclassified
n  Sensi>vity:TruePosiXve
Accuracy=(TP+TN)/All
recogniXonrate
•  Errorrate:1–accuracy,or
n  Sensi>vity=TP/P
Errorrate=(FP+FN)/All
n  Specificity:TrueNegaXve
recogniXonrate
n  Specificity=TN/N
n 
179
ClassifierEvalua>onMetrics:
PrecisionandRecall,andF-measures
•  Precision:exactness–what%oftuplesthattheclassifierlabeledas
posiXveareactuallyposiXve
•  Recall:completeness–what%ofposiXvetuplesdidtheclassifier
labelasposiXve?
•  Perfectscoreis1.0
•  InverserelaXonshipbetweenprecision&recall
•  Fmeasure(F1orF-score):harmonicmeanofprecisionandrecall,
•  Fß:weightedmeasureofprecisionandrecall
-  assignsßXmesasmuchweighttorecallastoprecision
180
ClassifierEvalua>onMetrics:Example
- 
ActualClass\Predictedclass
cancer=yes
cancer=no
Total
RecogniXon(%)
cancer=yes
90
210
300
30.00(sensi0vity
cancer=no
140
9560
9700
98.56(specificity)
Total
230
9770
10000
96.40(accuracy)
Precision=90/230=39.13%Recall=90/300=30.00%
181
Sec. 15.2.4
Micro-vs.Macro-Averaging
•  Ifwehavemorethanoneclass,howdo
wecombinemulXpleperformance
measuresintoonequanXty?
•  Macroaveraging:Computeperformance
foreachclass,thenaverage.
•  Microaveraging:Collectdecisionsforall
classes,computeconXngencytable,
evaluate.
182
Micro-vs.Macro-Averaging:Example
Class 1
Truth:
yes
Class 2
Truth:
no
Sec. 15.2.4
Micro Ave. Table
Truth:
yes
Truth:
no
Truth:
yes
Truth:
no
Classifi 10
er: yes
10
Classifi
er: yes
90
10
Classifier:
yes
100
20
Classifi 10
er: no
970
Classifi
er: no
10
890
Classifier:
no
20
1860
n 
n 
n 
Macroaveraged precision: (0.5 + 0.9)/2 = 0.7
Microaveraged precision: 100/120 = .83
Microaveraged score is dominated by score
on common classes
183
Evalua>ngClassifierAccuracy:
Holdout&Cross-Valida>onMethods
•  Holdoutmethod
-  GivendataisrandomlyparXXonedintotwoindependentsets
-  Trainingset(e.g.,2/3)formodelconstrucXon
-  Testset(e.g.,1/3)foraccuracyesXmaXon
-  Randomsampling:avariaXonofholdout
-  RepeatholdoutkXmes,accuracy=avg.ofthe
accuraciesobtained
•  Cross-valida>on(k-fold,wherek=10ismostpopular)
-  RandomlyparXXonthedataintokmutuallyexclusivesubsets,
eachapproximatelyequalsize
-  Ati-thiteraXon,useDiastestsetandothersastrainingset
-  Leave-one-out:kfoldswherek=#oftuples,forsmallsized
data
-  *Stra>fiedcross-valida>on*:foldsarestraXfiedsothatclass
dist.ineachfoldisapprox.thesameasthatintheiniXaldata
184
Es>ma>ngConfidenceIntervals:
ClassifierModelsM1vs.M2
•  Supposewehave2classifiers,M1andM2,whichoneisbe:er?
•  Use10-foldcross-validaXontoobtainand
•  Thesemeanerrorratesarejustes0matesoferroronthetrue
populaXonoffuturedatacases
•  Whatifthedifferencebetweenthe2errorratesisjust
a:ributedtochance?
-  Useatestofsta>s>calsignificance
-  ObtainconfidencelimitsforourerroresXmates
185
Es>ma>ngConfidenceIntervals:NullHypothesis
•  Perform10-foldcross-validaXon
•  Assumesamplesfollowatdistribu>onwithk–1degreesof
freedom(here,k=10)
•  Uset-test(orStudent’st-test)
•  NullHypothesis:M1&M2arethesame
•  Ifwecanrejectnullhypothesis,then
-  weconcludethatthedifferencebetweenM1&M2is
sta>s>callysignificant
-  Chosemodelwithlowererrorrate
186
Es>ma>ngConfidenceIntervals:t-test
•  Ifonly1testsetavailable:pairwisecomparison
-  Forithroundof10-foldcross-validaXon,thesamecross
parXXoningisusedtoobtainerr(M1)ianderr(M2)i
-  Averageover10roundstoget
-  t-testcomputest-sta>s>cwithk-1degreesoffreedom:
•  Iftwotestsetsavailable:usenon-pairedt-test
wherek1&k2are#ofcross-validaXonsamplesusedforM1&M2,resp.
187
Es>ma>ngConfidenceIntervals:Tablefort-distribu>on
•  Symmetric
•  Significancelevel,
e.g.,sig=0.05or5%
meansM1&M2are
significantlydifferent
for95%ofpopulaXon
•  Confidencelimit,z=
sig/2
188
Es>ma>ngConfidenceIntervals:Sta>s>calSignificance
•  AreM1&M2significantlydifferent?
-  Computet.Selectsignificancelevel(e.g.sig=5%)
-  Consulttablefort-distribuXon:Findtvaluecorrespondingto
k-1degreesoffreedom(here,9)
-  t-distribuXonissymmetric:typicallyupper%pointsof
distribuXonshown→lookupvalueforconfidencelimit
z=sig/2(here,0.025)
-  Ift>zort<-z,thentvalueliesinrejecXonregion:
-  Rejectnullhypothesisthatmeanerrorratesof
M1&M2aresame
-  Conclude:staXsXcallysignificantdifference
betweenM1&M2
-  Otherwise,concludethatanydifferenceischance
189
ModelSelec>on:ROCCurves
• 
• 
• 
• 
• 
• 
ROC(ReceiverOperaXngCharacterisXcs)
curves:forvisualcomparisonof
classificaXonmodels
OriginatedfromsignaldetecXontheory
Showsthetrade-offbetweenthetrue
posiXverateandthefalseposiXverate
TheareaundertheROCcurveisa
measureoftheaccuracyofthemodel
Rankthetesttuplesindecreasingorder:
theonethatismostlikelytobelongto
theposiXveclassappearsatthetopof
thelist
Theclosertothediagonalline(i.e.,the
closertheareaisto0.5),theless
accurateisthemodel
n 
n 
n 
n 
VerXcalaxis
representsthetrue
posiXverate
Horizontalaxisrep.
thefalseposiXverate
Theplotalsoshowsa
diagonalline
Amodelwithperfect
accuracywillhavean
190
areaof1.0