Download Lecture Note 5

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression toward the mean wikipedia , lookup

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Least squares wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
DataInformatics
SeonHoKim,Ph.D.
[email protected]
ClassificationandPrediction
Outline
•
•
•
•
•
•
•
Classificationvs.prediction
Issuesregardingclassificationandprediction
Classificationbydecisiontreeinduction
BayesianClassification
Prediction
Summary
Reference
Classificationvs.Prediction
• Classification
– predictscategoricalclasslabels(discreteornominal)
– classifiesdata(constructsamodel)basedonthe
trainingdatasetandusesitinclassifyingnewdata
• Prediction
– modelscontinuous-valuedfunctions,i.e.,predicts
unknownormissingvalues
• Typicalapplications
– Creditapproval,Targetedmarketing,Medical
diagnosis,Frauddetection,etc.
Classification—ATwo-StepProcess
• Modelconstruction:describingasetofpredetermined
classes
– Eachsampleisassumedtobelongtoapredefinedclass,as
determinedbytheclasslabelattribute
– Thesetoftuplesusedformodelconstructionistrainingset
– Themodelisrepresentedasclassificationrules,decision
trees,ormathematicalformulae
Classification—ATwo-StepProcess
• Modelusage:forclassifyingfutureorunknown
objects
– Estimateaccuracyofthemodel
• Theknownlabeloftestsampleiscomparedwiththe
classifiedresultfromthemodel
• Accuracyrateisthepercentageoftestsetsamplesthatare
correctlyclassifiedbythemodel
• Testsetisindependentoftrainingset,otherwiseoverfitting
willoccur
– Iftheaccuracyisacceptable,usethemodeltoclassify
datatupleswhoseclasslabelsarenotknown
ClassificationProcess(1):ModelConstruction
Training
Data
NAM E
M ike
M ary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IFrank=‘professor’
ORyears>6
THENtenured=‘yes’
ClassificationProcess(2):UsetheModelinPrediction
Classifier
Testing
Data
UnseenData
(Jeff,Professor,4)
NAM E
Tom
M erlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Tenured?
Supervisedvs.UnsupervisedLearning
• Supervisedlearning(classification)
– Supervision:Thetrainingdata(observations,
measurements,etc.)areaccompaniedbylabelsindicating
theclassoftheobservations
– Newdataisclassifiedbasedonthetrainingset
• Unsupervisedlearning(clustering)
– Theclasslabelsoftrainingdataisunknown
– Givenasetofmeasurements,observations,etc.withthe
aimofestablishingtheexistenceofclassesorclustersin
thedata
IssuesRegardingClassificationand
Prediction(1):DataPreparation
• Datacleaning
– Preprocessdatainordertoreducenoiseandhandle
missingvalues
• Relevanceanalysis(featureselection)
– Removetheirrelevantorredundantattributes
• Datatransformation
– Generalizeand/ornormalizedata
Issuesregardingclassificationandprediction(2):
Evaluatingclassificationmethods
• Accuracy:classifierandpredictoraccuracy
• Speed
– timetoconstructthemodel(trainingtime)
– timetousethemodel(classification/predictiontime)
• Robustness
– handlingnoiseandmissingvalues
• Scalability:efficiency asdatasizegrows
• Interpretability
– understandingandinsightprovidedbythemodel
• Othermeasures,e.g.,goodnessofrules,suchas
decisiontreesizeorcompactnessofclassificationrules
TrainingDataset
(Exampleof“BuyingComputer”)
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes
fair
low
yes
excellent
low
yes
excellent
medium
no
fair
low
yes
fair
medium
yes
fair
medium
yes
excellent
medium
no
excellent
high
yes
fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
DecisionTrees
• Adecisiontreeisadecisionsupporttoolthatusesa
tree-likegraphormodelofdecisionsandtheir
possibleconsequences.
• Sortinstances(data)accordingtofeaturevalues(i.e.,
age,income,etc.):ahierarchyoftests
– dataareclassified/sortedaccordingtospecificfeature
values,whichbecomeincreasinglyspecific.
• Nodes:features
– Rootnode:featurethatbestdividesdata
• Algorithmsexistfordeterminingthebestrootnode
• Branches:valuesthenodecanassume
Output:ADecisionTreefor“buying_computer”
age?
<=30
student?
overcast
30..40
yes
>40
creditrating?
no
yes
excellent
fair
no
yes
no
yes
AlgorithmforDecisionTreeInduction
• Basicalgorithm(agreedyalgorithm)
– Treeisconstructedintop-downrecursivedivide-andconquer manner
– Atstart,allthetrainingexamplesareattheroot
– Attributesarecategorical(ifcontinuous,discretizedin
advance)
– Examplesarepartitionedrecursivelybasedon
selectedattributes
– Testattributesareselectedonthebasisofaheuristic
orstatisticalmeasure (e.g.,informationgain)
AlgorithmforDecisionTreeInduction
• Conditionsforstoppingpartitioning
– Allsamplesforagivennodebelongtothesame
class
– Therearenoremainingattributesforfurther
partitioning– majorityvotingisemployedfor
classifyingtheleaf
– Therearenosamplesleft
AttributeSelectionMeasure:InformationGain
n
n
n
n
Selecttheattributewiththehighestinformationgain
S containssi tuplesofclassCi fori ={1,…,m}
information measuresinforequiredtoclassifyanyarbitrary
m
tuple
si
si
I( s1,s2,...,sm) = −∑ log 2
s
i =1 s
entropy ofattributeA withvalues{a1,a2,…,av}
s1 j + ... + smj
E(A) = ∑
I (s1 j ,..., smj )
s
j =1
v
n
informationgained bybranchingonattributeA
Gain(A) = I(s1 ,s2 ,...,sm ) − E(A)
AttributeSelectionbyInformationGainComputation
g
g
g
g
Class P: buys_computer = “yes”
Class N: buys_computer = “no”
I(p, n) = I(9, 5) =0.940
Compute the entropy for age:
age
<=30
30…40
>40
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
5
4
I (2,3) +
I (4,0)
14
14
5
+
I (3,2) = 0.694
14
E (age) =
5
I (2,3) means “age <=30”
14
has 5 out of 14 samples, with
2 yes’es and 3 no’s. Hence,
Gain(age) = I ( p, n) − E (age) = 0.246
Similarly,
Gain(income) = 0.029
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
ComputingInformation-GainforContinuousValueAttributes
• LetattributeA beacontinuous-valuedattribute
• Mustdeterminethebestsplitpoint forA
– SortthevalueA inincreasingorder
– Typically,themidpoint betweeneachpairofadjacentvalues
isconsideredasapossiblesplitpoint
• (ai+ai+1)/2isthemidpointbetweenthevaluesofai andai+1
– Thepointwiththeminimumexpectedinformation
requirement forA isselectedasthesplit-pointforA
• Split:D1 isthesetoftuplesinD satisfyingA ≤ split-point,andD2
isthesetoftuplesinD satisfyingA >split-point
ExtractingClassificationRulesfromTrees
•
•
•
•
•
RepresenttheknowledgeintheformofIF-THENrules
Oneruleiscreatedforeachpathfromtheroottoaleaf
Eachattribute-valuepairalongapathformsaconjunction
Theleafnodeholdstheclassprediction
Rulesareeasierforhumanstounderstand
• Example
IFage=“<=30”ANDstudent=“no”THEN buys_computer =“no”
IFage=“<=30”ANDstudent=“yes”THEN buys_computer =“yes”
IFage=“31…40”THEN buys_computer =“yes”
IFage=“>40”ANDcredit_rating =“exc”THEN buys_computer =“yes”
IFage=“<=30”ANDcredit_rating =“fair”THEN buys_computer =“no”
AvoidOverfitting inClassification
• Overfitting:Aninducedtreemayoverfit thetrainingdata
– Toomanybranches,somemayreflectanomalies(noiseoroutliers)
– Pooraccuracyforunseensamples
• Twoapproachestoavoidoverfitting
– Prepruning:Halttreeconstructionearly—donotsplitanodeifthis
wouldresultinthegoodnessmeasurefallingbelowathreshold
• Difficulttochooseanappropriatethreshold
– Postpruning:Removebranchesfroma“fullygrown”tree—geta
sequenceofprogressivelyprunedtrees
• Useasetofdatadifferentfromthetrainingdatatodecidewhich
isthe“bestprunedtree”
ApproachestoDeterminetheFinalTreeSize
• Separatetraining(2/3)andtesting(1/3)sets
• Usecrossvalidation
• Useallthedatafortraining
– butapplyastatisticaltest (e.g.,chi-square)toestimate
whetherexpandingorpruninganodemayimprovethe
entiredistribution
DecisionTrees:Assessment
• Advantages:
• Classificationofdatabasedonlimiting
• featuresisintuitive
• Handlesdiscrete/categoricalfeaturesbest
• Limitations:
• Dangerof“overfitting”thedata
• Notthebestchoiceforaccuracy
BayesianClassification:Why?
• Probabilisticlearning:Calculateexplicitprobabilitiesfor
hypothesis,amongthemostpracticalapproachestocertain
typesoflearningproblems
• Incremental:Eachtrainingexamplecanincrementally
increase/decreasetheprobabilitythatahypothesisiscorrect.
Priorknowledgecanbecombinedwithobserveddata.
• Probabilisticprediction:Predictmultiplehypotheses,weighted
bytheirprobabilities
• Standard:EvenwhenBayesianmethodsarecomputationally
intractable,theycanprovideastandardofoptimaldecision
makingagainstwhichothermethodscanbemeasured
BayesianTheorem:Basics
• LetX beadatasamplewhoseclasslabelisunknown
• LetH beahypothesisthatX belongstoclassC
• Forclassificationproblems,determineP(H|X):theprobabilitythat
thehypothesisholdsgiventheobserveddatasampleX
• P(H): priorprobabilityofhypothesisH (i.e.theinitialprobability
beforeweobserveanydata,reflectsthebackgroundknowledge)
• P(X):probabilitythatsampledataisobserved
• P(X|H):probabilityofobservingthesampleX,giventhatthe
hypothesisholds
BayesianTheorem
• Giventrainingdata X,posterioriprobabilityofahypothesis H,
P(H|X)followstheBayestheorem
P(H | X ) = P( X | H )P(H )
P( X )
• Informally,thiscanbewrittenas
posteriori=likelihoodxprior/evidence
• MAP(maximumposteriori)hypothesis
h
≡ arg max P(h | D) = arg max P(D | h)P(h).
MAP h∈H
h∈H
• Practicaldifficulty:requireinitialknowledgeofmany
probabilities,significantcomputationalcost
NaiveBayesClassifier
• Asimplifiedassumption:attributesareconditionally
independent:
n
P( X | C i ) = ∏ P( x k | C i )
k =1
• Theproductofoccurrenceofsay2elements x1 andx2,giventhe
currentclassisC,istheproductoftheprobabilitiesofeach
elementtakenseparately,giventhesameclassP([y1,y2],C)=
P(y1,C)*P(y2,C)
• Nodependencerelationbetweenattributes
• Greatlyreducesthecomputationcost,onlycounttheclass
distribution.
• OncetheprobabilityP(X|Ci) isknown,assignX totheclasswith
maximumP(X|Ci)*P(Ci)
Steps
1.
2.
3.
Convertthedatasetintoafrequencytable
CreateLikelihoodtablebyfindingtheprobabilities
use NaiveBayesianequationtocalculatetheposterior
probabilityforeachclass.Theclasswiththehighest
posteriorprobabilityistheoutcomeofprediction.
Question: Playerswillplayifweather
issunny.Isthisstatementiscorrect?
• Wecansolveitusingabovediscussedmethodof
posteriorprobability.
• P(Yes|Sunny)=P(Sunny|Yes)*P(Yes)/P
(Sunny)
• HerewehaveP(Sunny|Yes)=3/9=0.33,
P(Sunny)=5/14=0.36,P(Yes)=9/14=0.64
• Now,P(Yes|Sunny)=0.33*0.64/0.36=0.60,
whichhashigherprobability.
Trainingdataset
Class:
C1:buys_computer=
‘yes’
C2:buys_computer=
‘no’
Datasample
X=(age<=30,
Income=medium,
Student=yes
Credit_rating=
Fair)
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
NaiveBayesianClassifier:AnExample
•
Compute P(X|Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 , income =medium, student=yes, credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) :
P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
Therefore, X belongs to class “buys_computer=yes”
WhatIsPrediction?
• (Numerical)predictionissimilartoclassification
– constructamodel
– usemodeltopredictcontinuousororderedvalueforagiveninput
• Predictionisdifferentfromclassification
– Classificationreferstopredictcategoricalclasslabel
– Predictionmodelscontinuous-valuedfunctions
• Majormethodforprediction:regression
– modeltherelationshipbetweenoneormoreindependent orpredictor
variablesandadependent orresponse variable
• Regressionanalysis
– Linearandmultipleregression
– Non-linearregression
– Otherregressionmethods:generalizedlinearmodel,Poissonregression,
log-linearmodels,regressiontrees
LinearRegression
• Instatistics,linearregressionisanapproachformodelingthe
relationshipbetweenascalardependentvariableyandone
ormoreexplanatoryvariables(orindependentvariable)
denotedx.Thecaseofoneexplanatoryvariableiscalled
simplelinearregression.
60
40
20
0
Y
0
20
40
60
40
20
0
X
60
Howwouldyoudraw
alinethroughthe
points?Howdoyou
determinewhichline
‘fitsbest’?
Y
0
20
40
X
60
WhichIsMoreLogical?
Sales
Sales
Advertising
Sales
Advertising
Sales
Advertising
Advertising
TypesofRegressionModels
1 Explanatory
Variable
Regression
Models
2+ Explanatory
Variables
Multiple
Simple
Linear
NonLinear
Linear
NonLinear
LeastSquares
• ‘BestFit’ MeansDifferenceBetweenActualY
Values&PredictedY ValuesAreaMinimum.
ButPositiveDifferencesOff-SetNegative.So
squareerrors!
n
∑(
i =1
Yi − Yˆi
n
) = ∑ εˆ
2
2
i
i =1
• LS(LeastSquares)MinimizestheSumofthe
SquaredDifferences(errors)(SSE)
37
LeastSquaresGraphically
n
2
2
2
2
2
!
!
!
!
!
LS minimizes ∑ ε i = ε 1 + ε 2 + ε 3 + ε 4
i =1
Y2 = β! 0 + β! 1X 2 + ε! 2
Y
ε^4
ε^2
ε^1
ε^3
!
!
!
Yi = β 0 + β 1X i
X
38
CoefficientEquations
• Predictionequation
yˆi = βˆ0 + βˆ1 xi
• Sampleslope
β̂1
(x − x )(y − y )
∑
=
∑( x − x )
i
i
2
i
• SampleY- intercept
βˆ0 = y − βˆ1x
LinearRegressionExample
• Lastyear,fiverandomlyselectedstudentstooka
mathaptitudetestbeforetheybegantheir
statisticscourse.TheStatisticsDepartmenthas
threequestions.
– Whatlinearregressionequationbestpredicts
statisticsperformance,basedonmathaptitude
scores?
– Ifastudentmadean80ontheaptitudetest,what
gradewouldweexpecthertomakeinstatistics?
– Howwelldoestheregressionequationfitthedata?
LinearRegressionExample(Cont…)
• Inthetablebelow,thexi columnshowsscoresontheaptitude
test.Similarly,theyi columnshowsstatisticsgrades.Thelast
tworowsshowsumsandmeanscoresthatwewilluseto
conducttheregressionanalysis.
LinearRegressionExample(Cont…)
• Theregressionequationisalinearequationofthe
form:ŷ =b0 +b1x .Toconductaregressionanalysis,
weneedtosolveforb0 andb1.Computationsare
shownbelow(givingtheminimumsumofsquared
residuals).
• b1 =Σ [(xi - x)(yi - y)]/Σ [(xi - x)2]=470/730=0.644
• b0 =y- b1 *x=77- (0.644)(78) =26.768
• Therefore,theregressionequationis:
ŷ =26.768+0.644x
LinearRegressionExample(Cont…)
• Usingtheregressionequation:
– Chooseavaluefortheindependentvariable(x),
performthecomputation,andyouhavean
estimatedvalue(ŷ)forthedependentvariable.
– Inourexample,theindependentvariableisthe
student'sscoreontheaptitudetest.The
dependentvariableisthestudent'sstatistics
grade.Ifastudentmadean80ontheaptitude
test,theestimatedstatisticsgradewouldbe:
ŷ =26.768+0.644x=26.768+0.644*80
=26.768+51.52=78.288
LinearRegressionExample(Cont…)
• Warning:Whenyouusearegressionequation,donot
usevaluesfortheindependentvariablethatare
outsidetherangeofvaluesusedtocreatethe
equation.Thatiscalledextrapolation,anditcan
produceunreasonableestimates.
• Inthisexample,theaptitudetestscoresusedtocreate
theregressionequationrangedfrom60to95.
Therefore,onlyusevaluesinsidethatrangeto
estimatestatisticsgrades.Usingvaluesoutsidethat
range(lessthan60orgreaterthan95)isproblematic.
LinearRegressionExample(Cont…)
• Wheneveryouusearegressionequation,youshouldaskhow
welltheequationfitsthedata.Onewaytoassessfitistocheck
thecoefficientofdetermination,whichcanbecomputedfrom
thefollowingformula.
R2 ={(1/N)*Σ [(xi - x)*(yi - y)]/(σx *σy )}2
• whereNisthenumberofobservationsusedtofitthemodel,Σ
isthesummationsymbol,xi isthexvalueforobservationi,xis
themeanxvalue,yi istheyvalueforobservationi,yisthe
meanyvalue,σx isthestandarddeviationofx,andσy isthe
standarddeviationofy.
LinearRegressionExample(Cont…)
• Computationsforthesampleproblemofthislessonare
shownbelow.
• Acoefficientofdeterminationequalto0.48indicatesthat
about48%ofthevariationinstatisticsgrades(thedependent
variable)canbeexplainedbytherelationshiptomath
aptitudescores(theindependentvariable).
• R2 =1indicatesthattheregressionlineperfectlyfitsthedata.
NonlinearRegression
• Somenonlinearmodelscanbemodeledbyapolynomialfunction
• Apolynomialregressionmodelcanbetransformedintolinear
regressionmodel.Forexample,
y=w0 +w1 x+w2 x2+w3 x3
convertibletolinearwithnewvariables:x2=x2,x3=x3
y=w0 +w1 x+w2 x2+w3 x3
• Otherfunctions,suchaspowerfunction,canalsobetransformed
tolinearmodel
• Somemodelsareintractablenonlinear(e.g.,sumofexponential
terms)
– possibletoobtainleastsquareestimatesthroughextensivecalculationon
morecomplexformulae
OtherRegression-BasedModels
• Generalizedlinearmodel:
– Foundationonwhichlinearregressioncanbeappliedtomodeling
categoricalresponsevariables
– Varianceofyisafunctionofthemeanvalueofy,notaconstant
– Logisticregression:modelstheprob.ofsomeeventoccurringasalinear
functionofasetofpredictorvariables
– Poissonregression:modelsthedatathatexhibitaPoissondistribution
• Log-linearmodels:(forcategoricaldata)
– Approximatediscretemultidimensionalprob.distributions
– Alsousefulfordatacompressionandsmoothing
• Regressiontreesandmodeltrees
– Treestopredictcontinuousvaluesratherthanclasslabels
Summary
• Classification and prediction canbeusedtoextractmodels describing
importantdataclassesortopredictfuturedatatrends.
• Effectiveandscalablemethodshavebeendevelopedfordecisiontrees
induction,NaiveBayesianclassification,Bayesianbeliefnetwork,rulebasedclassifier,SupportVectorMachine(SVM),associativeclassification,
nearestneighborclassifiers, andcase-basedreasoning,andother
classificationmethodssuchasgeneticalgorithms,roughsetandfuzzy
set approaches.
• Linear,nonlinear,andgeneralizedlinearmodelsofregression canbe
usedforprediction.Manynonlinearproblemscanbeconvertedto
linearproblemsbyperformingtransformationsonthepredictor
variables.Regressiontrees andmodeltrees arealsousedforprediction.
EnhancementstoBasicDecisionTreeInduction
• Allowforcontinuous-valuedattributes
– Dynamicallydefinenewdiscrete-valuedattributesthat
partitionthecontinuousattributevalueintoadiscreteset
ofintervals
• Handlemissingattributevalues
– Assignthemostcommonvalueoftheattribute
– Assignprobabilitytoeachofthepossiblevalues
• Attributeconstruction
– Createnewattributesbasedonexistingonesthatare
sparselyrepresented
– Thisreducesfragmentation,repetition,andreplication
ClassificationinLargeDatabases
• Classification—aclassicalproblemextensivelystudiedby
statisticiansandmachinelearningresearchers
• Scalability:Classifyingdatasetswithmillionsofexamples
andhundredsofattributeswithreasonablespeed
• Whydecisiontreeinductionindatamining?
– relativelyfasterlearningspeed(thanotherclassification
methods)
– convertibletosimpleandeasytounderstandclassification
rules
– canuseSQLqueriesforaccessingdatabases
– comparableclassificationaccuracywithothermethods