Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DataInformatics SeonHoKim,Ph.D. [email protected] ClassificationandPrediction Outline • • • • • • • Classificationvs.prediction Issuesregardingclassificationandprediction Classificationbydecisiontreeinduction BayesianClassification Prediction Summary Reference Classificationvs.Prediction • Classification – predictscategoricalclasslabels(discreteornominal) – classifiesdata(constructsamodel)basedonthe trainingdatasetandusesitinclassifyingnewdata • Prediction – modelscontinuous-valuedfunctions,i.e.,predicts unknownormissingvalues • Typicalapplications – Creditapproval,Targetedmarketing,Medical diagnosis,Frauddetection,etc. Classification—ATwo-StepProcess • Modelconstruction:describingasetofpredetermined classes – Eachsampleisassumedtobelongtoapredefinedclass,as determinedbytheclasslabelattribute – Thesetoftuplesusedformodelconstructionistrainingset – Themodelisrepresentedasclassificationrules,decision trees,ormathematicalformulae Classification—ATwo-StepProcess • Modelusage:forclassifyingfutureorunknown objects – Estimateaccuracyofthemodel • Theknownlabeloftestsampleiscomparedwiththe classifiedresultfromthemodel • Accuracyrateisthepercentageoftestsetsamplesthatare correctlyclassifiedbythemodel • Testsetisindependentoftrainingset,otherwiseoverfitting willoccur – Iftheaccuracyisacceptable,usethemodeltoclassify datatupleswhoseclasslabelsarenotknown ClassificationProcess(1):ModelConstruction Training Data NAM E M ike M ary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no Classification Algorithms Classifier (Model) IFrank=‘professor’ ORyears>6 THENtenured=‘yes’ ClassificationProcess(2):UsetheModelinPrediction Classifier Testing Data UnseenData (Jeff,Professor,4) NAM E Tom M erlisa George Joseph RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes Tenured? Supervisedvs.UnsupervisedLearning • Supervisedlearning(classification) – Supervision:Thetrainingdata(observations, measurements,etc.)areaccompaniedbylabelsindicating theclassoftheobservations – Newdataisclassifiedbasedonthetrainingset • Unsupervisedlearning(clustering) – Theclasslabelsoftrainingdataisunknown – Givenasetofmeasurements,observations,etc.withthe aimofestablishingtheexistenceofclassesorclustersin thedata IssuesRegardingClassificationand Prediction(1):DataPreparation • Datacleaning – Preprocessdatainordertoreducenoiseandhandle missingvalues • Relevanceanalysis(featureselection) – Removetheirrelevantorredundantattributes • Datatransformation – Generalizeand/ornormalizedata Issuesregardingclassificationandprediction(2): Evaluatingclassificationmethods • Accuracy:classifierandpredictoraccuracy • Speed – timetoconstructthemodel(trainingtime) – timetousethemodel(classification/predictiontime) • Robustness – handlingnoiseandmissingvalues • Scalability:efficiency asdatasizegrows • Interpretability – understandingandinsightprovidedbythemodel • Othermeasures,e.g.,goodnessofrules,suchas decisiontreesizeorcompactnessofclassificationrules TrainingDataset (Exampleof“BuyingComputer”) age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no DecisionTrees • Adecisiontreeisadecisionsupporttoolthatusesa tree-likegraphormodelofdecisionsandtheir possibleconsequences. • Sortinstances(data)accordingtofeaturevalues(i.e., age,income,etc.):ahierarchyoftests – dataareclassified/sortedaccordingtospecificfeature values,whichbecomeincreasinglyspecific. • Nodes:features – Rootnode:featurethatbestdividesdata • Algorithmsexistfordeterminingthebestrootnode • Branches:valuesthenodecanassume Output:ADecisionTreefor“buying_computer” age? <=30 student? overcast 30..40 yes >40 creditrating? no yes excellent fair no yes no yes AlgorithmforDecisionTreeInduction • Basicalgorithm(agreedyalgorithm) – Treeisconstructedintop-downrecursivedivide-andconquer manner – Atstart,allthetrainingexamplesareattheroot – Attributesarecategorical(ifcontinuous,discretizedin advance) – Examplesarepartitionedrecursivelybasedon selectedattributes – Testattributesareselectedonthebasisofaheuristic orstatisticalmeasure (e.g.,informationgain) AlgorithmforDecisionTreeInduction • Conditionsforstoppingpartitioning – Allsamplesforagivennodebelongtothesame class – Therearenoremainingattributesforfurther partitioning– majorityvotingisemployedfor classifyingtheleaf – Therearenosamplesleft AttributeSelectionMeasure:InformationGain n n n n Selecttheattributewiththehighestinformationgain S containssi tuplesofclassCi fori ={1,…,m} information measuresinforequiredtoclassifyanyarbitrary m tuple si si I( s1,s2,...,sm) = −∑ log 2 s i =1 s entropy ofattributeA withvalues{a1,a2,…,av} s1 j + ... + smj E(A) = ∑ I (s1 j ,..., smj ) s j =1 v n informationgained bybranchingonattributeA Gain(A) = I(s1 ,s2 ,...,sm ) − E(A) AttributeSelectionbyInformationGainComputation g g g g Class P: buys_computer = “yes” Class N: buys_computer = “no” I(p, n) = I(9, 5) =0.940 Compute the entropy for age: age <=30 30…40 >40 age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 pi 2 4 3 ni I(pi, ni) 3 0.971 0 0 2 0.971 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no 5 4 I (2,3) + I (4,0) 14 14 5 + I (3,2) = 0.694 14 E (age) = 5 I (2,3) means “age <=30” 14 has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence, Gain(age) = I ( p, n) − E (age) = 0.246 Similarly, Gain(income) = 0.029 Gain( student ) = 0.151 Gain(credit _ rating ) = 0.048 ComputingInformation-GainforContinuousValueAttributes • LetattributeA beacontinuous-valuedattribute • Mustdeterminethebestsplitpoint forA – SortthevalueA inincreasingorder – Typically,themidpoint betweeneachpairofadjacentvalues isconsideredasapossiblesplitpoint • (ai+ai+1)/2isthemidpointbetweenthevaluesofai andai+1 – Thepointwiththeminimumexpectedinformation requirement forA isselectedasthesplit-pointforA • Split:D1 isthesetoftuplesinD satisfyingA ≤ split-point,andD2 isthesetoftuplesinD satisfyingA >split-point ExtractingClassificationRulesfromTrees • • • • • RepresenttheknowledgeintheformofIF-THENrules Oneruleiscreatedforeachpathfromtheroottoaleaf Eachattribute-valuepairalongapathformsaconjunction Theleafnodeholdstheclassprediction Rulesareeasierforhumanstounderstand • Example IFage=“<=30”ANDstudent=“no”THEN buys_computer =“no” IFage=“<=30”ANDstudent=“yes”THEN buys_computer =“yes” IFage=“31…40”THEN buys_computer =“yes” IFage=“>40”ANDcredit_rating =“exc”THEN buys_computer =“yes” IFage=“<=30”ANDcredit_rating =“fair”THEN buys_computer =“no” AvoidOverfitting inClassification • Overfitting:Aninducedtreemayoverfit thetrainingdata – Toomanybranches,somemayreflectanomalies(noiseoroutliers) – Pooraccuracyforunseensamples • Twoapproachestoavoidoverfitting – Prepruning:Halttreeconstructionearly—donotsplitanodeifthis wouldresultinthegoodnessmeasurefallingbelowathreshold • Difficulttochooseanappropriatethreshold – Postpruning:Removebranchesfroma“fullygrown”tree—geta sequenceofprogressivelyprunedtrees • Useasetofdatadifferentfromthetrainingdatatodecidewhich isthe“bestprunedtree” ApproachestoDeterminetheFinalTreeSize • Separatetraining(2/3)andtesting(1/3)sets • Usecrossvalidation • Useallthedatafortraining – butapplyastatisticaltest (e.g.,chi-square)toestimate whetherexpandingorpruninganodemayimprovethe entiredistribution DecisionTrees:Assessment • Advantages: • Classificationofdatabasedonlimiting • featuresisintuitive • Handlesdiscrete/categoricalfeaturesbest • Limitations: • Dangerof“overfitting”thedata • Notthebestchoiceforaccuracy BayesianClassification:Why? • Probabilisticlearning:Calculateexplicitprobabilitiesfor hypothesis,amongthemostpracticalapproachestocertain typesoflearningproblems • Incremental:Eachtrainingexamplecanincrementally increase/decreasetheprobabilitythatahypothesisiscorrect. Priorknowledgecanbecombinedwithobserveddata. • Probabilisticprediction:Predictmultiplehypotheses,weighted bytheirprobabilities • Standard:EvenwhenBayesianmethodsarecomputationally intractable,theycanprovideastandardofoptimaldecision makingagainstwhichothermethodscanbemeasured BayesianTheorem:Basics • LetX beadatasamplewhoseclasslabelisunknown • LetH beahypothesisthatX belongstoclassC • Forclassificationproblems,determineP(H|X):theprobabilitythat thehypothesisholdsgiventheobserveddatasampleX • P(H): priorprobabilityofhypothesisH (i.e.theinitialprobability beforeweobserveanydata,reflectsthebackgroundknowledge) • P(X):probabilitythatsampledataisobserved • P(X|H):probabilityofobservingthesampleX,giventhatthe hypothesisholds BayesianTheorem • Giventrainingdata X,posterioriprobabilityofahypothesis H, P(H|X)followstheBayestheorem P(H | X ) = P( X | H )P(H ) P( X ) • Informally,thiscanbewrittenas posteriori=likelihoodxprior/evidence • MAP(maximumposteriori)hypothesis h ≡ arg max P(h | D) = arg max P(D | h)P(h). MAP h∈H h∈H • Practicaldifficulty:requireinitialknowledgeofmany probabilities,significantcomputationalcost NaiveBayesClassifier • Asimplifiedassumption:attributesareconditionally independent: n P( X | C i ) = ∏ P( x k | C i ) k =1 • Theproductofoccurrenceofsay2elements x1 andx2,giventhe currentclassisC,istheproductoftheprobabilitiesofeach elementtakenseparately,giventhesameclassP([y1,y2],C)= P(y1,C)*P(y2,C) • Nodependencerelationbetweenattributes • Greatlyreducesthecomputationcost,onlycounttheclass distribution. • OncetheprobabilityP(X|Ci) isknown,assignX totheclasswith maximumP(X|Ci)*P(Ci) Steps 1. 2. 3. Convertthedatasetintoafrequencytable CreateLikelihoodtablebyfindingtheprobabilities use NaiveBayesianequationtocalculatetheposterior probabilityforeachclass.Theclasswiththehighest posteriorprobabilityistheoutcomeofprediction. Question: Playerswillplayifweather issunny.Isthisstatementiscorrect? • Wecansolveitusingabovediscussedmethodof posteriorprobability. • P(Yes|Sunny)=P(Sunny|Yes)*P(Yes)/P (Sunny) • HerewehaveP(Sunny|Yes)=3/9=0.33, P(Sunny)=5/14=0.36,P(Yes)=9/14=0.64 • Now,P(Yes|Sunny)=0.33*0.64/0.36=0.60, whichhashigherprobability. Trainingdataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Datasample X=(age<=30, Income=medium, Student=yes Credit_rating= Fair) age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no NaiveBayesianClassifier:AnExample • Compute P(X|Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 , income =medium, student=yes, credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 Therefore, X belongs to class “buys_computer=yes” WhatIsPrediction? • (Numerical)predictionissimilartoclassification – constructamodel – usemodeltopredictcontinuousororderedvalueforagiveninput • Predictionisdifferentfromclassification – Classificationreferstopredictcategoricalclasslabel – Predictionmodelscontinuous-valuedfunctions • Majormethodforprediction:regression – modeltherelationshipbetweenoneormoreindependent orpredictor variablesandadependent orresponse variable • Regressionanalysis – Linearandmultipleregression – Non-linearregression – Otherregressionmethods:generalizedlinearmodel,Poissonregression, log-linearmodels,regressiontrees LinearRegression • Instatistics,linearregressionisanapproachformodelingthe relationshipbetweenascalardependentvariableyandone ormoreexplanatoryvariables(orindependentvariable) denotedx.Thecaseofoneexplanatoryvariableiscalled simplelinearregression. 60 40 20 0 Y 0 20 40 60 40 20 0 X 60 Howwouldyoudraw alinethroughthe points?Howdoyou determinewhichline ‘fitsbest’? Y 0 20 40 X 60 WhichIsMoreLogical? Sales Sales Advertising Sales Advertising Sales Advertising Advertising TypesofRegressionModels 1 Explanatory Variable Regression Models 2+ Explanatory Variables Multiple Simple Linear NonLinear Linear NonLinear LeastSquares • ‘BestFit’ MeansDifferenceBetweenActualY Values&PredictedY ValuesAreaMinimum. ButPositiveDifferencesOff-SetNegative.So squareerrors! n ∑( i =1 Yi − Yˆi n ) = ∑ εˆ 2 2 i i =1 • LS(LeastSquares)MinimizestheSumofthe SquaredDifferences(errors)(SSE) 37 LeastSquaresGraphically n 2 2 2 2 2 ! ! ! ! ! LS minimizes ∑ ε i = ε 1 + ε 2 + ε 3 + ε 4 i =1 Y2 = β! 0 + β! 1X 2 + ε! 2 Y ε^4 ε^2 ε^1 ε^3 ! ! ! Yi = β 0 + β 1X i X 38 CoefficientEquations • Predictionequation yˆi = βˆ0 + βˆ1 xi • Sampleslope β̂1 (x − x )(y − y ) ∑ = ∑( x − x ) i i 2 i • SampleY- intercept βˆ0 = y − βˆ1x LinearRegressionExample • Lastyear,fiverandomlyselectedstudentstooka mathaptitudetestbeforetheybegantheir statisticscourse.TheStatisticsDepartmenthas threequestions. – Whatlinearregressionequationbestpredicts statisticsperformance,basedonmathaptitude scores? – Ifastudentmadean80ontheaptitudetest,what gradewouldweexpecthertomakeinstatistics? – Howwelldoestheregressionequationfitthedata? LinearRegressionExample(Cont…) • Inthetablebelow,thexi columnshowsscoresontheaptitude test.Similarly,theyi columnshowsstatisticsgrades.Thelast tworowsshowsumsandmeanscoresthatwewilluseto conducttheregressionanalysis. LinearRegressionExample(Cont…) • Theregressionequationisalinearequationofthe form:ŷ =b0 +b1x .Toconductaregressionanalysis, weneedtosolveforb0 andb1.Computationsare shownbelow(givingtheminimumsumofsquared residuals). • b1 =Σ [(xi - x)(yi - y)]/Σ [(xi - x)2]=470/730=0.644 • b0 =y- b1 *x=77- (0.644)(78) =26.768 • Therefore,theregressionequationis: ŷ =26.768+0.644x LinearRegressionExample(Cont…) • Usingtheregressionequation: – Chooseavaluefortheindependentvariable(x), performthecomputation,andyouhavean estimatedvalue(ŷ)forthedependentvariable. – Inourexample,theindependentvariableisthe student'sscoreontheaptitudetest.The dependentvariableisthestudent'sstatistics grade.Ifastudentmadean80ontheaptitude test,theestimatedstatisticsgradewouldbe: ŷ =26.768+0.644x=26.768+0.644*80 =26.768+51.52=78.288 LinearRegressionExample(Cont…) • Warning:Whenyouusearegressionequation,donot usevaluesfortheindependentvariablethatare outsidetherangeofvaluesusedtocreatethe equation.Thatiscalledextrapolation,anditcan produceunreasonableestimates. • Inthisexample,theaptitudetestscoresusedtocreate theregressionequationrangedfrom60to95. Therefore,onlyusevaluesinsidethatrangeto estimatestatisticsgrades.Usingvaluesoutsidethat range(lessthan60orgreaterthan95)isproblematic. LinearRegressionExample(Cont…) • Wheneveryouusearegressionequation,youshouldaskhow welltheequationfitsthedata.Onewaytoassessfitistocheck thecoefficientofdetermination,whichcanbecomputedfrom thefollowingformula. R2 ={(1/N)*Σ [(xi - x)*(yi - y)]/(σx *σy )}2 • whereNisthenumberofobservationsusedtofitthemodel,Σ isthesummationsymbol,xi isthexvalueforobservationi,xis themeanxvalue,yi istheyvalueforobservationi,yisthe meanyvalue,σx isthestandarddeviationofx,andσy isthe standarddeviationofy. LinearRegressionExample(Cont…) • Computationsforthesampleproblemofthislessonare shownbelow. • Acoefficientofdeterminationequalto0.48indicatesthat about48%ofthevariationinstatisticsgrades(thedependent variable)canbeexplainedbytherelationshiptomath aptitudescores(theindependentvariable). • R2 =1indicatesthattheregressionlineperfectlyfitsthedata. NonlinearRegression • Somenonlinearmodelscanbemodeledbyapolynomialfunction • Apolynomialregressionmodelcanbetransformedintolinear regressionmodel.Forexample, y=w0 +w1 x+w2 x2+w3 x3 convertibletolinearwithnewvariables:x2=x2,x3=x3 y=w0 +w1 x+w2 x2+w3 x3 • Otherfunctions,suchaspowerfunction,canalsobetransformed tolinearmodel • Somemodelsareintractablenonlinear(e.g.,sumofexponential terms) – possibletoobtainleastsquareestimatesthroughextensivecalculationon morecomplexformulae OtherRegression-BasedModels • Generalizedlinearmodel: – Foundationonwhichlinearregressioncanbeappliedtomodeling categoricalresponsevariables – Varianceofyisafunctionofthemeanvalueofy,notaconstant – Logisticregression:modelstheprob.ofsomeeventoccurringasalinear functionofasetofpredictorvariables – Poissonregression:modelsthedatathatexhibitaPoissondistribution • Log-linearmodels:(forcategoricaldata) – Approximatediscretemultidimensionalprob.distributions – Alsousefulfordatacompressionandsmoothing • Regressiontreesandmodeltrees – Treestopredictcontinuousvaluesratherthanclasslabels Summary • Classification and prediction canbeusedtoextractmodels describing importantdataclassesortopredictfuturedatatrends. • Effectiveandscalablemethodshavebeendevelopedfordecisiontrees induction,NaiveBayesianclassification,Bayesianbeliefnetwork,rulebasedclassifier,SupportVectorMachine(SVM),associativeclassification, nearestneighborclassifiers, andcase-basedreasoning,andother classificationmethodssuchasgeneticalgorithms,roughsetandfuzzy set approaches. • Linear,nonlinear,andgeneralizedlinearmodelsofregression canbe usedforprediction.Manynonlinearproblemscanbeconvertedto linearproblemsbyperformingtransformationsonthepredictor variables.Regressiontrees andmodeltrees arealsousedforprediction. EnhancementstoBasicDecisionTreeInduction • Allowforcontinuous-valuedattributes – Dynamicallydefinenewdiscrete-valuedattributesthat partitionthecontinuousattributevalueintoadiscreteset ofintervals • Handlemissingattributevalues – Assignthemostcommonvalueoftheattribute – Assignprobabilitytoeachofthepossiblevalues • Attributeconstruction – Createnewattributesbasedonexistingonesthatare sparselyrepresented – Thisreducesfragmentation,repetition,andreplication ClassificationinLargeDatabases • Classification—aclassicalproblemextensivelystudiedby statisticiansandmachinelearningresearchers • Scalability:Classifyingdatasetswithmillionsofexamples andhundredsofattributeswithreasonablespeed • Whydecisiontreeinductionindatamining? – relativelyfasterlearningspeed(thanotherclassification methods) – convertibletosimpleandeasytounderstandclassification rules – canuseSQLqueriesforaccessingdatabases – comparableclassificationaccuracywithothermethods