Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) KLASSIFIKATION UND PROGNOSE Teil 1: Entscheidungsbäume and Bayesian Klassifikation Slide 1 Skriptum zur Vorlesung 401 192 / 1 WS 2002 VO 2.0 Peter Brezany Institut für Softwarewissenschaft Universität Wien Classification and Prediction They are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Classification predicts categorical labels, prediction models continuous valued functions . Numerisches Attribut Slide 2 Kategorisches Attribut Klassenattribut ID Alter Autotyp Risikoklasse 1 23 Familie Hoch 2 17 Sport Hoch 3 43 Sport Hoch 4 68 Familie Niedrig 5 32 LKW Niedrig Einordnung von Versicherungskunden in die Risikoklassen "Hoch" bzw. "Niedrig" Categorical (nominal) attributes have a finite number of possible values, with no ordering among the values (e.g., occupation, color, etc.). 1 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Classification and Prediction (2) Another example: E.g., a classification model may be built to categorize bank loan applications as either safe of risky. Slide 3 A prediction model may be built to predict the expenditures of potential customers on computer equipment given their income and occupation. Most algorithms proposed so far are memory resident, typically assuming a small data size. Recent database mining research has built on such work, developing scalable classification and prediction technique capable of handling large disk-resident data. These techniques often consider parallel and distributed processing. The Data Classification Process Data classification is a two-step process (Fig. 1). In the first process, a model is built describing a predermined set of data classes or concepts. The model is constructed by analyzing database tuples described by attributes. Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label attribute. In the context of classification, data tuples are also referred to as samples, examples, or objects. Slide 4 The data tuples analyzed to build the model collectively form the training data set. The individual tuples making up the training set are referred to as training samples and are randomly selected from the same population. Since the class label of each training sample is provided, this step is also known as supervised learning. In unsupervised learning (or clustering) the class label of each training sample is not known, and the number or set of classes to be learned may not be known in advance. We will learn more about clustering later. The learned model is represented in the form of classification rules, decision trees, or mathematical formulae. 2 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) The Data Classification Process (2) In Fig. 1(a), classification rules can be learned to identify customers as having excellent or fair credit ratings. The rules can be used to categorize future data samples. In the second step Fig. 1(b), the model is used for classification. The predictive accuracy of the model (classifier) is estimated. Slide 5 A simple technique is applied that uses a test set of class-labeled samples. These are independent from the training samples. The accuracy of a model on a given test set is the percentage of test set samples that are correctly classified by the model; for each sample, the known class label is compared with the learned model’s class prediction for that sample. If the accuracy were estimated based on the training set, the estimate could be optimistic. If the accuracy of the model is considered acceptable, the model can be used to classify future data tuples or objects for which the class label is not known. In our example, the classification rules learned can be used to predict the credit rating of new or future custommers. The Data Classification Process (3) Classification algorithm Training data Slide 6 name Courtney Fox Sandy Jones Bill Lee Susan Lake Claire Phips Andre Beau ... age income credit_rating 31...40 high excellent <=30 low fair <=30 low excellent >40 med fair >40 med fair 31...40 high excellent ... ... ... Classification rules if age = "31...40" and income = high then credit_rating = excellent (a) Classification rules Test data name Frank Jones Sylvia Crest Anne Yee ... New data age income credit_rating >40 high fair <=30 low fair 31...40 high excellent ... ... ... (b) Figure 1: (John Henri, 31...40, high) Credit rating? excellent 3 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Comparing Classification Methods Predictive accuracy: the ability of the model to correctly predict the class label of new or previously unseen data. Slide 7 Speed: the computation costs involved in generating and using the model. Robustness: the ability of the model to make correct preconditions given noisy data or data with missing values. Scalability: the ability to construct the model efficiently given large amounts of data. Interpretability: the level of understanding and insight that is provided by the model. Classification by Decision Tree Induction A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes. Slide 8 Autotyp = LKW = LKW Risikoklasse = niedrig Alter >60 Risikoklasse = niedrig <=60 Risikoklasse = hoch 4 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Classification by Decision Tree Induction (2) Another typical decision tree is shown in Fig. 2. It represents the concept buys computer - it predicts whether or not a customer at AllElectronics is likely to purchase a computer. Slide 9 In order to classify an unknown sample, the attribute values of the sample are tested against the decision tree. A path is traced from the root to a leaf node that holds the class prediction for that sample. Decision trees can easily be converted to classification rules. Decision trees have been used in many application areas: medicine, business, game theory, etc. When a decision tree is built, many of the branches may reflect noise or outliers in the training data. Tree pruning attempts to identify and remove such branches - improving classification accuracy on unseen data. An Example Decision Tree age? <=30 Slide 10 yes student? no no >40 31...40 yes yes credit_rating? excellent no fair yes A decision tree for concept buys_computer, indicating whether or not a customer at AllElectronics is likely to purchase a computer. Each internal (nonleaf) node represents a test on an attribute. Each leaf node represents a class (either buys_computer = yes or buys_computer = no). Figure 2: 5 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) 6 Another Example - The Contact Lense Problem Look at the contact lens data in the table split across the next 2 slides: this table gives the condition under which an optician might want to prescribe soft contact lenses, etc. Slide 11 Slide 12 spectacle tear production recommended age prescription astigmatism rate lenses -------------------------------------------------------------------------------young myope no reduced none young myope no normal soft young myope yes reduced none young myope yes normal hard young hypermetrope no reduced none young hypermetrope no normal soft young hypermetrope yes reduced none young hypermetrope yes normal hard pre-presbyopic myope no reduced none pre-presbyopic myope no normal soft pre-presbyopic myope yes reduced none pre-presbyopic myope yes normal hard pre-presbyopic hypermetrope no reduced none pre-presbyopic hypermetrope no normal soft pre-presbyopic hypermetrope yes reduced none pre-presbyopic presbyopic presbyopic presbyopic presbyopic presbyopic presbyopic presbyopic presbyopic hypermetrope myope myope myope myope hypermetrope hypermetrope hypermetrope hypermetrope yes no no yes yes no no yes yes normal reduced normal reduced normal reduced normal reduced normal none none none none hard none soft none none -------------------------------------------------------------------------------Part of a structural description of the above table information description might be as follows: If tear production rate = reduced then recommendation = none Otherwise, if age = young and astigmatig = no then recommendation = soft Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Decision Tree for the Contact Lens Data tear production rate normal reduced Slide 13 astigmatism none yes no soft spectacle prescription myope hard hypermetrope none The Third Example - The Weather Problem TABLE Slide 14 ------------------------------------------------------------------outlook temperature humidity windy play -------------------------------------------------------------------sunny hot high false no sunny hot high true no overcast hot high false yes rainy mild high false yes rainy cool normal false yes rainy cool normal true no overcast cool normal true yes sunny mild high false no sunny cool normal false yes rainy mild normal false yes sunny mild normal true yes overcast mild high true yes overcast hot normal false yes rainy mild high true no ------------------------------------------------------------------- 7 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) The Weather Problem – Rules A set of rules learned from the information introduced in the previous slide - not necessarily a very good set - might look like this: Slide 15 -----------------------------------------------------If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes ------------------------------------------------------ The above rule are meant to be interpreted in order: the first one first, then if it doesnt apply, the second, and so on. The Weather Data With Some Numeric Attributes Slide 16 outlook temperature humidity windy play -------------------------------------------------------------------sunny 85 85 false no sunny 80 90 true no overcast 83 86 false yes rainy 70 96 false yes rainy 68 80 false yes rainy 65 70 true no overcast 64 65 true yes sunny 72 95 false no sunny 69 90 false yes rainy 75 80 false yes sunny 75 70 true yes overcast 72 90 true yes overcast 81 75 false yes rainy 71 91 true no ------------------------------------------------------------------Now the first rule from the last slide might take the form: If outlook = sunny and humidity > 83 then play = no 8 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) 9 The Weather Data and Association Rules Many association rules can be derived from the weather data. Some good ones are If temperature = cool Slide 17 then huminidity = normal If huminity = normal and windy = false then play = yes If outlook = sunny and play = no then huminidity = high If windy = false and play = no then outlook = sunny and humidity = high Divide and Conquer: Constructing decision trees This problem can be expressed recursively: Select an attribute to place at the root node Slide 18 Make a branch for each possible value. This splits up the example set into subsets, one for every value of the attribute. Now the process can be repeated recursively for each branch, using only those instances that actually reach the branch. If at any time all instances at a node have the same classification, stop developing that part of the tree. How to determine which attribute to split on, given a set of examples with different classes? In our weather thata, there are 4 possibilities for each split - see the next slide. Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Tree Stumps for the Weather Data temperature outlook overcast sunny Slide 19 yes yes yes no no yes yes yes yes yes yes no no no rainy (a) hot yes yes yes yes no no yes yes no no (b) humidity high yes yes yes no no no no (c) cool mild yes yes yes no windy normal yes yes yes yes yes yes no false yes yes yes yes yes yes no no true yes yes yes no no no (d) Constructing decision trees (2) If we had a measure of the purity of each node, we could choose the attribute that produces the purest daughter nodes. Slide 20 The measure of purity is called the information and is measured in units called bits. Associated with a node of the tree, it represents the expected amount of information that would be needed to specify whether a new instance should be classified yes or no, given that the example reached that node. The information is calculated based on the number of yes or no classes at each node; we will look at the details of the calculation shortly. When evaluating the first tree in the figure in the last slide, the number of yes and no classes at the leaf nodes are [2,3], [4,0], and [3,2], and the respective information values (entropies) are info([2,3]) = 0.971 bits info([4,0]) = 0.0 bits info([3,2]) = 0.971 bits eq. entropy(2/5,3/5) = - 2/3*log2/3 - 3/5*log3/5 10 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Constructing decision trees (3) We calculate the average information value of these, taking into account the number of instances that go down each branch info([2,3], [4,0],[3,2]) = (5/14) * 0.971 + (4/14) * 0 + (5/14) * 0.971 = 0.693 bits Slide 21 The above average represents the amount of information that we expect would be necessary to specify the class of a new instance, given the tree structure (a) in our example. Before computing any of the tree structures, the training examples at the root comprised nine yes and five no nodes, corresponding to an information value of info([9,5]) = 0.940 bits Thus the tree (a) is responsible for an information gain of gain(outlook) = info([9,5]) - info([2,3], [4,0],[3,2]) = 0.940 - 0.693 = 0.247 bits, which can be interpreted as the information value of creating a branch on the outlook attribute. Constructing decision trees (4) The way forward is clear. We calculate the information for each attribute and choose the one that gains the most information to split on. In the situation of our figure, Slide 22 gain(outlook) = 0.247 bits gain(temperature) = 0.029 bits gain(humidity) = 0.152 bits gain(windy) = 0.048 bits so we select outlook as the splitting attribute at the root of the tree. Then we continue recursively. Next figure shows the possibilities for a further branch at the node reached when outlook is sunny. The information gain for each turns out to be gain(temperature) = 0.571 bits gain(humidity) = 0.971 bits gain(windy) = 0.020 bits, so we select humidity as the splitting attribute at this point. 11 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Extended Tree Stumps for the Weather Data outlook outlook sunny sunny ... temperature hot no no Slide 23 high no no no cool mild yes yes no ... humidity ... ... normal yes yes outlook sunny windy false yes yes no no ... ... true yes no Decision Tree for the Weather Data outlook sunny Slide 24 ... humidity high no rainy overcast ... yes normal yes yes windy ... false true no 12 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Calculating Information The measure should be applicable to multiclass situations, not just to two-class one. For example, in a 3-class situation: Slide 25 info([2, 3, 4]) = entropy(2/9, 3/9, 4/9) !#"$%&'()*"$%+),(-"$%& The algorithms are expressed in base 2. So far, we have addressed the decision tree topic in an informal way. A more formal approach follows. Decision Tree Construction - The Basic Algorithm The basic algorithm for decision tree construction is an algorithm that constructs decision trees in a top-down recursive divide-and-conquer manner. Slide 26 Algorithm: Generate decision tree. Generate a decision tree from training data. Input: The training samples, samples, represented by discrete-valued attributes, attribute-list. Output: A decision tree. Method: (1) create a node . ; (2) if samples are all of the same class, / (3) return . then as a leaf node labeled with the class / ; (4) if attribute-list is empty then (5) return . as a leaf node labeled with the most common class in samples; // majority voting (6) select test-attribute, the attribute among attribute-list with the highest information gaing; 13 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Decision Tree Construction - Basic Algorithm (2) Slide 27 (7) label node . with test-attribute; (8) for each known value of test-attribute // partition the samples (9) grow a branch from node . for the condition test-attribute = ; (10) let be the set of samples in samples for which test-attribute = ; (11) (12) (13) if // a partition is empty then attach a leaf labeled with the most common class in samples; else attach the node returned by Generate decision tree( , attribute-list – test-attribute); The basic strategy of the algorithm is informally explained in the next slides. Decision Tree Construction - Basic Algorithm (3) 1. The tree starts as a single node representing the training samples (step 1). 2. If the samples are all of the same class, then the node becomes a leaf and is labeled with that class (steps 2 and 3). Slide 28 3. Otherwise, the algorithm uses an entropy-based measure known as information gaing as a heuristic for selecting the attribute that will best separate the samples into individual classes (step 6). This attribute becomes the “test” or “decision” attribute at the node (step 7). In this version of the algorithm, all attributes are categorical, that is, discrete-valued. Continuous-valued attributes must be discretized. 4. A branch is created for each known value of the test attribute, and the samples are partitioned accordingly (steps 8-10). 5. The algorithm uses the same process recursively to form a decision tree for the samples at each partition. Once an attribute has occured at a node, it need not be considered in any of the node‘s descendents (step 13). 14 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Decision Tree Construction - Basic Algorithm (4) 6 The recursive partitioning stops only when any one of the following conditions is true: (a) All samples for a given node belong to the same class (steps 2 and 3), or Slide 29 (b) There are no remaining attributes on which the samples may be further partitioned (step 4). In this case, majority voting is employed (step 5). This involves converting the given node into a leaf and labeling it with the class in majority among samples. Alternatively, the class distribution of the node samples may be stored. (c) There are no samples for the branch test-attribute = (step 11). In this case, a leaf is created with the majority class in samples (step 12). Attribute Selection Measure The information gain measure is used to select the test attribute at each node in the tree - it is also called an attribute selection measure or a measure of the goodness of split. Slide 30 The attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node. Let be a set consisting of data samples. Suppose the class label attribute has distinct values defining distinct classes, (for i = 1, ..., m). Let be the number of samples of in class . The expected information needed to classify a given sample is given by ! " % (Eq.1) where is the probability that an arbitrary sample belongs to class and is estimated by . 15 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Attribute Selection Measure (2) Let attribute have distinct values, . Attribute can be used to partition into subsets, , , where contains those samples in that have value of . If were selected as the test attribute (i.e., the best attribute for splitting), then these subsets would correspond to the branches grown from the node containing the set . Slide 31 Let be the number of samples of class in a subset . The entropy, or expected information based on the partitioning into subsets by , is given by ! The term acts as the weight of the th subset and is the number of samples in the subset (i.e., having value of ) divided by the total number od samples in . The smaller entropy value, the greater the purity of the subset partitions. Attribute Selection Measure (2) For a given subset Slide 32 , ! "$% where and is the probability that a sample in belongs to class . The encoding information that would be gained by branching non is ! ! , ! is the expected reduction in entropy caused by knowing the value of . The algorithm computes the information gain of each attribute. The attribute with the highest information gain is chosen as the test attribute for the given set . 16 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Example on the Induction of a Decision Tree Table 2 (on a subsequent slide) presents a traning set taken from the AllElectronics customer database. The class label attribute, buys computer, has two distinct values; therefore, there are 2 distinct classes. Let corresponds to and to . There are 9 samples of class yes and 5 samples of class no. Slide 33 ! "$% " % Next, we need to compute the entropy of each attribute. For % !! " # For % , ! $ % $ For % & '& $ &, (& !! Example on the Induction of a Decision Tree (2) The expected information needed to classify a given sample if the samples are partitioned according to age is %,! Slide 34 *) , +) & (& !$, Hence, the gain in information from such a partitioning would be %,!$-., ! %,! Similarly, we compute * / 0 ! ! 2143, 56! ! /63 ! % #$07 Since %, has the highest information gain among the attributes, it is selected as the test attribute - see Fig. 3. The final decision tree returned by the algorithm is shown in Fig. 2. 17 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Training Data Tuples from the AllElectronics DB Table 2 Slide 35 -------------------------------------------------------------------RID age income student credit_rating Class: buys_computer -------------------------------------------------------------------1 <=30 high no fair no 2 <=30 high no excellent no 3 31...40 high no fair yes 4 >40 medium no fair yes 5 >40 low yes fair yes 6 >40 low yes excellent no 7 31...40 low yes excellent yes 8 <=30 medium no fair no 9 <=30 low yes fair yes 10 >40 medium yes fair yes 11 <=30 medium yes excellent yes 12 31...40 medium no excellent yes 13 31...40 high yes fair yes 14 >40 medium no excellent no -------------------------------------------------------------------- Sample Partitioning age ? <=30 Slide 36 income high high medium low medium student no no no yes yes credit_rating class fair excellent fair fair excellent no no no yes yes income high low medium high >40 31...40 student no yes no yes income medium low low medium medium student no yes yes yes no credit_rating class fair excellent excellent fair yes yes yes yes credit_rating class fair fair excellent fair excellent yes yes no yes no The attribute age has the highest information gain and therefore becomes a test attribute at the root node of the decision tree. Branches are grown for each value of age. The samples are shown partitioned according to each branch. Figure 3: 18 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) 19 Extracting Classification Rules from Decision Trees The knowledge represented in decision trees can be extracted and represented in the form of classification IF-THEN rules. Slide 37 One rule is created for each path from the root to a leaf node. The IF-THEN rules may be easier for humans to understand, particularly if the given tree is very large. IF IF IF IF IF age age age age age = = = = = "<=30" AND student = "no" THEN buys_computer = "no" "<=30" AND student = "yes" THEN buys_computer = "yes" "31...40" THEN buys_computer = "yes" ">40" AND credit_rating = "excellent" THEN buys_computer = "no" ">40" AND credit_rating = "fair" THEN buys_computer = "yes" BAYESIAN CLASSIFICATION Bayesian classifiers are statistical classifiers. They can predict class membership probabibilities such as the probability that a given samle belongs to a particular class. Slide 38 A simple Bayesian classifier known as the naive Bayesian classifier is comparable in performance with decision tree and neural network classifiers - also, high speed and accuracy when applied to large databases. Naive Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes - class conditional independence. Bayesian belief networks are graphical models which allow the representation of dependencies among subsets of attributes. Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) 20 Bayes Theorem - a data sample whose class label is unknown - some hypothesis, such as that , the belongs to a specified class . For classification, we want to determine probability that holds for the given . Slide 39 is the posterior probability, or a posteriori probability, of conditioned on . Example: Suppose: the world of data samples consists of fruits, described by their color reflects our confidence and shape; is red and round; H = “X is an apple”. Then that is an apple given we have seen that is red and round. is the prior probability, or a priori, of conditioned on . Example: This is the probability that any given data sample is an apple, regardless of how the data sample looks. The posterior probability is based on more information than the prior probability. Bayes Theorem (2) is the posterior probability of Example: It is the probability that is an apple. Slide 40 is the prior probability of conditioned on . is red and round given that we know that it is true that . Example: It is the probability that a data sample from our set of fruits is red and round. Bayes theorem is In the next part, we will learn how Bayes theorem is used in the naive Bayesian classifier. Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Naive Bayesian Classification The naive Bayesian classifier, or simple Bayesian classifier, works as follows 1. Each data sample is represented by an -dimensional feature vector, , depicting measurements made on the sample from attributes, respectively, . Slide 41 2. Suppose that there are classes, Given an unknown data sample, (i.e., having no class label), the classifier will predict that belongs to the class having the highest posterior propability, conditioned on . That is, the naive Bayesian classifier assigns an unknown sample to the class if and only if % for . - the class is called the maximum posteriori Thus we maximize hypothesis. By Bayes theorem, Naive Bayesian Classification (2) 3. As is constant for all classes, only that . need be maximized. Note 4. Given data sets with many attributes, it would be extremely computationally expensive . Therefore, the naive assumption of class conditional to compute independence is made. This presumes that there are no dependence relationships among the attributes. Thus Slide 42 # The probabilities , training samples, where , ..., can be estimated from the (a) If is categorical, then # samples of class having the value samples belonging to . , where is the number of training for , and is the number of training (b) If is continuous-based, then the attribute is typically assumed to have a Gaussian distribution so that 21 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Slide 43 %) ! are the mean and standard deviation, respectively and 5. In order to classify an unknown sample , is evaluated for each class . Sample is then assigned to the class if and only if % for . Bayesian Classification – Example Slide 44 Predicting a class label using naive Bayesian classification: We wish to predict the class label of an unknown sample using naive Bayesian classification, given the same training data as in our example for decision tree induction. The data samples are described by the attributes age, income, student, and credit rating. The class label attribute, buys computer, has two distinct values (namely, yes, no ). Let correspond to the class buys computer = “yes” and correspond to buys computer= “no”. The unknown sample we wish to classify is X = (age= "<=30",income = "medium", student = "yes", credit_rating = "fair") . We need to maximize , for class, can be computed based on the training samples: , the prior propability of each P(buys_computer = "yes") = 9/14 = 0,643 P(buys_computer = "no") = 5/14 = 0,357 To compute , for , we compute the following conditional probabilities: 22 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) 23 Bayesian Classification – Example (2) Slide 45 P(age="<30" |buys_computer = "yes") = P (age="<30"|buys_computer = "no") = 3/5 = P(income = "medium" | buys_computer = "yes") = P( income = "medium" | buys_COMPUTER = "no") = P(student= "yes" | buys_computer="yes") = P(student="yes"| buys_computer = "no") = P(credit_rating= "fair"|buys_computer= "yes")= P(credit_rating="fair" |buys_computer="no" = 2/9 =0,222 0,600 4/9=0,444 2/5=0,400 6/9=0,667 1/5=0,200 6/9=0,667 2/5=0,400 Using the above probabilities, we obtain P(X|buys_computer="yes")=0,222x0,444x0,667x0,667=0,044 P(X|buys_computer= no")=0.600 x 0.400 = 0.019 P(X|buys_computer="yes")P(buys_computer="yes") = 0.044 x 0.643 = 0.028 P(X|buys_computer = "no")P(buys_computer = "no") = 0.019 x 0.357 = 0.007 Therefore, the naive Bayesian classifier predicts buys computer = “yes” for sample . PREDICTION Slide 46 What if we would like to predict a continuous value, rather than a categorical label? The prediction of continuous valued can be modeled by a statistical techniques of regression. For example, we may like to develop a model to predict the salary of college graduates with 10 years of work experience. Many problem can be solved by linear regression, and even more can be tackled by applying transformations to the variables so that a nonlinear problem can be converted to a linear one. Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Linear Regression Data are modeled using a straight line. A random variable, (called a response variable) is modeled as a linear function of another random variable, (called a predictor variable) Slide 47 and ) are regression coefficients. They can be computed by the method of least squares. is the average of , , and is the average of , , where and is the number of samples. Linear Regression - Example Slide 48 Salary data =========================================== X Y Years of experience Salary (in ($1000) ------------------------------------------3 30 8 57 9 64 13 72 3 36 6 43 11 59 21 90 1 20 16 83 ============================================ A plot of the data is shown in the next figure. 24 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Plot of the data in the previous table 100 Salary (in $1,000) 80 Slide 49 60 40 20 0 0 5 10 15 20 25 Years experience Although the points do not fall on a straight line, the overall pattern suggests a linear relationship between X (years experience) and Y (salary). Linear Regression - Example (2) Using the linear regression formulas, we introduced before: 5 - , ( ( 0 , Slide 50 - ) 10 years of experience: $58.6K Multiple regression is an extension of linear regression: ) ) The method of least squares can also be applied here to solve for + ! . 25 Peter Brezany Institut für Softwarewissenaschaft, Univ. Wien (2002) Nonlinear Regression Many response variable and predictor variables relationships can be modeled by polynomial functions (we say that such relationships do not show a linear dependence). Slide 51 By applying transformations to the variables, we can convert the nonlinear model into a linear one which can be solved by the method of least squares. For example: ) ) & ) & & & Then the equation above can be converted to linear form: ) ) ) & & 26