Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification and Learning Classification is the process of distributing objects, items, concepts into classes or categories of the same type. This is synonymous to: groupings, indexing, relegation, taxonomy etc. A classifier is the tool that obtains classification of things. Process of classification is one important tool of data-mining. What knowledge can we extract from a given set of data? Given data D , can we assert D1 M 1 ... Di M i How do we learn these? In what other forms, knowledge could be extracted from the data given? Data Mining Classification Clustering Classification: What type? What class? What group? -- A labeling process Clustering: partitioning data into similar groupings. A cluster is grouping of 'similar' items! Clustering is a process that partitions a set of objects into equivalence classes. Classification is used extensively in ■ marketing ■ healthcare outcomes ■ fraud detection ■ homeland security ■ investment analysis ■ automatic website and image classifications Classification allows prioritization and filtering. It accommodates key-words search. A typical example. Data: A relational database comprising tuples about emails passing through a port. Each tuple = < sender, recipient, date, size> Classify each mail either as an authentic or a junk. Data preparation before data mining: ► Normal data to be mined is noisy with many unwanted attributes, etc. ►Discretization of continuous data ►Data normalization [ -1 .. + 1] or [0 .. 1] range ►Data smoothing to reduce noise, removal of outliers, etc. ►Relevance analysis: feature selection to ensure relevant set of wanted features only Process of classification: ► Model construction each tuple belongs to a predefined class and is given a class label set of all tuples thus offered is called a training set The attendant model is expressed as ** classification rule (IF-THEN statements) ** decision tree ** mathematical formula ► Model evaluation Estimation of accuracy on the test set. Compare the known label of the test sample with the computed label. Compute percentage of error. Ensure test set training set ► Implement model to classify unseen objects ** assign a label to a new tuple ** predict the value of an actual attribute Training Data Classification Algorithm Classifier model Rule1: (term = "cough" && term =~"chest Xray") ignore; Rule2: (temp = > 103 bp=180/100) malaria || pneumonia; Rule3: (term = "general pain") && (term="LBC") infection Different techniques from statistics, information retrieval and data mining are used for classification. Included in it are: ■ Bayesian methods ■ Bayesian belief networks ■ Decision trees ■ Neural networks ■ Associative classifier ■ Emerging patterns ■ Support vector machines. Combinations of these are used as well. The basic approach to a classification model construction: If one or several attributes or features ai A occur together in more than one itemset (data sample, target data) assigned the topic T , then output a rule Rule : ai a j ... am T or Rule P at | X thres T t 1 Examples. Naïve Bayes Classifiers. Best for classifying texts, documents, ... Major drawback: unrealistic independent assumption among individual items. Basic issue here: Do we accept a document d in class C ? If we do, what is the penalty for misclassification? For a good mail classifier, a junk mail should be assigned "junk" label with a very high probability. The cost of doing this to a good mail is very high. The probability that a document di belongs to topic C j is computed by Bayes' rule P C j | di P di | C j P C j P di ... (1) Define priori odds on C j as O( C j ) P( C j ) 1 P( C j ) ... (2) Then Bayes' equation gives us the posteriori odds O( C j | di ) O( C j ) P( di | C j ) P( di | C j ) O( C j )L( di | C j ) ... (3) where L( di | C j ) is the likelihood ratio. This is one way we could use the classifier to yield posteriori estimate of a document. Another way would be to go back to (1). Here P C j | di P di | C j P C j P di with P C j nd C j | D| ... (4) where | D | is the total volume of the documents in the database, and nd C j is the number of documents in class C j . The following from Julia Itskevitch's work is outlined 1 . 1 Julia Itskevitch, `` Automatic Hierarchical E-mail Classification Using Association Rules '', M.Sc. thesis, Computing Science, Simon Fraser University, July ... www-sal.cs.uiuc.edu/~hanj/pubs/theses.html Multi-variate Bernoulli model Assumption: Each document is a collection of set of terms (key-words, etc.) t . Either a specific term is present or it's absent (we are not interested in its count, or in its position in the document). In this model, P di | C j P t | C j t 1 it P t | C j 1 it 1 P t | C j t 1 ...(5) it 1, if t di or zero, otherwise. If we use (5) P t | C j 1 nd C j ,t 2 nd C j ... (6) Alternative to this would be a term-count model as outlined below: In this, for every term we count its frequency of occurrence if it is present. Positional effects are ignored. In that case, P t | C j co nst t 1 P t | C j it it ! ... (7) Jason Rennie's Ifile Naïve Baysian approach outlines a multinomial model 2 . Reference: http://cbbrowne.com/info/mail.html#IFILE Every new item considered will be allowed to change the frequency count dynamically. The frequent terms are kept, non-frequent terms are abandoned if their count log2 ( age ) 1 where age is the total time (space) elapsed since first encounter. ID3 Classifier. (Quinlan 1983) A decision-tree approach. Suppose the problem domain is on a feature space ai . Should we use every feature to discriminate? Isn't there a possibility that some feature is more important (revealing) than others and therefore should be used more heavily? e.g. TB or ~TB case. Training space: Three features: (Coughing, Temp, and Chest-pain). Possible values over a vector of features: Coughing (yes, no) Temp (hi, med, lo) Chest-pain (yes, no) Case 1. 2. 3. 4. 5. 6. 7. 8. Description (yes, hi, yes, T) (no, hi, yes, T) (yes, lo, yes, ~T) (yes, med, yes, T) (no, lo, no, ~T) (yes, med, no, ~T) (no, hi, yes, T) (yes, lo, no, ~T) Consider the feature "Coughing". Just on this feature, the training set splits into two groups: a "yes" group, and a "no" group. The decision tree on this feature appears as: Coughing Yes 1. (yes, hi, yes, T) 3. (yes, lo, yes, ~T) 4. (yes, med, yes, T) 6. (yes, med, no, ~T) 8. (yes, lo, no, ~T) No 2. (no, hi, yes, T) 5. (no, lo, no, ~T) 7. (no, hi, yes, T) Entropy of the overall system before further discrimination based on "coughing" feature: 4 4 4 4 T log 2 log 2 1 bit 8 8 8 8 Entropy of the (yes | coughing) with 2 positive and 3 negative cases is: 2 2 3 3 Tyes|coughing log 2 log 2 0.9710 bit 5 5 5 5 Similarly, the entropy of the branch (no | coughing) with 2 positives and 1 negative gives us: 1 1 2 2 Tno|coughing log 2 log 2 0.9183 bit 3 3 3 3 Entropy of the combined branches (weighed by their probabilities) 5 3 Tcoughing Tyes|coughing Tno|coughing 0.951 bit 8 8 Therefore, information gained by testing the "coughing" feature gives us T Tcoughing 0.49 bit ID3 yields a strategy of testing attributes in succession to discriminate on feature space. What combination of features one should take and in what order to determine a class membership is resolved here. Learning discriminants: Generalized Perceptron. Widrow-Hoff Algorithm (1960). Adaline. (Adaptive Linear neuron that learns via Widrow-Hoff Algorithm) Given an object with components xi on a feature space i . A neuron is a unit that receives the object components and processes them as follows. 1. Each neuron in a NN has a set of links to receive weighted input. Each link i receives its input xi , weigh it by i and then sends it out to the adder to get summed. u j x j j 1 produces. This is what adder 2. If the sum is greater than the activation function y ( u b ), the adder fires its output. The choice of (.) determines the neuron model. Step function. ( v ) 1, if v b 0, otherwise Sigmoid function. ( v ) 1 1 exp( v ) Gaussian function. 1 v 2 1 ( v ) exp 2 2 We first consider a single neuron system. This could be generalized to a more complex system. The single neuron Perceptron is a single neuron system that has a step-function activation unit ( v ) 1 if v 0 1, otherwise This is used for binary classification. Given training vectors and two classes C1 and C2 , if the output ( v ) 1 assign class C1 to the vector, otherwise class C2 . To train the system is equivalent to adjusting its weights associated with its links. How do we adjust the weights? To train the system is equivalent to adjusting its weights associated with its links. 1. k=1 2. get k . Initial weights could be randomly chosen between (0,1) 3. while (misclassified training examples) k 1 k x ; is learning rate parameter and the error function w.x 1 Since the correction to weights could be expressed as x , the rule is known as delta-rule. A perceptron can only model linearly separable functions like AND, OR, NOT. But it can't model XOR. Generlaized -rule for the Semilinear feedback Net with backpropagation of Error. Output Layer k wkj Hidden Layer j w ji Input Layer i INPUT PATTERN The net input to a node in layer j is net j w ji oi … (1) The output of a node j is o j f (net j ) 1 1 e ( net j j ) / 0 … (2) This is the nonlinear sigmoid activation function; this tells us how the hidden layer nodes would fire if they at all fire. The input to the nodes of layer k (here the output layer) is netk wkj o j … (3) and the output of the layer k is ok f (netk ) … (4) In the learning phase a number of training samples would be sequentially introduced. Let x p {i pi } be one such input object. The net seeing this adjusts its weights to its links. The output pattern {o pk } might be different from the ideal input pattern {t pk }. The network link-weight adjustment strategy is to “adjust the link weights so that net square of 1 2 the error E (t pk o pk ) is 2P k minimized.” For convenience, we omit the subscript p and ask for what changes in the weights 1 E (t k ok ) 2 … (5) is minimized. 2k We attempt to do so by gradient descent to the minimum. That is, wkj const E wkj E wkj … (6) Now E E netk wkj netk wkj and netk wkj o j o j wkj wkj E E ok netk ok netk Let us compute k According to (5), E (t k pk ) ok And … (7) ok f k (netk ) netk … (8) … (9) So that, eqn (6) can now be expressed as wkj (t k pk ) f k (netk ) … (10a) k o j Similarly, the weight adjustment within the inside is E E net j E w ji oi w ji net j w ji net j E o j E oi oi f j (net j )( ) o j net j o j oi j E cannot be computed directly. Instead, we o j express it in terms of the known But E E E netk o j k netk o j k netk E wkj k wkj k netk k wkm om o j m Thus, it implies j f j (net j ) k wkj k In other words, the deltas in the hidden nodes can be evaluated by the deltas at the output layer. Note that given o j f (net j ) o j net j 1 1 e ( net j j ) / 0 o j (1 o j ) This results for the following delta-rules in the upper and the hidden layer respectively and pk (t pk o pk )o pk (1 o pk ) pj o pj (1 o pj ) pk wkj k