Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Algorithms of Artificial Intelligence Lecture 6: Learning E. Tyugu © Enn Tyugu 1 Content • Basic concepts – transfer function – classification – stages of usage • Perceptron • Hopfield net • Hamming net • Carpenter-Grossberg’s net • Kohonen’s feature maps • Bayesian networks • ID3 • AQ © Enn Tyugu 2 Neural nets Neural nets provide another form of massively parallel learning functionality. They are well suited for learning pattern recognition. A simple way to describe a neural net is to represent it as a graph. Each node of the graph has an associated variable called state and a constant called threshold. Each arc of the graph has an associated numeric value called weight. Behaviour of a neural net is determined by transfer functions for nodes which compute new values of states from previous states of neighbouring nodes. © Enn Tyugu 3 Node of a net A common transfer function is of the form xj = f( wij*xi - tj) where the sum is taken over incoming arcs with weigths wij, and xi are states of the neighboring nodes, tj is threshold of the node j where the new state is computed. Learning in neural nets means changing the weights in a right way. x1 w1j f xn xj wnj © Enn Tyugu 4 Transfer functions f(x) f(x) +1 f(x) +1 x +1 x x -1 hard limiter threshold logic © Enn Tyugu sigmoid 5 Forward-pass and layered nets 1. Forward-pass neural net is an acyclic graph. Its nodes can be classified as input, output and internal nodes. Input nodes do not have neighbours on incoming arcs, output nodes do not have them on outgoing arcs and internal nodes possess both kinds of neighbours. 2. Layered (n-layered) net is a forward-pass net where each path from an input node to an output node contains exactly n nodes. Each node in such a graph belongs exactly to one layer, n-layered net is strongly connected, if each node in the i-th layer is connected to all nodes of the (i+1) -st layer, i=1,2, . . . ,n-1. States of the layered net can be interpreted as decisions made on the basis of the states of the input nodes. © Enn Tyugu 6 Layered neural net } output nodes .... } intermediate nodes } input nodes Learning in a layered net can be performed by means of back-propagation. In this case, the states taken by output nodes are evaluated and credit or blam is assigned to each output node. The evaluations are propagated back to othe © Enn Tyugu 7 Stages of usage 1. Selection of the structure (of the network type) 2. Assignment of initial weights 3. Learning/teaching 4. Application © Enn Tyugu 8 Perceptrons Perceptrons´nodes are hard limiters or sigmoids. Examples: single-layer double-layer © Enn Tyugu three-layer 9 Learning in a single-layer perceptron 1. Initialize weights wi and threshold t to small random values. 2. Take an input x1, …, xn and the desired output d. 3. Calculate output x of the perceptron. 4. Adapt the weights: wi´= wi + h*(d - x)*xi, where h<1 is a positive gain value, +1, if input is from one class d= -1, if input is from the other class Repeat 2 - 4, if needed. NB! Note that the weights are changed only for incorrect output d. © Enn Tyugu 10 Regions separable by perceptrons B B A A B A single-layered double-layered © Enn Tyugu three-layered 11 Hopfield net Every node is connected to all other nodes, weights are symmetric (wij = wji). Works with binary (+1,-1) input signals. The output is also a tuple of values +1 or -1. x1' x2' xn' even a sigmoid can be used x1 xn x2 © Enn Tyugu 12 Hopfield net 1. Initialize connection weights: wij = xis * xjs, s i j, where xis is +1 or -1 as in the description x1s, …, xns of the class s. 2. Initialise states with an unknown pattern x1, …, xn. 3. Iterate until convergence (even can be done asynchronously): xj´= f ( wij*xi), where f is the hard limiter. Remarks: • A Hopfield net can be used either as a classifier or an associative memory. • It converges always, but no match may occur. • It works well, in the case when number of classes is less than 0.15*n. • There are several modifications of the Hopfield net architecture.. © Enn Tyugu 13 Hamming net The Hamming net calculates Hamming distance to exemplar of each class and shows positive output for the class with the minimal distance. This net is widely used for restoring corrupted binary fixed length signals. Hamming net works faster than Hopfield net, has less connections for larger number of input signals. It implements the optimum minimum error classifier when bit errors are random and independent. © Enn Tyugu 14 Hamming net y1 y2 ym select the best match z2 z1 zm calculate Hamming distance x1 xn x2 © Enn Tyugu 15 Hamming net Value at a middle node zs is n – hds where hds is Hamming distance to the exemplar pattern ps . Threfore in the lower subnet weight from input xi to the middle node zs is wis = xis/2, ts = n/2 for each exemplar s. Indeed, 0 for the most incorrect code, and 1 = (+1 –(-1))* xis/2 is added for each correct input signal, so that this gives n for correct code. © Enn Tyugu 16 Hamming net continued 1. Initialize weights and offsets: a) lower subnet: wis = xis/2, tj = n/2 for each exemplar s; b) upper subnet: tk=0, wsk = if k=s then 1 else -e, where 0 < e < 1/m. 2. Initialize the lower subnet with (unknown) pattern x1,…, xn and calculate yj = f( wij*xi - tj). 3. Iterate in the upper subnet until convergence: yj´= f(yj - e* yk). © Enn Tyugu 17 A comparator subnet Here is a comparator subnet that selects the maximum of two analog inputs x0, x1. Combining several of these nets one builds comparators for more inputs (4, 8 etc., approximately log2n layers for n inputs). Output z is the maximum value, y0 and y1 indicate which input is maximum, dark nodes are hard limiters, light nodes are threshold logic nodes, all thresholds are 0, weights are shown on arcs. z y0 1 y1 0.5 0.5 1 0.5 0.5 1 x0 1 -1 -1 © Enn Tyugu x1 18 Carpenter-Grossberg net This net forms clusters without supervision. Its clustering algorithm is similar to the simple leader clustering algorithm: select the first input as the exemplar for the first cluster; if the next input is close enough to some cluster exemplar, it is added to the cluster, otherwise it becomes the exemplar of a new cluster. The net includes much feedback and is described by nonlinear differential equations. © Enn Tyugu 19 Carpenter-Grossberg net Carpenter-Grossberg net for three binary inputs: x0, x1, x2 and two classes. x0 x1 © Enn Tyugu x2 20 Kohonen´s feature maps A Kohonen’s self organizing feayture map (K-map) is uses analogy with such biological neural structures where the placement of neurons is orderly and reflects structure of external (sensed) stimuli (e.g. in auditory and visual pathways). K-map learns, when continuous-valued input vectors are presented to it without specifying the desired output. The weights of connections can adjust to regularities in input. Large number of examples is needed. K-map mimics well learning in biological neural structures. It is usable in speech recognizer. © Enn Tyugu 21 Kohonen´s feature maps continued This is a flat (two-dimensional) structure with connections between neighbors and connections from each input node to all its output nodes. It learns clusters of input vectors without any help from teacher. Preserves closeness (topolgy). Output nodes continues valued input vector © Enn Tyugu 22 Learning in K-maps 1. Initialize weights to small random numbers and set initial radius of neighborhood of nodes. 2. Get an input x1, …, xn. 3. Compute distance dj to each output node: dj = (xi - wij)2 4. Select output node s with minimal distance ds. 5. Update weights for the node s and all nodes in its neighborhood: wij´= wij + h* (xi - wij), where h<1 is a gain that decreases in time. Repeat steps 2 - 5. © Enn Tyugu 23 Bayesian networks Bayesian networks use the conditional probability formula P(e,H)=P(H|e)P(e) = P(e|H)P(H) binding the conditiona probabilities of evidence e and hypothesis H. Bayesian network is a graph whose nodes are variables denoting occurrence of events, arcs express causal dependence of events. Each node x has conditional probabilities for every possible combination of events influencing the node, i.e. for every collection of events in nodes of pred(x) immediately preceding the node x in the graph. © Enn Tyugu 24 Bayesian networks x1 Example: x3 x4 x2 x6 x5 The joint probability assessment for all nodes x1,…,xn: P(x1,…,xn) = P(x1|pred(x1))*...*P(xn|pred(xn)) constitutes a joint-probability model that supports the assessed event combination. For the present example it is as follows: P(x1,…,x6) = P(x6|x5)*P(x5|x2,x3)*P(x4|x1,x2)*P(x3|x1)*P(x2|x1)*P(x1) © Enn Tyugu 25 Bayesian networks continued A bayesian network can be used for diagnosis/classification: given some events, the probablities of events depending on the given ones can be predicted. To construct a bayesian network, one needs to • determine its structure (topology) • find conditional probabilities for each dependency. © Enn Tyugu 26 Taxonomy of neural nets NEURAL NETS BINARY-VALUED INPUTS SUPERVISED LEARNING HAMMING NETS HOPFIELD NETS CONTINUOUS INPUTS UNSUPERVISED LEARNING CARPENTERGROSSBERG NETS SUPERVISED LEARNING UNSUPERVISED LEARNING MULTI-LAYERED PERCEPTRONS SINGLE-LAYERED PERCEPTRONS © Enn Tyugu KOHONEN MAPS 27 A decision tree outlook sunny overcast rain humidity + windy high normal _ + true _ © Enn Tyugu false + 28 ID3 algorithm • To get the fastest decision-making procedure, one has to arrange attributes in a decision tree in a proper order - the most discriminating attributes first. This is done by the algorithm called ID3. • The most discriminating attribute can be defined in precise terms as the attribute for which the fixing its value changes the enthropy of possible decisions at most. Let wj be the frequency of the j-th decision in a set of examples x. Then the enthropy of the set is E(x)= - wj* log(wj) • Let fix(x,a,v) denote the set of these elements of x whose value of attribute a is v. The average enthropy that remains in x , after the value a has been fixed, is: H(x,a) = kv E(fix(x,a,v)), where kv is the ratio of examples in x with attribute a having value v. © Enn Tyugu 29 ID3 algorithm ID3 uses the following variables and functions: p -- pointer to the root of the decision tree being built; x -- set of examples; E(x) -- enthropy of x for the the set of examples x; H(x,a) -- average entropy that remains in x after the value of a has been fixed; atts(x) -- attributes of the set of examples x; vals(a) -- values of the attribute a; mark(p,d) -- mark node p with d; newsucc(p,v) -- new successor to the node p, with attribute value v, returns pointer p to the new node; fix(x,a,v) -- subset of given set of examples x with the value v of the attribute a. © Enn Tyugu 30 ID3 continued A.3.10: ID3(x,p)= if empty(x) then failure elif E(x)=0 then mark(p,decision(x)) else h:=bignumber; for a atts(x) do if H(x,a) < h then h:=H(x,a); am:=a fi od; mark(p,am); for vvals(am,x) do ID3(fix(x,am,v),newsucc(p,v)) od fi © Enn Tyugu 31 AQ algorithm This algorith is for learning knowledge in the form of rules. The algorithm AQ(ex,cl) builds a set of rules from the given set of examples ex for the collection of classes cl using the function aqrules(p,n,c) for building a set of rules for a class c from its given positive examples p and negative examples n. pos(ex,c) is a set of positive examples for class c in ex neg(ex,c) is a set of negative examples for class c in ex covers(r,e) is a predicate which is true when example e satisfies the rule r. prune(rules) throws away rules covered by some other rule. © Enn Tyugu 32 AQ continued A.3.11: AQ(ex,cl)= allrules = { }; for c cl do allrules:=alrules aqrules(pos(ex,c),neg(ex,c),c) od; return(allrules) aqrules(pos,neg,c) = rules := {aqrule(selectFrom(pos),neg,c)}; for e pos do L: {for r rules do if covers(r,e)then break L fi od; rules:=rules {aqrule(e,neg,c)}; prune(rules) } od; return(rules) © Enn Tyugu 33 AQ continued aqrule(seed,neg,c) -- builds a new rule from the initial condition seed and negative examples neg for the class c. newtests(r,seed,e) -- generates amendments q to the rule r, r&q covers seed and not e; worstelements(star) -- chooses the least promising elements in star. aqrule(seed,neg,c) = star:={true}; for e neg do for r star do if covers(r,e) then star:=(star {r&q| q newtests(r,seed,e)}) \{r} fi; while size(star)>maxstar do star:=star\worstelements(star) od od od; return("if" bestin(star) "then"c) © Enn Tyugu 34 A clustering problem (learning without a teacher) © Enn Tyugu 35 Hierarchy of learning methods Learning parametric learning by automata massively parallel learing numerically neural nets genetic algorithms symbolic learning search in concept space specific to general inductive inference general to specific inverse resolution © Enn Tyugu 36 Otsustustabelid Otsutustabel on teadmuse kompakte esitusviis, kui teadmus on valiku tegemiseks lõpliku (ja mitte eriti suure) hulga võimaluste seast. Otsustustabel on kolme tüüpi aladest koosnev liittabel. Tingimuste list (C1, C2,…,Ck on tingimused, mis pannakse kirja mingis formaalses – programmiks tõlgitavas keeles): C1 C2 … © Enn Tyugu Ck 37 Otsustustabelid Valikumaatriks,mis koosneb tingimustele vastavatest veergudest ja valikuvariantidele vastavatest ridadest. Tabeli igas ruudus võib olla üks kolmest väärtusest: y – jah, tingimus peab oleman täidetud n – ei, tingimus ei tohi olla täidetud 0 – pole oluline, kas tingimus on täidetud või ei (lahter on siis sageli lihtsalt tühi). C1 C2 …. Ck © Enn Tyugu 38 Otsustustabelid Tabeli kolmandas väljas on valitavad otsused. Kui esimest ja teist tüüpi välju on kumbagi kaks, saab ka kolmanda välja teha maatriksi kujul: Tingimused y y y n y y y n n y n Tingimused n Otsused y © Enn Tyugu 39 Bibliography • Kohonen, T. (1984) Self-Organization and Associative Memory. Springer Verlag, Holland. • Lippmann, R. (1987) An Introduction to Computing with Neural Nets. IEEE ASSP Magazine, No. 4, 4 - 22. • Michalski, S. (1983) Theory and methodology of inductive learning. In: S. Michalski, J. Carbonell, T. Mitchell, eds. Machine learning: an Artificial Intelligence approach. Tioga, Palo Alto, 83 –134. © Enn Tyugu 40 Exercises 1. Calculate the entropies of attributes; 2. build a decision tree Sample data for ID 3 Outlook Temperature Humidity Windy Class Sunny Hot High False - Sunny Hot High True - Overcast Hot High False + Rain Mild High False + Rain Cool Normal False + Rain Cool Normal True - Overcast Cool Normal True + Sunny Mild High False - Sunny Cool Normal False + Rain Mild Normal False + Sunny Mild Normal True + Overcast Mild High True + Overcast Hot Normal False + Rain Mild High True - © Enn Tyugu 41