Download CSE 5290: Artificial Intelligence

Artificial Intelligence and Decision Making Session11: Probabilistic Reasoning Systems 15.1 Representing Knowledge in an Uncertain Domain 15.2 The Semantics of Belief Networks 15.2.1 Representing the joint probability distribution 15.2.1.1 A method for constructing belief networks 15.2.1.2 Compactness and node ordering 15.2.1.3 Representation of conditional probability tables 15.2.2 Conditional independence relations in belief networks 15.3 Inference in Belief Networks 15.3.1 The nature of probabilistic inferences 15.3.2 An algorithm for answering queries 15.4 Inference in Multiply Connected Belief Networks 15.4.1 Clustering methods 15.4.2 Cutset conditioning methods 15.4.3 Stochastic simulation methods 15.5 Knowledge Engineering for Uncertain Reasoning 15.5.1 Case Study: The Pathfinder system 15.6 Other Approaches to Uncertain Reasoning 15.6.1 Default reasoning 15.6.2 Rule-based methods for uncertain reasoning 15.6.3 Representing ignorance: Dempster-Shafer theory 15.6.4 Representing vagueness: Fuzzy sets and fuzzy logic 15.7 Summary Web Sites http://ist-socrates.berkeley.edu:4247/tech.reports/tech27.html http://info.sm.umist.ac.uk/wp/abstract/wp9801.htm http://www-cs-students.stanford.edu/~cathay/summer/inspec http://citeseer.nj.nec.com/context/17451/0 http://citeseer.nj.nec.com/context/4306/0 http://www.cs.vu.nl/vakgroepen/ai/education/courses/ks/slides/KS05/tsld031.htm From: http://www.sci.brooklyn.cuny.edu/~kopec/cis718/Tong3.htm Summary of Psychological validity of uncertainty combining rules in expert systems, by Bruce E. Tonn and Richard T. Goeltz, in Expert Systems, vol. 7, No. 2, pp 94–101, May, 1990 Summarized by Tao Tong Summary on Psychological validity of uncertainty combining rules in expert systems This is a summary of the article written by Bruce E. Tonn and Richard T. Goeltz, published on Expert Systems, vol. 7, No. 2, pp94–101, May, 1990. Several approaches have been developed to deal with the inexact reasoning in expert systems. Examples are: Certainty Factors (CF), Dempster-Shafter approach, Fuzzy Sets, and the Theory of Endorsements. They are used and justified because they are simple, mathematically rigorous, easy to program, cautious, and can capture the essence of natural language. Certainty Factors are used since MYCIN. It is the most extensively used uncertainty representation and manipulation approach. It is simple, and intuitive. Dempster-Shafter approach excels in mathematical rigor. It utilizes theory of probability and can trace back to the great mathematicians such as Jacob Bernoulli in the 1600s. Fuzzy Sets theory has the advantage of being mathematically rigorous and sensitive to natural language when dealing with uncertainties. Theory of Endorsements deals with complex uncertainty with a non-mathematical manner, however, it has less support than the other approaches. All the above uncertainty representation and manipulation approaches are undergoing extensions and improvements. However, Choosing from the various uncertainty approaches is not a trivial task for knowledge engineers. An important, but often ignored aspect of the uncertainty approaches is the psychological validity of the aforementioned approaches. Simply put, psychological validity means whether the human experts really use the uncertainty approaches when they are solving a problem. When the uncertainty approaches used by human experts and expert system shell are incompatible, the consequence may be minor; however, sometime serious adverse consequence may result. For example, when knowledge engineers tried to accommodate the expert system shell using a mismatched uncertainty approach, they may unknowingly compromise the knowledge base to fit the flawed expert system shell uncertainty approach. If both knowledge base and uncertainty approach is flawed, the performance of the resulting expert system must necessarily unsatisfactory. Knowledge engineers may be interested in some relevant psychological studies dealing with the human inexact reasoning. Hink R. and Woods, D., in the paper How Humans Process Uncertain Knowledge: An Introduction, provided an excellent review on the revenant psychological research. From the psychological researches, we can see experts and the average persons alike can be highly inefficient, frequently irrational when making decisions. Even mathematically trained individuals have the tendency of intransitive preference, overconfidence, and poor probability calibration. People usually violate the axioms of Expected Utility Theory and deal with probability in a clumsy way. When combining estimates of uncertainty, previous works show the people consistently overestimate conjunctive probabilities. This paper presents the research conducted in Oak Ridge National Laboratory to explore the cognitive validity of commonly accepted uncertainty combining rules. Likelihood Elicitation System (LES), a computer program written in Common Lisp, running on DEC VAX computer clusters has been developed to facilitate the research. LES has two major components, Session 1 and Session 2. Session 1 elicits the likelihood of simple propositions, and Session 2 elicits the likelihood of complex propositions derived from simple propositions elicited in session 1. LES represents the likelihood with three modalities, probability, certainty factors, and natural languages. The subjects are chose from Oak Ridge National Laboratory and University of Tennessee, Knoxville. The proposition are from three fields, daily events, personality trait, and cancer. The subjects should have enough "common knowledge expertise" on these fields. First, the subjects are asked to evaluate the likelihood of simple propositions in session 1, then, the LES customize 36 complex propositions for each subject from the simple propositions in session 1 by conjunction, and, using the likelihood given by the subject and the expert system uncertainty combining rules, calculates the likelihood of the complex propositions. The subjects are asked to evaluate the likelihood of the complex propositions, and the likelihoods are compared with the computed likelihood by LES. The results of this research are rather intriguing. The human uncertainty heuristics seemed to be influenced by the modality of uncertainly representation. And, all the results cannot confirm the psychological validity of the uncertainty approaches used in modern expert system shells. It indicates that humans are not doing inexact reasoning using the expert systems uncertainty approaches. The conclusion: people’s natural uncertainty combining rules are not best modeled by any generally accepted expert systems uncertainty approaches, and the natural uncertainty combining rules maybe highly idiosyncratic. Knowledge engineers should be aware of this kind of mismatch, and be careful when developing expert systems using the uncertainty approaches. Because of the importance of uncertain information presentation and manipulation in the operation of expert systems, eventually, knowledge experts must determine experts’ natural uncertainty processing heuristics. In the future, expert system shells could contain advanced machine learning modules to model an expert’s uncertainty combining rules, and expert system shells should have alternative uncertainty combining rules. Knowledge engineers need to collaborate with psychologists to provide better models in dealing with the uncertainty representation and manipulation. From: http://members.tripod.com/~RichardBowles/published/euro89.htm This paper I gave at the European Conference on Speech Communication and Technology, Paris, September 1989 (Eurospeech 89), pages 384-387. Application of the Dempster-Shafter Theory of Evidence to Improved-Accurac IsolatedWord Recognition R.L. Bowles and R.I. Damper Department of Electronics and Computer Science, University of Southampton, Highfield, Southampton SO9 5NH, U.K. Abstract This paper describes experiments in which the outputs of three isolated-word recognition algorithms were combined to yield a lower average error rate than that achieved by any individual algorithm. The input tokens were simulated, fixed-length spectral speech patterns subjected to additive noise and either a "high" or "low" degree of time distortion. The three recognition techniques used were dynamic time warping, hidden Markov modelling and a multi-layer perceptron. Two combination techniques were employed: the formula derived from the Dempster-Shafer (D-S) theory of evidence and simple majority voting (MV). D-S performed significantly better than MV under both time-distortion conditions. Evidence is also presented that the assumption of independent word scores, which is necessary for D-S theory to be strictly applicable, is questionable. 1 Introduction The problem domains of interest in artificial intelligence are characterised by the need to handle uncertain information. Several standard methodologies for this have arisen, each with its own representation of the degree of uncertainty of the data, including Bayesian updating, belief functions [1], fuzzy logic [2] and the MYCIN calculus [3]. Each approach gives a way of combining evidence from distinct knowledge sources; a comprehensive survey of the methods is presented in [4,5]. Apart from our own work [6], these techniques have not (as far as we are aware) been employed in automatic speech recognition. According to Allerhand [7], the use of simple models in speech recognition creates an inherent performance limitation. A plateau is reached, set by the assumptions implicit in the model, where further improvement cannot be made. Combination of evidence offers a possible means of overcoming the fundamental limitation imposed by use of a single model. In [6], we showed that it was possible in certain circumstances to obtain a significant increase in recognition accuracy by combining the outputs from two or three distinct isolated-word recognition (IWR) algorithms subjected to the same simulated speech input using Dempster-Shafer (D-S) theory (see below). In this case, the individual algorithms act as separate sources of evidence (assumed independent) concerning word identity. The algorithms used at that time were: dynamic time warping (DTW), hidden Markov modelling (HMM) and a rather simple spectral peak-picking technique (SPP). No real efforts were made to optimise the separate algorithms; indeed, our position was that the combination was that the combination approach should, if anything, compensate for their imperfections. In this paper, we report results obtained when three individual algorithms all achieve high degree of accuracy (typically in the range 90 to 99%). To effect this, we have improved considerably the training of the HMM word models and used the more powerful multilayer perceptron (MLP) in place of SPP. We have also greatly improved the method of obtaining belief functions from distance scores and this is described. We also compare results using D-S combination with those obtained using the simplest possible combination technique, namely majority voting (MV). 2 Dempster-Shafer Combination The D-S formula [8] is a means of combining evidence based on belief functions so as to select among competing hypotheses. In this theory, the frame of discernment, Θ, is a set of exhaustive and mutually exclusive hypotheses (singletons). The set og possible hypotheses is the powerset of Θ, including the empty set, called Ω. The subsets A of Ω for which there is direct evidence are termed the focal elements of Ω. 2.1 Probability masses and belief Evidence for the hypotheses takes the form of probability assignments - called basic probability masses, m(A) - to each of the focal elements, A. The masses are probabilitylike in that they are in the range [0,1] and sum to 1 over all hypotheses, but are not exactly probabilities; rather, they represent the belief assigned to a focal element, however measured. The belief in an hypotheses H is termed BEL(H) or, sometimes, the lower probability of H. It is equal to the sum of all the probability masses of the subsets of H: BEL(H) = Σ m(A) A H (1) The plausibility of H, sometimes called the upper probability, is defined to be (1 BEL(H~)). It can be considered as the extent to which the evidence does not contradict the hypothesis. Belief values always lie in the range 0 to 1; BEL(H) = 1 means that H is effectively certain while BEL(H) = 0 means that there is a total lack of belief in H (not to be confused with disbelief). When the hypotheses, Ω, are all singletons, belief and plausibility become identical. The combination formula allows for two different sets of probability masses from independent sources, but relevant to the same hypotheses, to be combined to give overall belief values. If F is an hypothesis and m(G) and m(H) are the probability assignments to focal elements G and H, then the combined evidence for F is given by: ΣG H=F m(G).m(H) ΣG H m(G).m(H) BEL(F) = (2) The denominator acts as a normalising term to ensure that the combined belief value lies in the range 0 to 1. The combination is commutative and associative and, hence, any number of masses can be combined in any order. 2.2 Isolated word recognition In the case of IWR, the hypotheses, Ω, are the possible identities of the words and the evidence for these words is the similarity scores corresponding to them. The hypotheses reduce to a set of singletons since each score obtained related only to a single word. That is, for a vocabulary of N words, there are N hypotheses: Hi is "the ith word of the i N. In this case, Ω is actually identical to the frame of discernment, Θ. (Note that we have ignored the possibility of out-of-vocabulary utterances, corresponding to the inclusion of the empty set in Ω). Thus, the only set intersections which are non-empty are those pertaining to the probability masses for the ith word obtained from the three algorithms, X, Y and Z, say. Thus, (2) becomes: m(Xi).m(Yi).m(Zi) BEL(Hi) = (3) ΣNj=1 m(Xi).m(Yi).m(Zi) Applying the D-S formula in the form of (3), i.e. with singletons, is very close to Bayesian updating but with important differences. First, probability masses and a belief appear in place of probabilities and, second, an assumption is made about independence of the evidence (scores) which is not essential to the Bayesian calculus [5]. Because of the way we derive probability masses from scores (see below), which means that are not true probabilities, we feel it is preferable to view our method in a D-S framework. 2.3 Probability mass computation A method of covnerting the similarity scores into probability masses is required for use in (3). According to Lindley [9], probability is the best possible measure of belief in an hypothesis. Further, if probabilities were available, we could use Bauyes' rule to determine the "belief" in hypothesis, H, given the evidence available, E: p(E | H).p(H) p(H | E) = (4) p(E) Since all "words" are equally likely in our simulation, and there are no out-of-vocabulary utterances, p(H) is simply equal to 1/N. However, the probabilities p(E|H) and p(E) are effectively unknown; but, we can estimate them from the distribution of scores as we now describe. 2,000 tokens were input to the three recognition algorithms and scores calculated for each token matched against all the word models. For each algorithm, a distribution was constructed of all scores obtained; these were then assumed to be Gaussian. Thus, for any particular score, E, an estimate of p(E) can be obtained using the normal formula. In like manner, p(E|H) can be estimated for each word by constructing a distribution of correct match scores only; there are 32 of these distributions for each algorithm. This then allows us to estimate p(H|E) using (4). Since the result is only probability-like, we prefer to think of it as a probability mass, m(H). This method of computing probability masses ensures that all the m(H)'s sum to 1. 3 The Test System The test system consisted of the three standard recognition algorithms - DTW, HMM and MLP - each of which was subjected to simulated speech patterns drawn from a "vocabulary" of N = 32 words. The outputs from these algorithms were then combined to produce an overall score for each word. 3.1 Simulated speech data Because of the large amount of input data needed to test the combination strategy, we have chosen to use simulated speech. Each token is intended to mimic the output from a 16-filter bandpass analysis, time-normalized to a sequence of 16 frames. 32 "prototypes" were manually created to resemble actual patterns. Further tokens were produced by randomly adding and deleting frames and renormalising; Gaussian noise was also added to the spectral values to produce a 16 dB SNR. Two degress of time distortion were applied: low (LTD) in which each frame had a 33% chance of being repeated or deleted, and high (HTD) in which the chance was 67%. 3.2 Recognition algorithms The DTW algorithm was exactly as described in [6]. This used a conventional asymmetric local path constraint based ont he Chebychev distance metric. Warping paths were constrained to end at the final frame in both the test word and the vocabulary prototype. global paths were constrained by setting a semi-window width of 5. As in [6], the HMMs used were discrete, 5-state, left-to-right, double-step models. However, the models used here were trained on synthetic tokens (produced as above), the frames of which were clustered into 32 classes using the c-means clustering algorithm in conjunction with a Euclidean distance measure. The figure of 32 cluster centres was chosen empirically to maximise the HMM recognition rate. Training employed the Baum-Welch algorithm; the forward-backward procedure was used for matching, producing a probability reflecting the degree of similarity. For the MLP, a 3-layer network was used with 256 nodes in the input layer, 20 nodes in the hidden layer and 32 in the output layer. The function of the input layer was to hardlimit the speech-pattern values to 0 or 1. The nodes in the hidden layer were no fully connected to the input layer but, rather like the "zonal units" of [10], were divided into 4 groups of 5. Each group of 4 was connected to 4 contiguous frequency bands in the input. This was intended primarily to reduce computation time but it also slightly mimicked teh critical band filtering of the auditory system [10]. The output layer was fully interconnected with the hidden layer and each output node corresponded to one vocabulary word. The MLP was trained using error-back propagation by clamping one output high when the corresponding word pattern was presented to the inputs. In practice, a test word caused all the outputs to go to different values between 0 and 1, these values being taken as the vector of similarity scores for the MLP. 3.3 Combination techniques The D-S formula, equation (3), was used to obtain BEL values from the probability masses found as in Section 2.3. The word recognised was that with the highest BEL value. For reference purposes, we also combined the outputs from the three individual algorithms using majority voting with each algorithm contributing a single vote on word identity. This is perhaps the simplest combination strategy possible. Further, it is amenable to theoretical analysis under an independence assumption so that predicted and obtained accuracies can be compared. The degree of differences between these two can be taken as an indication of the validity of the assumption. Taking P, Q and R to be the recognition rates of the three algorithms, and further taking these to be synonymous with the probability of correct recognition, the combined accuracy with majority voting, MV, is easily shown to be: MV = PQ + PR + QR - 2PQR (5) Error rates were defined as (1 - MV), i.e they include both false acceptances and rejectiong (which occur when all three algorihms vote differently.) 4 Results 2,000 tokens of simulated speech data were input to the three recognition algorithms, BEL values computed and the rule of combination applied to produce a composite score, the maximum value of which indicated the word recognised from the 32 candidates. This was repeated 10 times for both high- and low-time distortion in order to give some idea of the variance on the figures. Table 1 shows percentage error rates for each algorithm and for the combined (D-S) recogniser with the standard deviations in brackets. HTD LTD DTW 3.86 (0.09) 0.93 (0.10) HMM 12.47 (0.15) 2.35 (0.16) MLP 2.77 (0.09) 0.74 (0.08) D-S 0.45 (1.64) 0.20 (1.87) Table 1: Means and standard deviations of the percentage error rates for the recognition processes and the combined D-S recogniser. For both HTD and LTD conditions, D-S combination results in a very much lower average error rate than that of the best performing algorithm (MLP). However, the D-S recogniser has a much higher standard deviation as a natural consequence of the combination strategy. As is clear from the values of the standard deviations, D-S combination performed less well than the individual algorithms on some runs but, on average, did much better. Majority voting was also used to combine individual results; Table 2 shows the error rates obtained with the D-S figure given for comparison. In the table, MV(act) refers to the actual error rate and MV(pred) denotes the error rate predicted from equation (5). HTD LTD D-S 0.45 0.20 MV(act) 1.02 0.28 MV(pred) 0.81 0.05 Table 2: Error rates for the D-S combination and for majority voting. This shows that D-S combination performs better than majority voting under both conditions of time distortion. The differences between MV(act) and MV(pred) reveal that the independence assumption is problematic, and more so for the lower time-distortion condition. Interestingly, D-S combination under LTD conditions does not do as well as MV ought to do. Whether this is because of the independence assumption implicit in D-S, or because MV is inherently superior at very low error rates, is uncertain. 5 Conculsions this work has shown that Dempster-Shafer combination can be used to increase the accuracy of isolated word recognition over that obtainabl from a single algorithm even when the individual algorithms each achieve high accuracy. Although the speech data in these trial runs was synthetic, efforts were made to ensure the test words were realistic. Our next step is to see if a similar improvement is possible using real speech, at least for the prototypes. We are also currently implementing something closer to Bayesian updating which avoids the independence assumption. In the long term, we foresee combination techniques as contributing to connected-word and large vocabulary recognition based on sub-word modelling rather than whole-word pattern matching. In this case, the individual sources will provide evidence for the identity of sub-word units. References 1. G Shafer (1982), "Belief functions and parametric models", J. Royal Stat. Soc. Series B, 44, 322-352. 2. L A Zadeh (1968), "Probability measures of fuzzy events", J. Math. Annal. Appl., 23, 421-427. 3. E H Shortliffe and B G Buchanan (1975), "A model of inexact reasoning in medicine", Mathematical Biosciences, 23, 351-379. 4. H E Stephanou and A P Sage (1987) "Perspectives on imperfect information processing", IEEE Trans. Systems, Man and Cybernetics, SMC-17, 780-798. 5. S J Henkind and M C Harrison (1988), "An analysis of four uncertainty calculi", IEEE Trans. Systems, Man and Cybernetics, SMC-18, 700-714. 6. R L Bowles, R I Damper and S M Lucas (1988), "Combining evidence from separate speech recognition processes", Proc. FASE Speech '88, Edinburgh, 669674. 7. M Allerhand (1987), "Knowledge-Based Speech Pattern Recognition", Kogan Page, London. 8. G Shafer (1976), "A Mathematical Theory of Evidence", Princeton Univ. Press, Princeton, NJ. 9. D V Lindley (1987), "The probability approach to the treatment of uncertainty in artifical intelligence and expert systems", Statistical Science, 2, 3-44. 10. T D Harrison and F Fallside (1989), "A connectionist model for phoneme recognition in continuous speech", Proc. IEEE ICASPP '89, Glasgow, Vol. 1, 417-420 FUZZY SET THEROT From: http://www.etc.tuiasi.ro/ecit/Zimm/Zimm_Origins.html Origins, Present Developments and the Future of Computational Intelligence H.-J. Zimmermann RWTH/ELITE, Aachen Even though the first publication in the area of Fuzzy Set Theory (FST) appeared already in 1965, the development of this theory for almost 20 years remained in the academic realm. Almost all basic concepts, theories and methods were, however, developed during this period. Fuzzy Control opened the gate to real applications for FST. Particularly in Japan the applications of the fuzzy control principle in consumer goods made FST known in the public and made it commercially interesting for industry. This lead to two developments: since the development of fuzzy applicational systems had to be efficient, fuzzy CASE tools and expert system shells have been developed making FST to Fuzzy Technology. The success in Japan could draw the attention of the media and started – first in Germany – the "Fuzzy Booms", which lead to an unprecedented growth in publications, university teaching and other industrial applications in many countries. Around 1993 FST, Neural Nets and Evolutionary Computing joined forces and were soon considered to be one area called Soft Computing or Computational Intelligence. Applications in Engineering as well as in Management will be described during the presentation. Of particular interest for Europe might also be the development of ERUDIT (European Network of Excellence for Fuzzy Sets and Uncertainty Modeling), a network which grew from 15 nodes in 1995 to 250 nodes in 1997 and which was extended for another two years by the European Commission. The latest development is COIL (Computational Intelligence and Learning), a European Network of Excellence cluster of ERUDIT, NeuroNet, Evonet and ML-Net. Historical Development Fuzzy Set Theory, Fuzzy Technology and Computational Intelligence Fuzzy Set Theory was conceived in 1965 as a formal theory which could be considered as a generalization of either classical set theory or of classical dual logic. In spite of the fact that Prof. Zadeh when publishing his first contribution had already some applications in mind Fuzzy Set Theory for several reasons kept inside the academic sphere for more than 20 years. During these 20 years most of the basic concepts which are nowadays used very successfully have already been invented. Starting at the beginning of the 80s Japan was the leader in using a smaller part of Fuzzy Set Theory - namely fuzzy control - for practical applications. Particularly improved consumer goods such as video cameras with fuzzy stabilizers, washing machines including fuzzy control, rice-cookers etc. caught the interest of the media that led around 1989/1990 to the first "fuzzy boom" in Germany. Many attractive practical applications - not so much in the area of consumer goods but rather in automation and industrial control - led to the insight that the efficient and affordable use of this approach could only be achieved via CASE-tools. Hence, since the late 80s a large number of very user-friendly tools for fuzzy control, fuzzy expert systems, fuzzy data analysis etc. has emerged. This really changed the character of this area and started to my mind the area of "Fuzzy Technology". The next - and so far the last - large step in the development occurred in 1992 when almost independently in Europe, Japan and the USA the three areas of Fuzzy Technology, artificial neural nets and genetic algorithms joined forces under the title of "Computational Intelligence" or "Soft Computing". The synergies, which were possible between these three areas, have been exploited since then very successfully. Figure 1 shows these developments as a summary. Fig. 1. From Fuzzy Set Theory to Computational Intelligence Management, engineering and other areas can be supported by computational intelligence in many ways. This support can refer to information processing as well as to data mining, choice or evaluation activities or to other types of optimization. Classical decision support systems consist of data bank systems for the information processing part and algorithms for the optimization part. If, however, efficient algorithms are not available or if decisions have to be made in ill-structured environments, knowledge-based components are added to either supplement or substitute algorithms. In both cases Fuzzy Technology can be useful. In this context it may be useful to cite and comment the major goals of this technology briefly and to correct the still very common view that Fuzzy Set Theory or Fuzzy Technology is exclusively or primarily useful to model uncertainty: a) Modeling of uncertainty This is certainly the best known and oldest goal. I am not sure, however, whether it can (still) be considered to be the most important goal of Fuzzy Set Theory. Uncertainty has been a very important topic for several centuries. There are numerous methods and theories which claim to be the only proper tool to model uncertainties. In general, however, they do not even define sufficiently or only in a very specific and limited sense what is meant by "uncertainty". I believe that uncertainty, if considered as a subjective phenomenon, can and ought to be modeled by very different theories, depending on other causes of uncertainty, the type and quantity of available information, the requirements of the observer etc. In this sense Fuzzy Set Theory is certainly also one of the theories which can be used to model specific types of uncertainty under specific types of circumstances. It might then compete with other theories, but it might also be the most appropriate way to model this phenomenon for well-specified situations. It would certainly exceed the scope of this article to discuss this question in detail here [7]. b) Relaxation Classical models and methods are normally based on dual logic. They, therefore, distinguish between feasible and infeasible, belonging to a cluster or not, optimal or suboptimal etc. Often this view does not capture reality adequately. Fuzzy Set Theory has been used extensively to relax or generalize classical methods from a dichotomous to a gradual character. Examples of this are fuzzy mathematical programming [6], fuzzy clustering [2], fuzzy Petri Nets [3], and fuzzy multi criteria analysis [5]. c) Compactification Due to the limited capacity of the human short term memory or of technical systems it is often not possible to either store all relevant data, or to present masses of data to a human observer in such a way, that he or she can perceive the information contained in these data. Fuzzy Technology has been used to reduce the complexity of data to an acceptably degree usually either via linguistic variables or via fuzzy data analysis (fuzzy clustering etc.). d) Meaning Preserving Reasoning Expert System Technology has already been used since two decades and has led in many cases to disappointment. One of the reasons for this might be that expert systems in their inference engines, when they are based on dual logic perform symbol processing (truthvalues true or false) rather than knowledge processing. In Approximate Reasoning meanings are attached to words and sentences via linguistic variables. Inference engines then have to be able to process meaningful linguistic expressions, rather than symbols, and arrive at membership functions of fuzzy sets, which can then be retranslated into words and sentences via linguistic approximation. e) Efficient Determination of Approximate Solutions Already in the 70s Prof. Zadeh expressed his intention to have Fuzzy Set Theory considered as a tool to determine approximate solutions of real problems in an efficient or affordable way. This goal has never really been achieved successfully. In the recent past, however, cases have become known which are very good examples for this goal. Bardossy [1], for instance, showed in the context of water flow modeling that it can be much more efficient to use fuzzy rule based systems to solve the problems than systems of differential equations. Comparing the results achieved by these two alternative approaches showed that the accuracy of the results was almost the same for all practical purposes. This is particularly true if one considers the inaccuracies and uncertainties contained in the input data. The development of Fuzzy Technology during the last 30 years has, roughly speaking, led to the following application oriented classes of approaches: - Model-based (algorithmic) Applications • fuzzy optimization (fuzzy linear program. etc.) • fuzzy clustering (hierarchical and obj. function) • fuzzy Petri Nets • fuzzy multi criteria analysis - Knowledge-based Applications • fuzzy expert systems • fuzzy control • fuzzy data analysis - Information Processing • fuzzy data banks and query languages • fuzzy programming languages • fuzzy library systems. For almost all of the classes of application mentioned above tools (Software and/or hardware) are available to allow efficient modeling. Institutionally Fuzzy Set Theory developed very differently in the different areas of the world. The first European Working Group for Fuzzy Sets was started in 1975, at a time at which Fuzzy Sets became visible in international conferences, such as NOAK (Scandinavian Operations Research Conference, IFORS-Conference in Toronto, and the 1st USA-Japan Symposium in Berkeley. At the beginning of the 80s national societies were founded in the USA (NAFIPS) and Japan (Soft) and almost at the same time a worldwide society IFSA was started. When the 3rd World Congress of IFSA took place in Tokyo, Fuzzy Technology was already well-known in the Japanese economy where it had been successfully applied to consumer goods (washing machines, video cameras, rice cookers) but also to the industrial processes (cranes etc.) and to public transportation (subway system in Sendai). In the rest of the world it was still very little known and primarily considered as an academic area. The European Development By contrast to Japan and the USA Europe is very heterogeneous economically, culturally and scientifically. When in 1989/90 the "Fuzzy Boom" was triggered by the media, that had observed the fast development of this technology in Japan, there existed in different European countries approximately ten research groups in the area of Fuzzy Sets but they hardly communicated with each other, even hardly knew of each other. They were working on an international level but were not very application-oriented. In this situation the fear grew that Europe would again lose one of the major market potentials to Japan. What seemed to be needed most was communication and cooperation between European countries and between science and economy. Neither a company nor a university seemed to have the standing to bring this about. Hence, a foundation (ELITE = European Laboratory for Intelligent Techniques Engineering) was founded. It was much smaller and had much less public support than LIFE in Japan, which had very similar objectives. The Media and the strong public interest had strong influences on the universities and within one to two years the European Commission could be convinced of the economic importance of this area. Via a European Working Group on Fuzzy Control one of the European Networks of Excellence was dedicated to Fuzzy Technology (ERUDIT). It became a European framework in which new theoretical and practical developments were and are methodically and interdisciplinary triggered, supported and advanced. Some of its important features are o o o o its structure, its growth, its orientation, and its services. These are depicted in the following figures: The structure is a matrix organization with the functions and methods horizontally intersection all sectors of the economy. Fig. 2. ERUDIT – Structure Fig. 3. Committee – Structure A self-imposed constraint ensures the application orientation and focussed activities in several directions lead to a steady growth. Extensive surveys also allow very focussed activities to advance the area in scope and depth. Figure 4 sketches results from these surveys. Fig. 4. Application Areas In 1999 the Networks of Excellence for Fuzzy Sets, Neural Nets, Evolutionary Computing and Machine Learning joined into COIL (Computational Intelligence and Learning). Which conclusions can be drawn from the European experience described above? Maybe that a strong and steady growth of a technology, even in difficult conditions as in Europe, can be achieved if the media are intensively included in the promoting activities and if the development is not left to chance but if communication, initialization, support and technology transfer are improved systematically and steadily. Future Perspectives Fuzzy Technology has caught public attention in the 80s and 90s primarily via technical applications (washing machines, video cameras, subway systems etc.). These have not disappeared but the public interest has vanished. Strong research activities can still be found in the areas of adaptive systems, robotics, vision, quality control, medicine etc. Even centers, which focus on intelligent engineering systems, are being set-up at present. The general trend is not to concentrate on Fuzzy Technology only but to combine classical approaches with Neural Net technology and often with Evolutionary Computation. Major new applications and developments occur since a number of years in Business Intelligence. By contrast to the engineering environment there has been a tremendous change in the management area in recent years: while until the beginning of the 90s there existed or was conceived a serious lack of useful and EDP-readable data in this area, now there is an abundance of data in many management sectors. The increasing number of "data-warehouses" is an indication of this development. There are two sides of this coin: on the one hand, applications become possible which were not conceivable until the 80s. On the other hand, very often managers have serious problems to extract from the masses of stored data the information they need. This situation opens the door for all kinds of data mining and knowledge discovery approaches and makes new fascinating applications possible. Examples are automatic generation of (credit)-ratings of customers as well as of suppliers (an application which will grow in importance with growing ecommerce), market segmentation for focussed marketing actions, CRM (consumer relationship modeling) etc. Compared to the very often rather local engineering applications of fuzzy control, these are much more global issues with high profits or cost saving potentials. Figure 5 summarizes future development potentials in research as well as in applications. Research Applications - Hybrid Methods and Models (FT, NN, EC) Business Intelligence - Automation - Computing with words - Financial Engineering (fraud detection, retention, creditworthiness, ratings, stock exchange) - Machine Learning (Fuzzy Decision Trees, Kohonen Nets etc.) - Customer Relationship Modeling (market segmentation, database marketing, campaign management etc.) - Information Technology (Fuzzy Data Banks, Fuzzy SQL etc.) - Data Mining and Knowledge Discovery in Data Warehouses - Simultaneous Engineering - Intelligent Agents Technical Intelligence - Improvement of Man-Machine-Interfaces - Machine Intelligence (MIQ) - Non-FC-Applications - CAD - Simultaneous Engineering - Quality Management - Robotics Fig. 5. Future Developments and Applications A good survey of the present focus of developments can be found in [4]. The conclusion I draw in the present situation is, that Soft Computing is no longer as much publicly visible as it was in the 90s, the potentials and challenges in this area have not decreased but rather increased considerably. References 1. A. Bardossy, 1996. The Use of Fuzzy Rules for the Description of Elements of the Hydrological Cycle. Ecological Modelling, 85, 59 - 65. 2. J. C. Bezdek and S. K. Pal, 1992. Fuzzy Models for Pattern Recognition. New York. 3. H.-P. Lipp, R. Günther and P. Sonntag, 1989. Unscharfe Petri Netze - Ein Basiskonzept für Computerunterstützte Entscheidungssysteme in Komplexen Systemen. Wissenschaftliche Schriftenreihe der TU Chemnitz, 7. 4. H.-N. Teodorescu et al. (editors.), 2000. Intelligent Systems and Interfaces. Kluwer Academic Publishers. Boston. 5. H.-J. Zimmermann, 1986. Multi Criteria Decision Making in Crisp and Fuzzy Environments. in: Zimmermann, Jones and Kaufman (edtrs.). Fuzzy Set Theory and Applications. Dodrecht, 233 - 256. 6. H.-J. Zimmermann, 1996. Fuzzy Set Theory - and Its Applications. 3rd rev. edit. Boston. 7. H.-J. Zimmermann, 1997. A Fresh Perspective on Uncertainty Modeling: Uncertainty vs. Uncertainty Modeling. In: B. M. Ayyub and M. M. Gupta (editors.). Uncertainty Analysis in Engineering and Sciences: Fuzzy Logic, Statistics, and Neural Network Approach. International Series in Intelligent Technologies, Kluwer Academic Publishers, 353 – 364. Certainty Factors From: http://yoda.cis.temple.edu:8080/UGAIWWW/lectures/certainty.html Certainty Factors      Initial Definitions Reduction to Probabilities Composition of Beliefs Why Certainty Factors work in Mycin Certainty Factors in the Undergraduate AI course Certainty Factors were introduced in MYCIN. The basic reference is Buchanan,Shortliffe: Rule-Based Expert Systems, Addison-Wesley, 1984 Initial Definitions MB[H,E] MD[H,E] Measure of increased belief in hypothesis H given the evidence E It is a real number in the interval [0,1] Measure of increased disbelief in hypothesis H CF[H,E] the evidence E. It is a real number in the interval [0,1] Certainty Factor for hypothesis H given the evidence E. given CF[H,E] was originally defined as MB[H,E] - MD[H,E]. It was later modified to MB[H,E] - MD[H,E] ----------------------1 - Min{MB[H,E], MD[H,E]} Experts normally come up with the values for MB and MD of some facts, and with the value of CF for inference rules. Default initial values for MB, MD are 0. Since the beliefs of the experts are not necessarily consistent, it is necessary to carry out sanity checks. For example, if H1, H2, .., Hn are exhaustive, mutually exclusive hypotheses, The sum of their beliefs should be at most 1 and the sum of their disbeliefs should be at most n-1. Reduction to Probabilities Though certainty factors were arrived at without a probabilistic foundation, Heckerman in 1986 showed how MB and MD could be defined in terms of probabilities as indicated below (see Shafer-Pearl, pages 298-312 for more details): +-- 1 if P(H)=1 | MB[H,E] = | | max{P(H|E),P(H)} - P(H) +-- ----------------------- otherwise (1 - P(H)) * P(H|E) +-- 1 if P(H)=0 | MD[H,E] = | | min{P(H|E),P(H)} - P(H) +-- ----------------------- otherwise - P(H) * P(H|E) Composition of Beliefs Note that MB[H,E1&E2] is the belief when evidence E1 and evidence E2 both support the Hypothesis H. Similarly for MD[H,E1&E2]. +-- 0 if MD[H,E1&E2]=1 | MB[H,E1&E2] = | | +-- MB[H,E1] + MB[H,E2]*(1-MB[H,E1]) Otherwise The belief of H goes rapidly to 1 when many pieces of evidence support it. For example if M[H,E1]=M[H,E2]=0.5 then MB[H,E1&E2]=0.75 and if also MB[H,E3]=0.5 then MB[H,E1&E2&E3]=0.875,... +-- 0 if MB[H,E1&E2]=1 | MD[H,E1&E2] = | | +-- MD[H,E1] + MD[H,E2]*(1-MD[H,E1]) Otherwise +-- CF[H,E1]+CF[H,E2]-CF[H,E1]*CF[H,E2] if CF[H,E1] | and CF[H,E2] are both positive | CF[H,E1&E2] = +-- CF[H,E1]+CF[H,E2]+CF[H,E1]*CF[H,E2] if CF[H,E1] | and CF[H,E2] are both negative | | CF[H,E1]+CF[H,E2] +-- ----------------------------- Otherwise 1 - MIN{|CF[H,E1]|,|CF[H,E2]|} MB[H1&H2,E] = Min{MB[H1,E],MB[H2,E]} MD[H1&H2,E] = Max{MD[H1,E],MD[H2,E]} MB[H1vH2,E] = Max{MB[H1,E],MB[H2,E]} MD[H1vH2,E] = Min{MD[H1,E],MD[H2,E]} If we have a Chain where evidence E supports hypothesis H1 which in turn supports hypothesis H2, then MB[H2,E] = MB[E] * CF[H1,E] * CF[H2,H1] In a long chain the belief in the conclusion goes rapidly to 0. Why Certainty Factors work in Mycin 1. In Mycin we find short deduction chains 2. In Mycin the premises of rules are not too complex 3. In Mycin people have been careful to choose rules where the hypotheses are mutually exclusive and exhaustive [typically, just finding a distinct value for an attribute] 4. Experimentally it has been found that in Mycin the behavior is not substantially affected by small changes in the values of CF, MB, MD. This is a strictly pragmatic viewpoint: It is good because it works. Certainty Factors in the undergraduate AI course    It is easy to acquire a sense for the behavior of certainty factors and evaluate their value by using "freeware" like CLIPS It is a simple method that has been found to work well in some circumstances. Unfortunately, the recognition comes usually after the fact. i.e. after the use is successful, not easily in the design phase. Overall, it is a very minor topic for the course best dealt with when presenting the Expert System shell available to the course. INFLUENCE DIAGRAMS A DECISION-BASED APPROACH SEE: http://www.icbemp.gov/spatial/lee_monitor/decision.html Influence Diagrams page contents What How Go On What Is It Good... ...groups and have them create influence diagrams on flip charts. It... http://www.usbr.gov/guide/toolbox/influenc.htm Introduction to Influence Diagrams An influence diagram is... ...shall present the concept of influence diagrams by extending the... http://www.hugin.dk/hugintro/id_pane.html Proceedings, Conference on Influence Diagrams for Decision http://singapore.cs.ucla.edu/biblio.html Preface Monitoring has become a dominant theme among environmental scientists, land management, and policy makers alike. The number of publications and plans which propose to do much the same, namely detect and identify system state and change, continue to multiply, each suggesting alternative approaches and solutions. Despite considerable effort by various institutions and individuals, effective environmental monitoring remains an unanswered challenge. This is particularly the case for large-scale, agency-led projects such as the Interior Columbia Basin Ecosystem Management Project (ICBEMP), the Northwest Forest Plan (NWFP), and the Sierra Nevada Framework for Conservation and Collaboration (SNFCC). In the following report, we begin a dialogue about an appropriate conceptual framework for organizing and developing a monitoring plan for broad-scale ecosystem management efforts. We were asked to prepare this report for the group drafting a monitoring charter for the ICBEMP. Because much effort is being invested in preparations for monitoring within the NWFP and in the Sierra Nevada, it seems logical to look also at these efforts. Our general impression is that the monitoring plans that are currently being developed for broad-scale ecosystem management efforts, while they may be statistically sound, often lack an integrated strategy that allows one to easily see why certain information is important and how such information might influence future decisions and investments. Thus, we believe that our comments offered here apply equally well outside the ICBEMP. We also have decided to try a different approach to communication—using a hypertext approach instead of the traditional written report. Our purpose here is twofold. First, our view of monitoring embedded in a decision analysis framework relies on the synthesis of ideas ranging from ecological theory, to statistics, to decision analysis, to economics. Understanding the framework requires at least a cursory understanding of all of these ideas; operationalizing the framework will require in-depth understanding. Our intent is to provide links from the main body of the document to supplemental material that will provide greater detail and additional examples. [Few such links exist as of 10/26/98.] The second reason for the hypertext format is that we expect this to be a dynamic document that will undergo revision and expansion as the dialogue among scientists and managers regarding monitoring in the ICBEMP proceeds. Having a centrally accessible, electronic document that reflects that evolving dialogue should foster informed discussion. Influence diagrams provide an attractive graphical scheme for explicitly codifying conditional independence between critical probabilistic variables as justified by the expert's knowledge and statistical data. The salient information in the diagram is, in fact, not which variables influence each other, but rather, which ones do not influence each other given the conditioning information. Thus influence is defined by its dual concept "lack of influence".

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CSE 5290: Artificial Intelligence