Download CSE 5290: Artificial Intelligence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Incomplete Nature wikipedia , lookup

Affective computing wikipedia , lookup

Inductive probability wikipedia , lookup

History of artificial intelligence wikipedia , lookup

Stemming wikipedia , lookup

Knowledge representation and reasoning wikipedia , lookup

Time series wikipedia , lookup

Neural modeling fields wikipedia , lookup

AI winter wikipedia , lookup

Pattern recognition wikipedia , lookup

Type-2 fuzzy sets and systems wikipedia , lookup

Fuzzy concept wikipedia , lookup

Fuzzy logic wikipedia , lookup

Transcript
Artificial Intelligence and Decision Making
Session11: Probabilistic Reasoning Systems
15.1 Representing Knowledge in an Uncertain Domain
15.2 The Semantics of Belief Networks
15.2.1 Representing the joint probability distribution
15.2.1.1 A method for constructing belief networks
15.2.1.2 Compactness and node ordering
15.2.1.3 Representation of conditional probability tables
15.2.2 Conditional independence relations in belief networks
15.3 Inference in Belief Networks
15.3.1 The nature of probabilistic inferences
15.3.2 An algorithm for answering queries
15.4 Inference in Multiply Connected Belief Networks
15.4.1 Clustering methods
15.4.2 Cutset conditioning methods
15.4.3 Stochastic simulation methods
15.5 Knowledge Engineering for Uncertain Reasoning
15.5.1 Case Study: The Pathfinder system
15.6 Other Approaches to Uncertain Reasoning
15.6.1 Default reasoning
15.6.2 Rule-based methods for uncertain reasoning
15.6.3 Representing ignorance: Dempster-Shafer theory
15.6.4 Representing vagueness: Fuzzy sets and fuzzy logic
15.7 Summary
Web Sites
http://ist-socrates.berkeley.edu:4247/tech.reports/tech27.html
http://info.sm.umist.ac.uk/wp/abstract/wp9801.htm
http://www-cs-students.stanford.edu/~cathay/summer/inspec
http://citeseer.nj.nec.com/context/17451/0
http://citeseer.nj.nec.com/context/4306/0
http://www.cs.vu.nl/vakgroepen/ai/education/courses/ks/slides/KS05/tsld031.htm
From:
http://www.sci.brooklyn.cuny.edu/~kopec/cis718/Tong3.htm
Summary of
Psychological validity of uncertainty combining rules in expert systems,
by Bruce E. Tonn and Richard T. Goeltz,
in Expert Systems, vol. 7, No. 2, pp 94–101, May, 1990
Summarized by Tao Tong
Summary on Psychological validity of uncertainty combining rules in expert systems
This is a summary of the article written by Bruce E. Tonn and Richard T. Goeltz,
published on Expert Systems, vol. 7, No. 2, pp94–101, May, 1990.
Several approaches have been developed to deal with the inexact reasoning in expert
systems. Examples are: Certainty Factors (CF), Dempster-Shafter approach, Fuzzy Sets,
and the Theory of Endorsements. They are used and justified because they are simple,
mathematically rigorous, easy to program, cautious, and can capture the essence of
natural language.
Certainty Factors are used since MYCIN. It is the most extensively used uncertainty
representation and manipulation approach. It is simple, and intuitive.
Dempster-Shafter approach excels in mathematical rigor. It utilizes theory of probability
and can trace back to the great mathematicians such as Jacob Bernoulli in the 1600s.
Fuzzy Sets theory has the advantage of being mathematically rigorous and sensitive to
natural language when dealing with uncertainties.
Theory of Endorsements deals with complex uncertainty with a non-mathematical
manner, however, it has less support than the other approaches.
All the above uncertainty representation and manipulation approaches are undergoing
extensions and improvements. However, Choosing from the various uncertainty
approaches is not a trivial task for knowledge engineers.
An important, but often ignored aspect of the uncertainty approaches is the psychological
validity of the aforementioned approaches. Simply put, psychological validity means
whether the human experts really use the uncertainty approaches when they are solving a
problem. When the uncertainty approaches used by human experts and expert system
shell are incompatible, the consequence may be minor; however, sometime serious
adverse consequence may result. For example, when knowledge engineers tried to
accommodate the expert system shell using a mismatched uncertainty approach, they may
unknowingly compromise the knowledge base to fit the flawed expert system shell
uncertainty approach. If both knowledge base and uncertainty approach is flawed, the
performance of the resulting expert system must necessarily unsatisfactory.
Knowledge engineers may be interested in some relevant psychological studies dealing
with the human inexact reasoning. Hink R. and Woods, D., in the paper How Humans
Process Uncertain Knowledge: An Introduction, provided an excellent review on the
revenant psychological research. From the psychological researches, we can see experts
and the average persons alike can be highly inefficient, frequently irrational when making
decisions. Even mathematically trained individuals have the tendency of intransitive
preference, overconfidence, and poor probability calibration. People usually violate the
axioms of Expected Utility Theory and deal with probability in a clumsy way. When
combining estimates of uncertainty, previous works show the people consistently
overestimate conjunctive probabilities. This paper presents the research conducted in Oak
Ridge National Laboratory to explore the cognitive validity of commonly accepted
uncertainty combining rules.
Likelihood Elicitation System (LES), a computer program written in Common Lisp,
running on DEC VAX computer clusters has been developed to facilitate the research.
LES has two major components, Session 1 and Session 2. Session 1 elicits the likelihood
of simple propositions, and Session 2 elicits the likelihood of complex propositions
derived from simple propositions elicited in session 1. LES represents the likelihood with
three modalities, probability, certainty factors, and natural languages.
The subjects are chose from Oak Ridge National Laboratory and University of
Tennessee, Knoxville. The proposition are from three fields, daily events, personality
trait, and cancer. The subjects should have enough "common knowledge expertise" on
these fields.
First, the subjects are asked to evaluate the likelihood of simple propositions in session 1,
then, the LES customize 36 complex propositions for each subject from the simple
propositions in session 1 by conjunction, and, using the likelihood given by the subject
and the expert system uncertainty combining rules, calculates the likelihood of the
complex propositions. The subjects are asked to evaluate the likelihood of the complex
propositions, and the likelihoods are compared with the computed likelihood by LES.
The results of this research are rather intriguing. The human uncertainty heuristics
seemed to be influenced by the modality of uncertainly representation. And, all the
results cannot confirm the psychological validity of the uncertainty approaches used in
modern expert system shells. It indicates that humans are not doing inexact reasoning
using the expert systems uncertainty approaches.
The conclusion: people’s natural uncertainty combining rules are not best modeled by
any generally accepted expert systems uncertainty approaches, and the natural
uncertainty combining rules maybe highly idiosyncratic. Knowledge engineers should be
aware of this kind of mismatch, and be careful when developing expert systems using the
uncertainty approaches.
Because of the importance of uncertain information presentation and manipulation in the
operation of expert systems, eventually, knowledge experts must determine experts’
natural uncertainty processing heuristics. In the future, expert system shells could contain
advanced machine learning modules to model an expert’s uncertainty combining rules,
and expert system shells should have alternative uncertainty combining rules. Knowledge
engineers need to collaborate with psychologists to provide better models in dealing with
the uncertainty representation and manipulation.
From:
http://members.tripod.com/~RichardBowles/published/euro89.htm
This paper I gave at the European Conference on Speech Communication and
Technology, Paris, September 1989 (Eurospeech 89), pages 384-387.
Application of the Dempster-Shafter Theory of Evidence to Improved-Accurac IsolatedWord Recognition
R.L. Bowles and R.I. Damper
Department of Electronics and Computer Science, University of Southampton, Highfield,
Southampton SO9 5NH, U.K.
Abstract
This paper describes experiments in which the outputs of three isolated-word recognition
algorithms were combined to yield a lower average error rate than that achieved by any
individual algorithm. The input tokens were simulated, fixed-length spectral speech
patterns subjected to additive noise and either a "high" or "low" degree of time distortion.
The three recognition techniques used were dynamic time warping, hidden Markov
modelling and a multi-layer perceptron. Two combination techniques were employed: the
formula derived from the Dempster-Shafer (D-S) theory of evidence and simple majority
voting (MV). D-S performed significantly better than MV under both time-distortion
conditions. Evidence is also presented that the assumption of independent word scores,
which is necessary for D-S theory to be strictly applicable, is questionable.
1 Introduction
The problem domains of interest in artificial intelligence are characterised by the need to
handle uncertain information. Several standard methodologies for this have arisen, each
with its own representation of the degree of uncertainty of the data, including Bayesian
updating, belief functions [1], fuzzy logic [2] and the MYCIN calculus [3]. Each
approach gives a way of combining evidence from distinct knowledge sources; a
comprehensive survey of the methods is presented in [4,5]. Apart from our own work [6],
these techniques have not (as far as we are aware) been employed in automatic speech
recognition.
According to Allerhand [7], the use of simple models in speech recognition creates an
inherent performance limitation. A plateau is reached, set by the assumptions implicit in
the model, where further improvement cannot be made. Combination of evidence offers a
possible means of overcoming the fundamental limitation imposed by use of a single
model. In [6], we showed that it was possible in certain circumstances to obtain a
significant increase in recognition accuracy by combining the outputs from two or three
distinct isolated-word recognition (IWR) algorithms subjected to the same simulated
speech input using Dempster-Shafer (D-S) theory (see below). In this case, the individual
algorithms act as separate sources of evidence (assumed independent) concerning word
identity. The algorithms used at that time were: dynamic time warping (DTW), hidden
Markov modelling (HMM) and a rather simple spectral peak-picking technique (SPP).
No real efforts were made to optimise the separate algorithms; indeed, our position was
that the combination was that the combination approach should, if anything, compensate
for their imperfections.
In this paper, we report results obtained when three individual algorithms all achieve high
degree of accuracy (typically in the range 90 to 99%). To effect this, we have improved
considerably the training of the HMM word models and used the more powerful multilayer perceptron (MLP) in place of SPP. We have also greatly improved the method of
obtaining belief functions from distance scores and this is described. We also compare
results using D-S combination with those obtained using the simplest possible
combination technique, namely majority voting (MV).
2 Dempster-Shafer Combination
The D-S formula [8] is a means of combining evidence based on belief functions so as to
select among competing hypotheses. In this theory, the frame of discernment, Θ, is a set
of exhaustive and mutually exclusive hypotheses (singletons). The set og possible
hypotheses is the powerset of Θ, including the empty set, called Ω. The subsets A of Ω
for which there is direct evidence are termed the focal elements of Ω.
2.1 Probability masses and belief
Evidence for the hypotheses takes the form of probability assignments - called basic
probability masses, m(A) - to each of the focal elements, A. The masses are probabilitylike in that they are in the range [0,1] and sum to 1 over all hypotheses, but are not
exactly probabilities; rather, they represent the belief assigned to a focal element,
however measured. The belief in an hypotheses H is termed BEL(H) or, sometimes, the
lower probability of H. It is equal to the sum of all the probability masses of the subsets
of H:
BEL(H) =
Σ
m(A)
A H
(1)
The plausibility of H, sometimes called the upper probability, is defined to be (1 BEL(H~)). It can be considered as the extent to which the evidence does not contradict
the hypothesis. Belief values always lie in the range 0 to 1; BEL(H) = 1 means that H is
effectively certain while BEL(H) = 0 means that there is a total lack of belief in H (not to
be confused with disbelief).
When the hypotheses, Ω, are all singletons, belief and plausibility become identical.
The combination formula allows for two different sets of probability masses from
independent sources, but relevant to the same hypotheses, to be combined to give overall
belief values. If F is an hypothesis and m(G) and m(H) are the probability assignments to
focal elements G and H, then the combined evidence for F is given by:
ΣG
H=F
m(G).m(H)
ΣG
H
m(G).m(H)
BEL(F) =
(2)
The denominator acts as a normalising term to ensure that the combined belief value lies
in the range 0 to 1. The combination is commutative and associative and, hence, any
number of masses can be combined in any order.
2.2 Isolated word recognition
In the case of IWR, the hypotheses, Ω, are the possible identities of the words and the
evidence for these words is the similarity scores corresponding to them. The hypotheses
reduce to a set of singletons since each score obtained related only to a single word. That
is, for a vocabulary of N words, there are N hypotheses: Hi is "the ith word of the
i N. In this case, Ω is actually identical to the
frame of discernment, Θ. (Note that we have ignored the possibility of out-of-vocabulary
utterances, corresponding to the inclusion of the empty set in Ω). Thus, the only set
intersections which are non-empty are those pertaining to the probability masses for the
ith word obtained from the three algorithms, X, Y and Z, say. Thus, (2) becomes:
m(Xi).m(Yi).m(Zi)
BEL(Hi) =
(3)
ΣNj=1 m(Xi).m(Yi).m(Zi)
Applying the D-S formula in the form of (3), i.e. with singletons, is very close to
Bayesian updating but with important differences. First, probability masses and a belief
appear in place of probabilities and, second, an assumption is made about independence
of the evidence (scores) which is not essential to the Bayesian calculus [5]. Because of
the way we derive probability masses from scores (see below), which means that are not
true probabilities, we feel it is preferable to view our method in a D-S framework.
2.3 Probability mass computation
A method of covnerting the similarity scores into probability masses is required for use in
(3). According to Lindley [9], probability is the best possible measure of belief in an
hypothesis. Further, if probabilities were available, we could use Bauyes' rule to
determine the "belief" in hypothesis, H, given the evidence available, E:
p(E | H).p(H)
p(H | E) =
(4)
p(E)
Since all "words" are equally likely in our simulation, and there are no out-of-vocabulary
utterances, p(H) is simply equal to 1/N. However, the probabilities p(E|H) and p(E) are
effectively unknown; but, we can estimate them from the distribution of scores as we
now describe.
2,000 tokens were input to the three recognition algorithms and scores calculated for each
token matched against all the word models. For each algorithm, a distribution was
constructed of all scores obtained; these were then assumed to be Gaussian. Thus, for any
particular score, E, an estimate of p(E) can be obtained using the normal formula. In like
manner, p(E|H) can be estimated for each word by constructing a distribution of correct
match scores only; there are 32 of these distributions for each algorithm. This then allows
us to estimate p(H|E) using (4). Since the result is only probability-like, we prefer to think
of it as a probability mass, m(H). This method of computing probability masses ensures
that all the m(H)'s sum to 1.
3 The Test System
The test system consisted of the three standard recognition algorithms - DTW, HMM and
MLP - each of which was subjected to simulated speech patterns drawn from a
"vocabulary" of N = 32 words. The outputs from these algorithms were then combined to
produce an overall score for each word.
3.1 Simulated speech data
Because of the large amount of input data needed to test the combination strategy, we
have chosen to use simulated speech. Each token is intended to mimic the output from a
16-filter bandpass analysis, time-normalized to a sequence of 16 frames. 32 "prototypes"
were manually created to resemble actual patterns. Further tokens were produced by
randomly adding and deleting frames and renormalising; Gaussian noise was also added
to the spectral values to produce a 16 dB SNR. Two degress of time distortion were
applied: low (LTD) in which each frame had a 33% chance of being repeated or deleted,
and high (HTD) in which the chance was 67%.
3.2 Recognition algorithms
The DTW algorithm was exactly as described in [6]. This used a conventional
asymmetric local path constraint based ont he Chebychev distance metric. Warping paths
were constrained to end at the final frame in both the test word and the vocabulary
prototype. global paths were constrained by setting a semi-window width of 5.
As in [6], the HMMs used were discrete, 5-state, left-to-right, double-step models.
However, the models used here were trained on synthetic tokens (produced as above), the
frames of which were clustered into 32 classes using the c-means clustering algorithm in
conjunction with a Euclidean distance measure. The figure of 32 cluster centres was
chosen empirically to maximise the HMM recognition rate. Training employed the
Baum-Welch algorithm; the forward-backward procedure was used for matching,
producing a probability reflecting the degree of similarity.
For the MLP, a 3-layer network was used with 256 nodes in the input layer, 20 nodes in
the hidden layer and 32 in the output layer. The function of the input layer was to hardlimit the speech-pattern values to 0 or 1. The nodes in the hidden layer were no fully
connected to the input layer but, rather like the "zonal units" of [10], were divided into 4
groups of 5. Each group of 4 was connected to 4 contiguous frequency bands in the input.
This was intended primarily to reduce computation time but it also slightly mimicked teh
critical band filtering of the auditory system [10]. The output layer was fully
interconnected with the hidden layer and each output node corresponded to one
vocabulary word. The MLP was trained using error-back propagation by clamping one
output high when the corresponding word pattern was presented to the inputs. In practice,
a test word caused all the outputs to go to different values between 0 and 1, these values
being taken as the vector of similarity scores for the MLP.
3.3 Combination techniques
The D-S formula, equation (3), was used to obtain BEL values from the probability
masses found as in Section 2.3. The word recognised was that with the highest BEL
value.
For reference purposes, we also combined the outputs from the three individual
algorithms using majority voting with each algorithm contributing a single vote on word
identity. This is perhaps the simplest combination strategy possible. Further, it is
amenable to theoretical analysis under an independence assumption so that predicted and
obtained accuracies can be compared. The degree of differences between these two can
be taken as an indication of the validity of the assumption.
Taking P, Q and R to be the recognition rates of the three algorithms, and further taking
these to be synonymous with the probability of correct recognition, the combined
accuracy with majority voting, MV, is easily shown to be:
MV = PQ + PR + QR - 2PQR
(5)
Error rates were defined as (1 - MV), i.e they include both false acceptances and
rejectiong (which occur when all three algorihms vote differently.)
4 Results
2,000 tokens of simulated speech data were input to the three recognition algorithms,
BEL values computed and the rule of combination applied to produce a composite score,
the maximum value of which indicated the word recognised from the 32 candidates. This
was repeated 10 times for both high- and low-time distortion in order to give some idea
of the variance on the figures. Table 1 shows percentage error rates for each algorithm
and for the combined (D-S) recogniser with the standard deviations in brackets.
HTD
LTD
DTW 3.86 (0.09) 0.93 (0.10)
HMM 12.47 (0.15) 2.35 (0.16)
MLP 2.77 (0.09) 0.74 (0.08)
D-S 0.45 (1.64) 0.20 (1.87)
Table 1: Means and standard deviations of the percentage error rates for the recognition
processes and the combined D-S recogniser.
For both HTD and LTD conditions, D-S combination results in a very much lower
average error rate than that of the best performing algorithm (MLP). However, the D-S
recogniser has a much higher standard deviation as a natural consequence of the
combination strategy. As is clear from the values of the standard deviations, D-S
combination performed less well than the individual algorithms on some runs but, on
average, did much better.
Majority voting was also used to combine individual results; Table 2 shows the error
rates obtained with the D-S figure given for comparison. In the table, MV(act) refers to
the actual error rate and MV(pred) denotes the error rate predicted from equation (5).
HTD LTD
D-S
0.45 0.20
MV(act)
1.02 0.28
MV(pred) 0.81 0.05
Table 2: Error rates for the D-S combination and for majority voting.
This shows that D-S combination performs better than majority voting under both
conditions of time distortion. The differences between MV(act) and MV(pred) reveal that
the independence assumption is problematic, and more so for the lower time-distortion
condition. Interestingly, D-S combination under LTD conditions does not do as well as
MV ought to do. Whether this is because of the independence assumption implicit in D-S,
or because MV is inherently superior at very low error rates, is uncertain.
5 Conculsions
this work has shown that Dempster-Shafer combination can be used to increase the
accuracy of isolated word recognition over that obtainabl from a single algorithm even
when the individual algorithms each achieve high accuracy.
Although the speech data in these trial runs was synthetic, efforts were made to ensure
the test words were realistic. Our next step is to see if a similar improvement is possible
using real speech, at least for the prototypes. We are also currently implementing
something closer to Bayesian updating which avoids the independence assumption.
In the long term, we foresee combination techniques as contributing to connected-word
and large vocabulary recognition based on sub-word modelling rather than whole-word
pattern matching. In this case, the individual sources will provide evidence for the
identity of sub-word units.
References
1. G Shafer (1982), "Belief functions and parametric models", J. Royal Stat. Soc.
Series B, 44, 322-352.
2. L A Zadeh (1968), "Probability measures of fuzzy events", J. Math. Annal. Appl.,
23, 421-427.
3. E H Shortliffe and B G Buchanan (1975), "A model of inexact reasoning in
medicine", Mathematical Biosciences, 23, 351-379.
4. H E Stephanou and A P Sage (1987) "Perspectives on imperfect information
processing", IEEE Trans. Systems, Man and Cybernetics, SMC-17, 780-798.
5. S J Henkind and M C Harrison (1988), "An analysis of four uncertainty calculi",
IEEE Trans. Systems, Man and Cybernetics, SMC-18, 700-714.
6. R L Bowles, R I Damper and S M Lucas (1988), "Combining evidence from
separate speech recognition processes", Proc. FASE Speech '88, Edinburgh, 669674.
7. M Allerhand (1987), "Knowledge-Based Speech Pattern Recognition", Kogan
Page, London.
8. G Shafer (1976), "A Mathematical Theory of Evidence", Princeton Univ. Press,
Princeton, NJ.
9. D V Lindley (1987), "The probability approach to the treatment of uncertainty in
artifical intelligence and expert systems", Statistical Science, 2, 3-44.
10. T D Harrison and F Fallside (1989), "A connectionist model for phoneme
recognition in continuous speech", Proc. IEEE ICASPP '89, Glasgow, Vol. 1,
417-420
FUZZY SET THEROT
From:
http://www.etc.tuiasi.ro/ecit/Zimm/Zimm_Origins.html
Origins, Present Developments and the Future
of Computational Intelligence
H.-J. Zimmermann
RWTH/ELITE, Aachen
Even though the first publication in the area of Fuzzy Set Theory (FST) appeared already
in 1965, the development of this theory for almost 20 years remained in the academic
realm. Almost all basic concepts, theories and methods were, however, developed during
this period.
Fuzzy Control opened the gate to real applications for FST. Particularly in Japan the
applications of the fuzzy control principle in consumer goods made FST known in the
public and made it commercially interesting for industry. This lead to two developments:
since the development of fuzzy applicational systems had to be efficient, fuzzy CASE
tools and expert system shells have been developed making FST to Fuzzy Technology.
The success in Japan could draw the attention of the media and started – first in Germany
– the "Fuzzy Booms", which lead to an unprecedented growth in publications, university
teaching and other industrial applications in many countries.
Around 1993 FST, Neural Nets and Evolutionary Computing joined forces and were soon
considered to be one area called Soft Computing or Computational Intelligence.
Applications in Engineering as well as in Management will be described during the
presentation. Of particular interest for Europe might also be the development of ERUDIT
(European Network of Excellence for Fuzzy Sets and Uncertainty Modeling), a network
which grew from 15 nodes in 1995 to 250 nodes in 1997 and which was extended for
another two years by the European Commission.
The latest development is COIL (Computational Intelligence and Learning), a European
Network of Excellence cluster of ERUDIT, NeuroNet, Evonet and ML-Net.
Historical Development
Fuzzy Set Theory, Fuzzy Technology and Computational Intelligence
Fuzzy Set Theory was conceived in 1965 as a formal theory which could be considered
as a generalization of either classical set theory or of classical dual logic. In spite of the
fact that Prof. Zadeh when publishing his first contribution had already some applications
in mind Fuzzy Set Theory for several reasons kept inside the academic sphere for more
than 20 years. During these 20 years most of the basic concepts which are nowadays used
very successfully have already been invented. Starting at the beginning of the 80s Japan
was the leader in using a smaller part of Fuzzy Set Theory - namely fuzzy control - for
practical applications. Particularly improved consumer goods such as video cameras with
fuzzy stabilizers, washing machines including fuzzy control, rice-cookers etc. caught the
interest of the media that led around 1989/1990 to the first "fuzzy boom" in Germany.
Many attractive practical applications - not so much in the area of consumer goods but
rather in automation and industrial control - led to the insight that the efficient and
affordable use of this approach could only be achieved via CASE-tools. Hence, since the
late 80s a large number of very user-friendly tools for fuzzy control, fuzzy expert
systems, fuzzy data analysis etc. has emerged. This really changed the character of this
area and started to my mind the area of "Fuzzy Technology". The next - and so far the
last - large step in the development occurred in 1992 when almost independently in
Europe, Japan and the USA the three areas of Fuzzy Technology, artificial neural nets
and genetic algorithms joined forces under the title of "Computational Intelligence" or
"Soft Computing". The synergies, which were possible between these three areas, have
been exploited since then very successfully. Figure 1 shows these developments as a
summary.
Fig. 1. From Fuzzy Set Theory to Computational Intelligence
Management, engineering and other areas can be supported by computational intelligence
in many ways. This support can refer to information processing as well as to data mining,
choice or evaluation activities or to other types of optimization. Classical decision
support systems consist of data bank systems for the information processing part and
algorithms for the optimization part. If, however, efficient algorithms are not available or
if decisions have to be made in ill-structured environments, knowledge-based
components are added to either supplement or substitute algorithms. In both cases Fuzzy
Technology can be useful.
In this context it may be useful to cite and comment the major goals of this technology
briefly and to correct the still very common view that Fuzzy Set Theory or Fuzzy
Technology is exclusively or primarily useful to model uncertainty:
a) Modeling of uncertainty
This is certainly the best known and oldest goal. I am not sure, however, whether it can
(still) be considered to be the most important goal of Fuzzy Set Theory. Uncertainty has
been a very important topic for several centuries. There are numerous methods and
theories which claim to be the only proper tool to model uncertainties. In general,
however, they do not even define sufficiently or only in a very specific and limited sense
what is meant by "uncertainty". I believe that uncertainty, if considered as a subjective
phenomenon, can and ought to be modeled by very different theories, depending on other
causes of uncertainty, the type and quantity of available information, the requirements of
the observer etc. In this sense Fuzzy Set Theory is certainly also one of the theories
which can be used to model specific types of uncertainty under specific types of
circumstances. It might then compete with other theories, but it might also be the most
appropriate way to model this phenomenon for well-specified situations. It would
certainly exceed the scope of this article to discuss this question in detail here [7].
b) Relaxation
Classical models and methods are normally based on dual logic. They, therefore,
distinguish between feasible and infeasible, belonging to a cluster or not, optimal or
suboptimal etc. Often this view does not capture reality adequately. Fuzzy Set Theory has
been used extensively to relax or generalize classical methods from a dichotomous to a
gradual character. Examples of this are fuzzy mathematical programming [6], fuzzy
clustering [2], fuzzy Petri Nets [3], and fuzzy multi criteria analysis [5].
c) Compactification
Due to the limited capacity of the human short term memory or of technical systems it is
often not possible to either store all relevant data, or to present masses of data to a human
observer in such a way, that he or she can perceive the information contained in these
data. Fuzzy Technology has been used to reduce the complexity of data to an acceptably
degree usually either via linguistic variables or via fuzzy data analysis (fuzzy clustering
etc.).
d) Meaning Preserving Reasoning
Expert System Technology has already been used since two decades and has led in many
cases to disappointment. One of the reasons for this might be that expert systems in their
inference engines, when they are based on dual logic perform symbol processing (truthvalues true or false) rather than knowledge processing. In Approximate Reasoning
meanings are attached to words and sentences via linguistic variables. Inference engines
then have to be able to process meaningful linguistic expressions, rather than symbols,
and arrive at membership functions of fuzzy sets, which can then be retranslated into
words and sentences via linguistic approximation.
e) Efficient Determination of Approximate Solutions
Already in the 70s Prof. Zadeh expressed his intention to have Fuzzy Set Theory
considered as a tool to determine approximate solutions of real problems in an efficient or
affordable way. This goal has never really been achieved successfully. In the recent past,
however, cases have become known which are very good examples for this goal.
Bardossy [1], for instance, showed in the context of water flow modeling that it can be
much more efficient to use fuzzy rule based systems to solve the problems than systems
of differential equations. Comparing the results achieved by these two alternative
approaches showed that the accuracy of the results was almost the same for all practical
purposes. This is particularly true if one considers the inaccuracies and uncertainties
contained in the input data.
The development of Fuzzy Technology during the last 30 years has, roughly speaking,
led to the following application oriented classes of approaches:
- Model-based (algorithmic) Applications
• fuzzy optimization (fuzzy linear program. etc.)
• fuzzy clustering (hierarchical and obj. function)
• fuzzy Petri Nets
• fuzzy multi criteria analysis
- Knowledge-based Applications
• fuzzy expert systems
• fuzzy control
• fuzzy data analysis
- Information Processing
• fuzzy data banks and query languages
• fuzzy programming languages
• fuzzy library systems.
For almost all of the classes of application mentioned above tools (Software and/or
hardware) are available to allow efficient modeling.
Institutionally Fuzzy Set Theory developed very differently in the different areas of the
world. The first European Working Group for Fuzzy Sets was started in 1975, at a time at
which Fuzzy Sets became visible in international conferences, such as NOAK
(Scandinavian Operations Research Conference, IFORS-Conference in Toronto, and the
1st USA-Japan Symposium in Berkeley.
At the beginning of the 80s national societies were founded in the USA (NAFIPS) and
Japan (Soft) and almost at the same time a worldwide society IFSA was started.
When the 3rd World Congress of IFSA took place in Tokyo, Fuzzy Technology was
already well-known in the Japanese economy where it had been successfully applied to
consumer goods (washing machines, video cameras, rice cookers) but also to the
industrial processes (cranes etc.) and to public transportation (subway system in Sendai).
In the rest of the world it was still very little known and primarily considered as an
academic area.
The European Development
By contrast to Japan and the USA Europe is very heterogeneous economically, culturally
and scientifically. When in 1989/90 the "Fuzzy Boom" was triggered by the media, that
had observed the fast development of this technology in Japan, there existed in different
European countries approximately ten research groups in the area of Fuzzy Sets but they
hardly communicated with each other, even hardly knew of each other. They were
working on an international level but were not very application-oriented.
In this situation the fear grew that Europe would again lose one of the major market
potentials to Japan. What seemed to be needed most was communication and cooperation
between European countries and between science and economy. Neither a company nor a
university seemed to have the standing to bring this about. Hence, a foundation (ELITE =
European Laboratory for Intelligent Techniques Engineering) was founded. It was much
smaller and had much less public support than LIFE in Japan, which had very similar
objectives. The Media and the strong public interest had strong influences on the
universities and within one to two years the European Commission could be convinced of
the economic importance of this area. Via a European Working Group on Fuzzy Control
one of the European Networks of Excellence was dedicated to Fuzzy Technology
(ERUDIT). It became a European framework in which new theoretical and practical
developments were and are methodically and interdisciplinary triggered, supported and
advanced.
Some of its important features are
o
o
o
o
its structure,
its growth,
its orientation, and
its services.
These are depicted in the following figures:
The structure is a matrix organization with the functions and methods horizontally
intersection all sectors of the economy.
Fig. 2. ERUDIT – Structure
Fig. 3. Committee – Structure
A self-imposed constraint ensures the application orientation and focussed activities in
several directions lead to a steady growth.
Extensive surveys also allow very focussed activities to advance the area in scope and
depth. Figure 4 sketches results from these surveys.
Fig. 4. Application Areas
In 1999 the Networks of Excellence for Fuzzy Sets, Neural Nets, Evolutionary
Computing and Machine Learning joined into COIL (Computational Intelligence and
Learning).
Which conclusions can be drawn from the European experience described above? Maybe
that a strong and steady growth of a technology, even in difficult conditions as in Europe,
can be achieved if the media are intensively included in the promoting activities and if
the development is not left to chance but if communication, initialization, support and
technology transfer are improved systematically and steadily.
Future Perspectives
Fuzzy Technology has caught public attention in the 80s and 90s primarily via technical
applications (washing machines, video cameras, subway systems etc.). These have not
disappeared but the public interest has vanished. Strong research activities can still be
found in the areas of adaptive systems, robotics, vision, quality control, medicine etc.
Even centers, which focus on intelligent engineering systems, are being set-up at present.
The general trend is not to concentrate on Fuzzy Technology only but to combine
classical approaches with Neural Net technology and often with Evolutionary
Computation.
Major new applications and developments occur since a number of years in Business
Intelligence. By contrast to the engineering environment there has been a tremendous
change in the management area in recent years: while until the beginning of the 90s there
existed or was conceived a serious lack of useful and EDP-readable data in this area, now
there is an abundance of data in many management sectors. The increasing number of
"data-warehouses" is an indication of this development. There are two sides of this coin:
on the one hand, applications become possible which were not conceivable until the 80s.
On the other hand, very often managers have serious problems to extract from the masses
of stored data the information they need. This situation opens the door for all kinds of
data mining and knowledge discovery approaches and makes new fascinating
applications possible. Examples are automatic generation of (credit)-ratings of customers
as well as of suppliers (an application which will grow in importance with growing ecommerce), market segmentation for focussed marketing actions, CRM (consumer
relationship modeling) etc. Compared to the very often rather local engineering
applications of fuzzy control, these are much more global issues with high profits or cost
saving potentials.
Figure 5 summarizes future development potentials in research as well as in applications.
Research
Applications
- Hybrid Methods and Models (FT, NN, EC)
Business Intelligence
- Automation
- Computing with words
- Financial Engineering (fraud detection,
retention, creditworthiness, ratings, stock
exchange)
- Machine Learning (Fuzzy Decision Trees, Kohonen
Nets etc.)
- Customer Relationship Modeling (market
segmentation, database marketing, campaign
management etc.)
- Information Technology (Fuzzy Data Banks, Fuzzy
SQL etc.)
- Data Mining and Knowledge Discovery in
Data Warehouses
- Simultaneous Engineering
- Intelligent Agents
Technical Intelligence
- Improvement of Man-Machine-Interfaces
- Machine Intelligence (MIQ)
- Non-FC-Applications
- CAD
- Simultaneous Engineering
- Quality Management
- Robotics
Fig. 5. Future Developments and Applications
A good survey of the present focus of developments can be found in [4].
The conclusion I draw in the present situation is, that Soft Computing is no longer as
much publicly visible as it was in the 90s, the potentials and challenges in this area have
not decreased but rather increased considerably.
References
1. A. Bardossy, 1996. The Use of Fuzzy Rules for the Description of Elements of
the Hydrological Cycle. Ecological Modelling, 85, 59 - 65.
2. J. C. Bezdek and S. K. Pal, 1992. Fuzzy Models for Pattern Recognition. New
York.
3. H.-P. Lipp, R. Günther and P. Sonntag, 1989. Unscharfe Petri Netze - Ein
Basiskonzept für Computerunterstützte Entscheidungssysteme in Komplexen
Systemen. Wissenschaftliche Schriftenreihe der TU Chemnitz, 7.
4. H.-N. Teodorescu et al. (editors.), 2000. Intelligent Systems and Interfaces.
Kluwer Academic Publishers. Boston.
5. H.-J. Zimmermann, 1986. Multi Criteria Decision Making in Crisp and Fuzzy
Environments. in: Zimmermann, Jones and Kaufman (edtrs.). Fuzzy Set Theory
and Applications. Dodrecht, 233 - 256.
6. H.-J. Zimmermann, 1996. Fuzzy Set Theory - and Its Applications. 3rd rev. edit.
Boston.
7. H.-J. Zimmermann, 1997. A Fresh Perspective on Uncertainty Modeling:
Uncertainty vs. Uncertainty Modeling. In: B. M. Ayyub and M. M. Gupta
(editors.). Uncertainty Analysis in Engineering and Sciences: Fuzzy Logic,
Statistics, and Neural Network Approach. International Series in Intelligent
Technologies, Kluwer Academic Publishers, 353 – 364.
Certainty Factors
From:
http://yoda.cis.temple.edu:8080/UGAIWWW/lectures/certainty.html
Certainty Factors





Initial Definitions
Reduction to Probabilities
Composition of Beliefs
Why Certainty Factors work in Mycin
Certainty Factors in the Undergraduate AI course
Certainty Factors were introduced in MYCIN. The basic reference is
Buchanan,Shortliffe: Rule-Based Expert Systems, Addison-Wesley, 1984
Initial Definitions
MB[H,E]
MD[H,E]
Measure of increased belief in hypothesis H given
the evidence E
It is a real number in the interval [0,1]
Measure of increased disbelief in hypothesis H
CF[H,E]
the evidence E.
It is a real number in the interval [0,1]
Certainty Factor for hypothesis H given the evidence E.
given
CF[H,E] was originally defined as MB[H,E] - MD[H,E]. It was later
modified to
MB[H,E] - MD[H,E]
----------------------1 - Min{MB[H,E], MD[H,E]}
Experts normally come up with the values for MB and MD of some facts,
and with the value of CF for inference rules. Default initial values
for MB, MD are 0. Since the beliefs of the experts are not necessarily
consistent, it is necessary to carry out sanity checks. For example,
if H1, H2, .., Hn are exhaustive, mutually exclusive hypotheses,
The sum of their beliefs should be at most 1 and the sum of their
disbeliefs should be at most n-1.
Reduction to Probabilities
Though certainty factors were arrived at without a probabilistic
foundation, Heckerman in 1986 showed how MB and MD could be defined
in terms of probabilities
as indicated below (see Shafer-Pearl, pages 298-312 for more details):
+-- 1 if P(H)=1
|
MB[H,E] = |
| max{P(H|E),P(H)} - P(H)
+-- ----------------------- otherwise
(1 - P(H)) * P(H|E)
+-- 1 if P(H)=0
|
MD[H,E] = |
| min{P(H|E),P(H)} - P(H)
+-- ----------------------- otherwise
- P(H) * P(H|E)
Composition of Beliefs
Note that MB[H,E1&E2] is the belief when evidence E1 and evidence
E2 both support the Hypothesis H. Similarly for MD[H,E1&E2].
+-- 0 if MD[H,E1&E2]=1
|
MB[H,E1&E2] = |
|
+-- MB[H,E1] + MB[H,E2]*(1-MB[H,E1]) Otherwise
The belief of H goes rapidly to 1 when many pieces of evidence
support it. For example if M[H,E1]=M[H,E2]=0.5 then MB[H,E1&E2]=0.75
and if also MB[H,E3]=0.5 then MB[H,E1&E2&E3]=0.875,...
+-- 0 if MB[H,E1&E2]=1
|
MD[H,E1&E2] = |
|
+-- MD[H,E1] + MD[H,E2]*(1-MD[H,E1]) Otherwise
+-- CF[H,E1]+CF[H,E2]-CF[H,E1]*CF[H,E2] if CF[H,E1]
|
and CF[H,E2] are both positive
|
CF[H,E1&E2] = +-- CF[H,E1]+CF[H,E2]+CF[H,E1]*CF[H,E2] if CF[H,E1]
|
and CF[H,E2] are both negative
|
| CF[H,E1]+CF[H,E2]
+-- ----------------------------- Otherwise
1 - MIN{|CF[H,E1]|,|CF[H,E2]|}
MB[H1&H2,E]
= Min{MB[H1,E],MB[H2,E]}
MD[H1&H2,E]
= Max{MD[H1,E],MD[H2,E]}
MB[H1vH2,E]
= Max{MB[H1,E],MB[H2,E]}
MD[H1vH2,E]
= Min{MD[H1,E],MD[H2,E]}
If we have a Chain where evidence E supports hypothesis H1
which in turn supports hypothesis H2, then
MB[H2,E]
= MB[E] * CF[H1,E] * CF[H2,H1]
In a long chain the belief in the conclusion
goes rapidly to 0.
Why Certainty Factors work in Mycin
1. In Mycin we find short deduction chains
2. In Mycin the premises of rules are not too complex
3. In Mycin people have been careful to choose rules where the hypotheses are
mutually exclusive and exhaustive [typically, just finding a distinct value for an
attribute]
4. Experimentally it has been found that in Mycin the behavior is not substantially
affected by small changes in the values of CF, MB, MD. This is a strictly
pragmatic viewpoint: It is good because it works.
Certainty Factors in the undergraduate AI course



It is easy to acquire a sense for the behavior of certainty factors and evaluate their
value by using "freeware" like CLIPS
It is a simple method that has been found to work well in some circumstances.
Unfortunately, the recognition comes usually after the fact. i.e. after the use is
successful, not easily in the design phase.
Overall, it is a very minor topic for the course best dealt with when presenting the
Expert System shell available to the course.
INFLUENCE DIAGRAMS
A DECISION-BASED APPROACH
SEE: http://www.icbemp.gov/spatial/lee_monitor/decision.html
Influence Diagrams page contents What How Go On What Is It Good...
...groups and have them create influence diagrams on flip charts. It...
http://www.usbr.gov/guide/toolbox/influenc.htm
Introduction to Influence Diagrams
An influence diagram is...
...shall present the concept of influence diagrams by extending the...
http://www.hugin.dk/hugintro/id_pane.html
Proceedings, Conference on Influence Diagrams for Decision
http://singapore.cs.ucla.edu/biblio.html
Preface
Monitoring has become a dominant theme among environmental scientists, land
management, and policy makers alike. The number of publications and plans which
propose to do much the same, namely detect and identify system state and change,
continue to multiply, each suggesting alternative approaches and solutions. Despite
considerable effort by various institutions and individuals, effective environmental
monitoring remains an unanswered challenge. This is particularly the case for large-scale,
agency-led projects such as the Interior Columbia Basin Ecosystem Management Project
(ICBEMP), the Northwest Forest Plan (NWFP), and the Sierra Nevada Framework for
Conservation and Collaboration (SNFCC).
In the following report, we begin a dialogue about an appropriate conceptual framework
for organizing and developing a monitoring plan for broad-scale ecosystem management
efforts. We were asked to prepare this report for the group drafting a monitoring charter
for the ICBEMP. Because much effort is being invested in preparations for monitoring
within the NWFP and in the Sierra Nevada, it seems logical to look also at these efforts.
Our general impression is that the monitoring plans that are currently being developed for
broad-scale ecosystem management efforts, while they may be statistically sound, often
lack an integrated strategy that allows one to easily see why certain information is
important and how such information might influence future decisions and investments.
Thus, we believe that our comments offered here apply equally well outside the
ICBEMP.
We also have decided to try a different approach to communication—using a hypertext
approach instead of the traditional written report. Our purpose here is twofold. First, our
view of monitoring embedded in a decision analysis framework relies on the synthesis of
ideas ranging from ecological theory, to statistics, to decision analysis, to economics.
Understanding the framework requires at least a cursory understanding of all of these
ideas; operationalizing the framework will require in-depth understanding. Our intent is
to provide links from the main body of the document to supplemental material that will
provide greater detail and additional examples. [Few such links exist as of 10/26/98.] The
second reason for the hypertext format is that we expect this to be a dynamic document
that will undergo revision and expansion as the dialogue among scientists and managers
regarding monitoring in the ICBEMP proceeds. Having a centrally accessible, electronic
document that reflects that evolving dialogue should foster informed discussion.
Influence diagrams provide an attractive graphical scheme for explicitly codifying
conditional independence between critical probabilistic variables as justified by the
expert's knowledge and statistical data. The salient information in the diagram is, in fact,
not which variables influence each other, but rather, which ones do not influence each
other given the conditioning information. Thus influence is defined by its dual concept
"lack of influence".