Download Data mining with GUHA – Part 2 GUHA produces hypothesis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
TAM002
Data mining with GUHA – Part 2
GUHA produces hypothesis
Esko Turunen1
1 Tampere
University of Technology
Esko Turunen
http://www.vrtuosi.com
GUHA does not test hypothesis, instead GUHA produces them
In this chapter we learn the outlines of GUHA e.g. that
• GUHA is suitable for exploratory analysis of large data
• GUHA goes through 2 × 2 contingency tables and picks up
the interesting ones
• in logic terms, GUHA is based on first order monoidal logic
whose models are finite
• a fundamental reference, the ‘GUHA book’, is free to be
downloaded from www.cs.cas.cz/hajek/guhabook/
Esko Turunen
http://www.vrtuosi.com
♣ 1. GUHA (General Unary Hypotheses Automaton) is a
method of automatic generation of hypotheses based on
empirical data, thus a method of data mining.
• GUHA is one of the oldest methods of data mining – GUHA
was introduced by Hájek, Havel and Chytil in The GUHA
method of automatic hypotheses determination published in
Computing 1 (1966) 293–308.
• GUHA still develops: there are dozens of new features added
to GUHA during the past ten years.
• GUHA is a kind of automated exploratory data analysis.
Instead of testing given hypothesis supported by a data, GUHA
procedure generates systematically hypotheses supported by
the data. By GUHA it is possible to find all interesting
dependences hidden in a data.
Esko Turunen
http://www.vrtuosi.com
♣ 2. GUHA is suitable for exploratory analysis of large data.
• The processed data form a rectangle matrix, where rows
corresponds to objects belonging to the sample and each
column corresponds to one investigated variable. A typical data
matrix processed by GUHA has hundreds or thousands of rows
and tens of columns.
• Cells of the analyzed data matrix can contain whatever
symbols, however, before a concrete data mining task the data
must be categorized to contain only 0s or 1s (or are empty).
• Exploratory analysis means that there is no single specific
hypothesis that should be tested by our data; rather, our aim is
to get orientation in the domain of investigation, analyze the
behavior of chosen variables, interactions among them etc.
Such inquiry is not blind but directed by some general (possibly
vague) direction of research (some general problem).
Esko Turunen
http://www.vrtuosi.com
♣ 3. GUHA systematically creates all hypotheses interesting
from the point of view of a given general problem and on the
base of given data.
• This is the main principle: all interesting hypotheses. Clearly,
this contains a dilemma: all means most possible, only
interesting means not too many. To cope with this dilemma,
one may use different GUHA procedures, implemented in
LISp–Miner system, and having selected one, by fixing in
various ways its numerous parameters.
• The LISp–Miner system leads the user and makes the
selection of parameters relatively easy but somewhat laborious.
LISp–Miner cannot be as automatized as much as e.g.
B–course is automatized, however, the results of LISp–Miner
are much more illustrative and detailed.
Esko Turunen
http://www.vrtuosi.com
Three remarks:
• GUHA procedures polyfactorial hypotheses i.e. not only
hypotheses relating one variable with another one, but
expressing relations among single variables, pairs, triples,
quadruples of variables etc.
• GUHA offers hypotheses. Exploratory character implies that
the hypotheses produced by the computer (numerous in
number: typically tens or hundreds of hypotheses) are just
supported by the data, not verified. You are assumed to use
this offer as inspiration, and possibly select some few
hypotheses for further testing.
• GUHA is not suitable for testing a single hypothesis: routine
packages are good for this.
Esko Turunen
http://www.vrtuosi.com
♣ 4. 4ft–miner procedure generates hypotheses (or
observational statements) on association between complex
Boolean formulae (attributes). These formulae are constructed
from unary predicates (corresponding to the columns of the
processed 0/1–data matrix) by logical connectives ∧, ∨, ¬
(conjunction, disjunction, negation).
Examples of predicates are
TEMPERATURE : ≥ 38◦ C , PRESURE: high , DIAGNOSIS: infection
Examples of formulae are
[TEMPERATURE : ≥ 38◦ C] ∧ [PRESURE: high] and [DIAGNOSIS: infection]
An example of a hypothesis is
[TEMPERATURE : ≥ 38◦ C] ∧ [PRESURE: high] ≈ [DIAGNOSIS: infection]
More generally, hypotheses are of form φ ≈ ψ.
Notice that our terminology differs from the original GUHA
approach: we want to keep things as simple as possible!
Esko Turunen
http://www.vrtuosi.com
Given the 0/1–data matrix, each pair of Boolean attributes φ , ψ
determines its four-fold frequency table; the association of with
is defined by choosing an associational or generalized
quantifier i.e. a function assigning to each four-fold table either
1 (associated or true in the data) or 0 (not associated or
false in the data) and satisfying some natural monotonicity
conditions.
♣ The four–fold table has the form:
φ
¬φ
ψ
a
c
a+c =k
¬ψ
b
d
b+d =l
a+b =r
c+d =s
m
where a + b + c + d = m and
• a is the number of objects satisfying both φ and ψ,
• b is the number of objects satisfying φ but not ψ,
• c is the number of objects not satisfying φ but satisfying ψ,
• d is the number of objects not satisfying φ nor ψ.
Esko Turunen
http://www.vrtuosi.com
♣ 6. There are various types of generalized quantifiers
formalizing various kinds of associations:
• Implicational quantifiers formalize the association many φ are
ψ, they do not depend on the values c, d.
• Comparative quantifiers formalize the association φ makes ψ
more likely than ¬ψ.
• Some quantifiers just express observations on the data, some
others serve as tests of statistical hypotheses on unknown
probabilities.
• Some quantifiers ones symmetric: φ ≈ ψ implies ψ ≈ φ, some
admit negation: φ ≈ ψ implies ¬φ ≈ ¬ψ.
4ft-Miner procedure contains dozen of generalized quantifiers,
a novel procedure Ac4ft–Miner offers also action quantifiers. An
advantage of GUHA is that new quantifiers can be defined and
their properties can be analyzed in a well establish logic
framework.
Esko Turunen
http://www.vrtuosi.com
♣ 7. After preparing the original data matrix to a 0/1-format, the
user can start the 4ft–Miner procedure. The input of such a
procedure consists of
(1) the 0/1–data matrix
(2) parameters determining symbolic restriction to the pairs ψ, φ
of Boolean attributes to be generated; ψ is called antecedent
and φ is called succedent
(3) the quantifier to be used and some parameters of the
quantifier
(4) some other determinations, for example the user has to
declare predicates that can occur in the antecedent and the
succedent, the use of logic connectives, minimal and maximal
length of antecedent and succedent, way to process of missing
data etc.
Esko Turunen
http://www.vrtuosi.com
8. 4ft–Miner produces all associations φ ≈ ψ satisfying the
syntactic restrictions and true in the data. The generation is not
done blindly but uses various techniques serving to avoid
exhaustive search. The found associations together with
various parameters are not mechanically printed but saved in a
solution file for further processing.
9. 4ft–Result module for interpretation of results enables the
user to browse the associations’ format, sort them according to
various criteria, select reasonably defined subsets and output
concise information of various kinds.
There are however other procedures implemented in the
LISp–Miner system that do mine for other kinds of patterns,
even for more complex one than associational rules.
Esko Turunen
http://www.vrtuosi.com
10. The GUHA method has deep logical and statistical
foundations. GUHA is being further developed at the institute of
Computer Science of the Academy of Sciences of the Czech
Republic (Petr Hájek and his group), at the Prague University of
Economics (Jan Rauch and his group) and at Tampere
University of Technology (Esko Turunen and his group).
Esko Turunen
http://www.vrtuosi.com
Example
Assume we are observing children who have an allergic
reaction to, say, tomato, apple, orange, cheese or milk. These
observations are presented in the following table:
Child
Anna
Aina
Naima
Rauha
Kai
Kille
Lempi
Ville
Ulle
Dulle
Dof
Kinge
Laade
Koff
Olvi
Tomato
1
1
1
0
0
1
0
1
1
1
1
0
0
1
0
Apple
1
1
1
1
1
1
1
0
1
0
0
1
1
1
1
Esko Turunen
Orange
0
1
1
1
0
0
1
0
0
1
1
1
0
0
1
Cheese
1
0
1
0
1
0
1
0
1
0
0
0
1
0
1
http://www.vrtuosi.com
Milk
1
0
1
1
1
1
1
0
1
0
1
1
1
1
1
Thus, we have observations as Child x is allergic to milk and
Child y is allergic to cheese, We write shorter Milk(x) and
Cheese(y ).
• Milk(-), Cheese(-), Tomato(-), Orange(-) and Apple(-) are
unary predicates of our observational language and x, y , z, · · ·
are variables.
• Expressions like Milk(x) or Cheese(y ) are atomic (open)
formulae. Combine formulae by logical connectives ¬ (not), ∧
(and) ∨ (or). E.g. Milk(x)∧¬Cheese(x) would mean
Child x is allergic to milk and is not allergic to cheese.
• However, in stead of open formulae, we are more interested
in universal closed formulae, e.g. All children are allergic to
milk, Most children are not allergic to orange, There is a child
allergic to tomato, In most cases if a child is allergic to milk then
she/he is allergic to cheese, too.
Esko Turunen
http://www.vrtuosi.com
Everyone who passed a basic course in logic would know that
statements like All children are allergic to milk and There is a
child allergic to tomato are expressible by ∀xMilk(x) and
∃xTomato(x).
Unfortunately classical mathematical quantifiers ∀ and ∃ are
rather useless in the real world. Much more valuable are
generalized quantifiers like In most cases, e.g. in the statement
In most cases if a child is allergic to milk then she/he is allergic
to cheese, too.
GUHA method is a logic formalism of generalized quantifiers for
data mining purposes.
Esko Turunen
http://www.vrtuosi.com