Download Data mining with GUHA – Part 1 Does my data contain something

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
TAM002
Data mining with GUHA – Part 1
Does my data contain something interesting?
Esko Turunen1
1 Tampere
University of Technology
Esko Turunen
http://www.vrtuosi.com
The aim of data mining is to give answers to a question ’Does
my data contain something interesting?’
In this chapter we introduce the following basic concepts:
• knowledge discovery in databases and data mining
• data, typical data mining tasks and data mining tasks outputs
• GUHA and B–course: two dissimilar approaches to data
mining
We also get acquainted with a real life data collected from
Indonesia. We use this data to illustrate issues all over during
this course.
These push buttons ♣ open short verbal comments that might
be useful.
Esko Turunen
http://www.vrtuosi.com
♣ Knowledge discovery in databases (KDD) was initially
defined as the ’non–trivial extraction of implicit, previously
unknown, and potentially useful information from data’ [1]. A
revised version of this definition states that ’KDD is the
non–trivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data’ [2] .
According to this definition, data mining is a step in the KDD
process concerned with applying computational techniques
(i.e., data mining algorithms implemented as computer
programs) to actually find patters in the data.
In a sense, data mining is the central step in the KDD process.
The other steps in the KDD process are concerned with
preparing data for data mining, as well as evaluating the
discovered patterns, the results of data mining.
Esko Turunen
http://www.vrtuosi.com
♣ I Data The input to a data mining algorithm is most
commonly a single flat table comprising a number of fields
(columns) and records (rows). In general, each row represents
an object and columns represent properties of objects.
II Typical data mining tasks
• One task is to predict the value of one field from other fields.
If the class is continuous, the task is called regression. If the
class is discrete the task is called classification.
• Clustering is concerned with grouping objects into classes of
similar objects. A cluster is a collection of objects that are
similar to each other and are dissimilar to objects in other
clusters.
• Association analysis is the discovery of association rules.
Association rules specify correlation between frequent item
sets.
• Data characterization sums up the general characteristics or
features of the target class of data: this class is typically
collected by a database query.
Esko Turunen
http://www.vrtuosi.com
• Outlier detection is concerned with finding data objects that
do not fit the general behavior or model of the data: these are
called outliers.
• Evaluation analysis describes and models regularities or
trends whose behavior changes over time.
III Outputs of data mining procedures can be
• Equations e.g. TotalSpent = 189.5275 × Age + 7146[$]
• Predictive rules e.g. IF income is ≤ 100.000[$] and Gender =
Male THEN Not a Big Spender
• Association rules e.g.
{Gender = Female, Age ≥ 52} ⇒ {Big Spender = Yes}
• Probabilistic models e.g. Bayesian networks
• Distance and similarity measures, decision trees
• Many others
Esko Turunen
http://www.vrtuosi.com
♣ Our aim is to study in detail a particular data mining method
called GUHA – its principle was formulated in a paper by Hájek,
Havel and Chytil already in 1966 [3]. GUHA is the acronym for
General Unary Hypotheses Automaton – and its computer
implementation called LISpMiner developed in Prague
University of Economics by Jan Rauch and Milan Šimunek.
LISpMiner is freely downloadable from http://lispminer.vse.cz/ .
GUHA approach is suitable e.g. for association analysis,
classification, clustering and outlier detection tasks.
♣ We start be introducing a real life data which will serve as a
benchmark data test set during the whole course. To show how
GUHA differs from a Bayesian approach we briefly take a quick
look at B–course, see http://b-course.cs.helsinki.fi/obc/.
Esko Turunen
http://www.vrtuosi.com
The data we use is Tjen-Sien Lim’s publicly available data set
from the 1987 National Indonesia Contraceptive Prevalence
Survey. These are the responses from interviews of m = 1473
married women who were not pregnant at the time of interview.
The challenge is to learn to predict a woman’s contraceptive
method from knowledge about her demographic and
socio-economic characteristics. The 10 survey response
variables and their types are
Age
Education
Husband’s education
Number of children borne
Islamic
Working
Husband’s occupation
Standard of living
Good media exposure
Contraceptive method used
integer 16–49
4 categories
4 categories
integer 0–15
binary (yes/no)
binary (yes/no)
4 categories
4 categories
binary (yes/no)
3 categories (None, Long-term, Short-term)
Esko Turunen
http://www.vrtuosi.com
http://b-course.cs.helsinki.fi/obc/
♣
Esko Turunen
http://www.vrtuosi.com
http://b-course.cs.helsinki.fi/obc/
Esko Turunen
http://www.vrtuosi.com
Esko Turunen
http://www.vrtuosi.com
Esko Turunen
http://www.vrtuosi.com
♣
Esko Turunen
http://www.vrtuosi.com
Esko Turunen
http://www.vrtuosi.com
Esko Turunen
http://www.vrtuosi.com
Esko Turunen
http://www.vrtuosi.com
Esko Turunen
http://www.vrtuosi.com
Esko Turunen
http://www.vrtuosi.com
Esko Turunen
http://www.vrtuosi.com
Esko Turunen
http://www.vrtuosi.com
Speculating about causalities
Remember that dependencies are not necessarily
causalities. However, the theory of inferred causation
makes it possible to speculate about the causalities
that have caused the dependencies of the model.
There are two different speculations (called naive
model and not so naive model) which are based on
different background assumptions.
Esko Turunen
http://www.vrtuosi.com
How to read naive causal model ?
Naive causal models are easy to read, but they are built on assumptions that are many times
unrealistic, namely that there are no latent (unmeasured) variables in the domain that causes the
dependencies between variables. A simple example of the situation where this assumption is violated
can be placed to Finland where cold winter makes lakes and sea ice covered. Because of that most
drowning accidents happen in summertime. The warm summer also makes people eat much more
ice-cream than in wintertime. If you measure both the number of drowning accidents and the icecream consumption, but don't include the variable indicating the season there is clear dependency
between ice-cream consumption and drowning. Evidently this dependency is not causal (ice cream
does not cause drowning or other way round), but due to the excluded variable summer (technically
this is called confounding). Naive causal models are built on the assumption that there is no
confounding.
In naive causal models there may be two kind of connections between variables: undirected arcs and
directed arcs. Directed arcs denote the causal influence from cause to effect and the undirected arcs
denote the causal influence directionality of which cannot be automatically inferred from the data.
You can also read the naive causal models as representing the set of dependency models sharing the
same directed arcs. Unfortunately, this does not give you the freedom to re-orient the undirected arcs
any way you want. You are free to re-orient the undirected arcs as long as re-orienting them does not
create new V-structures in a graph. V-structure is the system of three variables A B C such that there
is directed arc from A to B and there is directed arc from C to B, but there is no arc (neither directed
nor undirected) between A and C.
Esko Turunen
http://www.vrtuosi.com
How to read causal graph produced by B-course?
Causal models are not difficult to read once you learn the difference between different kinds of
arcs. There are two kinds of lines in arcs, solid and dashed. With solid lines we indicate relations
that can be determined from the data. Dashed lines are used when we know that there is a
dependency, but we are not sure about its exact nature. The table below lists the different types of
arcs that can be found in causal models.
A has direct causal influence to B (direct meaning that causal influence
Solid arc from A to B
is not mediated by any other variable that is included in the study)
Dashed arc from A to B.
There are two possibilities, but we do not know which holds. Either A
is cause of B or there is a latent cause for both A and B.
Dashed line without any
There is a dependency but we do not know whether A causes B or if B
arrow heads between A and causes A or if there is a latent cause of them both the dependency
B.
(confounding).
♣
Esko Turunen
http://www.vrtuosi.com
W. Frawley, G. Piatetsky-Shapiro and C. Matheus:
Knowledge Discovery in Databases: An Overview. In
Knowledge Discovery in Databases, eds. G.
Piatetsky-Shapiro and W. Frawley (1991) 1–27. Cambridge,
Mass.: AAAI Press / The MIT Press.
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and P.
Uthurusamy: Advances in Knowledge Discovery and Data
Mining. AAAI/MIT Press (1996).
I. Havel , M. Chytil M.and P. Hájek: The GUHA–Method of
Automatic Hypotheses Determination. Computing, Vol. 1,
(1966) 293–308.
Esko Turunen
http://www.vrtuosi.com