Download FROM DATA MINING TO KNOWLEDGE MINING: SYMBOLIC DATA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
FROM DATA MINING TO
KNOWLEDGE MINING:
SYMBOLIC DATA ANALYSIS
AND THE SODAS SOFTWARE
E. Diday
University of Paris IX Dauphine
and INRIA
AIM
FROM
HUDGE DATA IN AN ECONOMIC WAY
-Extract new knowledge
-Summarize
-Concatenate
-Solve confidentiality
-Explain correlation
HOW? By working on HIGHER LEVEL UNITS as
“classes”, “categories” or ”concepts“ necessary described by
more complex data extending Data Mining to Knowledge
Mining.
1
OUTLINE
1) THE MAIN IDEA:
FIRST AND SECOND ORDER OBJECTS.
2)THE INPUT OF A SYMBOLIC DATA ANALYSIS:
SYMBOLIC DATA TABLE.
3) MAIN SOURCES OF SYMBOLIC DATA: FROM
DATA BASES, FROM CATEGORICAL VARIABLES.
4) MAIN OUTPUT OF SYMBOLIC DATA ANALYSIS
ALGORITHMS: SYMBOLIC DESCRIPTIONS AND
SYMBOLIC OBJECTS.
5) THE MAIN STEPS OF A SDA.
6) SOME TOOLS OF SYMBOLIC DATA ANALYSIS
7) SYNTHETICAL VIEW OF THE SODAS SOFTWARE
THE M AIN IDEA:
FIRST AND SECOND ORDER OBJECTS
THE ARISTOTLE ORGANON (IV B.C.) CLEARLY
DISTINGUISHES "FIRST ORDER OBJECTS" (AS THIS
HORSE OR THIS MAN) CONSIDERED AS A UNIT
DESCRIBING AN INDIVIDUAL OF THE W ORLD ,
FROM "SECOND ORDER OBJECTS" (AS A HORSE
OR A MAN) ALSO TAKEN AS A UNIT DESCRIBING
A CLASS OF INDIVIDUALS.
2
Class, Category, Concept
• A CLASS is a set of units (birds of an island)
• A CATEGORY is a value of a categorical
variable (“swallows” is a value of the variable:
“species of bird”)
• A CONCEPT is defined by an intent and an
extent
• Intent: characteristic properties of the
swallows
• Extent: the set of swallows
•A concept is modeled by a “symbolic object”
From standard units to concepts
the statistic is not the same!
On an island there are 400 swallows, 100 ostrich, 100 penguins :
Standard Data Table
Symbolic Data Table
Oiseau
Species
Flying
Size (cm)
1
Penguin
No
80
2
swallows
Yes
70
600
Ostrich
No
125
Species
Flying
Color
Size
Migr
Swallows
yes
0.3b,0.7grey
[60, 85]
Yes
Ostrich
No
0.1noir,0.9g
[85, 160]
No
Penguin
No
0.5n,0.5gris
[70, 95]
Yes
The species is a concept , it becomes the new
statistical unit
The variation due to the individuals is
expressed by intervals or histograms
Birds (individuals)
The statistic of individuals is different
from the statistic of concepts!
400
Species
A « conceptual »
variable is added: it
applies to concepts
(concepts)
2
1
200
Flying
No flying
Flying
No flying
3
From Database to Concepts
Individuals
Columns: standard variables
Relational
Data Base
QUERY
Rows : individuals
Individuals description
Concepts Description
Columns: symbolic variables
Rows : concepts
Symbolic Data Table
Individuals versus Concepts
Classical observations
(Individuals)
Symbolic observations
(Concepts)
Town (Paris, Lyon, ...)
Region (North, South, ...)
Bird (this cardinal, ...)
Species (Cardinals, Wrens, ...)
Football Player (Zidane, ...)
Team (Spane, ...)
Photo (taken today, ...)
Types of Photos (Family, Mountains, ...)
Item Sold (hammer, phone, ...)
Types of Item (Electrical Goods, Auto,...)
WEB trace (Jones, temperature today,.)
WEB Usages (Medical Info, Weather, ...)
Subscription (to Time, ...)
Magazines
Reservation of John to Fly AF Paris Ljubjana
All Reservation to Fly AF Paris Ljubjana July 2006
Fly AF Paris Ljubjana N° 205 5 May 06
Fly AF Paris Ljubjana N° 205 in May 2006
Client
Socio professional category
Water consumer
Type of consuming
Geographical Position
Type of environment
4
FOUR PRINCIPLES
1) ONLY TWO LEVELS OF UNITS:
First level:
Individuals
Second level: concepts
2) THE CONCEPTS CAN BE CONSIDERED
AS NEW INDIVIDUALS OF HIGHER LEVEL.
3) A CONCEPT IS DESCRIBED BY USING THE
DESCRIPTION OF A CLASS OF INDIVIDUALS OF
ITS EXTENT.
4) THE DESCRIPTION OF A CONCEPT MUST
EXPRESSES THE VARIATION OF THE
INDIVIDUALS OF ITS EXTENT
FROM FIRST ORDER OBJECTS TO SECOND
ORDER OBJECTS IN OFFICIAL STATISTICS
Units
Classes
Case n°
Region
Bedroom
DiningLiving
Socio-Econ
Group
11401
NorthernMetropolitan
NorthernMetropolitan
NorthernMetropolitan
2
1
1
2
1
3
1
3
3
12315
East-Anglia
12316
East-Anglia
14524
Greater-London
1
2
1
3
2
2
3
1
3
11402
11403
Descr. Var. of the Units
5
FROM FIRST ORDER OBJECTS TO SECOND ORDER OBJECTS
IN OFFICIAL STATISTICS
Classes
Region
Descriptive variable of the units
Bedroom Dining-Liv Socio-Ec gr
2
1
1
NorthernMetropolitan
Northern2
Metropolitan
Northern1
Metropolitan
1
3
3
3
East-anglia
East-anglia
East-anglia
3
2
2
3
1
3
1
2
1
Classes
Region
Descriptive variables of the classes
Bedroom
Dining-Liv Socio-Ec gr
(2\3) 2, (1\3) 1 (2\3) 1, (1\3) 3 (1\3) 1, (2\3) 3
NorthernMetropolitan
East-anglia
(2\3) 1, (1\3) 2 (2\3) 2, (1\3) 3 (2\3) 3, (1\3) 1
Keeping rules after the generalization process
Individuals
Concepts
I1
C1
a
Y1
2
Y2
I2
C1
b
1
I3
C1
c
2
I4
C2
b
1
I5
C2
b
3
I6
C2
a
2
Initial classical data table where individuals are
described by three variables
Y1
Y2
C1
{a, b, c }
{1, 2}
C2
{a, b }
{1, 2, 3}
Induced Symbolic Data Table with background knowledge defined by two rules:
[Y1 = a] ⇒ [Y2 = 2] and [Y2 = 1] ⇒ [Y1 = b]
6
Keeping correlation after the generalization process
Player
Team
Age
Weight
Height
Nationality
Fernandez
Spain
29
85
1. 84
Spanish
____
World Cup
yes
PRodriguez
Spain
23
90
1.92
Brazilian
yes
Mballo
France
25
82
1.90
African
yes
Zedane
France
27
78
1.85
French
yes
_ ____
_ ___
___
____
____
_____
Renie
XX
23
91
1. 75
Spanish
no
Olar
XX
29
84
1.84
Brazilian
no
Mirce
XXX
24
83
1.83
African
no
Rumbum
XXX
30
81
1.81
French
no
____
Classical data table describing players by numerical and categorical variables.
World Cup
Age
Weight
Height
Cor(Weight, Height)
yes
[21, 26]
[78, 90]
[1.85, 1.98]
0.85
no
[23, 30]
[81, 92]
[1.75, 1.85]
0.65
Symbolic Data Table obtained by generalisation for the variables age, weight and height and keeping back
the correlations between the Weight and the Height: the correlation is higher for the world cup players.
Schweitzer (1984): "Distributions are the
numbers of the future”
MORE GENERALY, WHAT IS THE INPUT OF A
SYMBOLIC DATA ANALYSIS?
SYMBOLIC DATA TABLE
+ BACKGROUND KNOWLEDGE
SYMBOLIC DATA TABLE :
THE CELLS CAN CONTAIN:
- SEVERAL QUALITATIVE OR QUANTITATIVE
WEIGHTED VALUES
- INTERVALS
- HISTOGRAMS
7
SOME BACKGROUND KNOWLEDGE CAN BE GIVEN
VARIABLES CAN BE:
-TAXONOMIC: « A SOCIO-ECONOMIC GROUP IS
CONSIDERED TO BE "SELF-EMPLOYED" IF IT IS
"PROFESSIONAL SELF-EMPLOYED" OR "OWN ACCOUNT
NON-PROFESSIONAL".
-HIERARCHICALLY DEPENDENT :
THE VARIABLE: “DOES THE COMPANY HAS COMPUTERS? “
AND
THE VARIABLE: “ KIND OF COMPUTER” ARE
HIERARCHICALLY LINKED.
- RULES KEEPING LOGICAL DEPENDENCIES:
« IF AGE(W) IS LESS THAN 2 MONTHS THEN HEIGHT(W) IS
LESS THAN 10 ».
A TAXONOMIC VARIABLE
Compagnies
________
Region
Comp1
Paris
Comp2
London
Comp3
France
Comp4
England
Comp5
Glasgow
Comp6
Europe
_________
The taxonomic
Europe tree associated with the variable « region ».
United King Dom
Scotland
England
London
Glasgow
Paris
France
Region
Predecessor
Paris
France
London
England
France
Europe
England
Europe
Glasgow
Scotland
Europe
Europe
Database
relation
8
EXAMPLE OF SYMBOLIC DATA TABLE
PRODUCT
WEIGHT
TOWN
COLOUR
PRODUCT 1
[5, 9]
{Londres}
{0.1red, 0.9white}
PRODUCT 2
[ 3,8]
{Paris, Londres }
PRODUCT 3
PRODUCT 4
{ 0.3 red, 0.7 green}
[2,3]
THE CELLS CAN CONTAIN:
-SEVERAL QUALITATIVE OR
QUANTITATIVE WEIGHTED VALUES
-INTERVALS
- HISTOGRAMS
SYMBOLIC DATA
ANALYSIS: 3 STEPS
FIRST STEP: From individuals to categories
SECOND STEP: From Categories to Concepts described
by a Symbolic Data Table.
THIRD STEP: Extract new knowledge from this symbolic
data completed by some background knowledge.
CONSEQUENCE: need of extending Standard Statistics,
Exploratory Data Analysis, and Data Mining to Symbolic
Data Tables. This is the aim of the SODAS software.
9
From Standard Versus Symbolic Data and methods
THE DATA
Standard Data
• Numerical: Points de R (nbres réels)
• Categorical Ordinal : Points of N (naturel numbers)
• Categorical non ordered : categorical values
Symbolic Data
- Diagrams, Histograms ou Distributions
- Sequences of numbers or categories
- Sequences of weighted categories
- Functions, curves
- Rules
- Taxonomies
- Graphes
Methods
• Stat descriptive (Histos, Corrélations , biplots)
•Typologie (hiérarchies, pyramids, K-means, Nuées
dynamiques, Kohonen maps ,…)
•Mixture decoimposition
•Décision trees, boosting, baging, …
•Dissimilarities and their Représentation
•Rules inference and causal trees
• Visualisation (points)
•factorial Analysis (ACP, AFC, …)
•Classical Regression, PLS
•Neural Network, VSM (Vector Support Machine),
Etc.
• Galois Lattices (binary data)
Any standard method can be generalized
to concepts described by symbolic data
(PLUS)
+ Specific Methods of Symbolic Data
Analysis
Class description, Haussdorf dissimilarities,
symbolic prototypes, modeling concepts by
symbolic objects…
SOURCES OF SYMBOLIC DATA
.FROM CATEGORICAL VARIABLES:
- GIVEN (AS « TYPE OF EMPLOYMENT »)
- OBTAINED BY CLUSTERING.
.FROM DATA BASES: QUERY CREATING A
NEW CATEGORICAL VARIABLE: cartesian prod
.FROM EXPERT: NATIVE SYMBOLIC DATA:
Scenario of road accidents, species of insects
.FROM CONFIDENTIAL DATA
IN ORDER TO HIDE THE INITIAL DATA BY
LESS ACCURACY
10
.FROM STOCHASTIC DATA TABLE:
THE PROBABILITY DISTRIBUTION , THE HISTOGRAM THE PERCENTILES OR
THE RANGE OF ANY RANDOM VARIABLE ASSOCIATED TO EACH CELL OF A
DATA TABLE
EXAMPLE
Tom
M athem atics
Physics Litterature
XM
XP
XL
Paul
X M is the random variable which associates to each
exam of TO M his m ark in m athem atics.
From X M several kinds of sym bolic objects can be
defined by using in each cell: - Probability distr.
- H istogram s
- Inter-quartile intervals
.FROM TIME SERIES
- IN DESCRIBING INTERVALS OF TIME:
( the variation of the values each week)
- IN DESCRIBING A TIME SERIES BY THE
HISTOGRAM OF ITS VALUES.
100
80
60
40
20
0
N or d
E st
O u e st
Nord
E st
1er
tr im .
2e
tr im .
3e
tr im .
4e
tr im .
11
FROM FUZZY DATA TO SYMBOLIC DATA
Paul
Jef
Jim
Bill
height
1.60
1.85
0.65
1.95
weight
45
80
30
90
hair
yellow
yellow
black
black
From Numerical to Fuzzy Data
small
average high
1
Initial Data
Paul
Jef
Jim
Bill
small
0.70
0
0.50
0
height
average
0.30
0.50
0
0
0.5
weight hair
high
0
0.50
0
0.48
45
80
30
90
yellow
yellow
black
black
0
1.50
1.80
1.90
JEF
0.65 1.60
1.85
1.95
Fuzzy Data
small
{Paul, Jef }
{Jim, Bill}
[0, 0.70]
[0, 0.50]
height
average
weight
hair
high
[0.30, 0.50]
0
[0, 0.50] [45, 80] yellow
[0, 0.48] [30, 90] black
Symbolic Data
MAIN OUTPUT OF SYMBOLIC
DATA ANALYSIS ALGORITHMS:
SYMBOLIC DESCRIPTIONS
SYMBOLIC OBJECTS.
SYMBOLIC DESCRIPTIONS
Description AGE
D1
{12,20,28}
SPC
{employee,worker}
D2
{teacher,countryman}
[5, 33]
12
CONCEPTS ARE MODELED BY
SYMBOLIC OBJECTS
WHATS A CONCEPT?
A CONCEPT IS DEFINED BY AN
* INTENT : ITS CHARACTERISTIC
PROPERTIES
* EXTENT:THE SET OF INDIVIDUALS
WHICH SATISFY THESE PROPERTIES
LIKE OUR MIND, SYMBOLIC OBJECTS
MODEL CONCEPTS BY AN INTENT AND A
WAY OF COMPUTING THE EXTENT
SYMBOLIC OBJECT
It’s an animal(w) = 0.99 yes
dC
d
R
y
w
S = (a, R, dC)
a(w) = [y(w)RdC]
13
TWO KINDS OF SYMBOLIC OBJECTS
BOOLEAN SYMBOLIC OBJECTS
S = (a, R, d1)
d1= {12, 20 ,28} x {employee, worker}]
R = (⊆
⊆ , ⊆ ),
a(w) = [age(w) ⊆ {12, 20 ,28}] ∧ [SPC(w) ⊆
{employee, worker}]
a(w) ∈ {TRUE, FALSE}.
THE MEMBERSHIP
FUNCTION« a » MODAL CASE
S = (a, R, d):
a(w) = [age(w) R1 {(0.2)12, (0.8) [20 ,28]}] ∧
[SPC(w) R2 {(0.4)employee, (0.6)worker}]
a(w) ∈ [0,1].
First approach: simple or flexible matching
R= (R1, R2 ): r Ri q = ∑j=1 ,k r j q j e (r j - min (r j, q j)) .
Second approach:
Probabilistic: if dependencies, copulas,
derivation of the joint distribution,
transforming the joint density in [0,1].
14
EXTENT OF A SYMBOLIC OBJECT S:
BOOLEAN CASE:
EXT(s) = {W ∈ Ω / a(W) = TRUE}.
MODAL CASE
EXTα (S)= EXTENTα (a) = {W ∈ Ω / a( W ) ≥ α}.
THE LEARNING PROCESS OF SYMBOLIC OBJECTS
REAL WORLD
MODELED WORLD
DESCRIPTIONS
INDIVIDUALS
xw
x
x x
Ω
Ext(s/Ω
Ω)
x dw R
x
x
x x
T dC
SYMBOLIC
OBJECTS
CONCEPTS
x
x
x
x
s = (a, R,dC)
Learning symbolic objects modeling concepts by reducing two kinds of errors: the
individuals who are in the extent of the symbolic objects but not in the extent of the
concept, the individuals which are not in the extent of the symbolic object but are in
the extent of the concept.
15
THE FOUR CASES OF STATISTICAL
OR DATA MINING ANALYSIS.
Output Classical
Analysis
Input
Case 1
Classical
Data
Case 3
Symbolic
Data
Symbolic
Analysis
Case 2
Case 4
Comparing standard data versus symbolic data
WHY SYMBOLIC DATA ARE NOT STANDARD DATA?
Symbolic Data Table
Quality of Goal
Weight
Height
Nationality
Very Good
[80, 95]
[1.80, 1.95]
{0.7 Eur, 0.3 Afr}
Associated standard data
Quality of
Goal
Weight
Min
Weight
Max
Height
Min
Height
Max
Eur
Afr
Very Good
80
95
1.80
1.95
0. 7
0.3
Consequences:
. Initial variables are lost
. Variation are lost
16
WHY IT IS ENHANCING TO WORK ON SYMBOLIC DATA
THAN ON THEIR DECOMPOSITION IN STANDARD DATA
TABLE?
Town
Nb of pupils
Kind
Level
Paris
[320, 450]
(100%)Public
{1, 3}
Lyon
[200, 380]
(50%)Public , (50%)Private
{2, 3}
Toulouse
[210, 290]
(50%)Public , (50%)Private
{1, 2}
Town
Min
Nb of pupils
Max
Nb of pupils
Public
Private
Level 1
Level 2
Level 3
1
Paris
320
450
100
0
1
0
Lyon
200
380
50
50
0
1
1
Toulouse
210
290
50
50
1
1
0
LOST INFORMATION BY STANDARD ENCODING
OF SYMBOLIC DATA
Example 1: in decision tree
By standard encoding of symbolic data, the variable size
desapears as it is replaced by the « Size Min » and « Size max »
The symbolic approach allows the use of the variable « size »
itself and not the variables « Size Min » and « Size max »
Height
≥1.8O
Very good goal players
have generally more than
1,80m.
Very good
goal players
Explanatory Variable
< 1.80
Mean or bad
goal players
Here the Height is not a
minimum height or a
maximum height or a
mean, but the value of the
age which discriminate the
very good players from the
others.
17
Symbolic Histogram
It is not a minimum or a maximum histogram!
26
33
[
[
]
]
22
20
27
35
30
25
[
]
27
36
Sum of the portion of the
intervals
Sum of the portion of the
intervals
Lost of information by using standard coding of symbolic data in
Principal component analysis
By standard coding each concept is represented by a point
By symbolic coding
Each concept is represented by a surface which expresses the variation of the individuals of the extent of
the concept
Each concept can be described by a conjunction of properties in term of the new variables (the factors) .
Height, weight…
Very Weak
Height, weight…
Weak
Very weak
Mean
Very Good
x
x
Means
x
Very Good
Good
Good
x
x
Number of
goals
Number of
goals
…
Weak
Classical PCM
Symbolic PCM
18
AN OVERVIEW ON THE SODAS SOFTWARE
THE SODAS 2 SOFTWARE FROM ASSO
19
THE MAIN STEPS FOR A SYMBOLIC DATA
ANALYSIS IN SODAS
. PUT THE DATA IN A RELATIONAL DATA BASE
(Oracle, Acces, ...)
.DEFINE A CONTEXT BY GIVING
* The Units (Individuals,
Households,...)
* The Classes (Regions, Socioeconomics groups,...)
* The descriptive variables of the units
. BUILD A SYMBOLIC DATA TABLE WHERE
* The units are the preceding classes
* The descriptions of each class is obtained by a
generalization process applied to its members
APPLY
SYMBOLIC DATA ANALYSIS TOOLS
- Correlation, Mean, Mean Square
- Histogram of a symbolic variable
- Dissimilarities between symbolic descriptions
- Clustering of symbolic descriptions
- Principal component Analysis
- Decision Tree
- Graphical visualisation of Symbolic Objects
20
SODAS Software
Menu
Sds file
Methods
Report
Graphs
Graph
Chaining
Concatenation of summarized data
from several populations
SODAS
Join two or more sets of SO based on
different underlyingpopulations
household
DataBases
school
25 SO
25 SO
Concatenation
25 SO
21
DISSIMILARITY MEASURES
A straightforward extension of similarity measures for classical
data matrices with nominal variables.
Agreement
Disagreement
Total
Agreement
Disagreement
Total
α=µ
µ(Aj∩Bj)
χ=µ
µ(c(Aj)∩
∩Bj)
µ(Bj)
β=µ
µ(Aj∩c(Bj))
µ(Aj)
δ=µ
µ(c(Aj)∩
∩c(Bj)) µ(c(Aj))
µ(c(Bj))
µ(Yi)
where µ(Vj) is either the cardinality of the set Vj (if Yj is a
nominal variable) or the length of the interval Vj (if Yj is a
continuous variable).
DE CARVALHO’S DISSIMILARITY
MEASURES
Five different similarity measures si, i = 1, ..., 5, are defined:
si
C o m p a riso n F u n c tio n
R ange
P ro p e rty
s1
s2
s3
s4
S5
α/ (α+ β + χ)
2 α / (2 α + β + χ )
α/ (α+ 2 β + 2 χ)
½ [ α / ( α + β )+ α / ( α + χ )]
α / [( α + β ) ( α + χ )] ½
[0 ,1 ]
[0 ,1 ]
[0 ,1 ]
[0 ,1 ]
[0 ,1 ]
m e tric
s e m i m e tric
m e tric
s e m i m e tric
s e m i m e tric
The corresponding dissimilarities are di = 1 − si.
The di are aggregated on p variables by the generalised
Minkowski metric, thus obtaining:
SO_1
p
d i ( a ,b ) =
q
∑ [w
d i( A j, B j) ]
q
j
1≤ i≤ 5
j =1
22
Graphical Representation of a dissimilarity
SOE Software
Graphic Representation of the symbolic
description of a concept
Figure21: Examples of 2D and 3D ZOOM STAR:
each star describes a concept
23
Pyramid
Each level is a class of concepts modeled by a symbolic description.
PRODUCTS
WEIGHT
COST
AGE
PROFIT
PRODUCT1
[2,4]
[3,5]
[4,6]
[0,3]
PRODUCT2
[4,5]
[3,4]
[1,6]
[2,7]
PRODUCT3
[1,6]
[2,7]
[5,8]
[6,9]
AGE
PRODUCT1
PRODUCT2
6*
4*
3
*
*
2
5
*
COST
* 4
WEIGHT
24
Symbolic Principal
Component Analysis
xx
x
xx
Symbolic correlation
x
Principal Component Analysis
25
SYMBOLIC DESCRIPTION OF CLASSES
THE SODAS 2 SOFTWARE FROM ASSO
SS
YY
MMB B
OO
L IL C
IC
DDAATTAA
mi p l o t
HH
i s it so t
go
r agmr ,a B
S t a r s , G r a p h ic s
D eecci si isoino ntt rreeee
D
D is s im ila r it ie s
D is c r im in a t io n
C lu s t e r in g
F aFc at co tr oi ar li aAl n a l y s i s
R eFgar ce tsos ri oi an l A n a l y s i s
T TS S
S Y M BSOYLMI B
C. OOB BJ JE EC C
S
o ff b
b eesst t
S eel el ec tc ito ino n
o
SSy y
mm
b ob
c O
b bj jeecc tt ss
.l i O
NNe e
wSwy
Sm
y mbb o l i c
D a t a T a b le
S y mG br oa lpi ch i cD
Ssay tm
oafb
A n a ly s is
E x p o r t a t io n t o a
new D atab ase
S y m b o lic O b j e c t s
P r o p a g a t io n
26
NEW PROBLEMS APPEAR
.QUALITY, ROBUSTNESS RELIABILITY OF
THE APPROXIMATION OF A CONCEPT BY
A SYMBOLIC OBJECT,
.THE SYMBOLIC DESCRIPTION OF A
CLASS,
.THE CONSENSUS BETWEEN SYMBOLIC
DESCRIPTIONS ETC..
MUCH HAS TO BE DONE:
SYMBOLIC
.REGRESSION, FACTORIAL ANAL.
.MULTIDIMENSIONAL SCALING,
.MIXTURE DECOMPOSITION,
.NEURAL NETWORK, KOHONEN MAP
.CONCEPT PROPAGATION between
Databases…...
Some recent advances:
- Mixture decomposition of Distributions of
distributions (by Copulas, Dirichlet and Kraft
stochastic process)
- Stochastic Symbolic Conceptual lattices
using capacity theory
- Symbolic class description
-Symbolic Regression
-NEXT FUTUR
-- Spatial symbolic clustering by pyramids
- Symbolic time series.
- Consensus between different description of
the same set of units
27
AIM ATTAINED
FROM
HUDGE DATA BASES
IN AN ECONOMIC WAY
WE ARE ABLE TO: -Extract new knowledge
-Summarize
-Concatenate
-Solve confidentiality
-Explain Correlation
HOW? By working on HIGHER LEVEL UNITS
extending Data Mining to Knowledge Mining.
CONCLUSION
Symbolic Data Analysis is an extension of standard data
analysis therefore
First principle: any Symbolic Data Mining method must
have as a special case method of Data Mining on standard
data.
Second principle : the output must be a symbolic
description or a symbolic object
New problems appear as the quality, robustness and
reliability of the modelling of a concept by a symbolic
object, the symbolic description of a class, the consensus
between symbolic descriptions etc..
Due to the intensive development of the information
technology the great chapters of the standard statistics will
have to be think in these new terms.
28
References
SPRINGER, 2000 :
“Analysis of Symbolic Data”
H.H., Bock, E. Diday, Editors . 450 pages.
JASA (Journal of the American Statistical Association)
“From the Statistic of Data to the Statistic of Knowledge:
Symbolic Data Analysis” Billard, Diday June, 2003 .
Electronic Journal of S. D. A.: ESDA
E. Diday, R. Verde, Y. Lechevallier editors
WILEY 2006: To be published
SYMBOLIC DATA ANALYSIS AND THE SODAS SOFTWARE
Diday, Noirhomme (editors)
SYMBOLIC DATA ANALYSIS : conceptual statistics and data mining
Billard , Diday
Download SODAS with documentation and examples :
www.ceremade.dauphine.fr
then LISE, then SODAS
29