Download Symbolic data analysis of complex data

Document related concepts

Cluster analysis wikipedia , lookup

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
INTRODUCTION TO
SYMBOLIC DATA
ANALYSIS
E. Diday
CEREMADE. Paris–Dauphine University
TUTORIAL: 13 June 2014
Activity Center, Academia Sinica, Taipei, Taiwan
OUTLINE
 PART 1: BUILDING SYMBOLIC DATA FROM
STANDARD OR COMPLEX DATA
 PART 2: SYMBOLIC DATA ANALYSIS
Is Symbolic Data Analysis a new paradigm?
 .PART 3: OPEN DIRECTION OF RESEARH
 PART 4: SDA SOFTWARES: SODAS, SYR and R
 PART 5: INDUSTRIAL APPLICATIONS
PART 1
BUILDING SYMBOLIC DATA
FROM STANDARD OR
COMPLEX DATA
What is a standard Data Table?
It is a set of individuals (i.e. observations) described by a set of
 Numerical variables (as age, weight,..) or
 Categorical variables (as Nationality, club name,…).
Example:
Individuals
Players
Player 1
Messi
Ronaldo
Player n
age
height weight
Nationality
Club
Team
What are Complex Data?
Any data which cannot be considered as a
“standard observations x standard variables”
data table.
Example
The individuals are Towers of nuclear power plants
described by
• Table 1) Observations: Cracks .
Variables: Cracks description.
• Table 2) Observations: corrosions.
Variables: corrosion description .
• Table 3) Observations: vertices of a grid.
Variables: Gap depression from the ground.
Why considering classes of individuals as
new individuals?
Example:
 if we wish to know what makes a player wins, we are
interested by a standard data table where the individuals are the
players (in rows) described (in columns) by their standard
caracteristic variables.
 If our wish is now to know what makes a team wins, we are
interested by a data table where the teams (in rows) are descibed by
caracteristic variables of the teams taking care on the variability of
the players inside each team.
 The teams can be now considered as new individuals of higher level
described by symbolic variables taking care on the variability of the
individuals inside each class.
From standard data tables to symbolic data tables
Standard data table
describing Football
players (individuals). in each cell
a number
players X
(age) or
Xj
1
a category
ind1
A
(Nationality)
indi
Symbolic Data Table
describing Teams (i.e.
classes of individuals)
X’1
X’j
C1
A symbolic data
in each cell
(Bar chart age of
the Messi Team)
Ci
Xij
indn
Ck
Weight
interval
Age Bar
chart
Some columns are contigency tables
Nationalities
Bar chart
SYMBOLIC DATA EXPRESS VARIABILITY
INSIDE CLASSES OF INDIVIDUALS
TEAM OF THE
WEIGHT
NATIONALITY
NB OF GOALS
BARSA
[75 , 89 ]
{French}
{0.8 (0), 0.2 (1)}
MANCHESTER
[80, 95]
{Fr, Alg, Arg }
{0.1 (0), 0.3 (1), …}
PARIS-ST G.
[76, 95]
{Fr, Tun }
{0.4 (0), 0.2 (1), …}
DORTMUND
[70, 85]
MONDIAL
{Fr, Engl, Arg } {0.2 (0), 0.5 (1), …}
Here the variation (of weight, nationality, …)
concerns the players of each team.
Therefore each cell can contain:
A number, an interval, a sequence of categorical values, a
sequence of weighted values as a barchart, a distribution, …
THIS NEW KIND OF VARIABLES ARE CALLED « SYMBOLIC »
BECAUSE THEY ARE NOT PURELY NUMERICAL IN ORDER TO
EXPRESS THE INTERNAL VARIATION INSIDE EACH CLASS.
What is the actual failure which has
produced the SDA Paradigm?
The failure is that in the actual practice
Only the “individual” kind of
observations is considered.
 Therefore these individual observations
are only described by standard numerical
and categorical variables.

The SDA paradigm shift
It is the transition
 from “individual observations” described by
standard variables of numerical or categorical
values.
 To “classes of individuals” (considered as
“higher level observations”)
 Described by “symbolic variables”, of “symbolic
values” (intervals, probability distributions, sets
of categories or numbers, random variables,…)
 taking care on the variability inside the classes
 “symbolic values” can not be treated as
numbers.
Building Symbolic Data needs three steps
First Step: we have a standard data table TAB1, where individuals are
described by numerical or categorical random variables Yj .
Second step : we have a Table 2: where classes of individuals are described by
random variables Y’j with random variables Yij value.
Third step: we have a symbolic data table Table 3: where the random
variables Yij are represented by:
• Probability distributions, histograms, bar charts, percentiles,…
• Intervals Min, Max, interquartil interval etc.
• Set of numbers or categories
• Functions as Time Series.
VARIABLES
 Standard variables value:
• numerical (income, profit,…),
• categorical (Countries, Stock-Exchange
places,..)
 Symbolic variables value:
• interval,
• bar chart,
• Histogram, etc.
Ten examples of Symbolic variables
What kind of questions and how
are they structured?
Building Symbolic
data Table
From Complex Data
Managing Symbolic
data table
Analysing
Symbolic data
tables
• Agregation, by discretisation maximizing the
dissimilarity between the classes and maximizing the
correlation between the bins of the symbolic variables
• concatanation
• Fusion
• Sorting rows by min, max of intervals or frequencies
of barchart
• Sorting variables by discriminate power
•Extending to symbolic data:
• Statistics
•Data Mining,
•Learning Machine.
How to build symbolic data from
standard or complex data?
 How to categorize the numerical, ordinal,
nominal ground variables, in order that the
obtained symbolic histograms or barchart
variables for each class?
 First: find the discretisation which discriminates
as well as possible these classes.
 Second or simultaneously: Maximize the
correlation between the bins.
SOME ADVANTAGES of SYMBOLIC DATA:
• Work at the needed level of generality without loosing
variability.
• Reduce simple or complex huge data.
• Reduce number of observations and number of variables.
• Reduce missing data.
• Ability to extract simplified knowledge and decision from
complex data.
• Solve confidentiality (classes are not confidential as
individuals).
• Facilitate interpretation of results: decision trees, factorial
analysis new graphic kinds.
• Extent Data Mining and Statistics to new kinds of data with
PART 2
SYMBOLIC DATA ANALYSIS
SYMBOLIC DATA ANALYSIS TOOLS
HAVE BEEN DEVELOPPED
- Graphical visualisation of Symbolic Data
- Correlation, Mean, Mean Square, distribution of a
symbolic variables.
- Dissimilarities between symbolic descriptions
- Clustering of symbolic descriptions
- S-Kohonen Mappings
- S-Decision Trees
- S-Principal Component Analysis
- S-Discriminant Factorial Analysis
- S-Regression
- Etc...
From standard observations to classes,
the correlation is not the same!
Y2
x
x
x
x
Y1
• Observations data are uniformly distributed in the circle:
• no correlation between Y1 and Y2 for intial observations data.
• A correlation appears between the two variables for the centers of a given partition in
4 classes.
WHY SYMBOLIC DATA CANNOT BE REDUCED TO A
CLASSICAL STANDARD DATA TABLE?
Symbolic Data Table
Players category
Weight
Size
Nationality
Very good
[80, 95]
[1.70, 1.95]
{0.7 Eur, 0.3 Afr}
Transformation in classical data
Players
category
Weight
Min
Weight
Max
Size
Min
Size
Max
Eur
Afr
Very good
80
95
1.70
1.95
0. 7
0.3
Concern:
The initial variables are lost and the variation is lost!
Divisive Clustering or Decision tree
Symbolic Analysis
Weight
Classical Analysis
Max Weight
PCA and NETWORK OF BAR CHART DATA
of 30 Iris Fisher Data Clusters*
Any symbolic variable (set of bins variables) can be projected. Here the
species variable.
* SYROKKO Company [email protected]
The Symbolic Variables contributions are inside the smallest
hyper cube containing the correlation sphere of the bins
Numerical versus symbolical space of representation
a 1 b 1 a 2 b2
Y1
C1
Y2
C1
Ci
a1i b1i a2i b2i
(Y1(Ci ), Y2(Ci )) = ([a1i , b1i ], ([a2i , b2i ])
Ck
Ci
Ck
Numerical representation of
interval variables
Bi-plot of interval variables
Y2
b1
Ci
x
a2
x
Ci
a2i b2i
b2
a1
x
a1i
b1i
Y1
Bi-plot of histogram variables
• The joint probability can be inferred by a
copula model
Y1
C1
Ci
Ck
Y2
Copula
PART 3: OPEN DIRECTION OF RESEARH
•
•
•
•
•
•
Models of models
Law of parameters of laws
Laws of vectors of laws.
Copulas needed.
Four general convergence theorem.
Optimisation in non supervised learning
(hierarchical and pyramidal clustering).
From lower level of individual observation
to higher level observation of classes:
higher level models are needed
Table 1
Individual
Table 2
X1
Team
s
Xj
ind1
A number
X’1
X’j
C1
(age of Messi)
Messi
indn
Xij
Ci
Ck
Xj is a standard random numerical variable
X’j is a random variable with histogram value
 Question: if the law of Xj is given what is
the law of X’j ? (Dirichlet models useful).
A symbolic data
(age of Messi
team)
Why using copula models in
Symbolic Data Analysis?
f(i, j, j’) is the joint probability of the variables j and j’ for the
individual i.
 In case of independency , we have
f(i, j, j’) = f(i, j’). f(i, j’),
 If there is no dépendancy:
f(i, j, j’) = Copula(f(i, j’). f(i, j’))
Aim of Copula model in SDA:
 find the Copula which minimises the difference with the joint.
 In order to avoid the restriction to independency hypotheses
and to reduce the cost of f(i, j, j’) computing.
FOUR THEOREM TO BE PROVED FOR ANY EXTENDED METHOD
TO SYMBOLIC DATA.
M(n, k) is supposed to be a SDA method where k is the number of
classes obtained on n initial individuals
THEOREME 1 : If the k classes are fixed and n tends towards
infinity, then M(n, k) converges towards a stable position.
THEOREME 2 : If k increases until getting a single individual by
class, then M(n, k) converges towards a standard one.
THEOREME 3 : I k and n increases simulataneously towards
infinity, then M(n, k) converges towards a stableposition.
THEOREME 4 If the k laws associated to the k classes are
considered as a sample of a law of laws, then M(n, k) applied to this
sample converges to M(n, k) applied to this law.
Exemples :
Théorème 1: il a été démontré dans Diday, Emilion (CRAS, Choquet 1998), pour les treillis de Galois: à mesure que la taille de la
population augmente les classes (décrites par des vecteurs de distributions), s’organisent dans un treillis de Galois qui converge.
Emilion (CRAS, 2002) donne aussi un théorème dans le cas de mélanges de lois de lois utilisant les martingales et un modèle de
Dirichlet.
Théorème 2: Par ex, l’ACP classique MO est un cas particulier de l’ACP notée M(n, k) construite sur les vecteurs d’intervalles.
Théorème 3: c’est le cadre de données qui arrivent séquentiellement (de type « Data Stream ») et des algorithmes de type one pass
(voir par ex Diday, Murty (2005)).
Théorème 4: Dans le cas d'une classification hiérarchique ou pyramidale 2D, 3D etc. la convergence signifie que les grands paliers et
leur structure se stabilisent. Dans le cas d’une ACP la convergence signifie que les axes factoriels se stabilisent.
Optimisation in clustering
d is the given dissimilarity
Ultrametric
dissimilarity = U
x1
x2
x3
x5
x4
Each class is
described by
symbolic data
x1
x2
Hierarchies
Pyramides
x3
x4
x5
3D Spatial Pyramid
S1
W = |d - U |
Robinsonian
dissimilarity = R
W = |d - R |
S
2
A1
C3
C2
B1
C1
Yadidean
dissimilarity = Y
W = |d - Y |
PART 4: SDA SOFTWARES:
SODAS
RSDA
SYR
Software
To build symbolic data from standard or complex data and analyze
symbolic data, different software packages exist today.
SODAS - academic free package, though registration required and a
code needed for installation,
http://www.info.fundp.ac.be/asso/sodaslink.htm
Much Symbolic data data bases can be found at
http://www.ceremade.dauphine.fr/SODAS/
RSDA: academic free packages are available on CRAN:
[email protected]
SYR: professional package, see :
[email protected]
SODAS SOFTWARE
CARTE DE KOHONEN DE CONCEPTS
ANALYSE FACTORIELLE: ACP de variables à valeur intervalle
Superposition de deux deux étoîles associées à deux
classes de la pyramides
Arbre de décision sur variables à valeur histogramme ou
intervalle
The objective of SCLUST is the clustering of symbolic objects by a dynamic algorithm based on symbolic
data tables. The aim is to build a partition of SO´s into a predefined number of classes. Each class has a
prototype in the form of a SO. The optimality criterion used is based on the sum of proximities between
the individuals and the prototypes of the clusters.
Pyramide
classifiante
FROM DATA BASE TO SYMBOLIC DATA IN SODAS
Individuals
Classes
Relational
Data Base
QUERY
Description of individuals
Classes
Columns: symbolic variables
Class description
Symbolic Data Table
Cells contain Symbolic Data
SYR SOFTWARE
Produce a Symbolic Data
Table from complex data.
Manage Symbolic Data
Tables: sort rows and
columns by discriminant
power
Analyse Symbolic data
tables: SPCA,Sclustering…
Produce network, rules
and decision trees.
SYR: SYMBOLIC DATA TABLE MANAGEMENT
SYMBOLIC DATA TABLE
 Sorting rows by min, max of intervals or frequencies of barchart is
possible.
 Sorting variables by discriminate power of the concepts is also
possible.
* SYROKKO
Company [email protected]
PART 5: INDUSTRIAL APPLICATIONS
Time Series Data table: Anomaly detection on a bridge
LCPC (Laboratoire Central Des Ponts et Chaussées) and SNCF Data
Trains Sensor 1
Sensor 2
Sensor 3 ….
Sensor N
Each row represents a train going on the bridge at a given temperature,
each cell contains until 800.000 values.
Each cell is transformed in HISTOGRAM from a PROJECTION or from WAVELETS
HIERARCHICAL DATA*
Symbolic procedure
19 variables
125 farms x 30
animals
Description of pig
respiratory
diseases
Median score
(continuous var.)
Animal frequencies
(categorical var.)
64 variables
125 farms
Description of pig
respiratory
diseases
*C. Fablet, S. Bougeard (AFSSA)
From numerical description of
pigs to symbolic description
of Farms
• Numerical variables
and
• Categorical variables
are transformed in Bar Chart
of the frequencies based on
30 animals,
Or in interval value variables
Step 1: Symbolic Description of Farms*
* SYROKKO Company [email protected]
Nuclear Power Plant
Find Correlations Between
3 Standard Data Tables of Different
observation units and different Variables
NUCLEAR POWER PLANT
Nuclear thermal power station
Inspection :
Cartography of the towel by a grid
Inspection machine
Craks
PB: FIND CORRELATIONS BETWEEN 3 CLASSICAL DATA TABLES OF DIFFERENT UNITS AND
VARIABLES:
Table 1) Observations: Cracks . Variables: Cracks description.
Table 2) Observations: vertices of a grid. Variables: Gap deviation at different periods compared
to the initial model position.
Table 3) Observations: vertices of a grid. Variables: Gap depression from the ground.
ARE Transformed in ONE Symbolic Data Table where the classes the towers. On this new table
SDA can be applied.
FROM COMPLEX DATA TO SYMBOLIC DATA
Towers on PCA first axes
 PCA on chooosen symbolic
variables
 Three clusters.visualisation
 Interval and bar chart
variables can be seen..
 A network of the strongest
links can be represented.
NETSYR results (SYR software)
Symbolic variables projection inside the
hypercube of the correlation sphere
Telephone calls text mining in order to
discover “themes” without using semantic
INITIAL DATA: 2 814 446 rows
Documents
Words
Doc1
bonjour
Doc1
oui
Doc1
monsieur
………
Doc2
panne
……
Correspondence between
documents and words.
Each calling session is
called a document.
We start after
lemmatisation with a
table of
• 31454 documents
• 2258 words
First Steps:building overlapping clusters
of documents and words: CLUSTSYR
70 x 2258
2 814 446 rows:
31454 documents x
Correspondence
documents, words
2258 words
80 x 70
80 overlapping clusters of words
described by their tf-idf in
the 70 clusters of Docs.
70 Overlapping Clusters of
Documents described by the
tf-idf of 2258 words.
2258 x 70
2258 Words described by their tfidf on the 70 clusters of Docs.
Next step:
STATSYR
Each cluster of documents is described by the 80 clusters of words called “themes”
Classes of
documents
Themes
WORDS in Each Theme
GRAPHICAL REPRESENTATION
by NETSYR from SYR software
GRAPHICAL
REPRESENTATION of
themes ,
document classes, by
Pie Charts
And their Bar chart
description.
Overlapping
Clusters
SOCIAL NEWORK
Based on dissimilarities
ANNOTATION :
of Themes and
Document classes
Moving, Zooming…
We obtain finally a clear representation of the main
themes , their classes and their links : “failures”,
“budget”,”addresses”, “vacation” etc..
A Survey on Security
• A sample of people of three regions
(Vex, Val, Plai) have answered to
three questions:
• Gender: M or W,
• Security: priority to

Fight Against Unemployment
(FAU),

Juvenile Delinquency (JD)

Drug addict (D)),
•
Death penalty (Yes or No).
Gender, Security , D. Penalty are
« barchart value variables »
M, W, FAU, JD…are « bins »
From barchart symbolic variables
to Metabin latent variables
Region
Gender
Insecurity
Death Penalty
-
M
W
FAU JD
D
Yes
No
Vex
0.8
0.2
0.4 0.5
0.1
0.5
0.5
Val
0.7
0.3
0.5 0.2
0.3
0.4
0.6
Plai
0.3
0.7
0.7 0.1
0.2
0.1
0.9
Region S1cor
Vex
Val
Plai
M
0.8
0.7
0.3
S2cor
Table 1
Initial bar chart data
table
S3cor
JD
Yes
W
FAU No
NU
D
NU
0.5
0.5
0.2
0.4
0.5
NU
0.1
NU
0.2
0.4
0.3
0.5
0.6
NU
0.3
NU
0.1
0.1
0.7
0.7
0.9
NU
0.2
NU
Table 2
Metabin latent variables
CONCLUSION
• If you have standard units described by numerical
and (or) categorical variables, these variables
induce “classes” described by symbolic variables
taking care of their internal variation. Then SDA can
be applied on these new units in order to get
complementary and enhancing results by extending
standard analysis to symbolic analysis.
• Symbolic data have to be build from given standard
or complex data.
• Symbolic data cannot be reduced to standard data.
• Complex data can be simplified in symbolic data.
• Big Data bases can be reduced in symbolic data
• Symbolic data are not only distributions, they are the
numbers of the future.
Références
Basic books and papers:
•
•
•
•
•
•
Bock H.H., Diday E. (editors and co-authors) ( 2000): Analysis of
Symbolic Data.Exploratory methods for extracting statistical
information from complex data. Springer Verlag, Heidelberg, 425 pages,
ISBN 3-540-66619-2.
L. Billard, E. Diday (2003) "From the statistics of data to the statistic of
knowledge: Symbolic Data Analysis". JASA . Journal of the American
Statistical Association. Juin, Vol. 98, N° 462.
E. Diday, M. Noirhomme (eds and co-authors) (2008) “Symbolic Data
Analysis and the SODAS software”. 457 pages. Wiley. ISBN 978-0-47001883-5.
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual
Statistics and Data Mining. 321 pages. Wiley series in computational
statistics. Wiley, Chichester, ISBN 0-470-09016-2.
Noirhomme-Fraiture, M. and Brito, P. (2012) Far beyond the classical
data models: symbolic data analysis. Statistical Analysis and Data
Mining 4 (2), 157-170.
Lazare N. (2013) "Symbolic Data Analysis". CHANCE magazine. Editor’s
Letter – Vol. 26, No. 3.
Building Symbolic Data and
representation Referencies
• Stéphan V., Hébrail G.,Lechevallier Y. (2000) « Generation
of symbolic objects from relationnal data base ». Chapter in
book : Analysis of Symbolic Data: Exploratory Methods for Extracting
Statistical Information from Complex Data (eds. H.-H.Bock and E. Diday).
Springer-Verlag, Berlin, 103-124.
• Chiun-How, K., Chih-Wen, O., Yin-Jing, T., Chuan-kai, Yang,
Chun-houh, Chen (2012) “A Symbolic Database for TIMSS”. Arroyo J.,
Maté C., Brito P. Noihomme M. eds, 3rd Workshop in Symbolic Data Analysis.
Universidad Compiutense de Madrid. http://www.sda-workshop.org/.
• E. Diday, F. Afonso, R. Haddad (2013) : “The symbolic
data analysis paradigm, discriminate discretization and
financial application”. In Advances in Theory and Applications of
High Dimensional and Symbolic Data Analysis, HDSDA 2013. Revue des
Nouvelles Technologies de l'Information vol. RNTI-E-25, pp. 1-14
SOME SYMBOLIC DATA ANALYSIS REFERENCIES
 In Pricipal Component Analysis
Cazes P., Chouakria A., Diday E., Schektman Y. (1997). Extension de l’analyse en composantes
principales à des données de type intervalle, Rev. Statistique Appliquées, Vol. XLV Num. 3, pp. 5-24,
France. 29.
Cazes P. (2002) Analyse factorielle d’un tableau de lois de probabilité. Revue de statistique appliquée,
tome 50, n0 3.
Diday E. (2013) "Principal Component Analysis for bar charts and Metabins tables". Statistical
Analysis and Data Mining. Article first published online: 20 May 2013. DOI: 10.1002/sam.11188.
2013 Wiley. Statistical Analysis and Data Mining,6,5, 403-430.
Ichino, M. (2011). The quantile method for symbolic principal component analysis. Statistical
Analysis and Data Mining, Wiley. 184-198.
Makosso-Kallyth S. and Diday E. (2012) Adaptation of interval PCA to symbolic histogram variables.
Advances in Data Analysis and Classification (ADAC). July, Volume 6, Issue 2, pp 147-159.
Rademacher, J., Billard , L., (2012) Principal component analysis for interval data. Wiley
interdisciplinary Reviews: Computational Statistics .Volume 4, Issue 6, pp. 535–540.
Shimizu N., Nakano J. (2012) Histograms Principal Component Analysis. Arroyo J., Maté C., Brito
P. Noihomme M. eds, 3rd Workshop in Symbolic Data Analysis. Universidad Compiutense de
Madrid. http://www.sda-workshop.org/
Wang H., Guan R., Wu J. (2012a). CIPCA: Complete-Information-based Principal Component
Analysis for interval-valued data, Neurocomputing, Volume 86, Pages 158-169.
Symbolic Data Analysis references
 In Symbolic Forecasting
Arroyo, J. and Maté, C. (2009). Forecasting histogram time series with k-nearest neighbors'
methods. International Journal of Forecasting 25, 192–207.
García-Ascanio, C.; Maté, C. (2010). Electric power demand forecasting using
interval time series: A comparison between VAR and iMLP. Energy Policy 38, 715725
Han, A., Hong, Y., Lai, K.K., Wang, S. (2008). Interval time series analysis with an
application to the sterling-dollar exchange rate. Journal of Systems Science and
Complexity, 21 (4), 550-565.
He, L.T. and C. Hu (2009). Impacts of Interval Computing on Stock Market
Variability Forecasting. Computational Economics 33, 263-276.
 In Symbolic rule extraction
Afonso, F. et Diday, E. (2005). Extension de l’algorithme Apriori et des regles d’association
aux cas des donnees symboliques diagrammes et intervalles. Revue RNTI, Extraction et
Gestion des Connaissances (EGC 2005), Vol. 1, pp 205-210, Cepadues, 2005.
Symbolic Data Analysis referencies
In Symbolic Decision Tree
Ciampi, A., Diday, E., Lebbe, J., Perinel, E. et Vignes, R. (2000).
Growing a tree classifier with imprecise data. Pattern Recognition
letters 21: 787-803.
Mballo C., Diday E. (2006) The criterion of Smirnov-Kolmogorov
for binary decision tree : application to interval valued variables.
Intelligent Data Analysis. Volume 10, Number 4 . pp 325 – 341
Winsberg S., Diday E., Limam M. (2006). A tree structured
classifier for symbolic class description. Compstat 2006. PhysicaVerlag.
Bravo, M. et Garcia-Santesmases, J. (2000). Symbolic Object
Description of Strata by Segmentation Trees, Computational
Statistics, 15:13-24, Physica-Verlag.
Symbolic Data Analysis references
 In Clustering
•
•
•
•
•
De Carvalho F., Souza R., Chavent M., and Lechevallier Y. (2006) Adaptive Hausdorff distances and dynamic
clustering of symbolic interval data. Pattern Recognition Letters Volume 27, Issue 3, February 2006, Pages 167-179.
De Souza R.M.C.R, De Carvalho F.A.T. (2004). Clustering of interval data based on City-Block distances. Pattern
Recognition Letters, 25, 353–365.
Diday E. (2008) Spatial classification. DAM (Discrete Applied Mathematics) Volume 156, Issue 8, Pages 1271-1294.
Diday, E., Murty, N. (2005) "Symbolic Data Clustering" in Encyclopedia of Data Warehousing and Mining . John
Wong editor . Idea Group Reference Publisher.
Irpino, A. and Verde, R. (2008): Dynamic clustering of interval data using a Wasserstein-based distance. Pattern
Recognition Letters 29, 1648-1658.
 In Multidimensional Scaling
•
Terada, Y., Yadohisa, H. (2011) Multidimensional scaling with hyperbox model for
percentile dissimilarities, In: Watada, J., Phillips-Wren, G., Jain, L. C., and Howlett, R. J.
(Eds.): Intelligent Decision Technologies Springer Verlag, 779–788
•
Groenen, P.J.F.,Winsberg, S., Rodriguez, O., Diday, E. (2006). I-Scal: Multidimensional
scaling of interval dissimilarities. Computational Statistics and Data Analysis 51, 360–
378.
Some Symbolic Data Analysis
references
 In Self Organizing map
•
Hajjar C., Hamdan H. (2011). Self-organizing map based on L2 distance for intervalvalued data. In SACI 2011, 6th IEEE International Symposium on Applied
Computational Intelligence and Informatics (Timisoara, Romania), pp. 317–322.P.
In Dissimilarities between Symbolic Data
•
Kim, J. and Billard, L. (2013): Dissimilarity measures for histogram-valued
observations, Communications in Statistics-Theory and Method, 42, 283-303.
• Verde, R., Irpino, A. (2010). Ordinary Least Squares for
Histogram Data Based on Wasserstein Distance, in: Proc.
COMPSTAT’2010, Y. Lechevallier and G.Saporta (Eds).PP.581-589. Physica Verlag
Heidelberg.
Some Symbolic Data Analysis
references
In Regression and Canonical analysis extended to
Symbolic Data
Dias, S., Brito, P., (2011). A New Linear Regression Model for Histogram-Valued
Variables. In Proceedings of the 58th ISI World Statistics Congress (Dublin, Ireland).
Lauro, C., Verde, R. , Irpino, A. (2008). Generalized canonical analysis, in: Symbolic
Data Analysis and the Sodas Software, E. Diday and M. Noirhomme. Fraiture (Eds.),
313-330, Wiley, Chichester.
Tenenhaus A., Diday E., Emilion R., Afonso F. (2013) Regularized General Canonical
Correlation Analysis Extended To Symbolic Data. ADAC (publication on the way).
Neto, E.A, De Carvalho F.A.T. (2010). Constrained linear regression models for
symbolic interval-valued variables. Computational Statistics and Data
Analysis 54, 333-347.
Wang H., Guan R., Wu J. (2012c). Linear regression of interval-valued data
based on complete information in hypercubes, Journal of Systems Science
and Systems Engineering, Volume 21, Issue 4, Page 422-442.
Some Symbolic Data Models referencies
•
•
•
•
•
•
•
•
•
•
•
P. Bertrand, F. Goupil (2000) “ Descriptive Statistics for symbolic data“ . In H.H. Bock, E.
Diday (Eds) “Analysis of Symbolic Data “. Springer-Verlag, pp. 106-124.
Brito, P. and Duarte Silva, A.P. (2012). Modelling interval data with Normal and SkewNormal distributions. Journal of Applied Statistics, 39 (1), 3-20.
E. Diday, M. Vrac (2005) "Mixture decomposition of distributions by Copulas in the
symbolic data analysis framework". Discrete Applied Mathematics (DAM). Volume 147,
Issue1, 1 April, pp. 27-41.
E. Diday (2011) Modélisation de données symboliques et application au cas des
intervalles. Journées Nationales de la Société Francophone de Classification. Orléans
E. Diday (2002) “From Schweizer to Dempster: mixture decomposition of distributions by
copulas in the symbolic data analysis framework” IPMU 2002, July, Annecy, France
Diday E., Emilion R. (1997) "Treillis de Galois Maximaux et Capacités de Choquet" . C.R.
Acad. Sc. t.325, Série 1, p 261-266. Présenté par G. Choquet en Analyse Mathématiques
Diday E., R. Emilion (2003) Maximal and stochastic Galois lattices. Discrete appliedMath.
Journal. Vol. 27 (2), pp. 271-284.
Emilion R., Classification et mélanges de processus. C.R. Acad. Sci. Paris, 335, série I,
189-193 (2002).
Emilion R., Unsupervised Classification and Analysis of objects described by
nonparametric probability distributions. Statistical Analysis and Data Mining (SAM), Vol
5, 5, 388-398 (2012).
J. Le-Rademacher, L. Billard (2011) “Likelihood functions and some maximum likelihood
estimators for symbolic data”. Journal of Statistical Planning and Inference 141 1593–
1602. Elsevier.
T. Soubdhan, R. Emilion, R. Calif (2009) “Classification of daily solar radiation
distributions”. Solar Energy 83 (2009) 1056–1063. Elsevier.
Some SDA Industrial Applications
•
•
•
•
•
•
•
•
•
•
•
•
•
Afonso F., Diday E., Badez N., Genest Y. (2010) Symbolic Data Analysis of Complex Data:
Application to nuclear power plant. COMPSTAT’2010 , Paris.
Bezerra B., Carvalho F. (2011) Symbolic data analysis tools for recommendation systems.
Knowl. Inf. Syst 01/2011; 26:385-418. DOI:10.1007/s10115-009-0282-3.
Bouteiller V., Toque C., A., Cherrier J-F., Diday E., Cremona C. (2011) Non-destructive
electrochemical characterizations of reinforced concrete corrosion: basic and symbolic data
analysis. Corros Rev . Walter de Gruyter • Berlin • Boston. DOI 10.1515/corrrev-2011-002.
Courtois, A., Genest, G., Afonso, F., Diday, E., Orcesi, A., (2012) In service inspection of
reinforced concrete cooling towers – EDF’s feedback ,IALCCE 2012, Vienna, Austria
Cury, A., Crémona, C., Diday, E. (2010). Application of symbolic data analysis for structural
modification assessment. Engineering Structures Journal. Vol 32, pp 762-775.
Christelle Fablet, Edwin Diday, Stephanie Bougeard, Carole Toque, Lynne Billard (2010).
Classification of Hierarchical-Structured Data with Symbolic Analysis. Application to
Veterinary Epidemiology. COMPSTAT’2010 , Paris.
Haddad R., Afonso F., Diday E., (2011) Approche symbolique pour l'extraction de
thématiques: Application à un corpus issu d'appels téléphoniques. In actes des XVIIIèmes
Rencontres de la Sociéte francophone de Classification. Université d'Orléans
Laaksonen, S. (2008). People’s Life Values and Trust Components in Europe - Symbolic Data
Analysis for 20-22 Countries. In. Edwin Diday and Monique Noirhomme-Fraiture, “Symbolic
Data Analysis and the SODAS Software", Chapter 22, pp. 405-419. Wiley and Sons:
Chichester, UK.
Quantin C., Billard L., Touati M., Andreu N., Cottin Y., Zeller M., Afonso F., Battaglia G., Seck
D., Le Teuff G., and Diday E.. (2011) Classification and Regression Trees on Aggregate Data
Modeling: An Application in Acute Myocardial Infarction. Journal of Probability and Statistics
Volume 2011 (2011), 19 pages.
Terraza V, Toque C. (2013) Mutual Fund Rating: A Symbolic Data Approach. In "Understanding
Investment Funds Insights from Performance and Risk Analysis". Edited by Virginie Terraza and Hery
Razafitombo . Economics & Finance Collection 2013. The Palgrave Macmilan editor. UK.
He, L.T. and C. Hu (2009). Impacts of Interval Computing on Stock Market Variability Forecasting.
Computational Economics 33, 263-276.
E. Diday, F. Afonso, R. Haddad (2013) : The symbolic data analysis paradigm, discriminate
discretization and financial application, in Advances in Theory and Applications of High Dimensional
and Symbolic Data Analysis, HDSDA 2013. Revue des Nouvelles Technologies de l'Information vol.
RNTI-E-25, pp. 1-14
Han, A., Hong, Y., Lai, K.K., Wang, S. (2008). Interval time series analysis with an application
to the sterling-dollar exchange rate. Journal of Systems Science and Complexity, 21 (4), 550565.