Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Textual Information Clustering
and Visualization for Knowledge
Discovery and Management
Xavier Polanco
URI-INIST-CNRS
Introduction
• We are concerned with the design and
development of computer-based
information analysis tools
• Cluster analysis, computational linguistics
and artificial intelligence techniques are
combined
2
On the technology side
• An information analysis computer-based
system is
• an integrated environment that somehow
assisted a user
• in carrying out the complex process of
converting information from the textual data
sources to knowledge
3
Information Analysis System
French or English
text-data
Dataset or
Corpus
Bibliometric
statistics
Lexicons or
terminological
resources
Term Extraction
And
Indexation
Clustering
and
Mapping
DBMS-R
WWW
Server
SDOC
MIRIAD
ILC
HENOCH
NEURODOC
Mac
PC
WS
4
Home Pages
Intranet
Extranet
5
Plan
•
•
•
•
•
Text Mining
Cluster Analysis
Visualization or Mapping
Knowledge Discovery
Knowledge Management
6
Textual Information
• Big amount of information is available in
textual form in databases and online sources
• In this context, manual analysis and
effective extraction of useful information
are not possible
• It is relevant to provide automatic tools for
analyzing large textual collections
7
Text Mining
• Text mining consists of extraction information
from hidden patterns in large text-data
collections
• The results can be important both:
– for the analysis of the collection, and
– for providing intelligent navigation and browsing
methods
8
Process
• The text mining process can be organized
roughly into five-major steps:
•
•
•
•
•
Data Selection
Term Extraction and Filtering
Data Clustering and Classification
Mapping or Visualization
Result Interpretation
• Iterative and interactive process
9
Natural Language Processing
• Experience shows that linguistic
engineering approach insures a higher
performance of the data mining algorithms
• Part-of-speech tagging (tagging texts), and
lemmatization are tasks generally admit
10
The approach
• Our approach to text mining is based on
extracting meaningful terms from
documents
• In this presentation, the focus is on the term
extraction process, and
• The need of the organization of the
generated terms in a taxonomy
11
The main tasks
• Term extraction or acquisition
• Indexation
• Human control and screening
Indexing quality control
Index screening clustering phase
12
Language Engineering
Lexicons
Text-DB
Natural Language
Engineering System
Indexed
Corpus
Lexicons: Management and Linguistic Processing
Texts: Part-of-speech tagging, lemmatization, and indexation
13
Variation
Normal Form
Syntactic Variation
Morpho-syntactic
Variation
Resistance gene Resistance methylase gene
Resistance and susceptibility gene
Gene of the antibiotic resistance
Rare species
Rarely encountered
enterococus species
14
Taxonomy
• A taxonomic structure should improve text mining
• Considering the clustering techniques that might
be used in text mining. One must be mindful that
more taxonomic classifying capabilities would be
incorporated into text mining
• A taxonomic classifying capability might also
facilitate cluster interpretation by giving the user
some kind of rules
15
Clustering
• Clustering is a descriptive task where one
seeks to identify a finite set of categories
• Clustering is used to segment a database
into subsets or clusters
• Clustering means finding the clusters
themselves from a given set of data
16
Clustering Process
Lexicons
Natural Language
Engineering System
Similarity Measures: s(x,y)
Text-DB
Indexed
Corpus
D(n,p)
Clustering
Algorithm
C(m,p)
Dissimilarity Measures: d(x,y)
17
Documents Keywords
KW1 KW2 KW3 KW4 KW5 KW6
Di KWj = {1,0}
Di KWj = {1, 2, …, n}
D1
D2
D3
D4
1
1
0
1
0
0
1
0
1
1
0
0
0
0
1
1
1
1
0
0
1
1
0
1
C1 = ({D1,D2}{KW1,KW3,KW5,KW6})
C2 = ({D4}{KW1,KW4,KW6})
C3 = ({D3}{KW2,KW4})
18
Clustering Algorithms
• Major families of clustering methods:
• Sequential algorithms
• Hierarchical algorithms
– Agglomerative algorithms
– Divisive algorithms
• Fuzzy clustering algorithms
19
Information Analysis Process
•
The text-data information analysis is
divided into two phases:
1. Cluster generation
2. Map display of clusters
•
A hypertext user interface enables the
analyst to explore and interpret results
20
Example
Antibiotic Resistance
2 DB
4025 documents (1998-1999)
Data
Medicine
Molecular
Biology
30
Clusters
Map
Hypertext
21
Information Visualization
• Definition : The use of computer-supported,
interactive, visual representation of abstract data to
amplify the acquisition or use of knowledge (Card
et al., 1999)
• Visual artifacts aid human thought
• The progress of civilization can be read in the
invention of visual artifacts, from writing to
mathematics, to maps, to diagrams, to visual
computing
22
Process
• Raw Data Data Tables
• Data Tables Clustering
• Clustering Visual Structures : Map
• Visual Structures Views
23
Visual Structures
• Data Tables are mapped to Visual Structures,
which augment a spatial substrate with marks and
graphical properties to encode information
• A Graphic Representation is said to be expressive
if all and only the data in the Data Table are also
represented in the Visual Structure
• A Graphic Representation is said to be more
effective if it is faster to interpret
24
Map Display
•
We are concerned with map display of the
clusters
• A problem of particular interest is how to
visualize data set with many variables:
1. Multivariate-Data are clustered, and
2. Clusters are mapped
25
Mapping tools
• For mapping, we use the following
techniques:
–
–
–
–
–
Density and Centrality Diagrams
Principal Component Analysis (PCA)
Multi-Layer Perceptrons (MLP)
Self-Organizing Maps (SOM)
Multi-SOMs
26
Multi-Layer Perceptron 1
ISE=||s-x||2
prion
proteins
x1
Wcij
Wsjk
s1
sk
xi
human disease
spongiform
encephalopathy
mankind
Wc(p,2) Ws(2,p)
xp
scrapie
sp
CJD
27
Multi-Layer Perceptron 2
Input
Layer
First Hidden
Layer
Output
Layer
x1
y1
xp
yp
C(m,p)
protein
Second Hidden Layer
(Cartography)
Polarizer node
infection
resistance
Agrobacterium
plasmids
28
Multi-SOM Platform
Raw Data
DB
Processing System
Pre-processing
Graphic-Hypertext
User Interface
SOMPACK
Post-processing
MAPS
MULTISOM
Java Application
29
Multi-Self-Organizing Map
Display
Maps associated to 5 viewpoints :
Map 1 Plants
Map 2 Plant Parts
Map 3 Pathogen Agents
Map 4 Genetic Techniques
Map 5 Patenting Firms
5
4
2
3
1
Rice Area Activated
Use of the inter-Map Communication Mechanism
30
Knowledge Discovery
• KD is informally defined as the extraction
of useful knowledge from databases or large
amounts of data
• One of the most important research topics in
KD is the rule discovery or extraction
• The discovered knowledge is usually
expressed in the form of « if-then » rules
31
Association Rules
• Association rules can be seen as one of the
key tasks of KDD
• The intuitive meaning of an association rule
X Y, where X and Y are keywords or
descriptors, is : “a document set containing
keyword X is likely to also contain keyword Y”
32
Example
• In a given a food-industry corpus:
• “98% of the documents which are interested
on apple juice does it related with the
chromatography analytic technique”
• X Y : “apple juice chromatography”
33
The Galois Lattice
• Our current research includes an approach
based on the lattice structure to discover
concepts and rules to the objects
(documents) and their properties
(keywords)
• The Galois lattice approach is also known
as conceptual clustering
34
The concept lattice
Given the context (D1,T1) where
D1 = {d1,d2,d3,d4} & T1 = {t1,t2,t3,t4,t5,t6}
Hasse Diagram
C1:(D1,Ø)
R t1 t2 t3 t4 t5 t6
d1 1 0 1 0 1 1
d2 1 0 1 0 1 1
d3 0 1 0 1 0 0
d4 1 0 0 1 0 1
C2:({d1,d2,d4},{t1,t6}
C3:({d3,d4},{t4}
C4:({d1,d2},{t1,t3,t5,t6} C5:({d4},{t1,t4,t6} C6:({d3},{t2,t4}
Table: The input relation
R = documents keywords
The formal concept
C4 has two own terms
{t3,t5} and two inherited
terms {t1,t6}
C7:(Ø, T1)
35
Association Rules Extraction
• The formal concept C4 makes it possible the following
rules
• R1 : t3 t1 t6
• R2 : t5 t1 t6
• R3 : t3 t5
• The interpretation of the R1 and R2: The use of terms t3 or
t5 is always associated with that of terms t1 and t6
• The rule R3 express mutual equivalence of the terms
{t3,t5: All the documents which have the term t3 also have
the t5 term.
36
Summary
Text Mining
Clustering
Mapping
Knowledge Discovery
37
Knowledge Management
• A knowledge management system is
concerned with the identification,
acquisition, development, diffusion, use,
and preservation of the enterprise’s
knowledge
38
KM Objectives
• Using advanced technology
• For facilitating creation, access, and reuse
of knowledge
• For converting knowledge from the sources
accessible to an organization and
connecting people with that knowledge
39
Project
• Adding to the information analysis
system a formalized operator for
processing together:
– The knowledge that is extracted from
databases
– The knowledge that the experts produce when
they analyze the clusters, maps, concepts and
rules
40
We have reached our last subject,
but not the end !
41
Xavier Polanco
42