Download Visual Data Mining and Document Collections Visualization

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction
Universidade de São Paulo, São Carlos/SP, Brasil
Instituto de Ciências Matemáticas e de Computação (ICMC)
Departamento de Ciências da Computação
n
Visualization and Data Analysis
n
InfoVis2
– Visualization
– Sonification
– Mining
Visual Data Mining and Document
Collections Visualization
Partners
n
–
–
–
–
–
Fernando Vieira Paulovich
[email protected]
M. Cristina F. Oliveira
Alneu de Andrade Lopes
Luis Gustavo Nonato
Guilherme P. Telles
Haim Levkowitz
- Roberto Pinho
- Lionis Watanabe
- Pedro Vilela
2
Mining Large Data Sets - Motivation
What is (not) Data Mining?
What is not Data
Mining?
l
4.000.000
The Data Gap
3.500.000
3.000.000
2.500.000
Total new disk (TB) since 1995
2.000.000
1.500.000
Number of
analysts
1.000.000
500.000
0
1995
1996
1997
1998
1999
l
What is Data Mining?
– Look up phone
number in phone
directory
– Certain names are more
prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web
search engine for
information about
“Amazon”
– Group together similar
documents returned by
search engine according to
their context (e.g. Amazon
rainforest, Amazon.com)
3
Origins of Data Mining
n
n
Data Mining Tasks
Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
Traditional Techniques
may be unsuitable due to
– Enormity of data
– High dimensionality
of data
– Heterogeneous,
distributed nature
of data
4
Statistics/
AI
Machine Learning/
Pattern
Recognition
Data Mining
n
Prediction Methods
– Use some variables to predict unknown or
future values of other variables
n
Description Methods
– Find human-interpretable patterns that
describe the data
Database
systems
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
5
6
1
Data Mining Tasks
n
n
n
n
n
n
Data Mining Example: Classification
l
l
s
ica
ica
ou
or
or
inu
nt
teg
teg
ss
ca
ca
co
cla
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Training
Set
Learn
Classifier
Test
Set
Model
7
8
Illustrating Clustering
n
Association Rule Discovery: Definition
Euclidean Distance Based Clustering in 3-D space
Intracluster
Intraclusterdistances
distances
are
areminimized
minimized
Given a set of records each of which contain
some number of items from a given collection
– Produce dependency rules which will predict
occurrence of an item based on occurrences
of other items
n
Intercluster
Interclusterdistances
distances
are
aremaximized
maximized
TID
Items
1
2
Bread, Coke, Milk
Beer, Bread
3
4
5
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
{Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
9
10
Deviation/Anomaly Detection
n
n
Visualization
Detect significant deviations from normal behavior
Applications
– Credit Card Fraud Detection
– Network Intrusion
Detection
11
n
Visualization is the conversion of data into a
visual or tabular format so that the characteristics
of the data and the relationships among data
items or attributes can be analyzed or reported.
n
Visualization of data is one of the most powerful
and appealing techniques for data exploration.
– Humans have a well developed ability to
analyze large amounts of information that is
presented visually
– Can detect general patterns and trends
– Can detect outliers and unusual patterns
12
2
Example: Sea Surface Temperature
n
Iris Sample Data Set
The following shows the Sea Surface
Temperature (SST) for July 1982
n
– Tens of thousands of data points are summarized in a
single figure
Many of the exploratory data techniques are illustrated
with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning
Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– Three flower types (classes):
• Setosa
• Virginica
• Versicolour
– Four (non-class) attributes
• Sepal width and length
• Petal width and length
13
Visualization of the Iris Data Matrix
Scatter Plot Array of Iris Attributes
15
Visualization of the Iris Correlation Matrix
• Correlation
14
• Standard deviation
16
Parallel Coordinates Plots for Iris Data
17
18
3
Visualizing Text Collections
Projection Explorer Tool
n
Large and high-dimensional data sets
n
Dimension given by terms on the collection
n
Multidimensional Projection Technique
– Proximity by similarity (metrics)
19
20
Process Overview
Text Pre-Processing
n
The text pre-processing involves
1.Stopwords elimination
2.Extraction of words radicals (stemming)
3.Creation of n-grams
4.Frequency count and Luhn’s lower cut (ngrams appearing less then x times are
ignored)
5.Weighting process (term-frequency inverse
document-frequency - (tfidf))
21
22
Example of Documents x Terms Matrix
T1
T2 T3 T4 T5
Projection Technique
T6 T7
T8 ... Tm
Doc1
0.2 0.1 0.0 0.5 0.0 0.0 0.1
0.5 ... 0.1
Doc2
0.4 0.3 0.0 0.0 0.0 0.4 0.3
0.7 ... 0.5
Doc3
0.8 0.5 0.0 0.4 0.3 0.0 0.0
0.0 ... 0.0
...
...
...
Docn
0.4 0.0 0.0 0.0 0.3 0.7 0.0
...
...
...
...
...
...
X ∈ Rn
α
P ∈ R2
... ...
0.5 ... 0.1
n


n

tfidf (ti , d j ) = freq(ti , d j ) × log
dfreq
(
t
)
i 

n
n
23
α:X → P, |d(xi,xj) – d2(α(xi), α(xj))| ≈ 0, ∀ xi, xj ∈ X
d:Rn → R
d2:R2 → R
24
4
Projection Technique (Force-Based Placement)
n
Projection Techniques
Data instances considered into a systems obeying the
Newton rules
n
f=mxa
a = p’’ => p’’ = m x a
v' = a = f / m

 p' = v
n
Projection techniques for multidimensional data
– Interactive Document Map (IDMAP)
– Projection by Clustering (ProjClus)
– Least-Square Projection (LSP)
Data instances connected through springs
f = −ks (| d | − s )
d
|d |
25
26
5
Related documents