Download Lecture 2: VIS - information visualization and data mining

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Parallel coordinates
Tax rates
Population
House price
Birth-rate
Parallel coordinates
Tax rates
Population
34000
House price
2300000
Birth-rate
27
23
Tax rates
Population
House price
Birth-rate
23
34000
2300000
27
Parallel coordinates
Tax rates
Population
House price
Birth-rate
Tax rates
Population
House price
Birth-rate
23
34000
2300000
27
Parallel coordinates
Tax rates
Population
House price
Birth-rate
Tax rates
Population
House price
Birth-rate
23
34000
2300000
27
28
12000
1900000
25
…
Parallel coordinates
Tax rates
Population
House price
Birth-rate
Parallel coordinates
Tax rates
Population
House price
Birth-rate
Positiv korrelation
Ingen uppenbar korrelation
Negativ korrelation
Parallel coordinates
Table Lens
MPG
Horsepower
Weight
Acceleration
Cylinders
Year
Table Lens
MPG
Horsepower
Weight
Acceleration
Cylinders
Year
Mosaic plot
Titanic
Mosaic plot
1st
2nd
3rd
Crew
Mosaic plot
1st
Child
Adult
2nd
3rd
Crew
Mosaic plot
1st
Child
2nd
3rd
q
Adult
Female / Male
Crew
3D Representations
•
Use 3D wisely
•
More dimensions do not mean that more
information is simultaneously displayed
3D Representations
•
Use 3D wisely
•
More dimensions do not mean that more
information is simultaneously displayed
3D Representations
•
Use 3D wisely
•
More dimensions do not mean that more
information is simultaneously displayed
Presentation
Space Limitations
•
•
•
Scrolling
Overview + Detail
Zoom and Pan
Scrolling
Overview + detail
•
•
Focus+context
•
No information is hidden
Micro / macro readings
Overview + detail
Overview + detail
Overview + detail
Distortion
Perspective wall
Distortion
Distortion
MPG
Horsepower
Weight
Acceleration
Cylinders
Year
Distortion
MPG
14
11
Horsepower
150
132
Weight
4532
4821
Acceleration
135
110
Cylinders
8
6
Year
72
71
Zoom and pan
Zoom and pan
Geometric zoom Geometric and semantic zoom
Interaction
Interaction techniques
•
Brushing
•
A collection of techniques to dynamically
query and directly select elements in visual
representations
Interaction techniques
MPG
•
Brushing
Horsepower
Weight
Acceleration
Cylinders
Year
Interaction techniques
Model = Saab
Weight
•
Boot = large
Cylinders = 4
Details on demand
Price
Interaction techniques
•
Coordinated and multiple views (CMV
•
An action in one view is immediately
propagated to all other views
Demo
http://setebos.svt.ntnu.no/tomasz/gallery/Vul16/
Analysis of (very) Large Data
Data Mining
• Having an (enormous) amount of data
‣ Wonder what it can tell us
‣ Isolate (unexpected) relationships
‣ (Hopefully) find some which are
- Interesting
- Novel
‣ Informative
37
Data Mining:
• Extraction of interesting (non-trivial,
previously unknown and potentially useful)
information or patterns from data in
((very) large) databases
38
Data Mining and Visualization
• Data mining provides complex representations
• Fits (optimizes) them to the data
• Then visualize the data mining results.
39
Visual Data Mining
Possible patterns
Relevant
Data
InfoViz
Data Mining
Data
Warehouse
Selection
New
Knowledge!
Data Cleaning
Database(s)
40
Problems with Data
• Holes - Missing data values
• Errors and ‘estimates’
‣ Income of *exactly* 100000?
• Sample inconsistencies
‣ e.g. medical records with different
numbers of readings for the same person
41
Data Mining Tasks
1. Exploratory Data Analysis
2. Descriptive Modelling
3. Predictive Modelling
4. Discovering Patterns and Rules
5. Retrieval by content
42
Exploratory analysis
• Pure data mining
• “Explore the data with no clear idea of what
we are looking for”
• Typically very visual approach
‣ Very tied to ‘Visual Data Mining’
• Problems with:
‣ Large number of data points
‣ Large numbers of dimensions in data
43
Descriptive Modelling
• Attempt to describe all of the data
• Perhaps use:
‣ Model of overall probability distribution
in the p-dimensional space
‣ Partitioning into groups e.g.:
- Cluster analysis for natural grouping
- Segmentation for user-desired groups
44
Predictive Modelling
• Form a model of the data set which allows
prediction of a variable based on the known
values of the others
• Classification
‣ Prediction of a discrete variable
• Regression analysis
‣ Prediction of a continuous variable
• (Prediction does not mean future here)
45
Predictive Modelling
46
Discovering Rules and Patterns
• Concerned with the identification of local
patterns in sub-sets of the space
• Examples:
‣ Frequently occurring sets of transactions
‣ Finding patterns of action indicating fraud
47
Retrieval by Content
• Using a pattern of interest to locate similar
patterns
• Examples: Automatically…
‣ Finding images with similar content
- Face recognition at airports
‣ Finding text documents with similar content
- e.g. Urkund
48
Scoring functions
• All of the preceding classes of task share a
common feature:
‣ The notion of “is like” or “similarity”
- Or difference (dissimilarity)
‣ Defined through a ‘scoring function’
• In numerical data this is often easy
• In general it is not…
49
Scoring Functions
• Is an orange like an apple?
• Yes:
‣ Both are fruit.
‣ Both grow on trees.
• No:
‣ One is citrus, one isn’t.
‣ One is orange, one is is green/red
50
Scoring Functions
• Specification of the scoring function(s) is
crucial to the effectiveness of the system.
• One of the biggest contributions the user
has to make!
51
DM for Vis
• Modelling, Patterns and Rules are valid filters
for mapping
• Simplification of data - modelling
• Extraction of interesting features:
‣ Patterns, Rules
• Form valid representations for data features
52
Sampling
•
•
•
Take K items to be a representative set of M
items
Data abstraction
Many ways of doing this
• Random
• Systematic
• Density-based
• …
53
Cluster Analysis (Descriptive Modelling)
• Cluster: a collection of data items
‣ Similar to one another within the same
cluster
‣ Different from the items in other clusters
• Cluster analysis
‣ Grouping sets of data items into clusters
‣ Data abstraction
‣ Automatically
54
Major clustering approaches
• There are a number of approaches
‣ We will consider just one
• K-Means algorithm:
‣ Given a value k, find a partition of k
clusters that minimizes the total intracluster variance
55
K-means, example with K=3
56
K-means, example with K=3
57
K-means, example with K=3
58
K-means, example with K=3
59
K-means, example with K=3
60
K-means, example with K=3
61
K-means Method
1. Place K points into the space represented by the
items that are being clustered
- These points represent initial group centroids
2. Assign each data item to the group that has the
closest centroid
3. When all items have been assigned, recalculate the
positions of the K centroids
4. Repeat Steps 2 and 3 until the centroids no longer
move
62
SMART Series: Sketch-based Matching
through Approximated Ratios in Time Series
Searching for all possible patterns in a time series is a
computationally complex problem.
63
SMART Series: Sketch-based Matching through
Approximated Ratios in
Time Series
DEMO
64