Download No Slide Title

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
What Data Mining
Methods May Help
Bio-Informatics?
Jiawei Han
Database Systems Research Lab
Department of Computer Science
University of Illinois at Urbana-Champaign, U.S.A.
http://www.cs.uiuc.edu/~hanj
May 22, 2017
Data Mining & Bio-Informatics
1
Bio-informatics and Data Mining


Data mining: search for or discovery of patterns and
knowledge hidden in data
Biomedical/DNA data mining



May 22, 2017
Biological data is abundant and information rich
(e.g., gene chips, bio-testing data)
It is critical to find correlations, linkages between
disease and gene sequences, classification,
clustering, outliers, etc.
Lots of challenges and new techniques can be
developed: A field yet to be explored
Data Mining & Bio-Informatics
2
Biomedical Data Mining and DNA
Analysis





DNA sequences
 Four basic building blocks (nucleotides): adenine (A),
cytosine (C), guanine (G), and thymine (T).
Gene: a sequence of hundreds of individual nucleotides
arranged in a particular order
Humans have around 30,000 genes
Tremendous number of ways that the nucleotides can be
ordered and sequenced to form distinct genes
DNA micro-arrays and protein arrays have accumulated
tremendous amount of data related to patients and
diseases
May 22, 2017
Data Mining & Bio-Informatics
3
What Data Mining Methods May
Help Bio-Informatics?










Semantic integration of
heterogeneous, distributed genome
databases
Discovery of tandem repeats: Blast and beyond
Similarity search in genome databases
Association, correlation, and linkage analysis
Fault-tolerant sequential and structured pattern mining
Advanced classification techniques
Cluster analysis and outlier detection
Multi-dimensional data mining environments
Visual data mining
Invisible data mining
May 22, 2017
Data Mining & Bio-Informatics
4
Semantic Integration of Heterogeneous,
Distributed Genome Databases




Current situation—highly distributed,
uncontrolled generation and use of a wide
variety of DNA data
Semantic integration of different genome
databases—a critical task
It is highly desirable to build Web-based,
integrated, multi-dimensional genome
databases
Data cleaning and data integration methods
developed in data mining/data warehousing will
help
May 22, 2017
Data Mining & Bio-Informatics
5
What Data Mining Methods May
Help Bio-Informatics?










Semantic integration of heterogeneous, distributed genome
databases
Discovery of tandem repeats: Blast
and beyond
Similarity search in genome databases
Association, correlation, and linkage analysis
Fault-tolerant sequential and structured pattern mining
Advanced classification techniques
Cluster analysis and outlier detection
Multi-dimensional data mining environments
Visual data mining
Invisible data mining
May 22, 2017
Data Mining & Bio-Informatics
6
Discovery and Comparison of
DNA Sequences

Finding tandem repeats

Fault-tolerant sequential patterns (Is Blast enough?)
CACAC CACAC CACAC CACAC AC

Similarity search and comparison among DNA sequences


May 22, 2017
Compare the frequently occurring patterns of each class (e.g.,
diseased and healthy)
Query-based: Identify gene sequence patterns that play roles in
various diseases
Data Mining & Bio-Informatics
7
What Data Mining Methods May
Help Bio-Informatics?










Semantic integration of heterogeneous, distributed genome
databases
Discovery of tandem repeats: Blast and beyond
Similarity search in genome
databases
Association, correlation, and linkage analysis
Fault-tolerant sequential and structured pattern mining
Advanced classification techniques
Cluster analysis and outlier detection
Multi-dimensional data mining environments
Visual data mining
Invisible data mining
May 22, 2017
Data Mining & Bio-Informatics
8
Similarity Search in Multimedia Data

Description-based retrieval systems


Build indices and perform object retrieval based on
image descriptions, such as keywords, captions, size,
and time of creation

Labor-intensive if performed manually

Results are typically of poor quality if automated
Content-based retrieval systems

Support retrieval based on the image content, such
as color histogram, texture, shape, objects, and
wavelet transforms
May 22, 2017
Data Mining & Bio-Informatics
9
Approaches Based on Image
Signature


Color histogram-based signature
 The signature includes color histograms based on color
composition of an image regardless of its scale or
orientation
 No information about shape, location, or texture
 Two images with similar color composition may contain
very different shapes or textures, and thus could be
completely unrelated in semantics
Multifeature composed signature
 Define different distance functions for color, shape,
location, and texture, and subsequently combine them
to derive the overall result.
May 22, 2017
Data Mining & Bio-Informatics
10
One Signature for the Entire Image?



Walnus: [NRS99] by Natsev, Rastogi, and Shim
Similar images may contain similar regions, but a region
in one image could be a translation or scaling of a
matching region in the other
Wavelet-based signature with region-based granularity
 Define regions by clustering signatures of windows of
varying sizes within the image
 Signature of a region is the centroid of the cluster
 Similarity is defined in terms of the fraction of the area
of the two images covered by matching pairs of
regions from two images
May 22, 2017
Data Mining & Bio-Informatics
11
Similarity Search in Time-Series Analysis




Normal database query finds exact match
Similarity search finds data sequences that differ only
slightly from the given query sequence
Two categories of similarity queries
 Whole matching: find a sequence that is similar to the
query sequence
 Subsequence matching: find all pairs of similar
sequences
Typical Applications
 Financial market
 Market basket data analysis
 Scientific databases
 Medical diagnosis
May 22, 2017
Data Mining & Bio-Informatics
12
Similar time series analysis
May 22, 2017
Data Mining & Bio-Informatics
13
Similar time series analysis
VanEck International Fund
Fidelity Selective Precious Metal and Mineral Fund
Two similar mutual funds in the different fund group
May 22, 2017
Data Mining & Bio-Informatics
14
What Data Mining Methods May
Help Bio-Informatics?










Semantic integration of heterogeneous, distributed genome
databases
Discovery of tandem repeats: Blast and beyond
Similarity search in genome databases
Association, correlation, and linkage
analysis
Fault-tolerant sequential and structured pattern mining
Advanced classification techniques
Cluster analysis and outlier detection
Multi-dimensional data mining environments
Visual data mining
Invisible data mining
May 22, 2017
Data Mining & Bio-Informatics
15
Rule Measures: Support and
Confidence
Customer
buys both
Find all the rules X & Y  Z with
minimum confidence and support
 support, s, probability that a
transaction contains {X  Y 
Z}
 confidence, c, conditional
Customer
buys beer
probability that a transaction
having {X  Y} also contains Z
Transaction ID Items Bought Let minimum support 50%, and
minimum confidence 50%,
2000
A,B,C
we have
1000
A,C
 A  C (50%, 66.6%)
4000
A,D
5000
B,E,F
 C  A (50%, 100%)
May 22, 2017
Customer 
buys diaper
Data Mining & Bio-Informatics
16
Association Rule Mining: A Road Map




Boolean vs. quantitative associations (Based on the types of values
handled)
 buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”)
[0.2%, 60%]
 age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]
Single dimension vs. multiple dimensional associations (see ex. Above)
Single level vs. multiple-level analysis
 What brands of beers are associated with what brands of diapers?
Various extensions
 Correlation, causality analysis



Association does not necessarily imply correlation or causality
Maxpatterns and closed itemsets
Constraints enforced

May 22, 2017
E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
Data Mining & Bio-Informatics
17
Construct FP-tree from a
Transaction DB
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Steps:
2. Order frequent items in
frequency descending order
May 22, 2017
{}
Header Table
1. Scan DB once, find frequent
1-itemset (single item
pattern)
3. Scan DB again, construct
FP-tree
min_support = 0.5
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
Data Mining & Bio-Informatics
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
18
Classification of Constraints
Monotone
Antimonotone
Succinct
Strongly
convertible
Convertible
anti-monotone
Convertible
monotone
Inconvertible
May 22, 2017
Data Mining & Bio-Informatics
19
Association and Path Analysis in BioMedical and DNA Data Mining



Association analysis: identification of co-occurring gene
sequences
 Most diseases are not triggered by a single gene but
by a combination of genes acting together
 Association analysis may help determine the kinds of
genes that are likely to co-occur together in target
samples
Path analysis: linking genes to different disease
development stages
 Different genes may become active at different stages
of the disease
 Develop pharmaceutical interventions that target the
different stages separately
Visualization tools and genetic data analysis
May 22, 2017
Data Mining & Bio-Informatics
20
What Data Mining Methods May
Help Bio-Informatics?










Semantic integration of heterogeneous, distributed genome
databases
Discovery of tandem repeats: Blast and beyond
Similarity search in genome databases
Association, correlation, and linkage analysis
Fault-tolerant sequential and
structured pattern mining
Advanced classification techniques
Cluster analysis and outlier detection
Multi-dimensional data mining environments
Visual data mining
Invisible data mining
May 22, 2017
Data Mining & Bio-Informatics
21
What Is Sequential Pattern Mining?

Given a set of sequences, find the complete set
of frequent subsequences
A
sequence
: < (ef) (ab) (df) c b >
A sequence database
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
An element may contain a set of items.
Items within an element are unordered
and we list them alphabetically.
<a(bc)dc> is a subsequence
of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a
sequential pattern
May 22, 2017
Data Mining & Bio-Informatics
22
Pair-wise Checking Using S-matrix
SDB
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
Length-1 sequential patterns
<a>, <b>, <c>, <d>, <e>, <f>
<aa> happens twice
<(ac)> happens once
<ac> happens
4 times
<ca> happens
twice
a
2
b
(4, 2, 2)
1
c
(4, 2, 1)
(3, 3, 2)
3
d
(2, 1, 1)
(2, 2, 0)
(1, 3, 0)
0
e
(1, 2, 1)
(1, 2, 0)
(1, 2, 0)
(1, 1, 0)
0
f
(2, 1, 1)
(2, 2, 0)
(1, 2, 1)
(1, 1, 1)
(2, 0, 1)
1
a
b
c
d
e
f
S-matrix
All length-2 sequential patterns are found in S-matrix
May 22, 2017
Data Mining & Bio-Informatics
23
Constraint-Based Sequential Pattern Mining


Constraint-based sequential pattern mining

Constraints: User-specified, for focused mining of desired patterns

How to explore efficient mining with constraints? — Optimization
Classification of constraints

Anti-monotone: E.g., value_sum(S) < 150, min(S) > 10

Monotone: E.g., count (S) > 5, S  {PC, digital_camera}

Succinct: E.g., length(S)  10, S  {Pentium, MS/Office, MS/Money}


Convertible: E.g., value_avg(S) < 25, profit_sum (S) > 160,
max(S)/avg(S) < 2, median(S) – min(S) > 5
Inconvertible: E.g., avg(S) – median(S) = 0
May 22, 2017
Data Mining & Bio-Informatics
24
From Sequential Patterns to Structured
Patterns

Sets, sequences, trees and other structures

Transaction DB: Sets of items


Seq. DB: Sequences of sets:


{{<i1, i2>, …, <im, in, ik>}, …}
Sets of trees (each element being a tree):


{<{i1, i2}, …, {im, in, ik}>, …}
Sets of Sequences:


{{i1, i2, …, im}, …}
{t1, t2, …, tn}
Applications: Mining structured patterns in XML documents
May 22, 2017
Data Mining & Bio-Informatics
25
What Data Mining Methods May
Help Bio-Informatics?

Semantic integration of heterogeneous, distributed genome
databases
Discovery of tandem repeats: Blast and beyond
Similarity search in genome databases
Association, correlation, and linkage analysis
Fault-tolerant sequential and structured pattern mining

Advanced classification techniques








Cluster analysis and outlier detection
Multi-dimensional data mining environments
Visual data mining
Invisible data mining
May 22, 2017
Data Mining & Bio-Informatics
26
Classification Methods

Decision tree induction

Bayesian Classification

Classification by Neural Networks



Classification by Support Vector Machines
(SVM)
Classification based on concepts from
association rule mining
Other Classification Methods
May 22, 2017
Data Mining & Bio-Informatics
27
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
May 22, 2017
Data Mining & Bio-Informatics
28
Classification in MultiMediaMiner
May 22, 2017
Data Mining & Bio-Informatics
29
Bayesian Belief Network: An Example
Family
History
Smoker
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)
LungCancer
PositiveXRay
Emphysema
Dyspnea
Bayesian Belief Networks
May 22, 2017
LC
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
The conditional probability table
for the variable LungCancer:
Shows the conditional
probability for each possible
combination of its parents
P( z1,..., zn ) 
Data Mining & Bio-Informatics
n
 P( z i | Parents( Z i ))
i 1
30
Multi-Layer Perceptron
Output vector
Err j  O j (1  O j ) Errk w jk
Output nodes
k
 j   j  (l) Err j
wij  wij  (l ) Err j Oi
Hidden nodes
Err j  O j (1  O j )(T j  O j )
wij
Input nodes
Oj 
I j
1 e
I j   wij Oi   j
i
Input vector: xi
1
Linear Classification

x
x
x
x
x
May 22, 2017
x
x
x
x


x
ooo
o
o
o o
o
o
o o
o
o

Binary Classification
problem
The data above the red
line belongs to class ‘x’
The data below red line
belongs to class ‘o’
Examples – SVM,
Perceptron, Winnow,
Probabilistic Classifiers
Data Mining & Bio-Informatics
32
SVM – Support Vector Machines
Small Margin
Large Margin
Support Vectors
Association-Based Classification

Several methods for association-based classification
 ARCS: Quantitative association mining and clustering
of association rules (Lent et al’97)


Associative classification: (Liu et al’98)


It beats C4.5 in (mainly) scalability and also accuracy
It mines high support and high confidence rules in the form of
“cond_set => y”, where y is a class label
CAEP (Classification by aggregating emerging patterns)
(Dong et al’99)


May 22, 2017
Emerging patterns (EPs): the itemsets whose support
increases significantly from one class to another
Mine Eps based on minimum support and growth rate
Data Mining & Bio-Informatics
34
The k-Nearest Neighbor Algorithm





All instances correspond to points in the n-D space.
The nearest neighbor are defined in terms of
Euclidean distance.
The target function could be discrete- or real- valued.
For discrete-valued, the k-NN returns the most
common value among the k training examples nearest
to xq.
Vonoroi diagram: the decision surface induced by 1NN for a typical set of training examples.
.
_
_
+
_
_
May 22, 2017
_
.
+
xq
_
+
.
+
Data Mining & Bio-Informatics
.
.
.
35
What Data Mining Methods May
Help Bio-Informatics?










Semantic integration of heterogeneous, distributed genome
databases
Discovery of tandem repeats: Blast and beyond
Similarity search in genome databases
Association, correlation, and linkage analysis
Fault-tolerant sequential and structured pattern mining
Advanced classification techniques
Cluster analysis and outlier
detection
Multi-dimensional data mining environments
Visual data mining
Invisible data mining
May 22, 2017
Data Mining & Bio-Informatics
36
Cluster Analysis and Outliner
Detection

Partitioning Methods

K-means and k-medoids algorithms

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Constraint-Based Clustering

Outlier Analysis
May 22, 2017
Data Mining & Bio-Informatics
37
The K-Means Clustering Method

Example
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
10
9
9
8
8
7
7
6
6
5
5
4
2
1
0
0
1
2
3
4
5
6
7
8
7
8
9
10
reassign
3
May 22, 2017
Update
the
cluster
means
9
10
Update
the
cluster
means
Data Mining & Bio-Informatics
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
38
Typical k-medoids algorithm (PAM)
Total Cost = 20
10
10
10
9
9
9
8
8
8
Arbitrary
choose k
object as
initial
medoids
7
6
5
4
3
2
Assign
each
remainin
g object
to
nearest
medoids
7
6
5
4
3
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
0
10
1
2
3
4
5
6
7
8
9
10
7
6
5
4
3
2
1
0
0
K=2
Until no
change
10
3
4
5
6
7
8
9
10
10
Compute
total cost of
swapping
9
9
Swapping O
and Oramdom
8
If quality is
improved.
5
5
4
4
3
3
2
2
1
1
7
6
0
8
7
6
0
0
May 22, 2017
2
Randomly select a
nonmedoid object,Oramdom
Total Cost = 26
Do loop
1
1
2
3
4
5
6
7
8
9
10
Data Mining & Bio-Informatics
0
1
2
3
4
5
6
7
8
9
10
39
Hierarchical Clustering

Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
May 22, 2017
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
Data Mining & Bio-Informatics
divisive
(DIANA)
40
CF Tree
Root
B=7
CF1
CF2 CF3
CF6
L=6
child1
child2 child3
child6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
May 22, 2017
Leaf node
CF6 next
prev CF1 CF2
Data Mining & Bio-Informatics
CF4 next
41
CURE (Clustering Using
REpresentatives )

CURE: proposed by Guha, Rastogi & Shim, 1998


Stops the creation of a cluster hierarchy if a level
consists of k clusters
Uses multiple representative points to evaluate the
distance between clusters, adjusts well to arbitrary
shaped clusters and avoids single-link effect
May 22, 2017
Data Mining & Bio-Informatics
42
Overall Framework of CHAMELEON
Construct
Partition the Graph
Sparse Graph
Data Set
Merge Partition
Final Clusters
May 22, 2017
Data Mining & Bio-Informatics
43
DBSCAN: Density Based Spatial
Clustering of Applications with Noise


Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core
May 22, 2017
MinPts = 5
Data Mining & Bio-Informatics
44
Reachability
-distance
undefined

‘
May 22, 2017

Data Mining & Bio-Informatics
Cluster-order
of the objects
45
Density-Based Cluster analysis:
OPTICS & Its Applications
May 22, 2017
Data Mining & Bio-Informatics
46
Clustering and Distribution Density
Functions: Density Attractor
May 22, 2017
Data Mining & Bio-Informatics
47
Center-Defined and Arbitrary Shaped
May 22, 2017
Data Mining & Bio-Informatics
48
40
50
20
30
40
50
age
60
Vacation
=3
30
Vacation
(week)
0 1 2 3 4 5 6 7
Salary
(10,000)
0 1 2 3 4 5 6 7
20
age
60
30
50
age
May 22, 2017
Data Mining & Bio-Informatics
49
STING: A Statistical Information
Grid Approach



Wang, Yang and Muntz (VLDB’97)
Each cell stores statistical distribution of measure at
low level
Multi-level resolution
May 22, 2017
Data Mining & Bio-Informatics
50
WaveCluster

G. Sheikholeslami, et al. (1998)
Multiple wavelet transformationbased cluster analysis
May 22, 2017
Data Mining & Bio-Informatics
51
Constraint-Based Clustering: Planning
ATM Locations
C2
C3
C1
River
Mountain
Spatial data with obstacles
May 22, 2017
C4
Clustering without taking
obstacles into consideration
Data Mining & Bio-Informatics
52
Clustering with Spatial Obstacles
Not Taking obstacles into account
May 22, 2017
Taking obstacles into account
Data Mining & Bio-Informatics
53
What Data Mining Methods May
Help Bio-Informatics?










Semantic integration of heterogeneous, distributed genome
databases
Discovery of tandem repeats: Blast and beyond
Similarity search in genome databases
Association, correlation, and linkage analysis
Fault-tolerant sequential and structured pattern mining
Advanced classification techniques
Cluster analysis and outlier detection
Multi-dimensional data mining
environments
Visual data mining
Invisible data mining
May 22, 2017
Data Mining & Bio-Informatics
54
Multidimensional Data and Data Cubes

Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country
Product
City
Office
Quarter
Month Week
Day
Month
May 22, 2017
Data Mining & Bio-Informatics
55
Mining Multimedia Databases in
May 22, 2017
Data Mining & Bio-Informatics
56
Mining and Explorative Analysis of Data
Cubes (and Multi-Dimensional Databases)

Efficient computation of data or iceberg cubes

Discovery-driven data cube analysis

Cube-gradient analysis


May 22, 2017
What are the changes of the average house value in
Sillicon Valley in 2001 comparing with 2000?
Under what conditions the average house value
increases 10% per year in Chicago area in 1990s?
Data Mining & Bio-Informatics
57
What Data Mining Methods May
Help Bio-Informatics?

Semantic integration of heterogeneous, distributed genome
databases
Discovery of tandem repeats: Blast and beyond
Similarity search in genome databases
Association, correlation, and linkage analysis
Fault-tolerant sequential and structured pattern mining
Advanced classification techniques
Cluster analysis and outlier detection
Multi-dimensional data mining environments

Visual data mining

Invisible data mining







May 22, 2017
Data Mining & Bio-Informatics
58
Visual Data Mining & Data Visualization


Integration of visualization and data mining
 data visualization
 data mining result visualization
 data mining process visualization
 interactive visual data mining
Data visualization
 Data in a database or data warehouse can be
viewed
 at different levels of abstraction
 as different combinations of attributes or
dimensions
 Data can be presented in various visual forms
May 22, 2017
Data Mining & Bio-Informatics
59
Data Mining Result Visualization


Presentation of the results or knowledge obtained from
data mining in visual forms
Examples

Scatter plots and boxplots (obtained from descriptive
data mining)

Decision trees

Association rules

Clusters

Outliers

Generalized rules
May 22, 2017
Data Mining & Bio-Informatics
60
Boxplots from Statsoft: Multiple
Variable Combinations
May 22, 2017
Data Mining & Bio-Informatics
61
Visualization of Data Mining Results in
SAS Enterprise Miner: Scatter Plots
May 22, 2017
Data Mining & Bio-Informatics
62
Visualization of Association Rules in
SGI/MineSet 3.0
May 22, 2017
Data Mining & Bio-Informatics
63
Visualization of a Decision Tree in
SGI/MineSet 3.0
May 22, 2017
Data Mining & Bio-Informatics
64
Visualization of Cluster Grouping in IBM
Intelligent Miner
May 22, 2017
Data Mining & Bio-Informatics
65
Data Mining Process Visualization

Presentation of the various processes of data mining in
visual forms so that users can see

Data extraction process

Where the data is extracted

How the data is cleaned, integrated, preprocessed,
and mined

Method selected for data mining

Where the results are stored

How they may be viewed
May 22, 2017
Data Mining & Bio-Informatics
66
Visualization of Data Mining
Processes by Clementine
See your solution
discovery
process clearly
Understand
variations with
visualized data
May 22, 2017
Data Mining & Bio-Informatics
67
Interactive Visual Data Mining


Using visualization tools in the data mining process to
help users make smart data mining decisions
Example


May 22, 2017
Display the data distribution in a set of attributes
using colored sectors or columns (depending on
whether the whole space is represented by either a
circle or a set of columns)
Use the display to which sector should first be
selected for classification and where a good split
point for this sector may be
Data Mining & Bio-Informatics
68
Interactive Visual Mining by
Perception-Based Classification (PBC)
May 22, 2017
Data Mining & Bio-Informatics
69
Audio Data Mining





Uses audio signals to indicate the patterns of data or
the features of data mining results
An interesting alternative to visual mining
An inverse task of mining audio (such as music)
databases which is to find patterns from audio data
Visual data mining may disclose interesting patterns
using graphical displays, but requires users to
concentrate on watching patterns
Instead, transform patterns into sound and music and
listen to pitches, rhythms, tune, and melody in order to
identify anything interesting or unusual
May 22, 2017
Data Mining & Bio-Informatics
70
What Data Mining Methods May
Help Bio-Informatics?

Semantic integration of heterogeneous, distributed genome
databases
Discovery of tandem repeats: Blast and beyond
Similarity search in genome databases
Association, correlation, and linkage analysis
Fault-tolerant sequential and structured pattern mining
Advanced classification techniques
Cluster analysis and outlier detection
Multi-dimensional data mining environments
Visual data mining

Invisible data mining








May 22, 2017
Data Mining & Bio-Informatics
71
Invisible Data Mining

Embed mining functions into information services

Web search engine (link analysis, authoritative pages, user
profiles)—adaptive web sites, etc.


Improvement of query processing: history + data

Making service smart and efficient
Benefits from/to data mining research

Data mining research has produced many scalable, efficient,
novel mining solutions


Applications feed new challenge problems to research
Can we make bio-informatics based data mining invisible?
May 22, 2017
Data Mining & Bio-Informatics
72
Conclusions



Data mining and bio-informatics: Both are
young and promising disciplines
Data mining: A confluence of multiple
disciplines—database, data warehouse,
machine learning, statistics, high performance
computing, bio-technology, etc.
Lots of research issues: need biologists and
computer scientists working together
May 22, 2017
Data Mining & Bio-Informatics
73
http://www.cs.uiuc.edu/~hanj
Thank you !!!
May 22, 2017
Data Mining & Bio-Informatics
74