Download CS490D: Introduction to Data Mining Chris Clifton

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
UCR
CS 235
Data Mining
Winter 2011
Important Note
• All information about grades/homeworks/
projects etc will be given out at the next
meeting
• Someone give me a 15 minute warning
before the end of this class
2
What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
• “data mining” is not
– Databases
– (Deductive) query processing.
– Expert systems or small ML/statistical programs
3
What is Data Mining?
example from yesterdays newspapers
What is Data Mining?
Example from the NBA
• Play-by-play information recorded by teams
– Who is on the court
– Who shoots
– Results
• Coaches want to know what works best
– Plays that work well against a given team
– Good/bad player matchups
• Advanced Scout (from IBM Research) is a data
mining tool to answer these questions
Starks+Houston+
Ward playing
Shooting
Percentage
Overall
0
20
40
60
http://www.nba.com/news_feat/beyond/0126.html
5
What is Data Mining?
Example from Keogh/Mueen
Beet Leafhopper (Circulifer tenellus)
input resistor
conductive glue
to insect
to soil near plant
V
voltage source
voltage reading
2
0
1
00
plant membrane
0
50
150
200
Stylet
Approximately 14.4 minutes of insect telemetry
10,000
0
3
100
20,000
x 104
Instance at 3,664
2
Instance at 9,036
1
0
100
200
300
400
500
30,000
All these examples show…
• Lots of raw data in
• Some data mining
• Facts, rules, patterns out
Lots of data
Some rules or facts or patterns
7
Knowledge Discovery in
Databases: Process
Interpretation/
Evaluation
There exists a
planet at…
Data Mining
Knowledge
Preprocessing
Patterns
Selection
Preprocessed
Data
Data
Target
Data
adapted from:
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data
Mining: An Overview,” Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
12.2 3434.00232
11.2 3454.64555
23.6 4324.53435
8
Course Outline
1. Introduction: What is data mining?
– What makes it a new and unique
discipline?
– Relationship between Data
Warehousing, On-line Analytical
Processing, and Data Mining
– Data mining tasks - Clustering,
Classification, Rule learning, etc.
2. Data mining process
– Task identification
– Data preparation/cleansing
3. Association Rule mining
5. Prediction
– Regression
– Neural Networks
6. Clustering
– Distance-based approaches
– Density-based approaches
7. Anomaly Detection
–
–
–
Distance based
Density based
Model based
8. Similarity Search
– Problem Description
– Algorithms
4. Classification
–
–
–
–
Bayesian
Nearest Neighbor
Linear Classifier
Tree-based approaches
10
Data Mining: Classification
Schemes
• General functionality
– Descriptive data mining
– Predictive data mining
• Different views, different classifications
– Kinds of data to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted
11
Data Mining:
History of the Field
• Knowledge Discovery in Databases workshops started ‘89
– Now a conference under the auspices of ACM SIGKDD
– IEEE conference series started 2001
• Key founders / technology contributors:
– Usama Fayyad, JPL (then Microsoft, then his own company,
Digimine, now Yahoo! Research labs, now CEO at Open Insights)
– Gregory Piatetsky-Shapiro (then GTE, now his own data mining
consulting company, Knowledge Stream Partners)
– Rakesh Agrawal (IBM Research)
The term “data mining” has been around since at least 1983 –
as a pejorative term in the statistics community
12
Data Mining:
The big players
13
A data mining
problem…
Wei Wang - School of Life Science, Fudan University, Chin
Wei Wang - Nonlinear Systems Laboratory, Department of
Mechanical Engineering, MIT
Wei Wang - University of Maryland Baltimore County
Wei Wang - University of Naval Engineering
Wei Wang - ThinkIT Speech Lab, Institute of Acoustics, Ch
Academy of Sciences
Wei Wang - Rutgers University, New Brunswick, NJ, USA
Wei Wang - Purdue University Indianapolis
Wei Wang - INRIA Sophia Antipolis, Sophia Antipolis, Fran
Wei Wang - Institute of Computational Linguistics, Peking
University
Wei Wang - National University of Singapore
Wei Wang - Nanyang Technological University, Singapore
Wei Wang - Computer and Electronics Engineering, Unive
Nebraska Lincoln, NE, USA
Wei Wang - The University of New South Wales, Australia
Wei Wang - Language Weaver, Inc.
Wei Wang - The Chinese University of Hong Kong, Mecha
and Automation Engineering
Wei Wang - Center for Engineering and Scientific Comput
Zhejiang University, China
Wei Wang - Fudan University, Shanghai, China
Wei Wang - University of North Carolina at Chape
14
What Can Data Mining Do?
• Classify
– Categorical, Regression
• Cluster
• Summarize
– Summary statistics, Summary rules
• Link Analysis / Model Dependencies
– Association rules
• Sequence analysis
– Time-series analysis, Sequential associations
• Detect Deviations
15
Why is Data Mining Hard?
• Scalability
• High Dimensionality
• Heterogeneous and Complex Data
• Data Ownership and Distribution
• Non-traditional Analysis
• Over fitting
• Privacy issues
16
Scale of Data
Organization
Walmart
Google
Yahoo
NASA satellites
NCBI GenBank
France Telecom
UK Land Registry
AT&T Corp
Scale of Data
~ 20 million transactions/day
~ 8.2 billion Web pages
~10 GB Web data/hr
~ 1.2 TB/day
~ 22 million genetic sequences
29.2 TB
18.3 TB
26.2 TB
“The great strength of computers is that they can reliably manipulate
vast amounts of data very quickly. Their great weakness is that they
don’t have a clue as to what any of that data actually means”
(S. Cass, IEEE Spectrum, Jan 2004)
17
What Can Data Mining Do?
• Classify
– Categorical, Regression
• Cluster
• Summarize
– Summary statistics, Summary rules
• Link Analysis / Model Dependencies
– Association rules
• Sequence analysis
– Time-series analysis, Sequential associations
• Detect Deviations
18
The Classification Problem
Katydids
(informal definition)
Given a collection of annotated
data. In this case 5 instances
Katydids of and five of
Grasshoppers, decide what
type of insect the unlabeled
example is.
Katydid or Grasshopper?
Grasshoppers
The Classification Problem
spam
Given a collection of annotated
data…
email
Spam or email?
The Classification Problem
Spanish
Given a collection of annotated
data…
Polish
Spanish or Polish?
The Classification Problem
Stinging
Nettle
Given a collection of annotated
data…
False Nettle
Stinging Nettle or False Nettle?
The Classification Problem
Given a collection of annotated
data…
Greek
Gunopulos
Papadopoulos
Kollios
Dardanos
Irish
Tsotras
Greek or Irish?
Keogh
Gough
Greenhaugh
Hadleigh
The Classification Problem
Katydids
(informal definition)
Given a collection of annotated
data. In this case 5 instances
Katydids of and five of
Grasshoppers, decide what
type of insect the unlabeled
example is.
Katydid or Grasshopper?
Grasshoppers
For any domain of interest, we can measure features
Color {Green, Brown, Gray, Other}
Abdomen
Length
Has Wings?
Thorax
Length
Antennae
Length
Mandible
Size
Spiracle
Diameter
Leg Length
We can store
features in a
database.
The classification
problem can now be
expressed as:
• Given a training database
(My_Collection), predict the
class label of a previously
unseen instance
previously unseen instance =
My_Collection
Insect Abdomen Antennae Insect Class
ID
Length
Length
Grasshopper
1
2.7
5.5
2
3
4
5
6
7
8
9
10
8.0
0.9
1.1
5.4
2.9
6.1
0.5
8.3
8.1
11
9.1
4.7
3.1
8.5
1.9
6.6
1.0
6.6
4.7
5.1
7.0
Katydid
Grasshopper
Grasshopper
Katydid
Grasshopper
Katydid
Grasshopper
Katydid
Katydids
???????
Grasshoppers
Katydids
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Grasshoppers
We will also use this lager dataset as a
motivating example…
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Katydids
Each of these data
objects are
called…
• exemplars
• (training)
examples
• instances
• tuples
We will return to the
previous slide in two minutes.
In the meantime, we are
going to play a quick game.
I am going to show you some
classification problems which
were shown to pigeons!
Let us see if you are as
smart as a pigeon!
Pigeon Problem 1
Examples of
class A
3
4
1.5
5
Examples of
class B
5
2.5
5
2
6
8
8
3
2.5
5
4.5
3
Pigeon Problem 1
Examples of
class A
3
4
1.5
6
5
8
What class is this
object?
Examples of
class B
5
2.5
5
2
8
3
8
What about this one,
A or B?
4.5
2.5
5
4.5
3
1.5
7
Pigeon Problem 1
Examples of
class A
3
4
1.5
5
This is a B!
Examples of
class B
5
2.5
5
2
6
8
8
3
2.5
5
4.5
3
8
1.5
Here is the rule.
If the left bar is
smaller than the
right bar, it is an A,
otherwise it is a B.
Pigeon Problem 2
Examples of
class A
Examples of
class B
4
4
5
2.5
5
5
2
5
5
3
6
6
Oh! This ones hard!
8
Even I know this one
7
3
3
2.5
3
1.5
7
Pigeon Problem 2
Examples of
class A
Examples of
class B
The rule is as follows, if the
two bars are equal sizes, it is
an A. Otherwise it is a B.
4
4
5
2.5
5
5
2
5
So this one is an A.
5
6
6
3
7
3
3
2.5
3
7
Pigeon Problem 3
Examples of
class A
Examples of
class B
6
4
4
5
6
1
5
7
5
4
8
7
7
6
3
3
7
This one is really hard!
What is this, A or B?
6
Pigeon Problem 3
Examples of
class A
It is a B!
Examples of
class B
6
4
1
4
5
6
3
3
7
5
6
6
7
5
4
8
7
7
The rule is as follows, if the
square of the sum of the two
bars is less than or equal to
100, it is an A. Otherwise it is
a B.
Why did we spend so much
time with this game?
Because we wanted to
show that almost all
classification problems
have a geometric
interpretation, check out
the next 3 slides…
Examples of
class A
3
Examples of
class B
5
4
2.5
Left Bar
Pigeon Problem 1
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Right Bar
1.5
5
5
2
6
8
8
3
2.5
5
4.5
3
Here is the rule again.
If the left bar is smaller
than the right bar, it is
an A, otherwise it is a B.
Examples of
class A
4
4
Examples of
class B
5
2.5
Left Bar
Pigeon Problem 2
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Right Bar
5
5
6
6
3
3
2
5
5
3
2.5
3
Let me look it up… here it is..
the rule is, if the two bars
are equal sizes, it is an A.
Otherwise it is a B.
Examples of
class A
4
4
Examples of
class B
5
6
Left Bar
Pigeon Problem 3
100
90
80
70
60
50
40
30
20
10
10 20 30 40 50 60 70 80 90 100
Right Bar
1
5
6
3
3
7
7
5
4
8
7
7
The rule again:
if the square of the sum of the
two bars is less than or equal
to 100, it is an A. Otherwise it
is a B.
Grasshoppers
Katydids
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
previously unseen instance =
11
5.1
7.0
???????
• We can “project” the previously
unseen instance into the same
space as the database.
10
9
8
7
6
5
4
3
2
1
Antenna Length
• We have now abstracted away
the details of our particular
problem. It will be much easier to
talk about points in space.
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Katydids
Grasshoppers
Simple Linear Classifier
10
9
8
7
6
5
4
3
2
1
R.A. Fisher
1890-1962
If previously unseen instance above the line
then
class is Katydid
else
class is Grasshopper
1 2 3 4 5 6 7 8 9 10
Katydids
Grasshoppers
The simple linear classifier is defined for higher dimensional spaces…
… we can visualize it as
being an n-dimensional
hyperplane
It is interesting to think about what would happen in this example if we did not have
the 3rd dimension…
We can no longer get perfect accuracy
with the simple linear classifier…
We could try to solve this problem by
user a simple quadratic classifier or a
simple cubic classifier..
However, as we will later see, this is
probably a bad idea…
Which of the “Pigeon Problems” can be solved by
the Simple Linear Classifier?
1)
2)
3)
Perfect
Useless
Pretty Good
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Problems that can
be solved by a
linear classifier are
call linearly
separable.
10
9
8
7
6
5
4
3
2
1
100
90
80
70
60
50
40
30
20
10
10 20 30 40 50 60 70 80 90 100
1 2 3 4 5 6 7 8 9 10
Virginica
A Famous Problem
R. A. Fisher’s Iris Dataset.
• 3 classes
• 50 of each class
Setosa
The task is to classify Iris
plants into one of 3 varieties
using the Petal Length and
Petal Width.
Iris Setosa
Versicolor
Iris Versicolor
Iris Virginica
We can generalize the piecewise linear classifier to N classes, by fitting N-1 lines. In
this case we first learned the line to (perfectly) discriminate between Setosa and
Virginica/Versicolor, then we learned to approximately discriminate between
Virginica and Versicolor.
Virginica
Setosa
Versicolor
If petal width > 3.272 – (0.325 * petal length) then class = Virginica
Elseif petal width…
We have now seen one classification
algorithm, and we are about to see more.
How should we compare them?
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
– efficiency in disk-resident databases
• Robustness
– handling noise, missing values and irrelevant features, streaming data
• Interpretability:
– understanding and insight provided by the model
Predictive Accuracy I
• How do we estimate the accuracy of our classifier?
We can use K-fold cross validation
We divide the dataset into K equal sized sections. The algorithm is tested K
times, each time leaving out one of the K section from building the
classifier, but using it to test the classifier instead
Accuracy =
K=5
Number of correct classifications
Number of instances in our database
Insect
ID
Abdomen
Length
Antennae
Length
Insect Class
10
1
2.7
5.5
Grasshopper
2
8.0
9.1
Katydid
9
8
3
0.9
4.7
Grasshopper
4
1.1
3.1
Grasshopper
5
5.4
8.5
Katydid
6
2.9
1.9
Grasshopper
7
6.1
6.6
Katydid
8
0.5
1.0
Grasshopper
9
8.3
6.6
Katydid
10
8.1
4.7
Katydids
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Predictive Accuracy II
• Using K-fold cross validation is a good way to set any parameters we may
need to adjust in (any) classifier.
• We can do K-fold cross validation for each possible setting, and choose
the model with the highest accuracy. Where there is a tie, we choose the
simpler model.
• Actually, we should probably penalize the more complex models, even if they are
more accurate, since more complex models are more likely to overfit (discussed
later).
Accuracy = 94%
Accuracy = 100%
Accuracy = 100%
10
9
8
7
6
5
4
3
2
1
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Predictive Accuracy III
Number of correct classifications
Accuracy = Number of instances in our database
Accuracy is a single number, we may be better off looking at a confusion
matrix. This gives us additional useful information…
True label is...
Cat Dog Pig
Classified as a…
Cat
Dog
Pig
100 0
9 90
45 45
0
1
10
Speed and Scalability I
We need to consider the time and space
requirements for the two distinct phases of
classification:
• Time to construct the classifier
• In the case of the simpler linear classifier, the time taken to fit the
line, this is linear in the number of instances.
• Time
to use the model
• In the case of the simpler linear classifier, the time taken to test
which side of the line the unlabeled instance is. This can be done in
10
constant time.
As we shall see, some classification
algorithms are very efficient in one
aspect, and very poor in the other.
9
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
8
9 10
Speed and Scalability II
For learning with small datasets, this is the
whole picture
Speed and Scalability I
We need to consider the time and space requirements
for the two distinct phases of classification:
• Time to construct the classifier
However, for data mining with massive
datasets, it is not so much the (main memory)
time complexity that matters, rather it is how
many times we have to scan the database.
• In the case of the simpler linear classifier, the time taken to fit the line, this
is linear in the number of instances.
• Time to use the model
• In the case of the simpler linear classifier, the time taken to test which
side of the line the unlabeled instance is. This can be done in constant time.
As we shall see, some classification
algorithms are very efficient in one aspect,
and very poor in the other.
10
9
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
This is because for most data mining operations, disk access times completely
dominate the CPU times.
For data mining, researchers often report the number of times you must scan the
database.
8
9 10
Robustness I
We need to consider what happens when we
have:
•
Noise
• For example, a persons age could have been mistyped as 650 instead of 65,
how does this effect our classifier? (This is important only for building the
classifier, if the instance to be classified is noisy we can do nothing).
•Missing values
•
For example suppose we want to classify an
insect, but we only know the abdomen length (Xaxis), and not the antennae length (Y-axis), can we
still classify the instance?
10
10
99
88
77
66
55
44
33
22
11
11 22 33 44 55 66 77 88 99 10
10
Robustness II
We need to consider what happens when we
have:
• Irrelevant features
For example, suppose we want to classify people as either
• Suitable_Grad_Student
• Unsuitable_Grad_Student
And it happens that scoring more than 5 on a particular test is a
perfect indicator for this problem…
10
9
8
7
6
5
4
3
2
1
If we also use
“hair_length” as a
feature, how will
this effect our
classifier?
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Robustness III
We need to consider what happens when we
have:
• Streaming data
For many real world problems, we don’t have a single fixed dataset. Instead,
the data continuously arrives, potentially forever… (stock market, weather
data, sensor data etc)
Can our classifier handle streaming data?
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Interpretability
Some classifiers offer a bonus feature. The
structure of the learned classifier tells use
something about the domain.
Weight
As a trivial example, if we try to classify peoples
health risks based on just their height and weight,
we could gain the following insight (Based of the
observation that a single linear classifier does not
work well, but two linear classifiers do).
There are two ways to be unhealthy, being obese
and being too skinny.
Height
Nearest Neighbor Classifier
10
9
8
7
6
5
4
3
2
1
Antenna Length
Evelyn Fix
1904-1965
Joe Hodges
1922-2000
If the nearest instance to the previously
unseen instance is a Katydid
class is Katydid
else
class is Grasshopper
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Katydids
Grasshoppers
We can visualize the nearest neighbor algorithm in
terms of a decision surface…
Note the we don’t actually have to
construct these surfaces, they are
simply the implicit boundaries that
divide the space into regions
“belonging” to each instance.
This division of space is called Dirichlet
Tessellation (or Voronoi diagram, or
Theissen regions).
The nearest neighbor algorithm is sensitive to outliers…
The solution is to…
We can generalize the nearest neighbor
algorithm to the K- nearest neighbor (KNN)
algorithm.
We measure the distance to the nearest K instances,
and let them vote. K is typically chosen to be an odd
number.
K=1
K=3
The nearest neighbor algorithm is sensitive to irrelevant
features…
Suppose the following is true, if
an insects antenna is longer than
5.5 it is a Katydid, otherwise it is
a Grasshopper.
Training data
1 2 3 4 5 6 7 8 9 10
6
1 2 3 4 5 6 7 8 9 10
Using just the antenna length we
get perfect classification!
1 2 3 4 5 6 7 8 9 10
5
Suppose however, we add
in an irrelevant feature, for
example the insects mass.
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Using both the antenna
length and the insects mass
with the 1-NN algorithm we
get the wrong classification!
How do we mitigate the nearest neighbor
algorithms sensitivity to irrelevant features?
• Use
more training instances
• Ask an expert what features are relevant to the
task
• Use statistical tests to try to determine which
features are useful
• Search over feature subsets (in the next slide
we will see why this is hard)
Why searching over feature subsets is hard
Suppose you have the following classification problem, with 100 features, where is
happens that Features 1 and 2 (the X and Y below) give perfect classification, but
all 98 of the other features are irrelevant…
Only Feature 2
Only Feature 1
Using all 100 features will give poor results, but so will using only Feature 1, and so
will using Feature 2! Of the 2100 –1 possible subsets of the features, only one really
works.
1,2
1
2
3
4
1,3
2,3
1,4
2,4
1,2,3
•Forward Selection
•Backward Elimination
•Bi-directional Search
1,2,4
1,3,4
1,2,3,4
3,4
2,3,4
The nearest neighbor algorithm is sensitive to the units of measurement
X axis measured in
centimeters
Y axis measure in dollars
The nearest neighbor to the
pink unknown instance is red.
X axis measured in
millimeters
Y axis measure in dollars
The nearest neighbor to the
pink unknown instance is
blue.
One solution is to normalize the units to pure numbers. Typically the features are Znormalized to have a mean of zero and a standard deviation of one.
X = (X – mean(X))/std(x)
We can speed up nearest neighbor algorithm by
“throwing away” some data. This is called data
editing.
Note that this can sometimes improve accuracy!
We can also speed up classification with indexing
One possible approach.
Delete all instances that
are surrounded by
members of their own
class.
Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean
Distance, however this need not be the case…
DQ, C    qi  ci 
n
2
DQ, C  
i 1
10
9
8
7
6
5
4
3
2
1
p


q

c
 i i
n
p
i 1
Max (p=inf)
Manhattan (p=1)
Weighted Euclidean
Mahalanobi
s
1 2 3 4 5 6 7 8 9 10
…In fact, we can use the nearest neighbor algorithm
with any distance/similarity function
For example, is “Faloutsos” Greek or Irish?
We could compare the name “Faloutsos” to a
database of names using string edit
distance…
edit_distance(Faloutsos, Keogh) = 8
edit_distance(Faloutsos, Gunopulos) = 6
Hopefully, the similarity of the name
(particularly the suffix) to other Greek names
would mean the nearest nearest neighbor is
also a Greek name.
ID
1
2
3
4
5
6
7
8
Name
Class
Gunopulos Greek
Papadopoulos Greek
Kollios
Dardanos
Keogh
Gough
Greenhaugh
Hadleigh
Greek
Greek
Irish
Irish
Irish
Irish
Specialized distance measures exist for DNA strings, time series, images, graphs,
videos, sets, fingerprints etc…
Edit Distance Example
It is possible to transform any string Q
into string C, using only Substitution,
Insertion and Deletion.
Assume that each of these operators
has a cost associated with it.
The similarity between two strings can
be defined as the cost of the cheapest
transformation from Q to C.
How similar are the names “Peter”
and “Piotr”?
Assume the following cost function
Substitution
Insertion
Deletion
1 Unit
1 Unit
1 Unit
D(Peter,Piotr) is 3
Peter
Note that for now we have ignored the issue of how we can find this cheapest
Substitution (i for e)
transformation
Piter
Insertion (o)
Pioter
Deletion (e)
Piotr
Dear SIR,
I am Mr. John Coleman and my sister is Miss Rose
Colemen, we are the children of late Chief Paul
Colemen from Sierra Leone. I am writing you in
absolute confidence primarily to seek your assistance
to transfer our cash of twenty one Million Dollars
($21,000.000.00) now in the custody of a private
Security trust firm in Europe the money is in trunk
boxes deposited and declared as family valuables by my
late father as a matter of fact the company does not
know the content as money, although my father made
them to under stand that the boxes belongs to his
foreign partner.
…
This mail is probably spam. The original message has been
attached along with this report, so you can recognize or block
similar unwanted mail in future. See
http://spamassassin.org/tag/ for more details.
Content analysis details:
(12.20 points, 5 required)
NIGERIAN_SUBJECT2 (1.4 points) Subject is indicative of a Nigerian spam
FROM_ENDS_IN_NUMS (0.7 points) From: ends in numbers
MIME_BOUND_MANY_HEX (2.9 points) Spam tool pattern in MIME boundary
URGENT_BIZ
(2.7 points) BODY: Contains urgent matter
US_DOLLARS_3
(1.5 points) BODY: Nigerian scam key phrase
($NN,NNN,NNN.NN)
DEAR_SOMETHING
(1.8 points) BODY: Contains 'Dear (something)'
BAYES_30
(1.6 points) BODY: Bayesian classifier says spam
probability is 30 to 40%
[score: 0.3728]
Acknowledgements
Some of the material used in this lecture is drawn from
other sources:
• Chris Clifton
• Jiawei Han
• Dr. Hongjun Lu (Hong Kong Univ. of Science and Technology)
• Graduate students from Simon Fraser Univ., Canada, notably
Eugene Belchev, Jian Pei, and Osmar R. Zaiane
• Graduate students from Univ. of Illinois at Urbana-Champaign
• Dr. Bhavani Thuraisingham (MITRE Corp. and UT
Dallas)
76
77
Making Good Figures
• We are going to see many figures this quarter
• I personally feel that making good figures is very
important to a papers chance of acceptance.
2
3
3
1
3
1
1
1
1
1
1
4
6
5
Fig. 1. Sequence graph example
Fig. 1. A sample sequence graph. The line
thickness encodes relative entropy
What's wrong with this figure? Let me count the ways…
None of the arrows line up with the “circles”. The “circles” are all different sizes and aspect ratios, the (normally
invisible) white bounding box around the numbers breaks the arrows in many places. The figure captions has
almost no information. Circles are not aligned…
On the right is my redrawing of the figure with PowerPoint. It took me 300 seconds
This figure is an insult to reviewers. It says, “we expect you to spend an unpaid hour to
review our paper, but we don’t think it worthwhile to spend 5 minutes to make clear figures”
Fig. 1. Sequence graph example
Note that there are figures drawn seven hundred years
ago that have much better symmetry and layout.
Peter Damian, Paulus Diaconus, and others, Various saints lives: Netherlands, S. or France, N. W.; 2nd quarter of the 13th century
Lets us see some more examples of poor figures, then see some principles that can help
This figure wastes
80% of the space it
takes up.
In any case, it could
be replace by a short
English sentence:
“We found that for
selectivity ranging
from 0 to 0.05, the
four methods did not
differ by more than
5%”
Why did they bother
with the legend,
since you can’t tell
the four lines apart
This figure wastes
almost a quarter of a
page.
The ordering on the Xaxis is arbitrary, so the
figure could be replaced
with the sentence “We
found the average
performance was 198
with a standard deviation
of 11.2”.
The paper in question
had 5 similar plots,
wasting an entire page.
The figure below takes up 1/6 of a page, but it only
reports 3 numbers.
The figure below takes up 1/6 of a page, but it only
reports 2 numbers!
Actually, it really only reports one number! Only the relative times really
matter, so they could have written “We found that FTW is 1007 times faster
than the exact calculation, independent of the sequence length”.
Both figures below describe the classification of time series motions…
It is not obvious from this figure which
algorithm is best. The caption has almost
zero information
You need to read the text very carefully
to understand the figure
Redesign by Keogh
At a glance we can see that the
accuracy is very high. We can also see
that DTW tends to win when the...
The data is plotted in Figure 5. Note that any
correctly classified motions must appear in the
upper left (gray) triangle.
1
In this region
our algorithm
wins
In this region
DTW wins
0
0
1
Figure 5. Each of our 100 motions plotted as a point
in 2 dimensions. The X value is set to the distance to
the nearest neighbor from the same class, and the Y
value is set to the distance to the nearest neighbor
from any other class.
This should be a bar chart, the four items are unrelated
(in any case this should probably be a table, not a figure)
This pie chart takes up a lot of space to
communicate two numbers ( Better as a table, or as simple text)
People that have heard of Pacman
People that have not heard of Pacman
A Database Architecture For
Real-Time Motion Retrieval
Principles to make Good Figures
• Think about the point you want to make, should it be done
with words, a table, or a figure. If a figure, what kind?
• Color helps
(but you cannot depend on it)
• Linking helps
(sometimes called brushing)
• Direct labeling helps
• Meaningful captions helps
• Minimalism helps
(Omit needless elements)
• Finally, taking great care, taking pride in your work, helps
Direct
labeling
helps
It removes one
level of
indirection, and
allows the
figures to be self
explaining
(see Edward Tufte: Visual
Explanations, Chapter 4)
D
E
C
B
A
Figure 10. Stills from a video sequence; the right hand is
tracked, and converted into a time series: A) Hand at rest: B)
Hand moving above holster. C) Hand moving down to grasp
gun. D Hand moving to shoulder level, E) Aiming Gun.
Linking helps interpretability I
How did we get from here
To here?
What is Linking?
Linking is connecting the same data in two
views by using the same color (or thickness
etc). In the figures below, color links the data
in the pie chart, with data in the scatterplot.
Fish
It is not clear from the above figure
See next slide for a suggested fix.
50
45
40
Fow
l
35
30
25
20
15
10
Neither
Both
5
0
0
10
20
30
40
50
60
Linking helps interpretability II
In this figure, the color
of the arrows inside the
fish link to the colors of
the arrows on the time
series.
This tells us
exactly how
we go from a
shape to a
time series.
Note that there are other
links, for example in II, you
can tell which fish is which
based on color or link
thickness linking.
Minimalism helps: In this
case, numbers on the Xaxis do not mean anything,
so they are deleted.
A nice example of
linking
© Sinaue
1
EBEL
Detection Rate
ABEL
DROP1
DROP2
0
0
• Don’t cover the data with the labels!
You are implicitly saying “the results
are not that important”.
• Do we need all the numbers to
annotate the X and Y axis?
• Can we remove the text “With
Ranking”?
False Alarm Rate
1
Direct labeling helps
Note that the line thicknesses
differ by powers of 2, so even in
a B/W printout you can tell the
four lines apart.
Minimalism helps: delete the “with
Ranking”, the X-axis numbers, the grid…
Covering the data
with the labels is a
common sin
These two images, which are both use to discuss an anomaly detection algorithm,
illustrate many of the points discussed in previous slides.
Color helps - Direct labeling helps - Meaningful captions help
The images should be as self contained as possible, to avoid forcing the reader to
look back to the text for clarification multiple times.
Note that while Figure 6
use color to highlight the
anomaly, it also uses the
line thickness (hard to see
in PowerPoint) thus this
figure works also well in
B/W printouts
Sometime between next week and the end of the
quarter
Find a poor figure in a data mining paper.
Create a fixed version of it.
Present one to three slides about it at the beginning of a
class.
Before you start to work on the poor figure, run it by me.