Download Course Title Goes Here (same for every lecture)

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Machine Learning
Part 2: Intermediate and Active
Sampling Methods
Jaime Carbonell (with contributions from Pinar
Donmez and Jingrui He)
Carnegie Mellon University
[email protected]
December, 2008
© 2008, Jaime G Carbonell
Beyond “Standard” Learning:
 Multi-Objective Learning
 Structuring Unstructured Data
 Text Categorization
 Temporal Prediction
 Cycle & trend detection
 Semi-Supervised Methods
 Labeled + Unlabeled Data
 Active Learning
 Proactive Learning
 “Unsupervised” Learning
 Predictor attributes, but no explicit objective
 Clustering methods
 Rare category detection
December, 2008
© 2008, Jaime G. Carbonell
2
Multi-Objective Supervised Learning
 Several objectives to predict, overlapping sets of
predictor attributes
p1
p2
p3
p4
p5
Predictor att’s
p6
obj1
-- Independent predictions
(each solved ignoring others)
obj2
-- Dependent predictions
(results of earlier predictions
partially feed next round)
obj3
 Dependent case: sequence the predictions. If
feedback, cycle until stability (or fixed N)
December, 2008
© 2008, Jaime G. Carbonell
3
The Vector Space Model
How to Convert Text to “Data”
 Definitions of document and query vectors,
where wj = jth word, and c(wj,di) = count the
occurrences of wi in document dj
 For topic-categorization use wn+1 as objective
category to predict (e.g. “finance”, “sports”)
Vocabulary  {wi , w2 ,...wn }

di  [c( w1 , d i ), c( w2 , d i ),..., c( wn , d i )]

qi  [c( w1 , qi ), c( w2 , qi ),..., c( wn , qi )]
December, 2008
© 2008, Jaime G. Carbonell
4
Refinements to Word-Based
Features
Well-known methods
Stop-word removal (e.g., “it”, “the”, “in”, …)
Phrasing (e.g., “White House”, “heart attack”, …)
Morphology (e.g., “countries” => “country”)
Feature Expansion
Query expansion (e.g., “cheap” =>
“inexpensive”, “discount”, “economic”,…)
Feature Transformation & Reduction
Singular-value decomposition (SVD)
Linear discriminant analysis (LDA)
December, 2008
© 2008, Jaime G. Carbonell
5
Query-Document Similarity
(For Retrieval and for kNN)
Traditional “Cosine Similarity”


 
qd
Sim (q , d )   
qd
where:

d 
2
d
 i
i 1,... n
Each element in the query and document vectors are word weights
Rare words count more, e.g.: di = log2(Dall/Dfreq(wordi))
Getting the top-k documents (or web pages) is done by:
 

Retrieve( q, k )  Arg max [k , Sim(d , q )]
d D
December, 2008
© 2008, Jaime G. Carbonell
6
Multi-tier Text Categorization
News Event
Terrorist Event
Bombing
Economic disaster
Shooting
Asian Crisis
US tech crisis
Given text, predict category at each level
Issue: What if we need to go beyond words as features?
December, 2008
© 2008, Jaime G. Carbonell
7
Time Series Prediction Process
 Find leading indicators
 “predictor” variables from earlier epochs
 Code values per distinct time interval
 E.g. “sales at t-1, at t-2, t-3 …”
 E.g. “advertisement $ at t, t-1, t-2”
 Objective is to predict desired variable at
current or future epochs
 E.g. “sales at t, t+1, t+2”
 Apply machine learning methods you learned
 Regression, d-trees, kNN, Bayesian, …
December, 2008
© 2008, Jaime G. Carbonell
8
Time Series Prediction: caveat 1
2006 Total Sales
2008 Total Sales
Q1: 9.5M
Q1: 12M
Q2: 8.5M
Q2: 11M
Q3: 7.5M
Q3: 9.5M
Q4: 11M
Q4: ??
2007 Total Sales
1. Determine periodic cycle
Q1: 11M
2. Find within-cycle trend
Q2: 10M
3. Find cross-cycle trend
Q3: 8.5M
4. Combine both components
Q4: 13M
December, 2008
© 2008, Jaime G. Carbonell
9
Time Series Prediction: caveat 2
2008 Total Airline Sales
Q1: 12M
Q1: 9.5M
Q2: 8.5M
Q2: 11M
Q3: 7.5M
Q3: 9.5M
Q4: 11M
Q4: ??
Watch for exogenous variable!
(World-trade Center attack
wreaked havoc with airline
industry predictions)
 Less tragic and less obvious
one-of-a-kind events too
December, 2008
2006 Total Sales
2007 Total Sales
Q1: 11M
Q2: 10M
Q3: 8.5M
Q4: 13M
© 2008, Jaime G. Carbonell
10
Leveraging Existing Data Collecting Systems
1999 Influenza outbreak
Influenza cultures
Sentinel physicians
WebMD queries about ‘cough’ etc.
School absenteeism
Sales of cough and cold meds
Sales of cough syrup
ER respiratory complaints
ER ‘viral’ complaints
Influenza-related deaths
December, 2008
[Moore, 2002]
Week (1999-2000))
© 2008, Jaime G. Carbonell
11
Adaptive Filtering over a Document Stream
Training documents (past)
Unlabeled documents
Test documents
Current document: On-topic?
On-topic documents
Off-topic documents
December, 2008
time
Topic 1
Topic 2
Topic 3
…
RF
© 2008, Jaime G. Carbonell
12
Classifier = Rocchio, Topic = Civil War (R76 in TREC10), Threshold = MLR
MLR threshold function:
locally linear, globally non-linear
December, 2008
© 2008, Jaime G. Carbonell
13
Time Series in a Nutshell
 Time-Series Prediction requires regression, except
 Historical data per time period (aka “epoch”)
 Predictor attributes come from both current +
earlier epochs
 Objective attribute from earlier epochs 
predictor attributes for current epoch
 Process Difference with Normal Machine Learning
 First detect cyclical patterns among epochs
 Predict within a cycle
 Predict cross-cycle using corresponding epochs
only (then combine with within-cycle prediction)
December, 2008
© 2008, Jaime G. Carbonell
14
Active Learning
 Assume:

{x, y}
 Very few “labeled” instances
 Very many “unlabeled” instances {x}
 An omniscient “oracle” which can assign
an label to an unlabeled instance
 Objective:
 Select instances to label such that
learning accuracy is maximized with the
fewest oracle labeling requests
December, 2008
© 2008, Jaime G. Carbonell
15
Active Learning (overall idea)
Data Source
learn a new model
unlabeled data
Learning Mechanism
User
output
label request
Expert
labeled data
Why is Active Learning Important?
 Labeled data volumes  unlabeled data volumes
 1.2% of all proteins have known structures
 .01% of all galaxies in the Sloan Sky Survey
have consensus type labels
 .0001% of all web pages have topic labels
 If labeling is costly, or limited, we want to select
the points that will have maximal impact
December, 2008
© 2008, Jaime G. Carbonell
17
Review of Supervised Learning



 Training data: {xi , yi }i 1,... k , y simplify
 y
 Functional space:
  { f j  pl }
 Fitness Criterion:


 
arg min   yi  f j , pl ( xi )    ( f j , pl )
j ,l

 i

 Variants: online learning, noisy data, …
December, 2008
© 2008, Jaime G. Carbonell
18
Active Learning
 


 Training data: {xi , yi }i 1,... k  {xi }i  k 1,... n  O : xi  yi
 Special case: k  0
 Functional space:   { f j
 Fitness Criterion:
 a.k.a. loss function
 pl }


 
arg min   yi  f j , pl ( xi )    ( f j , pl )
j ,l

 i

 Sampling Strategy:




ˆ
arg min L( f ( x , y )) | x  {x ,..., x }

 

xi { xk 1 ,..., xn }
December, 2008
test
test
i
© 2008, Jaime G. Carbonell
1
k

19
Sampling Strategies
 Random sampling (preserves distribution)
 Uncertainty sampling (Tong & Koller, 2000)
 proximity to decision boundary
 maximal distance to labeled x’s
 Density sampling (kNN-inspired McCallum & Nigam, 2004)
 Representative sampling (Xu et al, 2003)
 Instability sampling (probability-weighted)
 x’s that maximally change decision boundary
 Ensemble Strategies
 Boosting-like ensemble (Baram, 2003)
 DUAL (Donmez & Carbonell, 2007)
 Dynamically switches strategies from Density-Based to
Uncertainty-Based by estimating derivative of expected
residual error reduction
December, 2008
© 2008, Jaime G. Carbonell
20
Which point to sample?
Green = unlabeled
Red = class A
Brown = class B
Density-Based Sampling
Centroid of largest unsampled cluster
Uncertainty Sampling
Closest to decision boundary
Maximal Diversity Sampling
Maximally distant from labeled x’s
Ensemble-Based Possibilities
Uncertainty + Diversity criteria
Density + uncertainty criteria
Active Learning Issues
 Interaction of active sampling with underlying
classifier(s).
 On-line sampling vs. batch sampling.
 Active sampling for rank learning and for
structured learning (e.g. HMMs, sCRFs).
 What if Oracle is fallible, or reluctant, or
differentially expensive  proactive learning.
 How does noisy data affect active learning?
 What if we do not have even the first labeled
point(s) for one or more classes?  new class
discovery.
 How to “optimally” combine A.L .strategies
December, 2008
© 2008, Jaime G. Carbonell
26
Strategy Selection:
No Universal Optimum
• Optimal operating
range for AL sampling
strategies differs
• How to get the best of
both worlds?
• (Hint: ensemble
methods, e.g. DUAL)
December, 2008
© 2008, Jaime G. Carbonell
27
Motivation for DUAL
 Strength of DWUS:
 favors higher density samples close to the decision boundary
 fast decrease in error
 But!
DWUS establishes diminishing returns! Why?
• Early iterations -> many points are highly uncertain
• Later iterations -> points with
high uncertainty no longer in dense regions28
December, 2008
© 2008, Jaime G. Carbonell
• DWUS wastes time picking instances with no direct effect on the error
How does DUAL do better?
 Runs DWUS until it estimates a cross-over
 (DWUS )

x t
 Monitor the change in expected error at each iteration to
detect when it is stuck in local minima
^
^
 (DWUS ) 
1
nt
 E [(y i  y i )
2
| xi ]  0
 DUAL uses a mixture model after the cross-over ( saturation )
point
^
x s  argmax  * E [(y i  y i )2 | x i ]  (1   ) * p (x i )
*
i I U
 Our goal should be to minimize the expected future error
 If we knew the future error of Uncertainty Sampling (US) to
be zero, then we’d force   1
 But in practice, we do not know it
December, 2008
© 2008, Jaime G. Carbonell
29
More on DUAL
 After cross-over, US does better => uncertainty score should
be given more weight

should reflect how well US performs
 can be calculated by the expected error of
^
^
US on the unlabeled data* =>    (US )
 Finally, we have the following selection criterion for DUAL:
^
^
^
x s  argmax(1   (US )) * E [(y i  y i ) | x i ]   (US ) * p (x i )
*
2
i I U
*
US is allowed to choose data only from among the already
sampled instances, and
is calculated on the remaining
^
unlabeled set
to
 (US )

December, 2008

© 2008, Jaime G. Carbonell
30
Results: DUAL vs DWUS
December, 2008
© 2008, Jaime G. Carbonell
31
Paired Density-Based Sampling
(Donmez & Carbonell, 2008)
 Desiderata
 Balanced Sampling from both (all) classes
 Combine density-based with coverage-based
 Method
 Non-Euclidian distance function
p 1

1 
 pk  pk 1
d (x i , x j )= ln(1  min  (e
 1)) 
p Pij
 
k 1

 Select maximally separated pairs of points
based on maximizing a utility function
December, 2008
© 2008, Jaime G. Carbonell
32
Paired Density Method (cont.)
 Utility function:
U (i , j )  log pˆ(x )  pˆ(x ) 
i
j


2
Pˆ(y k | x k )  
  exp( x i  x k ) * y min
k { 1}
k  i N x i



2


log   exp(  x j  x r ) * min Pˆ(y r | x r )  
y r { 1}
r  j N x j





ˆ
ˆ
s *  min P (y i | x i )  min P (y j | x j ) 

y j { 1}


 y i { 1}







 Select the two points that optimize utility and are
maximally distant
i
December, 2008
*
,j
*
  argmax x
i  j I U
i
xj
2
* U (i , j )
© 2008, Jaime G. Carbonell
33
Results of Paired-Density Sampling
December, 2008
© 2008, Jaime G. Carbonell
34
Active Learning model in NLP
Test Data
Evaluation
Parsing model
Training Data
Build
Machine Translation System
Active Learner
Named Entity Recognition module
Word Sense Disambiguation model
Sample selection
Addition
Samples
Unlabeled Set
Active Training Set
Un-annotated corpus
Annotation
Translation
Word-Sense Disambiguation
 Needed in NLP for parsing, translation, search…
 Example:
 Line  ax+by+c, rope, queue, track,…
 “Banco”  bench, financial inst, sand bank, …
 Challenge: How to disambiguate from context
 Approach: Build ML classifier (sense = class)
 Problem: Insufficient training data
 Amelioration: Active Learning
December, 2008
© 2008, Jaime G. Carbonell
36
Word Sense Disambiguation:
Active Learning Methods
 Entropy Sampling
 Vector q represents the trained model’s predictions
 qc prediction probability of class c
 Pick the example whose prediction vector displays the
greatest entropy
 Margin Sampling
 If c and c’ are the two most likely categories Picks the
example with the smallest margin
December, 2008
© 2008, Jaime G. Carbonell
Word Sense Disambiguation:
Experiment
On 5 English verbs that had coarse grained senses.
 Double-blind tagging applied to 50 instances of the target word
 If the inter-tagger (ITA) agreement < 90%, the sense entry is
revised by adding examples and explanations
December, 2008
© 2008, Jaime G. Carbonell
Word Sense Disambiguation Results
Active vs. Proactive Learning
ACTIVE LEARNING
PROACTIVE LEARNING
 All x’s cost the same to label
 Max number of labels
 Omniscient oracle
 Never errs
 Indefatigable oracle
 Always answers
 Single oracle
 Oracle selection unnecessary
December, 2008
 Labeling cost is f1(D(x),O)
 Max labeling budget
 Fallible oracles
 Errs with p(E(x)) ~
f2(D(x),O)
 Reluctant oracles
 Answers with p(A(x)) …
 Multiple oracles
 Joint optimization of oracle
and instance selection
© 2008, Jaime G. Carbonell
40
Scenario 1: Reluctance
 2 oracles:
 reliable oracle: expensive but always answers
with a correct label
 reluctant oracle: cheap but may not respond to
some queries
 Define a utility score as expected value of
information at unit cost
P (ans | x , k ) *V (x )
U (x , k ) 
Ck
December, 2008
© 2008, Jaime G. Carbonell
41
How to estimate Pˆ(ans | x , k ) ?
 Cluster unlabeled data using k-means
 Ask the label of each cluster centroid to the reluctant oracle. If
 label received: increase Pˆ(ans | x ,reluctant) of nearby points
 no label: decrease Pˆ(ans | x ,reluctant)
of nearby points

h (x c t , y c t ) maxd  x c t  x

Pˆ(ans | x ,reluctant) 
exp 
ln
Z
2
x ct  x


0.5


 x  C t


h (x c , y c )  {1, 1} equals 1 when label received, -1 otherwise
 # clusters depend on the clustering budget and oracle fee
December, 2008
© 2008, Jaime G. Carbonell
42
Algorithm for Scenario 1
December, 2008
© 2008, Jaime G. Carbonell
43
Scenario 2: Fallibility
 Two oracles:
 One perfect but expensive oracle
 One fallible but cheap oracle, always answers
 Alg. Similar to Scenario 1 with slight modifications
 During exploration:
 Fallible oracle provides the label with its confidence
 Confidence = Pˆ(y | x ) of fallible oracle
 If Pˆ(y | x )  [0.45,0.5] then we don’t use the label
but we still update Pˆ(correct | x , k )
December, 2008
© 2008, Jaime G. Carbonell
44
Scenario 3: Non-uniform Cost
 Uniform cost: Fraud detection, face recognition, etc.
 Non-uniform cost: text categorization, medical diagnosis,
protein structure prediction, etc.
 2 oracles:
 Fixed-cost Oracle
 Variable-cost Oracle
C non unif (x )  1 
December, 2008
max y Y Pˆ(y | x )  1 Y
1 1 Y
© 2008, Jaime G. Carbonell
45
Outline of Scenario 3
December, 2008
© 2008, Jaime G. Carbonell
46
Underlying Sampling Strategy
 Conditional entropy based sampling, weighted by a density
measure
 Captures the information content of a close neighborhood


U (x )  log  min Pˆ(y | x ,wˆ)   exp  x  k
k  x N x
y { 1}


2
2


ˆ
* min P (y | k ,wˆ) 
y { 1}



close neighbors of x
December, 2008
© 2008, Jaime G. Carbonell
47
Results: Reluctance
December, 2008
© 2008, Jaime G. Carbonell
48
Cost varies non-uniformly
statistically
significant
(p<0.01)
December, 2008
© 2008, Jaime G. Carbonell
49
Proactive Learning in General
 Multiple Expert (a.k.a. Oracles)
 Different areas of expertise
 Different costs
 Different reliabilities
 Different availability
 What question to ask and whom to query?
 Joint optimization of query & oracle selection
 Referals among Oracles (with referal fees)
 Learn about Oracle capabilities as well as
solving the Active Learning problem at hand
December, 2008
© 2008, Jaime G. Carbonell
50
Unsupervised Learning in DM
 What does it mean to learn without an objective?
 Explore the data for natural groupings
 Learn association rules, and later examine
whether they can be of any business use
 Illustrative examples
 Market basket analysis  later optimize shelf
allocation & placements
 Cascaded or correlated mechanical faults
 Demographic grouping beyond known classes
 Plan product bundling offers
December, 2008
© 2008, Jaime G. Carbonell
51
Example Similarity Functions
 Determine a similarity metric
 Eucledian
 Cosine
 KL-divergence
 

sim euclid (d i , d j )  

2
2 
(d i ,k  d j ,k ) 

k 1, n

 


q  di
simcos (q , d i )  

q 2  di
1
2
 Determine a clustering algorithm
 Incremental, agglomerative, K-means, …
December, 2008
© 2008, Jaime G. Carbonell
52
Hierarchical Agglomerative
Clustering Methods

Generic Agglomerative Procedure (Salton '89), result in
nested clusters via iterations
1. Compute all pairwise document-document similarity
coefficients
2. Place each of n documents into a class of its own
3. Merge the two most similar clusters into one;
- replace the two clusters by the new cluster
- recompute intercluster similarity scores w.r.t. the new
cluster
- If cluster radius > max-size, block further merging
4. Repeat the above step until there are only k clusters left
(note k could = 1).
December, 2008
© 2008, Jaime G. Carbonell
53
Group Agglomerative Clustering
2
1
6
5
4
3
9
7
8
K-Means Clustering
1. Select k-seeds s.t. d(ki,kj) > dmin
2. Assign points to clusters by min dist.
Cluster(pi) = Argmin(d(pi,sj))
sj{s1,…,sk}
3. Compute new cluster centroids:
 1

cj 
pi

n pi  j thcluster
4. Reassign points to clusters (as in 2 above)
5. Iterate until no points change clusters
December, 2008
© 2008, Jaime G. Carbonell
55
K-Means Clustering: Initial Data Points
Step 1: Select k random
seeds s.t. d(ki,kj) > dmin
Initial Seeds
(if k=3)
K-Means Clustering: First-Pass Clusters
Step 2: Assign points
to clusters by min dist.
Cluster(pi) = Argmin(d(pi,sj))
sj{s1,…,sk}
Initial Seeds
K-Means Clustering: Seeds  Centroids
Step 3: Compute new
cluster centroids:

1
cj 
n

p
 i
pi  j th cluster
New Centroids
K-Means Clustering: Second Pass Clusters
Note: some
data points
reassigned
Step 4: Recompute
Cluster(pi) = Argmin(d(pi,cj))
cj{c1,…,ck}
Centroids
Cluster Optimization (finding “k”)
 average(d ( xi , x j ), x  cluster , i  j ) 
k  Arg min 

k[1, n ]  average(d ( xk , xl ), x  cluster , k  l ) 
1
 1

d
(
x
,
x
)
i
j
 k  c2 

cCk
xi  x j c

k  Arg min 

k[1, n ]  1
d
(
cen
(
c
),
cen
(
c
))
l
m
 k2 

cl  cm Ck


1


k
d
(
x
,
x
)
i
j
  c2 

cCk
xi  x j c

k  Arg min 
k[1, n ]   d (cen(cl ), cen(cm )) 
 cl cm Ck



December, 2008
© 2008, Jaime G. Carbonell
60
Clustering for Novelty Detection
Functionality

Build background model
Technology

 Expected Events (clusters)

Find divergences
 (Hierarchical) k-means

 Individual outliers (but many
false positives)
 New Mini-clusters (unmasked
new-event detection)
 Detect when a novel event is
masked by ordinary ones

Trigger Alerts
December, 2008
Divergence metrics
 Radial density gradients from
cluster centroid
 Temporally-adaptive distance
measures
 Secondary peaks in density
function

 Route & Prioritize
 Formulate hypotheses for Analyst
Modeling methods
Create analyst profiles
 RETE-based SAMs methods
(last PI-meeting ARGUS
paper)
© 2008, Jaime G. Carbonell
61
Cluster Evolution
Constant Event
New Obfuscated Event
New Un-obfuscated Event
Growing Event

  ( x ) 
 ( x )

 (1   ) max 


j
r
  j 
Cluster Density Changes
Constant Event
New Obfuscated Event
New Unobfuscated Event
Growing Event

 

 ( x )
 ( x )

 (1   ) max 


j
r
  j 
Novelty Detection and Profile
Management 1
Novelty Detection
Matcher
Profiles
Data Streams
New Profiles
Analyst
December, 2008
© 2008, Jaime G. Carbonell
64
Results on Medical Data
New Mini-Cluster Analysis reveals outbreaks of:
•
•
•
•
Tularemia
Dengue Fever
Myiasis
Chagas Disease
SARS Outbreak simulation
 Added new records for patients
from a small geographical region
diagnosed with influenza in
9/2001
 Graph shows resulting secondary
peak in the pulmonary disease
density function
December, 2008
© 2008, Jaime G. Carbonell
65
What’s Rare Category Detection
 Start de-novo
 Very skewed classes
 Majority classes
 Minority classes
 Labeling oracle
 Goal
 Discover minority classes
with a few label requests
December, 2008
© 2008, Jaime G. Carbonell
66
Comparison with Outlier
Detection
 Rare classes
 A group of points
 Clustered
 Non-separable from the
majority classes
December, 2008
 Outliers
 A single point
 Scattered
 Separable
© 2008, Jaime G. Carbonell
67
Fraud detection
Network intrusion detection
Applications
Astronomy
Spam image detection
The Big Picture
Unbalanced
Unlabeled
Data Set
Rare
Category
Detection
Feature
Extraction
Learning in
Unbalanced
Settings
Classifier
Feature
Representation
Relational
Temporal

Raw
Data
Questions We Want to Address
 How to detect rare categories in an unbalanced, unlabeled
data set with the help of an oracle?
 How to detect rare categories with different data types, such
as graph data, stream data, etc?
 How to do rare category detection with the least information
about the data set?
 How to select relevant features for the rare categories?
 How to design effective classification algorithms which fully
exploit the property of the minority classes (rare category
classification)?
December, 2008
© 2008, Jaime G. Carbonell
70
Notation




d
x


S

x
,
,
x
 1 , n i
Unlabeled examples:
m Classes: yi 1, , m
m-1 rare classes: p 2 , , p m
One majority class: p1 , p c
2cm
 Goal: find at least ONE example from each rare
class by requesting a few labels
December, 2008
© 2008, Jaime G. Carbonell
71
Assumptions
 The distribution of the majority class is sufficiently
smooth
 Examples from the minority classes form compact
clusters in the feature space
0.25
0.2
0.15
0.1
0.05
December, 2008
0
-6
© 2008, Jaime G. Carbonell
-4
-2
0
2
72
4
6
Two Classes: NNDB
1. Calculate class-specific radius r 


2. xi  S , NN  xi , r    x x  xi  r  , ni  NN  xi , r  
Increase t by 1
3.
si 
max
x j NN  xi ,tr 
n  n 
i
j
4. Query x  arg max xi S si
No
5.
xRare class?
Yes
6. Output
December, 2008
x
© 2008, Jaime G. Carbonell
73
NNDB: Calculate Nearest Neighbors
r
200
190
180
170
160
150
140
130
120
120
140
160
180
200
220
200
190
180
170


NN  xi , r    x x  xi  r 
ni  NN  xi , r  
160
150
140
130
120
120
December, 2008
140
© 2008, Jaime G. Carbonell
160
180
200
220
74
NNDB: Calculate the Scores
tr 
200
si 
max
x j NN  xi ,tr 
n  n 
i
j
190
180
170
Query x  arg max xi S si
160
150
140
130
120
120
December, 2008
140
© 2008, Jaime G. Carbonell
160
180
200
220
75
NNDB: Pick the Next Candidate
 t  1 r
200
Increase t by 1
si 
max
190
n n 

 
x j NN  xi , t 1 r 
i
j
180
170
160
Query x  arg max xi S si
150
140
130
120
120
December, 2008
140
© 2008, Jaime G. Carbonell
160
180
200
220
76
Why NNDB Works
 Theoretically
 Theorem 1 [He & Carbonell 2007]: under
certain conditions, with high probability, after a
few iteration steps, NNDB queries at least one
example whose probability of coming from the
minority class is at least 1/3
 Intuitively
 The scoresi measures the
change in local density
200
190
180
170
160
150
140
130
120
120
December, 2008
© 2008, Jaime G. Carbonell
140
160
180
200
220
77
Multiple Classes: ALICE
2
m
p
,
,
p
 m-1 rare classes:
1
c
 One majority class: p
,p 2  c  m
c  c 1
Yes
1. For each rare class c,
2cm
2. We have found examples from class c
No
3. Run NNDB with prior
December, 2008
© 2008, Jaime G. Carbonell
pc
78
Why ALICE Works
 Theoretically
 Theorem 2 [He & Carbonell 2008]: under
certain conditions, with high probability, in
each outer loop of ALICE, after a few
iteration steps in NNDB, ALICE queries at least
one example whose probability of coming from
one minority class is at least 1/3
December, 2008
© 2008, Jaime G. Carbonell
79
Implementation Issues
 ALICE
 Problem: repeatedly sampling from the same rare class
 MALICE
 Solution: relevance feedback
Class-specific radius
December, 2008
© 2008, Jaime G. Carbonell
80
Results on Synthetic Data Sets
5
4
3
2
1
0
-1
-3
-2
-1
0
December, 2008
1
2
3
4
© 2008, Jaime G. Carbonell
81
Summary of Real Data Sets
 Abalone
 4177 examples
 7-dimensional features
 20 classes
 Largest class: 16.50%
 Smallest class: 0.34%
December, 2008
 Shuttle
 4515 examples
 9-dimensional features
 7 classes
 Largest class: 75.53%
 Smallest class: 0.13%
© 2008, Jaime G. Carbonell
82
Results on Real Data Sets
Abalone
Shuttle
MALICE
Interleave
Random sampling
December, 2008
© 2008, Jaime G. Carbonell
MALICE
Interleave
Random sampling
83
Imprecise priors
Abalone
Shuttle
20
7
Classes Discovered
Classes Discovered
6
15
-5%
-10%
-20%
0
+5%
+10%
+20%
10
5
0
0
50
100
150
200
Number of Selected Examples
December, 2008
5
4
3
2
250
1
0
© 2008, Jaime G. Carbonell
-5%
-10%
-20%
0
+5%
+10%
+20%
20
40
60
80
Number of Selected Examples
84
100
Specially Designed Exponential
Families [Efron & Tibshirani 1996]
 Favorable compromise between parametric and
nonparametric density estimation
 Estimated density
p 1 parameter vector
Carrier density


g  x  g0 x exp 0   t x
Normalizing parameter
December, 2008
T
1
p 1 vector of sufficient statistics
© 2008, Jaime G. Carbonell
85
SEDER Algorithm
 Carrier density: kernel density estimator
T
1 2
d 2

t x   x ,, x
 To decouple the estimation of different parameters
d
j



 Decompose 0  j 1 0
 Relax the constraint such that


xj
December, 2008

 



  dx
j
j 2 

x  xi 
1
j
j
j

exp

exp



x
0i
1
 2 j 2 
2  j


 
© 2008, Jaime G. Carbonell
2
j
1
86
Parameter Estimation
 Theorem 3 [To appear]: the maximum likelihood estimate
j and j satisfy the following conditions:
j and ˆ j of
 0i


̂
0i
1
1
 x 
n
k 1
j 2
k
where
j
    
Ei x
j 2
December, 2008
xj
j  1,, d 


  
j
j 2 

n
ˆ j  xk  xi  E j x j

exp

i1  0i 2  j 2  i
n


 k 1
j
j 2 

n
ˆ j  xk  xi 

exp

i 1  0i 2  j 2 


x 
j 2

 


 

2

  dx
j
j 2 

x  xi 
1
ˆ j  ˆ j x j

exp

exp

0i
1
j
j 2 

2 
 2

 
© 2008, Jaime G. Carbonell
2
87
j
Parameter Estimation cont.
1

 Let   1  j
 b
 1
j

: positive parameter
b
j 2
2
2

B

B
 4 AC
j
j  1,, d: bˆ 
2A
1 n
where
,j 2
j 2
B    C  k 1 xk 
n
bˆ j  1
j
j 2 

in most cases
n
xk  xi  j 2

i 1 exp   2  j 2  xi
1 n


A  k 1
j
j 2 
n

n
xk  xi 

exp

i 1  2  j 2 


j
1
 

December, 2008
  
 


 
© 2008, Jaime G. Carbonell
88
Scoring Function
 The estimated density

d
1 n
~
g b x   i 1  j 1
n
 
 Scoring function: norm of the gradient

n
sk 

d
l 1
where
1 d
Di x    j 1
n
December, 2008

j
j j 2 

x  b xi 
1

exp 
j 2 j 
j
j

2b 
 2 b 
i 1

l
k
l 2
l
Di xk  x  b x
l
  b 
l
i

2
2


j
j j 2 

x  b xi 
1

exp 
j 2 j 
j
j

2b 
 2 b 
© 2008, Jaime G. Carbonell
 
89
Results on Synthetic Data Sets
December, 2008
© 2008, Jaime G. Carbonell
90
Summary of Real Data Sets
Data
Set
n
d
m
Largest
Class
Smallest
Class
Ecoli
336
7
6
42.56%
2.68%
Glass
214
Moderately
Skewed
9
6
35.51%
4.21%
Page Blocks
5473
10
5
89.77%
0.51%
Abalone
4177
7
20
16.50%
0.34%
Shuttle
4515
9
7
75.53%
0.13%
December, 2008
Extremely Skewed
© 2008, Jaime G. Carbonell
91
Moderately Skewed Data Sets
Ecoli
Glass
MALICE
MALICE
December, 2008
© 2008, Jaime G. Carbonell
92
Extremely Skewed Data Sets
Page Blocks
Abalone
MALICE
MALICE
Shuttle
MALICE
Additional Notation
 W  : n  n pair-wise similarity matrix
 D : n  n diagonal matrix,

W  D 1 2W D 1 2
Dii   j 1Wij : normalized matrix
n
A 
 I nn  W  : global similarity matrix, where
is an
1
I nn identity matrix, and 
is a positive
parameter close to 1
December, 2008
© 2008, Jaime G. Carbonell
94
Global Similarity Matrix


1
A  I nn  W
 Better than pair-wise similarity matrix for rare
category detection
December, 2008
© 2008, Jaime G. Carbonell
95
GRADE: Full Prior Information
2cm
1. For each rare class c,
2. Calculate class-specific similarity a

 

c

3. xi  S, NN xi , a c  x A  x, xi   a c , nic  NN xi , a c
Increase t by 1
4. si 
Relevance
max c
Feedback
x j NN  xi , a t 
n
c
i
 ncj

5. Query x  arg max xi S si
No
6. x class c?
Yes
7. Output
x

GRADE-LI: Less Prior Information
1. Calculate problem-specific similarity a


2. xi  S , NN  xi , a   x A  x, xi   a , ni  NN  xi , a 
Increase t by 1
3.
si 

Relevance
max ni
xj
NN  xi , a t 
Feedback
 nj 
t 2
4. Query x  arg max xi S si
No
5. xa new class?
Yes
6. Output
December, 2008
x
© 2008, Jaime G. Carbonell
7. Budget exhausted?
No
97
MALICE
Glass
MALICE
Shuttle
Abalone
Ecoli
Results on Real Data Sets
MALICE
MALICE
Applying Machine Learning for
Data Mining in Business
Step 1: Have clear objective to Optimize
Step 2: Have sufficient data
Step 3: Clean, normalize, clean data some more
Step 4: Make sure there isn’t an easy solution
(e.g. a small number of rules from expert)
 Step 5: Do the Data Mining for real
 Step 6: Cross-validate, improve, go to step 5




December, 2008
© 2008, Jaime G. Carbonell
99
Managing the Data Mining Process
 Ingredients for successful DM
 Data (warehouse, stream, DBs, …)
 Right problems (objectives, …)
 Tools (Machine Learning tool suites, …)
 People (analogy to surgical team: next slide)
 Estimate (size) problem, approach, progress
 ROI (max, min, realistic)
 Determine if DM is likely best approach
 Deploy team
 Evaluate intermediate results
December, 2008
© 2008, Jaime G. Carbonell
100
The Data Mining Team
 The Administrator (manager & domain)
 Pick problem, resources, ROI calc, monitor, …
 The Surgeon (ML specialist w/domain knowledge)
 Select ML method, predictor atts, objective, …
 The Anesthesiologist (preparer)
 Chief data specialist, sampling, coverage, …
 The Nurses (assistants)
 DB manager, programmers, gophers …
 The Medical Students
 Prepare new surgeons: learn by doing
December, 2008
© 2008, Jaime G. Carbonell
101
Need Some Domain Expertise
 Data Preparation
 What are good candidate predictor att’s?
 How to combine multiple objectives?
 How to sample? (e.g. id cyclic periods)
 Progress monitoring and results interpretation
 How accurate must prediction be?
 Do we need more or different data?
 Are we pursing reasonable objective(s)?
 Application of DM after accomplished
 Update of DM when/as environment evolves
December, 2008
© 2008, Jaime G. Carbonell
102
Typical Data Mining Pitfalls
 Insufficient data to establish predictive patterns
 Incorrect selection of predictor attributes
 Statistics to the rescue (e.g. 2 test)
 Unrealistic objectives (e.g. fraud recovery)
 Inappropriate ML method selection
 Data preparation problems
 Failure to normalize across data sets
 Systematic bias in original data collection
 Belief in DM as panacea or black magic
 Giving up too soon (very common)
December, 2008
© 2008, Jaime G. Carbonell
103
Final Words on Data Mining
 Data Mining is:
 1/3 science (math, algorithms, …)
 …and 1/3 engineering (data prep, analysis, …)
 …and 1/3 “art” (experience really counts)
 10 years ago it was mostly art
 10 years from now it will be mostly engineering
 What to expect from the research labs?
 Better supervised algorithms
 Focus on unsupervised learning + optimization
 Move to incorporate semi-structured (text) data
December, 2008
© 2008, Jaime G. Carbonell
104
THANK YOU!
December, 2008
© 2008, Jaime G. Carbonell
105