Download Input: Crawl of about 1 million pages

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining:
Potentials and Challenges
Rakesh Agrawal
IBM Almaden Research Center
Thesis

Data mining has started to live up to its promise in
the commercial world, particularly in applications
involving structured data
 Promising data mining applications in nonconventional domains are beginning to emerge,
involving combination of structured and
unstructured data
 Investment in data mining research can have large
payoff
Outline

Examples of some promising nonconventional data mining applications and
technologies
 Some hurdles we need to cross
Identifying Social Links Using
Association Rules
Input: Crawl of about 1 million pages
Website Profiling using
Classification
Input: Example pages for each category during training
Discovering Trends Using
Sequential Patterns & Shape Queries
4
Support (%)
3
heat removal
emergency cooling
2
zirconium based alloy
feed water
1
0
1990
1991
1992
1993
1994
Time Periods
Input: i) patent database ii) shape of interest
Discovering Micro-communities
Japanese elementary schools
Turkish student associations
Oil spills off the coast of Japan
Australian fire brigades
Aviation/aircraft vendors
Guitar manufacturers
complete 3-3 bipartite graph
Frequently co-cited pages are related. Pages with large
bibliographic overlap are related.
Technical Chasms

Privacy Concerns?
– Privacy-preserving data mining

Data for data mining?
– Data mining over compartmentalized databases
Inducing Classifiers over Privacy
Preserved Numeric Data
Alice’s
age
Alice’s
salary
John’s
age
30 | 25K | …
30
become
s 65
(30+35)
50 | 40K | …
Randomizer
Randomizer
65 | 50K | …
35 | 60K | …
Reconstruct
Age Distribution
Reconstruct
Salary Distribution
Decision Tree
Algorithm
Model
Works Well
1000
Original
800
600
Randomized
400
Reconstructed
0
60
200
20
Number of People
1200
Age
Accuracy vs. Randomization
Fn 3
100
Accuracy
90
80
Original
70
Randomized
Reconstructed
60
50
40
10
20
40
60
80
100
Randomization Level
150
200
Discovering frequent itemsets
Breach level = 50%.
Soccer:
smin = 0.2%
Mailorder:
smin = 0.2%
Itemset
Size
True
Itemsets
True
Positives
False
Drops
False
Positives
1
266
254
12
31
2
217
195
22
45
3
48
43
5
26
Itemset
Size
True
Itemsets
True
Positives
False
Drops
False
Positives
1
65
65
0
0
2
228
212
16
28
3
22
18
4
5
Computation over
Compartmentalized Databases
"Frequent Traveler" Rating Model
Randomized Data
Shipping
Local computations
followed by
combination of
partial models
On-demand secure
data shipping and
data composition
Email
Phone
Demographic
Criminal
Records
State
Birth
Marriage
Local
Credit
Agencies
Some Hard Problems

Past may be a poor predictor of future
– Abrupt changes
– Wrong training examples

Reliability and quality of data
 Actionable patterns (principled use of domain knowledge?)
 Over-fitting vs. not missing the rare nuggets
 Richer patterns
 Simultaneous mining over multiple data types
 When to use which algorithm?
 Automatic, data-dependent selection of algorithm
parameters
Summary

Data mining has shown promise but we
need further research to realize its full
potential
We stand on the brink of great new answers, but even more, of
great new questions -- Matt Ridley
Related documents