Download Will Data Mining Change the Functions of DBMS?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Will Data Mining Change the
Functions of DBMS?
Jiawei Han
DAIS (Data And Information Systems) Lab
University of Illinois at Urbana-Champaign
Will DM Be Integrated with DB Functions?


DM: Already a functional component of DBMS

Microsoft/SQLServer: Analysis Manager

IBM/DB2 & IntelligentMiner

Oracle: Data Mining Package
But will DM be “intruding” into DBMS, i.e., be
integrated with essential DBMS functions?

Indexing

Data integration

Data cleaning

Query processing
Indexing by Data Mining


Indexing graphs? ─ # of subgraphs: exponential!

Chemical Informatics/bioinformatics …

Discriminative frequent graph patterns (SIGMOD’04)
Indexing subsequences?


Shopping sequence, DNA/protein sequence (SDM’05)
When is discriminative frequent pattern indexing useful?

Complex objects, big (object) queries
Sample database
(a)
(b)
Query graph
(c)
Data Cleaning by Data Mining


Load messy data into a structured database?
 Inconsistent data: age = “1946”?
 Field mis-alignments
 Glitches of data: completely messed up inputs
 Missing/un-matching delimiters: XML, HTML
data
 Big field: BLOB, CLOB, multimedia and text
Data mining
 Data cleaning by distribution/outlier analysis
 Dependency/correlation analysis
 Schema-directed or schema “discovery”
Data Integration by Data Mining


Linking and mining cross-over multiple data
relations
 Cross-mine (Classification across multiple
data relations: ICDE’04)
Search across heterogeneous databases
 Object identification/merge, reference
reconciliation (Alon’s group)
 Mining across heterogeneous DBs
 Personalizing data from heterogeneous
sources
Query Processing by Data Mining

Query plan refinement based on query
execution history

Better query planning by investigating additional
data statistics

Current optimizer: key/foreign key, cardinality,
# distinct values

Additional information:

Strong dependency/correlation

Histogram, dense vs. sparse regions, etc.
Conclusions



DBers have been “invading” into DM and made
great contributions
It is time to consider that DM may invade DBMS
to enhance its functionality
General philosophy
 Invisible data mining
 Google is doing this for page ranking
successfully
 Can we do it to enhance DBMS?
 You can do better if you know your data better!