Download From DBMiner to WebMiner: What is the Future of Data Mining?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
From DBMiner to
WebMiner: What is the
Future of Data Mining?
Jiawei Han
Intelligent Database System Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca/~han
Tuesday, January 11, 2000
1
Data Mining: “Necessity is the
mother of invention”
On-line databases are widely available

NASA’s EOS (Earth Observation System), WWW,
Digital Library, stock market data, e-commerce, telcommunication data, credit card transactions, market
basket data, bio-medical data, etc.
We are drowning in data, but starving for
knowledge!
Requirements: fast response, interactive and
exploratory analysis, mining hidden patterns
2
Data Mining: A KDD Process

Data mining: the core of
knowledge discovery
process.
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
3
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
4
Why Data Mining? — Potential
Applications
Database analysis and decision support

Market analysis and management
 target marketing, customer relation management, market
basket analysis, cross selling, market segmentation.

Risk analysis and management
 Forecasting, customer retention, improved underwriting,
quality control, competitive analysis.

Fraud detection and management
Text mining (news, emails, documents) and Web mining.
BioInformatics (DNA), GeoInformatics (Maps, Remote
sensing data), Intelligent query answering
5
Data Mining: On What Kind of Data?
Relational databases and Transactional databases
Data warehouses
Advanced DBMS and information repositories

Object-oriented and object-relational databases

Spatial databases

Time-series data and temporal data

Text databases and multimedia databases

Heterogeneous and legacy databases

WWW
6
Data Mining: Confluence of Multiple
Disciplines
Database systems, data warehouse and OLAP
Statistics
Machine learning
Visualization
Information science
High performance computing
Business and application domain knowledge expertise
Other disciplines:

Neural networks, mathematical modeling, information
retrieval, pattern recognition, etc.
7
Data Mining: Major Tasks
Characterization and descriptive data mining

Data distribution, dispersion and exception
Association, correlation, causality analysis

Find rules like “inside(x, city)  near(x, highway)”
Classification and predictive modeling


Classify countries based on climate
Predict sales based on product qualification
Clustering and outlier analysis

Cluster houses to find distribution patterns
Temporal and sequential pattern analysis

Trend and deviation, sequential patterns, periodicity
8
Batch Data Mining vs. On-Line
Analytical Mining
Data mining — A costly process



Deep analysis: association, classification, prediction,
clustering, sequence analysis, outline analysis, etc.
Huge amounts of data with wide diversity
Batch processing, “submit and wait?!” — is the
status but is not the answer!
On-line analytical mining (OLAM)


Fast, interactive mining of multi-dimensional
databases: response in seconds!
OLAM operations: mining with drilling, etc.
9
Expected Features of On-Line
Analytical Mining
Ability to mine anywhere
OLAP-like exploratory mining (interactive,
progressive deepening, intelligent focusing)
Efficient, data cube-based mining methods
Dynamic selection and integration of data
mining, OLAP, and statistical functions
Fast response and high performance
Visualization and extensibility
10
On-Line Analytical Mining: An Architecture
Mining query
Mining result
Layer4
User Interface
User GUI API
OLAM
Engine
OLAP
Engine
Layer3
OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration
Database API
Filtering
Layer1
Data cleaning
Databases
Data
Data integration Warehouse
Data
Repository11
From Research Prototypes to
Data Mining System Products
DBMiner — One of the pioneering data mining systems.
Integration of data warehousing (OLAP) with data
mining

On-Line Analytical Mining.
From research prototype to Enterprise 2.0 (6 years R&D
results).
Demonstrated in many conferences and trial use in
Boeing, HP, Hughes Research Labs.
12
Distinct Features of DBMiner
Multiple data mining functions.

OLAP service, cube exploration, statistical analysis,
classification (market/customer segmentation,
decision trees), association (basket data analysis),
cluster analysis, etc.
On-line analytical mining of Microsoft/ PLATO
OLAP cube.
Data and knowledge visualization tools: visual
data mining.
OLEDB and RDBMS connections.
13
A Few Snapshots of DBMiner
OLAP-based graphical user interface
OLAP-based multi-dimensional analysis
Association rule graph
Association 2-D plane
Classification (decision tree analysis)
Cluster analysis
3-D cube viewer and analyzer
14
DBMiner Manager
15
OLAP (Summarization)
16
3D Cube Browser
17
Data Dispersion: Boxplot Analysis
18
Market-Basket-Analysis (Association Ball Graph)
19
Market-Basket-Analysis (Association Plane)
20
Decision Tree (Classification)
21
Clustering (Data Segmentation)
22
Brief History of DBMiner
Technology Inc
Research on data mining since 1989.
International reputation and recognition.
Substantial research supports and contracts.
DBMiner Technology Inc.: A Simon Fraser University
Spin-Off Company
Incorporated in March 1997, dedicated to data mining
system development and commercialization.
Major products: DBMiner 2.0 (Enterprise)

Customization and application-oriented data mining systems

GeoMiner, WebMiner, WebLogMiner, …, more miners in progress
23
Mining Complex Data: Costly and
Largely Unexplored Frontier
Spatial OLAP and spatial data mining

maps, satellite images, geo-spatial modeling and
reasoning
Time-series and sequential pattern mining

pattern match, pattern discovery, trend and
periodicity analysis.
Mining hypertext and hypermedia data
Visual data mining
Scientific data mining
Web mining
24
Spatial OLAP: Pre- vs On-line
Computation
Precomputing all: too much storage space
On-line merge:
very expensive
25
Spatial Classification
Generalizationbased induction
Interactive
classification
26
Multimedia OLAP
27
28
From Coarse to
Fine Resolution
Mining
Progressively mine finer resolutions only on candidate frequent item-sets
Feature
Localization
Coarse
resolution
Fine
resolution
Minimum bounding
circles
Tile Size
Progressive Resolution Refinement
i = 0; D0 =D;
while (i < maxResLevel) do {
Ri = {sufficiently frequent item-sets at res i}
i = i + 1; Di = Filter(Di-1, Ri-1);
}
29
Web Mining: Lots To Be Done!
A taxonomy of Web mining

Web content mining

Web usage mining
Interesting and challenging problems on Web mining

Mining what Web search engine finds

Weblog mining (usage, access, and evolution)

Identification of authoritative Web pages

Web document classification

Warehousing a Meta-Web: Web yellow page service

Intelligent query answering in Web search
Web mining requires your response in seconds!
30
Challenges to Web Mining
Web: A huge, widely-distributed, highly heterogeneous,
semi-structured, interconnected, evolving,
hypertext/hypermedia information repository.
Problems:

the “abundance” problem

limited coverage of the Web (hidden Web sources)

limited query interface: keyword-oriented search

limited customization to individual users
DBMS, DBers, and data miners will play an increasingly
important role in the new generation of Internet
31
Mine What Web Search Engine Finds
Current Web search engines: convenient source for
mining

keyword-based, return too many answers, low quality answers,
still missing a lot, not customized, etc.
Data mining will help:

coverage: “Enlarge and then shrink,” using synonyms and
conceptual hierarchies

better search primitives: user preferences/hints

linkage analysis: authoritative pages and clusters

Web-based languages: XML + WebSQL + WebML

customization: home page + Weblog + user profiles
32
Web Log Mining
Weblog provides rich information about Web dynamics
Multidimensional Weblog analysis:

disclose potential customers, users, markets, etc.
Plan mining (mining general Web accessing regularities):

Web linkage adjustment, performance improvements
Web accessing association/sequential pattern analysis:

Web cashing, prefetching, swapping
Trend analysis:

Dynamics of the Web: what has been changing?
Customized to individual users
33
Discovery of Authoritative Pages in WWW
Page-rank method ( Brin and Page, 1998):
 Rank the "importance" of Web pages, based on a model
of a "random browser."
Hub/authority method (Kleinberg, 1998):
 Prominent authorities often do not endorse one another
directly on the Web.
 Hub pages have a large number of links to many relevant
authorities.
 Thus hubs and authorities exhibit a mutually reinforcing
relationship:
Both the page-rank and hub/authority methodologies have
been shown to provide qualitatively good search results for
broad query topics on the WWW.
34
Web Document Classification
Web document classification:

Good classification: Yahoo!, CS term hierarchies

Training set and learning model
Key-word based classification is different from
multi-dimensional classification



association or clustering based classification is often
more effective
multi-level classification is important
See K. Wang’s work and also S. Chakrabarti’s
COMPUTER Aug.’99 paper.
35
Warehousing a Meta-Web:
An MLDB Approach
Meta-Web: A structure which summarizes the contents,
structure, linkage, and access of the Web and which
evolves with the Web
Layer0: the Web itself
Layer1: the lowest layer of the Meta-Web
 an entry: a Web page summary, including class, time,
URL, contents, keywords, popularity, weight, links, etc.
Layer2 and up: summary/classification/clustering in various
ways and distributed for various applications
Meta-Web can be warehoused and incrementally updated
Querying and mining can be performed on or assisted by
meta-Web (a multi-layer digital library catalogue, yellow
page).
36
A Multiple Layered Meta-Web Architecture
Layern
More Generalized Descriptions
...
Layer1
Generalized Descriptions
Layer0
37
Construction of Multi-Layer Meta-Web
XML: facilitates structured and meta-information extraction
Hidden Web: DB schema “extraction” + other meta info
Automatic classification of Web documents:

based on Yahoo!, etc. as training set + keyword-based
correlation/classification analysis (IR/AI assistance)
Automatic ranking of important Web pages

authoritative site recognition and clustering Web pages
Generalization-based multi-layer meta-Web construction

With the assistance of clustering and classification
analysis
38
Use of Multi-Layer Meta Web
Benefits of Multi-Layer Meta-Web:

Multi-dimensional Web info summary analysis

Approximate and intelligent query answering

Web high-level query answering (WebSQL, WebML)

Web content and structure mining

Observing the dynamics/evolution of the Web
Is it realistic to construct such a meta-Web?


Benefits even if it is partially constructed
Benefits may justify the cost of tool development,
standardization and partial restructuring
39
Intelligent Web Query Answering
What is intelligent query answering?

Smart alternative answers, summary information, etc.

Based on user’s profiles or history
Web query needs more intelligent query answering
mechanism
How to develop it?

Data warehouse and Web Yellow Page service will help

Data mining will help too!
40
Conclusions
Data Mining

A rich, promising, young field with broad applications
and many challenging research issues
Progress

From research prototype to an on-line analytical
mining system: DBMiner 2.0 (Enterprise)
Future work

Application-specific data mining

From DBMiner to WebMiner, and many more!
41
Current On-Going Projects (1)
Spatial data mining

GeoMiner: (SIGMOD’97 demo)

Spatial data warehouse modeling and spatial OLAP (TKDE’99)

Spatial data cube and on-line aggregation (PAKDD’98, SSD’99)

Constraint-based spatial clustering (VLDB’00 sub?)
Multimedia mining

MultiMediaMiner: (SIGMOD’98 demo)

Multimedia data cube and multi-dimension analysis

Mining multimedia associations (ICDE’00)
Time-series data mining

Partial periodicity mining (KDD’98, ICDE’99)

Inter-transaction association mining (TOIS’99, KDD’99)
42
Current On-Going Projects (2)
Web mining (WebMiner and MetaWeb)

Three categories of Web mining: structure, usage, and content.

Web mining language: WebML (WIDM’98)

Document classification:

Weblog mining (ADL’98)
Plan mining: mining plan databases

Plan mining by divide-and-conquer (DMKD’99)
Intelligent query answering

Intelligent query answering by data mining techniques (TKDE’96)
Book

Data mining: concepts and Techniques (Han & Kamber’00)
43
References:http://www.cs.sfu.ca/~han
J. Han. Towards on-line analytical mining in large databases. ACM-SIGMOD Record, 27:97107, 1998
J. Han, et al. DBMiner: A system for data mining in relational databases and data
warehouses. Cascon'97 and KDD'96.
J. Han and Y. Fu. Discovery of multiple-level association rules from large databases.
VLDB'95, Zurich, Switzerland, Sept. 1995.
J. Han, K. Koperski, and N. Stefanovic. GeoMiner: A system prototype for spatial data
mining. SIGMOD'97 (demo), Tucson, Arizona, May 1997.
J. Han, L. V. S. Lakshmanan, and R. T. Ng. Human-centered, multidimensional data mining
-- the constraints way. COMPUTER, 8, 1999.
K. Koperski and J. Han. Discovery of spatial association rules in geographic information
databases. SSD'95, Portland, Maine, Aug. 1995.
L. V. S. Lakshmanan, R. Ng, J. Han, and A. Pang. Optimization of constrained frequent set
queries with 2-variable constraints. SIGMOD'99, Philadelphia, PA, June 1999.
R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning
optimizations of constrained associations rules. SIGMOD'98, Seattle, Washington.
O. R. Zaiane, M. Xin, and J. Han. Discovering Web access patterns and trends by applying
OLAP and data mining technology on Web logs. ADL'98, Santa Barbara, CA.
O. R. Zaiane, J. Han, et al. MultiMedia-Miner: A system prototype for multimedia data
44
mining, SIGMOD'98 (demo), Seattle, Washington, June 1998.
http://db.cs.sfu.ca/
Thank you !!!
45