Data Mining: Current Status
and Research Directions
Jiawei Han
Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
Data Mining: Status and Directions
Why is data mining hot?
Current status: Major technical
Is data mining flying high, or not?
How to fly data mining high?—
Research directions on data mining
Data Mining: Status and Directions
Why Is Data Mining Hot?
Data mining (knowledge discovery in databases)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information (knowledge) or patterns from data in
large databases or other information repositories
Necessity is the mother of invention
Data is everywhere—data mining should be
everywhere, too!
Understand and use data—an imminent task!
Data Mining: Status and Directions
Data, Data, Everywhere!!
Relational database—A commodity of every enterprise
Huge data warehouses are under construction
POS (Point of Sales): Transactional DBs in terabytes
Object-relational databases, distributed, heterogeneous,
and legacy databases
Spatial databases (GIS), remote sensing database (EOS),
and scientific/engineering databases
Time-series data (e.g., stock trading) and temporal data
Text (documents, emails) and multimedia databases
WWW: A huge, hyper-linked, dynamic, global information
Data Mining: Status and Directions
Data Mining Is Everywhere, too!—A
Multi-Dimensional View of Data Mining
Databases to be mined
Relational, transactional, object-relational, active, spatial, timeseries, text, multi-media, heterogeneous, legacy, WWW, etc.
Knowledge to be mined
Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc.
Data Mining: Status and Directions
Data Mining: Confluence of Multiple
Learning (AI)
Data Mining
Data Mining: Status and Directions
Data Mining—One Can Trace Back
to Early Civilization
Most scientific discoveries involve “data mining”
Kepler’s Law, Newton’s Laws, periodic table of chemical
elements, …, from “big bang” to DNA
Statistics: A discipline dedicated to data analysis
Then why data mining? What are the differences?
Huge amount of data—in giga to tera bytes
Fast computer—quick response, interactive analysis
Multi-dimensional, powerful, thorough analysis
High-level, “declarative”—user’s ease and control
Automated or semi-automated—mining functions
hidden or built-in in many systems
Data Mining: Status and Directions
A Brief History of Data Mining Activities
1989 IJCAI Workshop on Knowledge Discovery in Databases
1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in
Databases and Data Mining (KDD’95-98)
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
Journal of Data Mining and Knowledge Discovery (1997)
1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD
More conferences on data mining
PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, DaWaK, SPIE-DM, etc.
Data Mining: Status and Directions
Research Progress in the Last Decade
Multi-dimensional data analysis: Data warehouse and
OLAP (on-line analytical processing)
Association, correlation, and causality analysis
Classification: scalability and new approaches
Clustering and outlier analysis
Sequential patterns and time-series analysis
Similarity analysis: curves, trends, images, texts, etc.
Text mining, Web mining and Weblog analysis
Spatial, multimedia, scientific data analysis
Data preprocessing and database compression
Data visualization and visual data mining
Many others, e.g., collaborative filtering
Data Mining: Status and Directions
Multi-Dimensional Data Analysis
Data warehousing: integration from
heterogeneous or semi-structured databases
Multi-dimensional modeling of data: star &
snowflake schemas
Efficient and scalable computation of data cubes
or iceberg cubes
OLAP (on-line analytical processing): drilling,
dicing, slicing, etc.
Discovery-driven exploration of data cubes
From OLAP to OLAM: A multi-dimensional view
for on-line analytical mining
Data Mining: Status and Directions
Association and Frequent Pattern Analysis
Efficient mining of frequent patterns and
association rules:
 Apriori
and FP-growth algorithms
 Multi-level,
multi-dimensional, quantitative
association mining
From association to correlation, sequential
patterns, partial periodicity, cyclic rules, ratio
rules, etc.
Query and constraint-based association analysis
Data Mining: Status and Directions
Classification: Scalable Methods and
Handling of Complex Types of Data
Classification has been an essential theme in machine
learning, and statistics research
 Decision trees, Bayesian classification, neural
networks, k-nearest neighbors, etc.
 Tree-pruning, Boosting, bagging techniques
Efficient and scalable classification methods
 Exploration of attribute-class pairs
 SLIQ, SPRINT, RainForest, BOAT, etc.
Classification of semi-structured and non-structured data
 Classification by clustering association rules (ARCS)
 Association-based classification
 Web document classification
Data Mining: Status and Directions
Clustering and Outlier Analysis
Partitioning methods
 k-means, k-medoids, CLARANS
Hierarchical methods: micro-clusters
 Birch, Cure, Chameleon
Density-based methods:
Grid-based methods
 STING, CLIQUE, WaveCluster
Outlier analysis:
 statistics-based, distance-based, deviation-based
Constraint-based clustering
 COD (Clustering with Obstructed Distance)
 User-specified constraints
Data Mining: Status and Directions
Sequential Patterns and TimeSeries Analysis
Trend analysis
 Trend movement vs. cyclic variations, seasonal
variations and random fluctuations
Similarity search in time-series database
 Handling gaps, scaling, etc.
 Indexing methods and query languages for time-series
Sequential pattern mining
 Various kinds of sequences, various methods
 From GSP to PrefixSpan
Periodicity analysis
 Full periodicity, partial periodicity, cyclic association
Data Mining: Status and Directions
Similarity Search: Similar Curves,
Trends, Images, and Texts
Various kinds of data, various similarity mining methods
Discovery of similar trends in time-series data
 Data transformation & high-dimensional structures
Finding similar images based on color, texture, etc.
 Content-based vs. keyword-based retrieval
 Color histogram-based signature
 Multi-feature composed signature
Finding documents with similar texts
 Similar keywords (synonymy & polysemy)
 Term frequency matrix
 Latent semantic indexing
Data Mining: Status and Directions
Spatial, Multimedia, Scientific
Data Analysis
Multi-dimensional analysis of spatial, multimedia and
scientific data
 Geo-spatial data cube and spatial OLAP
 The curse of dimensionality problem
 Association analysis
 A progressive refinement methodology
 Micro-clustering can be used for preprocessing in the
analysis of complex types of data
 Classification
 Association-based for handling high-dimensionality and
sparse data
Data Mining: Status and Directions
Data Mining Industry and Applications
From research prototypes to data mining
products, languages, and standards
 IBM Intelligent Miner, SAS Enterprise Miner,
SGI MineSet, Clementine, MS/SQLServer 2000,
DBMiner, BlueMartini, MineIt, DigiMine, etc.
 A few data mining languages and standards
(esp. MS OLEDB for Data Mining).
 Application achievements in many domains
 Market analysis, trend analysis, fraud detection,
outlier analysis, Web mining, etc.
Data Mining: Status and Directions
Is Data Mining Flying? Or Not??
Data mining is flying
 R & D have been striding forward greatly
 Applications have been broadened substantially
But not as high as some may have hoped. Why not?
 Hope to see billions of $’s within years?
Not bread-and-butter but value-added service
DBMS, WWW, and other information systems will still be a
“data mining” aircraft-carrier
Not on-the-shelf in nature
A young and coming technology, not a hype!
Need training, understanding, and customizing (re-develop.)
Young technology—need much R&D to fly high
Much research, development, and real problem solving!
Data Mining: Status and Directions
How to Fly Data Mining High?—
Research Directions
Web mining
Towards integrated data mining environments
and tools
“Vertical” (or application-specific) data
Invisible data mining
Towards intelligent, efficient, and scalable data
mining methods
Data Mining: Status and Directions
Web Mining: A Fast Expanding
Frontier in Data Mining
Mine what Web search engine finds
Automatic classification of Web documents
Discovery of authoritative Web pages, Web
structures and Web communities
Meta-Web Warehousing: Web yellow page
Web usage mining
Data Mining: Status and Directions
Mine What Web Search Engine Finds
Current Web search engines: A convenient source for
keyword-based, return too many, often low quality
answers, still missing a lot, not customized, etc.
Data mining will help:
coverage: “Enlarge and then shrink,” using synonyms
and conceptual hierarchies
better search primitives: user preferences/hints
linkage analysis: authoritative pages and clusters
Web-based languages: XML + WebSQL + WebML
customization: home page + Weblog + user profiles
Data Mining: Status and Directions
Discovery of Authoritative Pages in WWW
Page-rank method ( Brin and Page, 1998):
 Rank the "importance" of Web pages, based on a model
of a "random browser."
Hub/authority method (Kleinberg, 1998):
 Prominent authorities often do not endorse one another
directly on the Web.
 Hub pages have a large number of links to many relevant
 Thus hubs and authorities exhibit a mutually reinforcing
Both the page-rank and hub/authority methodologies have
been shown to provide qualitatively good search results for
broad query topics on the WWW.
Data Mining: Status and Directions
Automatic Classification of Web
Web document classification:
Good human classification: Yahoo!, CS term
These classifications can be used as training sets to
build up learning model
Key-word based classification is different from multidimensional classification
Association or clustering-based classification is often
more effective
Multi-level classification is important
Data Mining: Status and Directions
A Multiple Layered Meta-Web Architecture
More Generalized Descriptions
Generalized Descriptions
2017년 5월 22일
Data Mining: Status and Directions
Web Yellow Page Service: A MultiLayer, Meta-Web Approach
XML: facilitates structured and meta-information extraction
Automatic classification of Web documents:
 based on Yahoo!, etc. as training set + keyword-based
correlation/classification analysis (IR/AI assistance)
Automatic ranking of important Web pages
 authoritative site recognition and clustering Web pages
Generalization-based multi-layer meta-Web construction
 With the assistance of clustering and classification
Meta-Web can be warehoused and incrementally updated
Querying and mining can be performed on or assisted by
Data Mining: Status and Directions
Importance of Constructing
Multi-Layer Meta Web
Benefits of Multi-Layer Meta-Web:
Multi-dimensional Web info summary analysis
Approximate and intelligent query answering
Web high-level query answering (WebSQL, WebML)
Web content and structure mining
Observing the dynamics/evolution of the Web
Is it realistic to construct such a meta-Web?
It benefits even if it is partially constructed
The benefit may justify the cost of tool development,
standardization, and partial restructuring
Data Mining: Status and Directions
Web Usage (Click-Stream) Mining
Weblog provides rich information about Web dynamics
Multidimensional Weblog analysis:
Plan mining (mining general Web accessing regularities):
Web cashing, prefetching, swapping
Trend analysis:
Web linkage adjustment, performance improvements
Web accessing association/sequential pattern analysis:
disclose potential customers, users, markets, etc.
Dynamics of the Web: what has been changing?
Customized to individual users
Data Mining: Status and Directions
Towards Integrated Data Mining
Environments and Tools
OLAP Mining: Integration of Data Warehousing
and Data Mining
Querying and Mining: An Integrated
Information Analysis Environment
Basic Mining Operations and Mining Query
“Vertical” (or application-specific) data mining
Invisible data mining
Data Mining: Status and Directions
OLAP Mining: An Integration of Data
Mining and Data Warehousing
Data mining systems, DBMS, Data warehouse systems
On-line analytical mining data
No coupling, loose-coupling, semi-tight-coupling, tight-coupling
integration of mining and OLAP technologies
Interactive mining multi-level knowledge
Necessity of mining knowledge and patterns at different levels of
abstraction by drilling/rolling, pivoting, slicing/dicing, etc.
Integration of multiple mining functions
Characterized classification, first clustering and then association
Data Mining: Status and Directions
An OLAM Architecture
Mining query
Mining result
User Interface
Data Cube API
Meta Data
Database API
Data cleaning
Data integration Warehouse
Data Mining: Status and Directions
Querying and Mining: An Integrated
Information Analysis Environment
Data mining as a component of DBMS, data warehouse, or
Web information system
Integrated information processing environment
MS/SQLServer-2000 (Analysis service)
IBM IntelligentMiner on DB2
SAS EnterpriseMiner: data warehousing + mining
Query-based mining
Querying database/DW/Web knowledge
Efficiency and flexibility: preprocessing, on-line
processing, optimization, integration, etc.
Data Mining: Status and Directions
Basic Mining Operations and Mining
Query Optimization
Relational databases: There are a set of basic relational
operations and a standard query language, SQL
E.g., selection, projection, join, set difference,
intersection, Cartesian product, etc.
Are there a set of standard data mining operations, on
which optimizations can be done?
Difficulty: different definitions on operations
Importance: optimization can be performed on them
systematically, standardization to facilitate information
exchange and system interoperability
Data Mining: Status and Directions
“Vertical” Data Mining
Generic data mining tools? —Too simple to match domainspecific, sophisticated applications
Expert knowledge and business logic represent many years of
work in their own fields!
Data mining + business logic + domain experts
A multi-dimensional view of data miners
Complexity of data: Web, sequence, spatial, multimedia, …
Complexity of domains: DNA, astronomy, market, telecom, …
Domain-specific data mining tools
Provide concrete, killer solution to specific problems
Feedback to build more powerful tools
Data Mining: Status and Directions
Invisible Data Mining
Build mining functions into daily information services
Web search engine (link analysis, authoritative pages,
user profiles)—adaptive web sites, etc.
Improvement of query processing: history + data
Making service smart and efficient
Benefits from/to data mining research
Data mining research has produced many scalable,
efficient, novel mining solutions
Applications feed new challenge problems to research
Data Mining: Status and Directions
Towards Intelligent Tools for Data Mining
Integration paves the way to intelligent mining
Smart interface brings intelligence
One picture may worth 1,000 words
Easy to use, understand and manipulate
Visual and audio data mining
Human-Centered Data Mining
Towards self-tuning, self-managing, selftriggering data mining
Data Mining: Status and Directions
Integrated Mining: A Booster for
Intelligent Mining
Integration paves the way to intelligent mining
Data mining integrates with DBMS, DW, WebDB, etc
Integration inherits the power of up-to-date information
technology: querying, MD analysis, similarity search, etc.
Mining can be viewed as querying database knowledge
Integration leads to standard interface/language,
function/process standardization, utility, and reachability
Efficiency and scalability bring intelligent mining to reality
Data Mining: Status and Directions
One Picture May Worth 1000 Words!
Visual Data Mining
Visualization of data
Visualization of data mining results
Visualization of data mining processes
Interactive data mining: visual classification
One melody may worth 1000 words too!
Audio data mining: turn data into music and melody!
Uses audio signals to indicate the patterns of data or
the features of data mining results
Data Mining: Status and Directions
Visualization of data mining results in SAS
Enterprise Miner: scatter plots
Data Mining: Status and Directions
Visualization of association rules in
MineSet 3.0
Data Mining: Status and Directions
Visualization of a decision tree in MineSet 3.0
Data Mining: Status and Directions
Visualization of Data Mining
Processes by Clementine
Data Mining: Status and Directions
Interactive Visual Mining by
Perception-Based Classification (PBC)
Data Mining: Status and Directions
Human-Centered Data Mining
Finding all the patterns autonomously in a database? —
unrealistic because the patterns could be too many but
Data mining should be an interactive process
 User directs what to be mined
Users must be provided with a set of primitives to be used
to communicate with the data mining system — using a
data mining query language
User should provide constraints on what to be mined
System should use such constraints to guide the mining
process (constraint-based mining or mining query
Data Mining: Status and Directions
Constraint-Based Mining
What kinds of constraints can be used in mining?
 Knowledge type constraint: classification, association, etc.
 Data constraint: SQL-like queries
 Find products sold together in Vancouver in Feb.’01.
 Dimension/level constraints:
 in relevance to region, price, brand, customer category.
 Rule constraints:
 small sales (price < $10) triggers big sales (sum >
 Interestingness constraints:
 E.g., strong rules (min_support  3%, min_confidence
 60%, min_lift > 3.0).
Data Mining: Status and Directions
Rule Constraints: A Classification
Convertible constraints
Inconvertible constraints
Data Mining: Status and Directions
Constraint-Based Clustering Analysis
User-specified constraints: no cluster has less than 1000 gold customers
Resource allocation (clustering) with obstacles
Data Mining: Status and Directions
Towards Automated Data Mining?
It is not realistic to automatically find all the knowledge
in a large database
Thus we promote human-centered, constraint-based
However, to achieve genuine intelligent data mining,
data mining process should be self-tuning, self-managing,
Functions should be developed to achieve such
Data Mining: Status and Directions
Data mining—A promising research frontier
Data mining research has been striding forward greatly in
the last decade
However, data mining, as an industry, has not been flying
as high as expected
Much research and application exploration are needed
Web mining
Towards integrated data mining environments and tools
Towards intelligent, efficient, and scalable data mining methods
Data Mining: Status and Directions
Thank you !!!
Data Mining: Status and Directions
