Download No Slide Title

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining: Current Status
and Research Directions
Jiawei Han
Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca/~han
2017년 5월 22일
Data Mining: Status and Directions
1
Outline




Why is data mining hot?
Current status: Major technical
progress
Is data mining flying high, or not?
How to fly data mining high?—
Research directions on data mining
2017년 5월 22일
Data Mining: Status and Directions
2
Why Is Data Mining Hot?

Data mining (knowledge discovery in databases)

Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information (knowledge) or patterns from data in
large databases or other information repositories

Necessity is the mother of invention

Data is everywhere—data mining should be
everywhere, too!

Understand and use data—an imminent task!
2017년 5월 22일
Data Mining: Status and Directions
3
Data, Data, Everywhere!!

Relational database—A commodity of every enterprise

Huge data warehouses are under construction

POS (Point of Sales): Transactional DBs in terabytes


Object-relational databases, distributed, heterogeneous,
and legacy databases
Spatial databases (GIS), remote sensing database (EOS),
and scientific/engineering databases

Time-series data (e.g., stock trading) and temporal data

Text (documents, emails) and multimedia databases

WWW: A huge, hyper-linked, dynamic, global information
system
2017년 5월 22일
Data Mining: Status and Directions
4
Data Mining Is Everywhere, too!—A
Multi-Dimensional View of Data Mining

Databases to be mined

Relational, transactional, object-relational, active, spatial, timeseries, text, multi-media, heterogeneous, legacy, WWW, etc.

Knowledge to be mined

Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.

Techniques utilized

Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.

Applications adapted

Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc.
2017년 5월 22일
Data Mining: Status and Directions
5
Data Mining: Confluence of Multiple
Disciplines
Database
Technology
Machine
Learning (AI)
Information
Science
2017년 5월 22일
Statistics
Data Mining
Visualization
Other
Disciplines
Data Mining: Status and Directions
6
Data Mining—One Can Trace Back
to Early Civilization

Most scientific discoveries involve “data mining”

Kepler’s Law, Newton’s Laws, periodic table of chemical
elements, …, from “big bang” to DNA

Statistics: A discipline dedicated to data analysis

Then why data mining? What are the differences?

Huge amount of data—in giga to tera bytes

Fast computer—quick response, interactive analysis

Multi-dimensional, powerful, thorough analysis

High-level, “declarative”—user’s ease and control

Automated or semi-automated—mining functions
hidden or built-in in many systems
2017년 5월 22일
Data Mining: Status and Directions
7
A Brief History of Data Mining Activities

1989 IJCAI Workshop on Knowledge Discovery in Databases


1991-1994 Workshops on Knowledge Discovery in Databases



Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in
Databases and Data Mining (KDD’95-98)


Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
Journal of Data Mining and Knowledge Discovery (1997)
1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD
Explorations
More conferences on data mining

PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, DaWaK, SPIE-DM, etc.
2017년 5월 22일
Data Mining: Status and Directions
8
Research Progress in the Last Decade











Multi-dimensional data analysis: Data warehouse and
OLAP (on-line analytical processing)
Association, correlation, and causality analysis
Classification: scalability and new approaches
Clustering and outlier analysis
Sequential patterns and time-series analysis
Similarity analysis: curves, trends, images, texts, etc.
Text mining, Web mining and Weblog analysis
Spatial, multimedia, scientific data analysis
Data preprocessing and database compression
Data visualization and visual data mining
Many others, e.g., collaborative filtering
2017년 5월 22일
Data Mining: Status and Directions
9
Multi-Dimensional Data Analysis






Data warehousing: integration from
heterogeneous or semi-structured databases
Multi-dimensional modeling of data: star &
snowflake schemas
Efficient and scalable computation of data cubes
or iceberg cubes
OLAP (on-line analytical processing): drilling,
dicing, slicing, etc.
Discovery-driven exploration of data cubes
From OLAP to OLAM: A multi-dimensional view
for on-line analytical mining
2017년 5월 22일
Data Mining: Status and Directions
10
Association and Frequent Pattern Analysis

Efficient mining of frequent patterns and
association rules:
 Apriori
and FP-growth algorithms
 Multi-level,
multi-dimensional, quantitative
association mining

From association to correlation, sequential
patterns, partial periodicity, cyclic rules, ratio
rules, etc.

Query and constraint-based association analysis
2017년 5월 22일
Data Mining: Status and Directions
11
Classification: Scalable Methods and
Handling of Complex Types of Data



Classification has been an essential theme in machine
learning, and statistics research
 Decision trees, Bayesian classification, neural
networks, k-nearest neighbors, etc.
 Tree-pruning, Boosting, bagging techniques
Efficient and scalable classification methods
 Exploration of attribute-class pairs
 SLIQ, SPRINT, RainForest, BOAT, etc.
Classification of semi-structured and non-structured data
 Classification by clustering association rules (ARCS)
 Association-based classification
 Web document classification
2017년 5월 22일
Data Mining: Status and Directions
12
Clustering and Outlier Analysis






Partitioning methods
 k-means, k-medoids, CLARANS
Hierarchical methods: micro-clusters
 Birch, Cure, Chameleon
Density-based methods:
 DBSCAN and OPTICS, DENCLU
Grid-based methods
 STING, CLIQUE, WaveCluster
Outlier analysis:
 statistics-based, distance-based, deviation-based
Constraint-based clustering
 COD (Clustering with Obstructed Distance)
 User-specified constraints
2017년 5월 22일
Data Mining: Status and Directions
13
Sequential Patterns and TimeSeries Analysis




Trend analysis
 Trend movement vs. cyclic variations, seasonal
variations and random fluctuations
Similarity search in time-series database
 Handling gaps, scaling, etc.
 Indexing methods and query languages for time-series
Sequential pattern mining
 Various kinds of sequences, various methods
 From GSP to PrefixSpan
Periodicity analysis
 Full periodicity, partial periodicity, cyclic association
rules
2017년 5월 22일
Data Mining: Status and Directions
14
Similarity Search: Similar Curves,
Trends, Images, and Texts




Various kinds of data, various similarity mining methods
Discovery of similar trends in time-series data
 Data transformation & high-dimensional structures
Finding similar images based on color, texture, etc.
 Content-based vs. keyword-based retrieval
 Color histogram-based signature
 Multi-feature composed signature
Finding documents with similar texts
 Similar keywords (synonymy & polysemy)
 Term frequency matrix
 Latent semantic indexing
2017년 5월 22일
Data Mining: Status and Directions
15
Spatial, Multimedia, Scientific
Data Analysis
Multi-dimensional analysis of spatial, multimedia and
scientific data
 Geo-spatial data cube and spatial OLAP
 The curse of dimensionality problem
 Association analysis
 A progressive refinement methodology
 Micro-clustering can be used for preprocessing in the
analysis of complex types of data
 Classification
 Association-based for handling high-dimensionality and
sparse data

2017년 5월 22일
Data Mining: Status and Directions
16
Data Mining Industry and Applications
From research prototypes to data mining
products, languages, and standards
 IBM Intelligent Miner, SAS Enterprise Miner,
SGI MineSet, Clementine, MS/SQLServer 2000,
DBMiner, BlueMartini, MineIt, DigiMine, etc.
 A few data mining languages and standards
(esp. MS OLEDB for Data Mining).
 Application achievements in many domains
 Market analysis, trend analysis, fraud detection,
outlier analysis, Web mining, etc.

2017년 5월 22일
Data Mining: Status and Directions
17
Is Data Mining Flying? Or Not??


Data mining is flying
 R & D have been striding forward greatly
 Applications have been broadened substantially
But not as high as some may have hoped. Why not?
 Hope to see billions of $’s within years?


Not bread-and-butter but value-added service


DBMS, WWW, and other information systems will still be a
“data mining” aircraft-carrier
Not on-the-shelf in nature


A young and coming technology, not a hype!
Need training, understanding, and customizing (re-develop.)
Young technology—need much R&D to fly high

2017년 5월 22일
Much research, development, and real problem solving!
Data Mining: Status and Directions
18
How to Fly Data Mining High?—
Research Directions


Web mining
Towards integrated data mining environments
and tools



“Vertical” (or application-specific) data
mining
Invisible data mining
Towards intelligent, efficient, and scalable data
mining methods
2017년 5월 22일
Data Mining: Status and Directions
19
Web Mining: A Fast Expanding
Frontier in Data Mining

Mine what Web search engine finds

Automatic classification of Web documents

Discovery of authoritative Web pages, Web
structures and Web communities

Meta-Web Warehousing: Web yellow page
service

Web usage mining
2017년 5월 22일
Data Mining: Status and Directions
20
Mine What Web Search Engine Finds

Current Web search engines: A convenient source for
mining


keyword-based, return too many, often low quality
answers, still missing a lot, not customized, etc.
Data mining will help:

coverage: “Enlarge and then shrink,” using synonyms
and conceptual hierarchies

better search primitives: user preferences/hints

linkage analysis: authoritative pages and clusters

Web-based languages: XML + WebSQL + WebML

customization: home page + Weblog + user profiles
2017년 5월 22일
Data Mining: Status and Directions
21
Discovery of Authoritative Pages in WWW



Page-rank method ( Brin and Page, 1998):
 Rank the "importance" of Web pages, based on a model
of a "random browser."
Hub/authority method (Kleinberg, 1998):
 Prominent authorities often do not endorse one another
directly on the Web.
 Hub pages have a large number of links to many relevant
authorities.
 Thus hubs and authorities exhibit a mutually reinforcing
relationship:
Both the page-rank and hub/authority methodologies have
been shown to provide qualitatively good search results for
broad query topics on the WWW.
2017년 5월 22일
Data Mining: Status and Directions
22
Automatic Classification of Web
Documents

Web document classification:



Good human classification: Yahoo!, CS term
hierarchies
These classifications can be used as training sets to
build up learning model
Key-word based classification is different from multidimensional classification


Association or clustering-based classification is often
more effective
Multi-level classification is important
2017년 5월 22일
Data Mining: Status and Directions
23
A Multiple Layered Meta-Web Architecture
Layern
More Generalized Descriptions
...
Layer1
Generalized Descriptions
Layer0
2017년 5월 22일
Data Mining: Status and Directions
24
Web Yellow Page Service: A MultiLayer, Meta-Web Approach






XML: facilitates structured and meta-information extraction
Automatic classification of Web documents:
 based on Yahoo!, etc. as training set + keyword-based
correlation/classification analysis (IR/AI assistance)
Automatic ranking of important Web pages
 authoritative site recognition and clustering Web pages
Generalization-based multi-layer meta-Web construction
 With the assistance of clustering and classification
analysis
Meta-Web can be warehoused and incrementally updated
Querying and mining can be performed on or assisted by
meta-Web
2017년 5월 22일
Data Mining: Status and Directions
25
Importance of Constructing
Multi-Layer Meta Web


Benefits of Multi-Layer Meta-Web:

Multi-dimensional Web info summary analysis

Approximate and intelligent query answering

Web high-level query answering (WebSQL, WebML)

Web content and structure mining

Observing the dynamics/evolution of the Web
Is it realistic to construct such a meta-Web?


It benefits even if it is partially constructed
The benefit may justify the cost of tool development,
standardization, and partial restructuring
2017년 5월 22일
Data Mining: Status and Directions
26
Web Usage (Click-Stream) Mining

Weblog provides rich information about Web dynamics

Multidimensional Weblog analysis:


Plan mining (mining general Web accessing regularities):


Web cashing, prefetching, swapping
Trend analysis:


Web linkage adjustment, performance improvements
Web accessing association/sequential pattern analysis:


disclose potential customers, users, markets, etc.
Dynamics of the Web: what has been changing?
Customized to individual users
2017년 5월 22일
Data Mining: Status and Directions
27
Towards Integrated Data Mining
Environments and Tools



OLAP Mining: Integration of Data Warehousing
and Data Mining
Querying and Mining: An Integrated
Information Analysis Environment
Basic Mining Operations and Mining Query
Optimization

“Vertical” (or application-specific) data mining

Invisible data mining
2017년 5월 22일
Data Mining: Status and Directions
28
OLAP Mining: An Integration of Data
Mining and Data Warehousing

Data mining systems, DBMS, Data warehouse systems
coupling


On-line analytical mining data


No coupling, loose-coupling, semi-tight-coupling, tight-coupling
integration of mining and OLAP technologies
Interactive mining multi-level knowledge

Necessity of mining knowledge and patterns at different levels of
abstraction by drilling/rolling, pivoting, slicing/dicing, etc.

Integration of multiple mining functions

Characterized classification, first clustering and then association
2017년 5월 22일
Data Mining: Status and Directions
29
An OLAM Architecture
Mining query
Mining result
Layer4
User Interface
User GUI API
OLAM
Engine
OLAP
Engine
Layer3
OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration
Database API
Filtering
Layer1
Data cleaning
Databases
2017년 5월 22일
Data
Data integration Warehouse
Data Mining: Status and Directions
Data
Repository
30
Querying and Mining: An Integrated
Information Analysis Environment

Data mining as a component of DBMS, data warehouse, or
Web information system


Integrated information processing environment

MS/SQLServer-2000 (Analysis service)

IBM IntelligentMiner on DB2

SAS EnterpriseMiner: data warehousing + mining
Query-based mining


Querying database/DW/Web knowledge
Efficiency and flexibility: preprocessing, on-line
processing, optimization, integration, etc.
2017년 5월 22일
Data Mining: Status and Directions
31
Basic Mining Operations and Mining
Query Optimization

Relational databases: There are a set of basic relational
operations and a standard query language, SQL


E.g., selection, projection, join, set difference,
intersection, Cartesian product, etc.
Are there a set of standard data mining operations, on
which optimizations can be done?


Difficulty: different definitions on operations
Importance: optimization can be performed on them
systematically, standardization to facilitate information
exchange and system interoperability
2017년 5월 22일
Data Mining: Status and Directions
32
“Vertical” Data Mining

Generic data mining tools? —Too simple to match domainspecific, sophisticated applications




Expert knowledge and business logic represent many years of
work in their own fields!
Data mining + business logic + domain experts
A multi-dimensional view of data miners

Complexity of data: Web, sequence, spatial, multimedia, …

Complexity of domains: DNA, astronomy, market, telecom, …
Domain-specific data mining tools

Provide concrete, killer solution to specific problems

Feedback to build more powerful tools
2017년 5월 22일
Data Mining: Status and Directions
33
Invisible Data Mining

Build mining functions into daily information services

Web search engine (link analysis, authoritative pages,
user profiles)—adaptive web sites, etc.


Improvement of query processing: history + data

Making service smart and efficient
Benefits from/to data mining research

Data mining research has produced many scalable,
efficient, novel mining solutions

Applications feed new challenge problems to research
2017년 5월 22일
Data Mining: Status and Directions
34
Towards Intelligent Tools for Data Mining

Integration paves the way to intelligent mining

Smart interface brings intelligence


One picture may worth 1,000 words



Easy to use, understand and manipulate
Visual and audio data mining
Human-Centered Data Mining
Towards self-tuning, self-managing, selftriggering data mining
2017년 5월 22일
Data Mining: Status and Directions
35
Integrated Mining: A Booster for
Intelligent Mining

Integration paves the way to intelligent mining

Data mining integrates with DBMS, DW, WebDB, etc

Integration inherits the power of up-to-date information
technology: querying, MD analysis, similarity search, etc.


Mining can be viewed as querying database knowledge
Integration leads to standard interface/language,
function/process standardization, utility, and reachability

Efficiency and scalability bring intelligent mining to reality
2017년 5월 22일
Data Mining: Status and Directions
36
One Picture May Worth 1000 Words!


Visual Data Mining

Visualization of data

Visualization of data mining results

Visualization of data mining processes

Interactive data mining: visual classification
One melody may worth 1000 words too!


Audio data mining: turn data into music and melody!
Uses audio signals to indicate the patterns of data or
the features of data mining results
2017년 5월 22일
Data Mining: Status and Directions
37
Visualization of data mining results in SAS
Enterprise Miner: scatter plots
2017년 5월 22일
Data Mining: Status and Directions
38
Visualization of association rules in
MineSet 3.0
2017년 5월 22일
Data Mining: Status and Directions
39
Visualization of a decision tree in MineSet 3.0
2017년 5월 22일
Data Mining: Status and Directions
40
Visualization of Data Mining
Processes by Clementine
2017년 5월 22일
Data Mining: Status and Directions
41
Interactive Visual Mining by
Perception-Based Classification (PBC)
2017년 5월 22일
Data Mining: Status and Directions
42
Human-Centered Data Mining





Finding all the patterns autonomously in a database? —
unrealistic because the patterns could be too many but
uninteresting
Data mining should be an interactive process
 User directs what to be mined
Users must be provided with a set of primitives to be used
to communicate with the data mining system — using a
data mining query language
User should provide constraints on what to be mined
System should use such constraints to guide the mining
process (constraint-based mining or mining query
optimization)
2017년 5월 22일
Data Mining: Status and Directions
43
Constraint-Based Mining

What kinds of constraints can be used in mining?
 Knowledge type constraint: classification, association, etc.
 Data constraint: SQL-like queries
 Find products sold together in Vancouver in Feb.’01.
 Dimension/level constraints:
 in relevance to region, price, brand, customer category.
 Rule constraints:
 small sales (price < $10) triggers big sales (sum >
$200).
 Interestingness constraints:
 E.g., strong rules (min_support  3%, min_confidence
 60%, min_lift > 3.0).
2017년 5월 22일
Data Mining: Status and Directions
44
Rule Constraints: A Classification
Succinctness
Anti-monotonicity
Monotonicity
Convertible constraints
Inconvertible constraints
2017년 5월 22일
Data Mining: Status and Directions
45
Constraint-Based Clustering Analysis

User-specified constraints: no cluster has less than 1000 gold customers

Resource allocation (clustering) with obstacles
2017년 5월 22일
Data Mining: Status and Directions
46
Towards Automated Data Mining?




It is not realistic to automatically find all the knowledge
in a large database
Thus we promote human-centered, constraint-based
mining
However, to achieve genuine intelligent data mining,
data mining process should be self-tuning, self-managing,
self-triggering
Functions should be developed to achieve such
performance
2017년 5월 22일
Data Mining: Status and Directions
47
Conclusions

Data mining—A promising research frontier

Data mining research has been striding forward greatly in
the last decade

However, data mining, as an industry, has not been flying
as high as expected

Much research and application exploration are needed

Web mining

Towards integrated data mining environments and tools

Towards intelligent, efficient, and scalable data mining methods
2017년 5월 22일
Data Mining: Status and Directions
48
http://www.cs.sfu.ca/~han
http://db.cs.sfu.ca
Thank you !!!
2017년 5월 22일
Data Mining: Status and Directions
49
References


J. Han and M. Kamber,
Data Mining: Concepts
and Techniques, Morgan
Kaufmann, 2001.
J. Han, L. V. S.
Lakshmanan, and R. T.
Ng, "Constraint-Based,
Multidimensional Data
Mining", COMPUTER
(special issues on Data
Mining), 32(8): 46-50,
1999.
2017년 5월 22일
Data Mining: Status and Directions
50