Download Introduction to Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Introduction to Data Mining
MIT-652 Data Mining Applications
Thimaporn Phetkaew
School of Informatics,
Walailak University
MIT-652: DM
1: Introduction to Data Mining
1
Introduction
„
Motivation: Why data mining?
„
What is data mining?
„
Data Mining: On what kind of data?
„
Data mining functionality
„
Are all the patterns interesting?
„
Classification of data mining systems
„
Data mining task primitives
„
Major issues in data mining
„
Summary
MIT-652: DM
1: Introduction to Data Mining
2
Motivation:
“Necessity is the Mother of
Invention”
„
Data explosion problem
„
Automated data collection tools and mature database technology
lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories
„
We are drowning in data, but starving for knowledge!
„
Solution: Data warehousing and data mining
„
Data warehousing and on-line analytical processing (OLAP)
„
Data Mining: extraction of interesting knowledge (rules,
regularities, patterns, constraints) from data in large databases
MIT-652: DM
1: Introduction to Data Mining
3
Evolution of Database Technology
„
„
„
„
„
1960s: Data collection, database creation, IMS and
network DBMS
1970s: Relational data model, relational DBMS
implementation
1980s: RDBMS, advanced data models (extendedrelational, OO, deductive, etc.) and application-oriented
DBMS (spatial, scientific, engineering, etc.)
1990s: Data mining and data warehousing, multimedia
databases, and Web databases
2000s: Data mining and its applications, Web technology
(XML, data integration) and global information systems
MIT-652: DM
1: Introduction to Data Mining
4
Introduction
„
Motivation: Why data mining?
„
What is data mining?
„
Data Mining: On what kind of data?
„
Data mining functionality
„
Are all the patterns interesting?
„
Classification of data mining systems
„
Data mining task primitives
„
Major issues in data mining
„
Summary
MIT-652: DM
1: Introduction to Data Mining
5
What Is Data Mining?
„
Data mining (knowledge discovery in databases):
„
„
„
Alternative names:
„
„
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from
data in large databases
Data mining: a misnomer?
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Watch out: is everything “data mining”?
„
„
MIT-652: DM
Simple search and query processing
Expert systems
1: Introduction to Data Mining
6
Potential Data Mining Applications
„
Database analysis and decision support
„
Market analysis and management
„
„
Risk analysis and management
„
„
„
Target marketing, customer relation management (CRM),
market basket analysis, cross selling, market segmentation
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
„
Text mining (news group, email, documents) and Web mining
„
Bioinformatics and bio-data analysis
MIT-652: DM
1: Introduction to Data Mining
7
Data Mining: A KDD Process
Pattern Evaluation
„
Data mining: the core of
knowledge discovery
Data Mining
process
Data Transformation
Task-relevant Data
Data Warehouse
Data Selection
Data Cleaning
Data Integration
Databases
MIT-652: DM
1: Introduction to Data Mining
8
Steps of a KDD Process
„
„
„
„
Data cleaning: to remove noise and incomplete data
Data integration: where multiple data sources may be
combined
Data selection: where data relevant to the analysis task
are retrieved from the database/repository
Data transformation: where data are transformed into
forms appropriate for mining
MIT-652: DM
1: Introduction to Data Mining
9
Steps of a KDD Process
„
„
„
Data mining: where intelligent methods are applied in
order to extract data patterns
Pattern evaluation: to identify the truly interesting
patterns representing knowledge
Knowledge presentation: where visualization and
knowledge representation techniques are used to
present the mined knowledge to the user
MIT-652: DM
1: Introduction to Data Mining
10
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Decisions
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific Experiments, Database Systems
MIT-652: DM
1: Introduction to Data Mining
DBA
11
Architecture of a Typical Data
Mining System
Graphical User Interface
Pattern Evaluation
Data Mining Engine
Knowl
edgeBase
Database or Data
Warehouse Server
data cleaning, integration, and selection
Database
MIT-652: DM
Data
World-Wide Other Info
Repositories
Warehouse
Web
1: Introduction to Data Mining
12
Introduction
„
Motivation: Why data mining?
„
What is data mining?
„
Data Mining: On what kind of data?
„
Data mining functionality
„
Are all the patterns interesting?
„
Classification of data mining systems
„
Data mining task primitives
„
Major issues in data mining
„
Summary
MIT-652: DM
1: Introduction to Data Mining
13
Data Mining: On What Kind of Data?
„
„
„
„
Relational databases
Transactional databases
Data warehouses
Advanced DB and information repositories
„
„
„
„
„
„
„
MIT-652: DM
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases
Multimedia databases
Heterogeneous and legacy databases
WWW
1: Introduction to Data Mining
14
Relational Databases
„
„
„
„
„
A collection of tables consisting of a set of tuples
(records or rows)
Each tuple owns a set of attributes (columns or fields)
A tuple is identified by a unique key
SQL “Show me total sales of last year grouped by
branch”
Data mining can search for trends and data patterns
e.g. detect deviations such as
„
„
MIT-652: DM
Items whose sales are far from those expected compared to last
year;
Predict credit risk of new customers based on their income, age,
credit history
1: Introduction to Data Mining
15
Transactional Databases
„
„
„
Consist of a file where each record represents a
transaction (TransID + ItemList, such as items
purchased in a store)
Q: How many transactions include item #13?
Data mining: Market basket analysis: identify frequent
itemsets to enable maximizing sales
„
„
MIT-652: DM
Knowing printers are commonly purchased together with
computers, thus discounting an expensive model printers with a
purchase of selected computers (hoping selling more expensive
printers)
Shelf diapers, beers and potato chips nearby to increase sales
1: Introduction to Data Mining
16
Data Warehouses
„
„
„
„
A repository of information collected from multiple
sources, organized under a unified schema at a single
site in order to facilitate management decision making
Provide a multidimensional view of data and allows
precomputation and fast accessing of summarized data
Data warehouse systems are well suited for On-line
Analytical Processing (OLAP)
Data mining uses data warehouse tools to support data
analysis
„
MIT-652: DM
OLAP operations such as drill-down and roll-up
1: Introduction to Data Mining
17
Object-Oriented Databases
„
„
„
Based on OO programming paradigm
Data and code encapsulated into a single unit
Entity is considered as object associated with
„
„
„
„
A set of variables describe the object (attributes in E-R model)
A set of messages the object uses to communicate with other
object
A set of methods which return a value in response upon
receiving a message
Superclass/subclass with inheritance benefits information
sharing
MIT-652: DM
1: Introduction to Data Mining
18
Object-Relational Databases
„
„
Extend basic relational model by adding power to handle
complex data types, class hierarchies, and object
inheritance
Data mining in OO and OR systems share some
similarities with relational Data mining
„
MIT-652: DM
Techniques need to be developed for handling complex object
structure, class and subclass hierarchies, property inheritance,
and methods and procedures
1: Introduction to Data Mining
19
Spatial Databases
„
„
„
Include geographic (map) databases, medical image
databases and satellite image databases
Applications: forestry and ecology planning, vehicle
navigation, providing public service information
regarding location of telephone and electric cables,
pipes, and sewage systems
Data mining uncover patterns describing
„
„
MIT-652: DM
Characteristics of houses located near a park
Climate of mountainous areas located at various altitudes
1: Introduction to Data Mining
20
Time-series and Temporal Databases
„
„
„
Time-series database stores sequences of values that
change with time e.g. stock exchange
Temporal database stores relational data that include
time-related attributes
Data mining finds characteristics of object evolution or
the trend of changes for objects in database e.g.
„
„
MIT-652: DM
Aid in scheduling of bank tellers according to the volume of
customer traffic
Uncover trends that could help investment strategy planning
(when is the best time to purchase XXX stock?)
1: Introduction to Data Mining
21
Text Databases
„
„
Contain word descriptions for objects such as library
databases, articles, product specifications, and bug
reports
Data mining may uncover keyword or content
associations, clustering behavior of text objects e.g.
search engines cluster documents according to the
words contained
MIT-652: DM
1: Introduction to Data Mining
22
Multimedia Databases
„
„
„
Store text, image, audio,video data
Application: picture content-based retrieval, voice mail
system, speech-based user interfaces
Data mining needs storage and search techniques to be
integrated to support real-time retrieval of countinuous
media data
MIT-652: DM
1: Introduction to Data Mining
23
Heterogeneous and Legacy Databases
„
„
„
Heterogenous database consists of a set of
interconnected, autonomous component databases
Legacy database is a group of heterogeneous databases
that combined different kinds of data systems, such as
relational or OO databases, spreadsheets, or file systems
Data mining may provide an interesting solution to the
information exchange problem by transforming the given
data into higher, more generalized, conceptual levels
MIT-652: DM
1: Introduction to Data Mining
24
World Wide Web
„
„
A huge, widely distributed information repository made
available by internet
Data mining: Mining path traversal patterns: understand
user access patterns
„ Help providing efficient access between highly
correlated objects (improve system desigh)
„ Lead to better marketing decisions e.g. placing
advertisements
„ Providing better customer classification and behavior
analysis
MIT-652: DM
1: Introduction to Data Mining
25
Introduction
„
Motivation: Why data mining?
„
What is data mining?
„
Data Mining: On what kind of data?
„
Data mining functionality
„
Are all the patterns interesting?
„
Classification of data mining systems
„
Data mining task primitives
„
Major issues in data mining
„
Summary
MIT-652: DM
1: Introduction to Data Mining
26
Data Mining Functionalities
„
Concept description: Characterization and discrimination
„
„
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
Frequent patterns, association (correlation and causality)
„
Multi-dimensional vs. single-dimensional association
„
„
MIT-652: DM
age(X, “20..29”) ^ income(X, “20..29K”) -> buys(X, “PC”)
[support = 20%, confidence = 60%]
contains(X, “computer”) -> contains(X, “software”) [10%,
75%]
1: Introduction to Data Mining
27
Data Mining Functionalities
„
Classification and Prediction
„
„
„
Finding models (functions) that describe and distinguish classes
or concepts for future prediction
E.g., classify countries based on climate, or classify cars based
on gas mileage
„
Presentation: decision-tree, classification rule, neural network
„
Prediction: Predict some unknown or missing numerical values
Cluster analysis
„
„
MIT-652: DM
Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
Clustering based on the principle: maximizing the intra-class
similarity and minimizing the interclass similarity
1: Introduction to Data Mining
28
Data Mining Functionalities
„
Outlier analysis
„
Outlier: a data object that does not comply with the general
behavior of the data
„
It can be considered as noise or exception but is quite useful in
fraud detection, rare events analysis
„
Trend and evolution analysis
„
Trend and deviation: regression analysis
„
Sequential pattern mining: e.g., digital camera Æ large SD memory
„
Periodicity analysis
„
Similarity-based analysis
MIT-652: DM
1: Introduction to Data Mining
29
Introduction
„
Motivation: Why data mining?
„
What is data mining?
„
Data Mining: On what kind of data?
„
Data mining functionality
„
Are all the patterns interesting?
„
Classification of data mining systems
„
Data mining task primitives
„
Major issues in data mining
„
Summary
MIT-652: DM
1: Introduction to Data Mining
30
Are All the “Discovered” Patterns
Interesting?
„
A data mining system/query may generate thousands of
patterns, not all of them are interesting.
„
Suggested approach: Human-centered, query-based, focused
mining
„
Interestingness measures:
„
A pattern is interesting if it is easily understood by humans,
valid on new or test data with some degree of certainty,
potentially useful, novel, or validates some hypothesis that a
user seeks to confirm
MIT-652: DM
1: Introduction to Data Mining
31
Are All the “Discovered” Patterns
Interesting?
„
Objective vs. subjective interestingness measures:
„
Objective: based on statistics and structures of
patterns, e.g., support, confidence, etc.
„
Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
MIT-652: DM
1: Introduction to Data Mining
32
Can We Find All and Only Interesting
Patterns?
„
„
Find all the interesting patterns: Completeness
„
Can a data mining system find all the interesting patterns?
„
Heuristic vs. exhaustive search
„
Association vs. classification vs. clustering
Search for only interesting patterns: Optimization
„
Can a data mining system find only the interesting patterns?
„
Approaches
„
„
MIT-652: DM
First general all the patterns and then filter out the
uninteresting ones
Generate only the interesting patterns—mining query
optimization
1: Introduction to Data Mining
33
Introduction
„
Motivation: Why data mining?
„
What is data mining?
„
Data Mining: On what kind of data?
„
Data mining functionality
„
Are all the patterns interesting?
„
Classification of data mining systems
„
Data mining task primitives
„
Major issues in data mining
„
Summary
MIT-652: DM
1: Introduction to Data Mining
34
Data Mining: Confluence of Multiple
Disciplines
Database
Technology
Machine
Learning
Information
Science
MIT-652: DM
Statistics
Data Mining
Algorithm
1: Introduction to Data Mining
Visualization
Other
Disciplines
35
Data Mining: Classification Schemes
„
General functionality
„
Descriptive data mining characterize general properties of data
in database e.g. association rules
„
Predictive data mining perform inference on current data in
order to make predictions
„
Different views, different classifications
„
Data view: kinds of databases to be mined
„
Knowledge view: kinds of knowledge to be discovered
„
Method view: kinds of techniques utilized
„
Application view: kinds of applications adapted
MIT-652: DM
1: Introduction to Data Mining
36
Data Mining: Classification Schemes
„
Different views, different classifications (1)
„ Databases to be mined
„ Relational, transactional, object-oriented, objectrelational, active, spatial, time-series, text, multimedia, heterogeneous, legacy, WWW, etc.
„ Knowledge to be mined
„ Characterization, discrimination, association,
classification, clustering, trend, deviation and
outlier analysis, etc.
„ Multiple/integrated functions and mining at
multiple levels
MIT-652: DM
1: Introduction to Data Mining
37
Data Mining: Classification Schemes
„
Different views, different classifications (2)
„ Techniques utilized
„ Database-oriented, data warehouse (OLAP),
machine learning, statistics, visualization, neural
network, etc.
„ Applications adapted
„ Retail, telecommunication, banking, fraud
analysis, DNA mining, stock market analysis, Web
mining, Weblog analysis, etc.
MIT-652: DM
1: Introduction to Data Mining
38
Introduction
„
Motivation: Why data mining?
„
What is data mining?
„
Data Mining: On what kind of data?
„
Data mining functionality
„
Are all the patterns interesting?
„
Classification of data mining systems
„
Data mining task primitives
„
Major issues in data mining
„
Summary
MIT-652: DM
1: Introduction to Data Mining
39
What Defines a Data Mining Task ?
„
Task-relevant data
„
Type of knowledge to be mined
„
Background knowledge
„
Pattern interestingness measurements
„
Visualization of discovered patterns
MIT-652: DM
1: Introduction to Data Mining
40
Task-Relevant Data
„
Database or data warehouse name
„
Database tables or data cubes
„
Condition for data selection
„
Relevant attributes or dimensions
„
Data grouping criteria
MIT-652: DM
1: Introduction to Data Mining
41
Types of knowledge to be mined
„
Characterization
„
Discrimination
„
Association
„
Classification/prediction
„
Clustering
„
Outlier analysis
„
Other data mining tasks
MIT-652: DM
1: Introduction to Data Mining
42
Types of knowledge to be mined:
Metapatterns
„
Metapatterns: all discovered patterns must match
„
For example, metarule for association rules
„
P(X:customer,W) ^ Q(X,Y) ⇒ buys(X,Z)
„
age(X,”30..39”) ^ income(X,”40K..49K”) ⇒ buys(X,”VCR”)
[2.2%, 60%]
„
occupation(X,”student”) ^ age(X,”20..29”) ⇒
buys(X,”Computer”) [1.4%, 70%]
MIT-652: DM
1: Introduction to Data Mining
43
Background Knowledge: Concept
Hierarchies
„
„
„
„
Schema hierarchy
„ E.g., street < city < province_or_state < country
Set-grouping hierarchy
„ E.g., {20-39} = young, {40-59} = middle_aged, {6089} = senior
Operation-derived hierarchy
„ E.g., email address: [email protected]
login-name < department < university < country
Rule-based hierarchy
„ low_profit_margin (X) <= price(X, P1) and cost (X, P2)
and (P1 - P2) < $50
MIT-652: DM
1: Introduction to Data Mining
44
Measurements of Pattern
Interestingness
„
Simplicity
e.g., (association) rule length, (decision) tree size
„
Certainty
e.g., confidence, P(A|B) = n(A and B)/ n (B), classification
reliability or accuracy, certainty factor, rule strength, rule
quality, discriminating weight, etc.
„
Utility
potential usefulness, e.g., support (association), noise
threshold (description)
„
Novelty
not previously known, contribute new information, e.g.,
exception
MIT-652: DM
1: Introduction to Data Mining
45
Visualization of Discovered Patterns
„
Different backgrounds/usages may require different forms
of representation
„
„
Concept hierarchy is also important
„
„
„
E.g., rules, tables, crosstabs, pie/bar chart etc.
Discovered knowledge might be more understandable when
represented at high level of abstraction
Interactive drill up/down, pivoting, slicing and dicing provide
different perspective to data
Different kinds of knowledge require different
representation: association, classification, clustering, etc.
MIT-652: DM
1: Introduction to Data Mining
46
Introduction
„
Motivation: Why data mining?
„
What is data mining?
„
Data Mining: On what kind of data?
„
Data mining functionality
„
Are all the patterns interesting?
„
Classification of data mining systems
„
Data mining task primitives
„
Major issues in data mining
„
Summary
MIT-652: DM
1: Introduction to Data Mining
47
Major Issues in Data Mining
„
Mining methodology and user interaction
„
„
Mining different kinds of knowledge in databases
Interactive mining of knowledge at multiple levels of
abstraction
„
Incorporation of background knowledge
„
Data mining query languages and ad-hoc data mining
„
Expression and visualization of data mining results
„
Handling noise and incomplete data
„
Pattern evaluation: the interestingness problem
MIT-652: DM
1: Introduction to Data Mining
48
Major Issues in Data Mining
„
„
Issues relating to the diversity of data types
„ Handling relational and complex types of data
„ Mining information from heterogeneous databases and
global information systems (WWW)
Performance and scalability
„
Efficiency and scalability of data mining algorithms
„
Parallel, distributed and incremental mining methods
MIT-652: DM
1: Introduction to Data Mining
49
Introduction
„
Motivation: Why data mining?
„
What is data mining?
„
Data Mining: On what kind of data?
„
Data mining functionality
„
Are all the patterns interesting?
„
Classification of data mining systems
„
Data mining task primitives
„
Major issues in data mining
„
Summary
MIT-652: DM
1: Introduction to Data Mining
50
Summary
„
„
„
„
Database technology: evolved from primitive file processing
to the development of database management systems
Data mining: discovering interesting patterns from large
amounts of data in a variety of information repositories
A KDD process: includes data cleaning, data integration,
data selection, data transformation, data mining, pattern
evaluation, and knowledge presentation
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend
analysis, etc.
MIT-652: DM
1: Introduction to Data Mining
51
Summary
„
Pattern interestingness: easily understood, valid (with some
degree of certainty), useful, novel
„
„
Measures of pattern interestingness, either objective or
subjective can be used to guide the discovery process
Data mining systems: can be classified according to the
kinds of databases mined, the kinds of knowledge mined,
the techniques used, or the applications adapted
MIT-652: DM
1: Introduction to Data Mining
52
Summary
„
Five primitives for specification of a data mining task
„
task-relevant data (i.e., the data set to be mined)
„
kind of knowledge to be mined
„
background knowledge
„
interestingness measures
„
knowledge presentation and visualization techniques to
be used for displaying the discovered patterns
MIT-652: DM
1: Introduction to Data Mining
53