Download DATA MINING

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
DATA MINING
Introductory and Advanced Topics
Part I
Margaret H. Dunham
Department of Computer Science and Engineering
Southern Methodist University
Data Mining Outline
 PART I
Introduction
Related Concepts
Data Mining Techniques
 PART II
 Classification
 Clustering
 Association Rules
 PART III
 Web Mining
 Spatial Mining
 Temporal Mining
2
Ming-Yen Lin, IECS, FCU
Introduction Outline
Goal: Provide an overview of data mining.
Define data mining
Data mining vs. databases
Basic data mining tasks
Data mining development
Data mining issues
3
Ming-Yen Lin, IECS, FCU
Introduction
Data is growing at a phenomenal rate
Users expect more sophisticated
information
simple listing vs. purchase detail
How?
UNCOVER HIDDEN INFORMATION
DATA MINING
4
Ming-Yen Lin, IECS, FCU
Data Mining Definition
Finding hidden information in a database
Fit data to a model
Similar terms
Exploratory data analysis
Data driven discovery
Deductive learning
...
5
Ming-Yen Lin, IECS, FCU
資料探勘:各種名稱
資料庫之知識發現、樣式探勘、知識挖掘、知識擷取、
資料挖掘、資訊收割、資料分析、企業智慧、資料考古
Knowledge
knowledge
Pattern Knowledge
Discovery
extraction
Mining
Discovery
in Databases
data/pattern
(KDD)
Data
information
analysis
harvesting
Mining
Data
Data
Dredging
business intelligence
Archeology
資料探勘、資料挖掘、資料採礦、資料勘測、知識挖掘
資料探勘:由(儲存於資料庫的)大量資料中
查詢與擷取(通常)過去未知的、
有用的知識、模式或趨勢 的過程
6
Ming-Yen Lin, IECS, FCU
Database Processing vs. Data
Mining Processing
[Fig. 1.1]
 Query
 Query
 Well defined
 SQL

Data
 Poorly defined
 No precise query language

– Operational data

Output
– Precise
– Subset of database
Data
– Not operational data

Output
– Fuzzy
– Not a subset of database
7
Ming-Yen Lin, IECS, FCU
Query Examples
Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more
than $10,000 in the last month.
– Find all customers who have purchased milk
Data Mining
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)
– [ex. 1.1: D.M. helps to authorize a credit card
transaction: 4 classes]
Ming-Yen Lin, IECS, FCU
8
Data Mining Algorithm
Objective: Fit Data to a Model
Characterize D.M. Algorithms as 3 parts
Model
Preference – Criteria to fit the best model
Search – Technique to search the data
 [ex. 1.1 illustrated]
Models
Predictive: predict about values of data
Descriptive: identify patterns/relationships in
data [explore the properties of data]
9
Ming-Yen Lin, IECS, FCU
Data Mining Models and Tasks
illustrative examples only, not exhaustive listing
10
Ming-Yen Lin, IECS, FCU
Predictive Data Mining
 Classification maps data into predefined groups or
classes
 Supervised learning
 examples: loan, credit risk
 Pattern recognition: a type of classification
 example: airport security screening -- face patterns
 Regression is used to map a data item to a real
valued prediction variable.
 linear regression, error analysis to find the best
 Prediction: predict future data (rather than current
data)
 flooding, speech recognition, …
 data collected by the sensors upriver…w.r.t. time
11
Ming-Yen Lin, IECS, FCU
Time Series Analysis
 Example: Stock Market
 Predict future values
 Determine similar patterns over time
 Classify behavior: Y[6..20] is similar to Z[13..27]
12
Ming-Yen Lin, IECS, FCU
Descriptive Data Mining
 Clustering groups similar data together into
clusters. [vs. classification]
 Unsupervised learning
 Segmentation/Partitioning data
 example: demographic groups & specialized catalogs
 Summarization maps data into subsets with
associated simple descriptions.
 Characterization/Generalization
 Link Analysis uncovers relationships among data.
 Affinity Analysis/Associations
 Association Rules [store example]
 Sequential Analysis (sequence discovery) determines
sequential patterns.
Ming-Yen Lin, IECS, FCU
13
Data Mining 功能 (I)
 概念描述:特徵與區別(Concept description:
Characterization and discrimination)
 廣義化、綜合(Generalize, summarize)
 對比資料的特性(contrast data characteristics)
 關連(Association :correlation and causality相關、因果)
 Diaper -> Beer [0.5%, 75%]
 分類與預測(Classification and Prediction )
 建立模型(函數)以描述與分辨類別或概念,作為未來預測用
 例:classify countries based on climate, or classify cars based on gas
mileage
 預測某些未知的、或遺失的(missing) 數值
14
Ming-Yen Lin, IECS, FCU
Data Mining 功能 (II)
 群聚分析 (Cluster analysis)
 類別標籤未知: 把資料依相似性分群
 e.g., cluster houses to find distribution patterns
 maximizing intra-class similarity
 minimizing interclass similarity
 離群分析 (Outlier analysis)
 outlier: 某資料object,無法符合資料的一般行為(模式)
 雜質noise?例外exception? No! 用在fraud detection, rare events
analysis
 趨勢與演進 (Trend and evolution analysis)
 trend and deviation(偏差) : regression analysis
 sequential pattern mining
 periodicity analysis
 similarity-based analysis
 Estimation, Visualization
15
Ming-Yen Lin, IECS, FCU
Data Mining vs. KDD
Knowledge Discovery in Databases (KDD):
process of finding useful information and
patterns in data.
Data Mining: Use of algorithms to extract
the information and patterns derived by the
KDD process.
16
Ming-Yen Lin, IECS, FCU
KDD Process
Modified from [FPSS96C]
 Selection: Obtain data from various (heterogeneous)
sources.
 Preprocessing: Cleanse (incorrect/missing) data.
 Transformation: Convert to common format;
Transform to new format; Reduce data amount
 Data Mining: Obtain desired results.
 Interpretation/Evaluation: Present results to user in
meaningful manner.
Ming-Yen Lin, IECS, FCU
17
資料探勘:KDD的程序
Data mining: the core of
knowledge discovery
process. 核心程序
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
Ming-Yen Lin, IECS, FCU
18
KDD: Knowledge Discovery in Database
 KDD Process (Interactive and iterative)互動、反覆
 Learning the application domain (relevant prior knowledge &
goals of application)學習應用領域及相關知識
 Steps
 資料選擇(data selection:creating a target data set)
 資料清理與前置處理(data cleaning & preprocessing :may
take 60% of effort!)
 資料簡化與轉換(data reduction & transformation:find useful
features, dimensionality/variable reduction, invariant representation)
 資料探勘 (choose function: summarization/ classification/
clustering regression/ association choose algorithms search for
interest patterns)
 模式評估與知識呈現 (Pattern evaluation & knowledge
presentation: visualization, transformation)
Ming-Yen Lin, IECS, FCU
19
KDD Process Ex.: Web Log
 Selection:
 Select log data (dates and locations) to use
 Preprocessing:
 Remove identifying URLs
 Remove error logs
 Transformation:
 Sessionize (sort and group)
 Data Mining:
 Identify and count patterns
 Construct data structure
 Interpretation/Evaluation:
 Identify and display frequently accessed sequences.
 Potential User Applications:
 Cache prediction
 Personalization
20
Ming-Yen Lin, IECS, FCU
Visualization Techniques
Graphical
bar chart, pie charts, histograms, line graphs
Geometric
box plot, scatter diagram
Icon-based
figures, colors
Pixel-based
unique colored pixel
Hierarchical
Hybrid
Ming-Yen Lin, IECS, FCU
21
Data Mining Development
•Relational Data Model
•SQL
•Association Rule Algorithms
•Data Warehousing
•Scalability Techniques
•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Imprecise Queries
•Textual Data
•Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Algorithm Design Techniques
•Algorithm Analysis
•Data Structures
[Table 1.1]
Ming-Yen Lin, IECS, FCU
•Neural Networks
•Decision Tree Algorithms
22
資料探勘的技術
決策支援
Decision
Support
統計
Statistics
機器學習
Machine
Learning
Ming-Yen Lin, IECS, FCU
資料庫管理
與資料倉儲
Database
Management
& Warehousing
資料探勘
Data
Mining
其他
Others
平行處理
Parallel
Processing
視覺化
Visualization
演算法
Algorithm
23
資料庫技術的演進
 1960s 資料收集
 Data collection, database creation, information
management systems and network DBMS
 1970s 資料庫
 Relational data model, relational DBMS
implementation
 1980s 進階資料庫
 RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) and application-oriented DBMS
(spatial, scientific, engineering, etc.)
 1990s—2000s 資料探勘
 Data mining and data warehousing, multimedia
databases, and Web databases
Ming-Yen Lin, IECS, FCU
24
D. M. Implementation Issues
Human Interaction
domain experts/technical experts
Overfitting
model does not fit future states
Outliers
Interpretation
expert/common users
Visualization
Large Datasets
High Dimensionality
Ming-Yen Lin, IECS, FCU
25
Implementation Issues (cont’d)
Multimedia Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
Integration
into traditional DBMS
Application
determine the intended use, business practice
26
Ming-Yen Lin, IECS, FCU
Data Mining – 什麼樣的資料?
 Relational databases關連式資料庫
 Data warehouses資料倉儲
 Transactional databases交易資料
 Advanced DB & information repositories(儲藏)
 Object-oriented and object-relational databases
 Spatial (空間)databases
 Time-series (時序)data & temporal (時間的)data
 Text databases & multimedia databases
 Heterogeneous (異質)& legacy(傳統) databases
 WWW
27
Ming-Yen Lin, IECS, FCU
Data Mining Metrics
Effectiveness/Usefulness measure
Return on Investment (ROI)
Accuracy in classification
Space/Time complexity analysis
28
Ming-Yen Lin, IECS, FCU
Social Implications of DM
Privacy
Profiling
Unauthorized use
29
Ming-Yen Lin, IECS, FCU
Database Perspective on Data
Mining
Scalability
Real World Data: noisy, missing values
Updates
Ease of Use
abstraction of data definition/access
primitives, query processing support
30
Ming-Yen Lin, IECS, FCU
典型資料探勘系統的架構
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration
Databases
Filtering
Data
Warehouse
31
Ming-Yen Lin, IECS, FCU
The Future
DMQL (data mining query language)
access to concept hierarchy
example (p.18)
rule_spec
generalized relation/characteristic rule/discriminate
rule/classification rule
KDD process model: CRISP-DM (CrossIndustry Standard Process for Data Mining)
5A: assess, access, analyze, act, automate
32
Ming-Yen Lin, IECS, FCU
參考網站
KDD
http://www.kdnuggets.com/
http://www.acm.org/sigkdd/
http://www.acm.org/sigmod/
Ref. slides
http://www.cs.uiuc.edu/~hanj/book
Research papers
http://www.researchindex.com/
http://www.google.com/
(p.20)
Ming-Yen Lin, IECS, FCU
33