Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Related Concepts Outline
Goal: Examine some areas which are related to
data mining.
 Database/OLTP Systems
 Fuzzy Sets and Logic
 Information Retrieval(Web Search Engines)
 Dimensional Modeling
 Data Warehousing
 OLAP/DSS
 Statistics
 Machine Learning
 Pattern Matching
1
Ming-Yen Lin, IECS, FCU
DB & OLTP Systems
On-Line Transaction Processing
 Schema
 (ID,Name,Address,Salary,JobNo)
 Data Model
 Entity-Relationship
 Relational
 Transaction
 Query:
SELECT Name
FROM T
WHERE Salary > 100000
[Fig. 2.1]
DM: Only imprecise queries
2
Ming-Yen Lin, IECS, FCU
Fuzzy Sets and Logic
 Fuzzy Set: Set membership function is a real valued
function with output in the range [0,1].
 f(x): Probability x is in F.
 1-f(x): Probability x is not in F.
 EX:
 T = {x | x is a person and x is tall}
 Let f(x) be the probability that x is tall
 Here f is the membership function
 {x|x  R and x.salary > 100,000} vs. {x|xR and x
is tall}
DM: Prediction and classification are fuzzy.
Ming-Yen Lin, IECS, FCU
3
Fuzzy Sets & Fuzzy Logic
Fuzzy logic: reasoning with uncertainty; multiple valued logic
retrieve data with imprecise/missing values
mem(x) = 1- mem(x);
mem(xy) = min(mem(x), mem(y))
mem(xy) = max(mem(x), mem(y))
4
Ming-Yen Lin, IECS, FCU
Classification/Prediction is Fuzzy
Grey area
Loan
Reject
Reject
Amnt
Accept
Simple
Accept
Fuzzy
5
Ming-Yen Lin, IECS, FCU
Information Retrieval
 Information Retrieval (IR): retrieving desired information
from textual data.
 Library Science
 Digital Libraries
 Web Search Engines
 Traditionally keyword based
 Sample query:
Find all documents about “data mining”.
DM: Similarity measures;
Mine text/Web data.
6
Ming-Yen Lin, IECS, FCU
Information Retrieval (cont’d)
 Similarity: measure of how close a query is to a document.
 Documents which are “close enough” are retrieved.
sim(q,Di); sim(Di, Dj)
 Metrics:
 Precision = |Relevant and Retrieved|
|Retrieved|
 Recall = |Relevant and Retrieved|
|Relevant|
 Inverse Document Frequency:
 IDFk = log(n/|documents containing k|) + 1
 Concept hierarchy [Fig. 2.7]
 Replace ‘tiger’ with ‘CAT’
 May be a Directed Acyclic Graph
7
Ming-Yen Lin, IECS, FCU
IR Query Result Measures and Classification
calculate precision/recall
IR
Classification
8
Ming-Yen Lin, IECS, FCU
Decision Support Systems
Improve decision making by providing
specific information needed by management
Executive information systems
Executive Support Systems
as a suite of tools, assist in the overall DSS
process
9
Ming-Yen Lin, IECS, FCU
Dimensional Modeling
 a different way to view and interrogate data in DB
 View data in a hierarchical manner more as
business executives might
 Useful in decision support systems and mining
 Dimension: collection of logically related attributes;
axis for modeling data.
 Facts: data stored
 Ex: Dimensions – products, locations, date
Facts – quantity, unit price
DM: May view data as dimensional.
Ming-Yen Lin, IECS, FCU
10
Relational View of Data
ProdID
123
123
150
150
150
150
200
300
500
500
LocID
Dallas
Houston
Dallas
Dallas
Fort
Worth
Chicago
Seattle
Rochester
Bradenton
Chicago
Date
022900
020100
031500
031500
021000
Quantity
5
10
1
5
5
UnitPrice
25
20
100
95
80
012000
030100
021500
022000
012000
20
5
200
15
10
75
50
5
20
25
1
11
Ming-Yen Lin, IECS, FCU
Dimensional Modeling Queries
Roll Up: more general dimension
Drill Down: more specific dimension
Dimension (Aggregation) Hierarchy
SQL uses aggregation
Multidimensional schemas
star schema
snowflake schema
fact constellation schema
Multidimensional indexing
bitmap index, join index
Ming-Yen Lin, IECS, FCU
12
Cube view of Data
13
Ming-Yen Lin, IECS, FCU
Aggregation Hierarchies
order relationship
second < minute
aggregate sum
additive
14
Ming-Yen Lin, IECS, FCU
Star Schema
Day
product
Sales
Division
Ming-Yen Lin, IECS, FCU
dimension
facts
Location
aggregate facts for efficiency
15
Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_street
country
Measures
16
Ming-Yen Lin, IECS, FCU
Options to implement star schema
(a) flattened: store data for each dimension in
exactly one table; roll up: by SQL aggregate
(b) normalized: a table exists for each level in each
dimension; each table has one tuple for every
occurrence at the level
(c) expanded: num. of dimen. tables =
normalized; lowest dim. = flattened
(d) levelized: has one dim. table as does the
flattened, but aggregations have been
performed.
[Fig. 2.12]
Ming-Yen Lin, IECS, FCU
17
Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
province_or_street
country
18
Ming-Yen Lin, IECS, FCU
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
shipper_key
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
Measures
Galaxy schema
Ming-Yen Lin, IECS, FCU
time_key
from_location
branch_key
branch
Shipping Fact Table
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type 19
Data Warehousing
“Subject-oriented, integrated, time-variant,
nonvolatile” William Inmon
 Operational Data: Data used in day to day needs of
company.
 Informational Data: Supports other functions such
as planning and forecasting.
 Data mining tools often access data warehouses
rather than operational data.
DM: May access data in warehouse.
20
Ming-Yen Lin, IECS, FCU
What is Data Warehouse?
 定義
 一個分別設置的,獨立於公司作業資料庫的,決策支
援資料庫
 為支援資料處理,提供分析之用,提供完善的、統合
歷史資料的平台
 “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in
support of management’s decision-making
process.”—W. H. Inmon
 Data warehousing
 建構與使用 data warehouses的程序
21
Ming-Yen Lin, IECS, FCU
D. W.—Subject-Oriented
依主要主題而組織,如 customer, product,
sales
焦點集中在決策者要的資料模型或分析,不
在日常作業或交易處理
去除決策資源程序中無用的資料,提供簡化
的、精簡的(環繞於特定主題的)view
22
Ming-Yen Lin, IECS, FCU
Data Warehouse—Integrated
 藉整合多個、異質的資料來源而建構
 relational databases
 flat files
 on-line transaction records
 應用data cleaning 與 data integration的技巧
 確保不同資料來源的一致性
 naming conventions
 encoding structures
 attribute measures
 例:Hotel price: currency, tax, breakfast covered, etc.
 當資料「移動」到 warehouse時,已經經轉換
23
Ming-Yen Lin, IECS, FCU
Data Warehouse—Time Variant
 data warehouse 的時間軸明顯的比作業性系統長
 Operational database: current value data.
 Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
 data warehouse的各主要結構(key structure)
 外顯或隱含地(explicitly or implicitly) 包含 time 這個元素
 operational data:不一定包含“time element”
24
Ming-Yen Lin, IECS, FCU
Data Warehouse—Non-Volatile
由作業環境中的資料轉換得到的、實質
上獨立的儲存(physically separate store)
data warehouse 不含操作性的更新
不需交易處理、復原、協同控制
(concurrency control) 機制
僅需兩種操作
資料的初始載入
資料的取用
25
Ming-Yen Lin, IECS, FCU
Data Warehousing
traditional db: operational data
data warehouse: information data
‘what if’ questions -> warehouse + query
eg. analyze trend from historical data
basic components
data migration
warehouse
access tool
26
Ming-Yen Lin, IECS, FCU
Transformation in DWing
 Transformation [Fig. 2.14]
 remove unwanted data
 convert heterogeneous source into one common format
 merge snapshots to create historical view
 summarize data at levels
 add derived data
 handling missing/erroneous data
 also called data scrubbing/data staging
 Improve performance of data warehouse
applications
 Summarization
 Denormalization (speed up join!)
 Partitioning
27
Ming-Yen Lin, IECS, FCU
Operational vs. Informational
Operational Data
Data Warehouse
Application
OLTP
OLAP
Use
Precise Queries
Ad Hoc
Temporal
Snapshot
Historical
Modification
Dynamic
Static
Orientation
Application
Business
Data
Operational Values
Integrated
Size
Level
Gigabits
Detailed
Terabits
Summarized
Access
Often
Less Often
Response
Few Seconds
Minutes
Data Schema
Relational
Star/Snowflake
28
Ming-Yen Lin, IECS, FCU
OLAP
Online Analytic Processing (OLAP):
provides more complex queries than OLTP.
OnLine Transaction Processing (OLTP):
traditional database/transaction processing.
Dimensional data; cube view
Visualization of operations:
Slice: examine sub-cube.
Dice: rotate cube to look at another dimension.
Roll Up/Drill Down
DM: May use OLAP queries.
Ming-Yen Lin, IECS, FCU
29
A Concept Hierarchy
Dimension (location)
all
all
Europe
region
country
city
office
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver ...
L. Chan
...
...
Toronto
M. Wind
Used for multi-level abstraction (for interactive mining)
Ming-Yen Lin, IECS, FCU
Mexico
30
典型的 OLAP 運算
 Roll up (drill-up): 綜合資料
 by climbing up hierarchy or by dimension reduction
 Drill down (roll down): roll-up的相反
 from higher level summary to lower level summary or detailed
data, or introducing new dimensions
 Slice and dice: (選取部分)
 project and select
 Pivot (rotate): (旋轉)
 reorient the cube, visualization, 3D to series of 2D planes.
 Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its backend relational tables (using SQL)
31
Ming-Yen Lin, IECS, FCU
Cube
Operations
dice
(location=x
AND time=Y
AND item = Z)
roll-up
(city2location)
drill-down
(quarter2month)
slice
(time=Q1)
pivot
32
Ming-Yen Lin, IECS, FCU
OLAP Operations
Roll Up
Drill Down
Single Cell
Multiple Cells
Slice
Dice
OLAP tools: ROLAP (relational) or MOLAP (multidimentional)
ROLAP: a ROLAP server (middleware) creates MD view for users
MOLAP: specialized DBMS & s/w to directly support MD data
OR Hybrid tool
33
Ming-Yen Lin, IECS, FCU
Web Search Engines
be viewed as query systems like IR systems
query: keyword, boolean, weighted, …
Conventional search engines suffer
Abundance
Limited coverage
Limited query
Limited customization
Web Mining
content/structure/usage
Web search => content mining
34
Ming-Yen Lin, IECS, FCU
Statistics
Simple descriptive models
Statistical inference: generalizing a model
created from a sample of the data to the
entire dataset.
Exploratory Data Analysis:
Data can actually drive the creation of the
model
Opposite of traditional statistical view.
Data mining targeted to business user
DM: Many data mining methods come
from statistical techniques.
Ming-Yen Lin, IECS, FCU
35
Machine Learning
 Machine Learning: area of AI that examines how to
write programs that can learn.
 Often used in classification and prediction
 Supervised Learning: learns by example.
 Unsupervised Learning: learns without knowledge
of correct answers.
 Machine learning often deals with small static
datasets.
 [table 2.3]
DM: Uses many machine learning
techniques.
Ming-Yen Lin, IECS, FCU
36
Pattern Matching (Recognition)
Pattern Matching: finds occurrences of a
predefined pattern in the data.
Applications include speech recognition,
information retrieval, time series analysis.
DM: Type of classification.
37
Ming-Yen Lin, IECS, FCU
DM vs. Related Topics
Area
Query
Data
DB/OLTP Precise Database
IR
OLAP
DM
Results Output
Precise DB Objects
or
Aggregation
Precise Documents
Vague Documents
Analysis Multidimensional Precise DB Objects
or
Aggregation
Vague Preprocessed Vague KDD
Objects
38
Ming-Yen Lin, IECS, FCU