Download Data Description

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
What is Cluster Analysis? (1/4)
• Cluster: a collection of data objects (物以類聚)
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
– 將一異質的群體(a diverse group)區隔為同質性較高的群
集(clusters叢聚)或是子群(subgroups)
• Clustering is unsupervised classification: no
predefined classes
– 資料依照本身的自我相似性(self-similarity)而群集在一起,
群集(clusters)的意義要靠事後的闡釋才能得知。
2017/5/5
Data Mining
1
What is Cluster Analysis? (2/4)
 找出隱藏的現象或內部結構
2017/5/5
Data Mining
2
What is Cluster Analysis? (3/4)
 Typical applications
 As a stand-alone tool to get insight into data
distribution
 As a preprocessing step for other algorithms
− clustering might be the first step in a market
segmentation effort
 a one-size-fits-all rule for “what kind of promotion do
customers respond to best” (x)
 what kind of promotion works best for each cluster
(with similar buying habit) (o)
2017/5/5
Data Mining
3
What is Cluster Analysis? (4/4)
 線上購物網站的使用者族群與消費能力
– 具有類似基本資料的人,通常也有相近的行為模式
會員 年齡 平均月收入 (千)
20
20
2
21
26
3
22
25
4
41
30
5
43
32
6
52
40
7
55
38
2017/5/5
50
平均月收入(千)
1
年齡與平均月收入散佈圖
40
C3
30
C2
20
C1
10
0
0
10
Data Mining
20
30
年齡
40
50
60
4
What Is Good Clustering? (1/2)
 A good clustering method will produce high quality
clusters with
– high intra-class similarity and low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
 The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns.
− 在十數個刷卡行為的群集中,出現一個群集含有高比例
的呆帳案例,而其他群集毫無特色可言
2017/5/5
Data Mining
5
What Is Good Clustering? (2/2)
2017/5/5
Data Mining
6
Cluster Analysis的議題
 根據甚麼資訊(特徵,屬性)來分群
 事先決定cluster的數目是一件困難的工作
 data屬於那個cluster應該是程度的問題(fuzzy)
而非是或否的問題(crisp)
 非監督式學習沒有所謂最佳的模型
 視覺化工具 vs 分群演算法 (專家經驗)
2017/5/5
Data Mining
7
A scatter graph helps to understand and
visualize clusters of customers (1/2)
2017/5/5
Data Mining
8
A scatter graph helps to understand and
visualize clusters of customers (2/2)
 Each Axis
 a purchase of an item associate with that pet
 The box at the intersection
 the number of customers who purchased the
corresponding items
 Four segments of customers
1. Only-dog-owners
2. Only-cat-owners
3. Only-fish-owners and cat-and-dog-owners
4. The rest can be lumped together as “others”
2017/5/5
Data Mining
9
Cluster Analysis based on RFM (1/2)
 透過RFM值的分析可以量化顧客消費行為
並且衡量顧客忠誠度和貢獻度,以利顧客分群
及目標客戶的鎖定
 R(Recency): 最近購買日
 the time period since the last purchase;
 F(Frequency): 購買頻率
 the number of purchases made in a certain time period;
 M(Monetary):購買金額
 the amount of money spent during a certain period of time.
2017/5/5
Data Mining
10
Cluster Analysis based on RFM (2/2)
 取得某一時間區間內客戶們的RFM值
 進行叢聚分析
 Average RFM values of each cluster (Vc) are compared
with the total average RFM values of all clusters (Vt)
 if vc > vt then give  else give 
 目標客戶與行銷策略
 R  F  M : Promising customers
 R  F  M : Loyal customers
 R  F  M : Vulnerable customers
 有些變化的組合很難去解釋、以及變化的幅度未考量
2017/5/5
Data Mining
11
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
• Land use: Identification of areas of similar land use in an
earth observation database
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
• City-planning: Identifying groups of houses according to
their house type, value, and geographical location
• Text Mining: 文件分類、客服申訴處理、病人病例分析、軍
事刑事情報管理 (關鍵字結構的相似性)
2017/5/5
Data Mining
12
Data Classification 與 Data Clustering之比較
 Data Classification
– 是根據資料的屬性和一些預先建立的規則(Rule)來將資料
分類
– 事前必先對資料的結構有一定的了解才能實行
– 找出許多(輸入)變數與命題(輸出變數)之間的關連性
 Data Clustering
– 它不需要了解資料庫中的資料特色和結構,就能把資料分
類成群
– 讓群組內的資料相似度最高,讓群組跟群組間的資料相似
度最低
– 呈現變數之間的結構,有比較多的詮釋空間
2017/5/5
Data Mining
13
Description and Visualization (1/2)
 描述在複雜的資料庫中到底發生了什麼?透過這種
方式,可以讓我們對我們的客戶、產品以及流程等
有更多的認識與了解。
− A good enough description of a behavior will often suggest
an explanation for it as well
 parental movie viewing habits are strongly influenced by the taste of
children
2017/5/5
Data Mining
14
Description and Visualization (2/2)
 Data visualization is one powerful form of
descriptive data mining.
− It is not always easy to come up with meaningful visualizations,
but the right picture really can be worth a thousand association
rules
− Data Cube, Scatter graph, Histogram, …
2017/5/5
Data Mining
15
資料探勘的技術
 統計分析 (Statistic Analysis)
 關聯分析 (Association Analysis)
 分類法 (Classification)
 叢聚分析 (Clustering Analysis)
 其他的技術
– 趨勢分析 (Trend Analysis)、時間序列分析 (Time
Serial Analysis)、迴歸分析 (Regression Analysis)、
異常值分析 (Outlier Analysis)或是人工智慧領域
中的類神經網路(Neural Network)技術……等。
2017/5/5
Data Mining
16
All six tasks in one small database
以電影迷(Moviegoers)資料庫為例
• We wondered
 what movies a person watches
 Who goes to see a movie
• The moviegoers database contains
 the responses to an informal survey conducted during
August and September of 1996
• The Sample Populations
 the survey was distributed to four different populations in
hopes that interesting intergroup differences might be revealed
• The survey asked for age, sex, and last movies seen
in a movie theater
2017/5/5
Data Mining
17
The layout of the moviegoers database
1
∞
1
∞
∞
∞
2017/5/5
1
Data Mining
18
Moviegoer Survey (The first few rows are shown)
2017/5/5
姓名
性別
年紀
來源地點
電影名稱
Amy
女
27
Oberlin
Independence day
Andrew
男
25
Oberlin
12 monkeys
Andy
男
34
Oberlin
The birdcage
Anne
女
30
Oberlin
Trainspotting
Ansje
女
25
Oberlin
I shot andy wrrhol
Beth
女
30
Oberlin
Chain reaction
Bob
男
51
Pinewoods
Schindler’s list
Brian
男
23
Oberlin
Super cop
Candy
女
29
Oberlin
Eddie
Cara
女
25
Oberlin
Phenomenon
Cathy
女
39
124Mt.Aubum
The birdcage
Charles
男
25
Oberlin
Kingpin
Curt
男
30
MRJ
T2 judgment day
David
男
40
MRJ
Independence day
Erica
女
23
124 Mt.Aubum
trainspotting
Data Mining
19
What can data mining do? (1/3)
 電影迷分類(Moviegoer Classification)
• 根據年齡、來源以及看的電影來區分性別
• 根據性別、年齡以及看的電影來區分來源
• 根據以往看過的電影、年齡、性別和來源去區分會看
什麼電影 (most recent movie)
 技術: 決策樹
 電影迷推估(Estimation)
• 年齡為連續性變數,因此可以作為推估作業的目標變數。
• 年齡 = f(來源地點,性別,看過的電影)
2017/5/5
Data Mining
20
What can data mining do? (2/3)
 電影迷預測(Prediction)
− 預測一部新片上映時,誰會是它的觀眾?
 將影迷與電影進行群集分析
 針對每一群影迷,挖掘規則來解釋這群人的電影品味
 針對每一群電影,挖掘規則描述其最佳目標觀眾
 新電影上映時,由新電影所屬群集就可以找出目標觀眾
 電影迷關聯分組(Affinity grouping)
− 哪些電影總是被同類的人觀賞 (which movies go together?)
− 經由產生的關聯法則來分析性別的分類 (Virtual items)
2017/5/5
Data Mining
21
What can data mining do? (3/3)
 電影迷群集化
− to find groups of movies that go together because they
are seen by the same people
− to find groups of people that go together because they
see the same movies
 people with young children form a clearly recognizable cluster
in the moviegoers database
 電影迷描述
− 基本統計量: 平均年齡、女性人口百分比。
− 關聯規則: 看過X電影的人也會看Y電影
− 規則也可視為一種描述:12~17歲的男性喜歡看X電影
2017/5/5
Data Mining
22
Evaluation and Interpretation
 Model validation
– after building a model, you must evaluate its results and
interpret their significance
– accuracy by itself is not necessarily the right metric for
selecting the best model. You need to know more about the
type of errors and the costs associated with them
 Confusion matrices
– for classification problem, a confusion matrix is a very useful
tool for understanding results
– it shows not only how well the model predicts, but also
presents the details needed to see exactly where things may
have gone wrong
2017/5/5
Data Mining
23
Confusion matrix (1/2)
Model X
Actual
Prediction
Class A
Class B
Class C
Class A
45
2
3
Class B
10
38
2
Class C
4
6
40
– this is much more informative than simply telling us an overall
accuracy rate of 82% (123/150)
– If there are different costs associated with different errors, a
model with a lower overall accuracy may be preferable to one
with higher accuracy but a greater cost to the organization due to
the types of errors it makes
2017/5/5
Data Mining
24
Confusion matrix (2/2)
Model Y
Actual
Prediction
Class A
Class B
Class C
Class A
40
12
10
Class B
6
38
1
Class C
2
1
40
– The accuracy has dropped to 79% (118/150)
– Suppose each correct answer had a value of $10 and each
incorrect answer for class A had a cost of $5, for class B a cost of
$10, and for class C a cost of $20
 The net value of model X = (123*10)-(5*5)-(12*10)-(10*20) = 885
 The net value of model Y = (118*10)-(22*5)-(7*10)-(3*20) = 940
2017/5/5
Data Mining
25
Confusion matrix 的使用 (1/4)
 Data mining: 利用historical data找出rare event
 高度獲利或嚴重損失,但是針對所有的客戶採取行動,又
顯得划不來
使用confusion matrix可以獲得三種資訊: 3R
 Response Rate (回應率): 在我們預測的名單中找出多少稀
有事件?
 Recall (反查):預測出來的稀有事件佔總體稀有事件多少比
例?
 Range Reduce (間距縮減): 透過資料採礦模型來找尋稀有事
件時,名單縮小了多少?
2017/5/5
Data Mining
26
Confusion matrix 的使用 (2/4)
0: 不會購買 1:會購買
Actual
Prediction
Class 0
Class 1
Class 0
6855
2171
Class 1
2497
6961
 Response Rate (回應率): 寧缺勿濫的能力
 Response Rate = 6961 / (2497+6961) = 73.6%
 總體Response Rate = (6961 + 2171) / (6855+2171+2497+6961)
= 49.4%
 回應率提升了1.49倍
2017/5/5
Data Mining
27
Confusion matrix 的使用 (3/4)
0: 不會購買 1:會購買
Actual
Prediction
Class 0
Class 1
Class 0
6855
2171
Class 1
2497
6961
 Recall (反查):寧可殺錯一萬,不可誤放一人
 Recall = 6961 / (6961+2171) = 76.22%
 Range Reduce :根據模型執行活動時的成本
 Range Reduce = (6961 + 2497) / (6855+2171+2497+6961) =
51.2%
2017/5/5
Data Mining
28
Confusion matrix 的使用 (4/4)
 Which is the best model depends on the
business problem
 For a marketing response problem, we want to get
as many potential responders as possible and we do
not care about false positives
For a medical diagnostic test for cancer, we might
use such a model as a initial screen. We care a lot
about false negatives – and we want as few as
possible
2017/5/5
Data Mining
29
The Lift (Gain) Chart
• It shows how responses are changed by applying the
model. This change ratio is called the lift
2017/5/5
Data Mining
30
The ROI (Return on Investment) Chart
• A pattern may be interesting, but acting on it may cost
more than the revenue or savings it generate
• Here, ROI is defined as ratio of profit to cost
2017/5/5
Data Mining
31
The Profit Chart
• Profit = revenue minus cost
• The maximum lift was achieved at the 1st decile (10%), the
maximum ROI at the 2nd decile (20%), and the maximum profit
at the 3rd and 4th deciles
2017/5/5
Data Mining
32
External Validation
No matter how good the accuracy of a model is estimated
to be, there is no guarantee that it reflects the real world
– One of the main reasons for this problem is that there are
always assumptions implicit in the model
 The inflation rate may not have been included as a variable in a model
that predicts the propensity of an individual to buy
 It is important to test a model in the real world
– do a test mailing to verify the model
– try the model on a small set of applicants before full
deployment
2017/5/5
Data Mining
33
Deploy the model and results (1/2)
The first way is for an analyst to recommend actions
based on simply viewing the model and its results
– The analyst may look at the clusters the model has identified,
the rules that define the model, or the lift and ROI charts that
depict the effect of the model
The second way is to apply the model to different data
sets
– to flag records based on their classification,
– to assign a score such as the probability of an action, or
– can select some records from the database and subject these
to further analyses with an OLAP tool, and so on
2017/5/5
Data Mining
34
Deploy the model and results (2/2)
The amount of time to process each new transaction, and
the rate at which new transactions arrive, will determine
whether a parallelized algorithm is needed
– Monitoring credit card transactions or cellular telephone calls
for fraud
When delivering a complex application, data mining is
often only a small, albeit critical, part of the final product
– In a fraud detection system, known patterns of fraud may be
combined with discovered patterns
You must measure how well your model has worked after
you use it (model monitoring)
– To be retested, retrained and possibly completely rebuilt
2017/5/5
Data Mining
35
Acting on the Results (1/2)
 Sometimes, it is valuable to incorporate a bit
of experimental design into the process
–
If we are predicting customer response to a
product, we might have three different groups
1) A group of customers based on the results of the Data
Mining model, who get the marketing message
2) A group of customers chosen at random, who get the
marketing message
3) A group of customers chosen at random, who do not
get the marketing message
2017/5/5
Data Mining
36
Acting on the Results (2/2)
– What we hope is that

the first group will have a high response rate

The second group will have a mediocre response rate

The third will have a negligible response rate
– We can test the strength of the marketing message

The difference in response between the second and third
groups
– We can test the strength of the data mining

2017/5/5
The difference between the first and second groups
Data Mining
37
Measuring the Model’s Effectiveness
 We need to compare the results to what actually
happened in the real world
–
Did the predicted behavior actually happen?

–
–
Did the prospects accept the offer, did the customers
purchase the new product, did they churn?
The lift charts and confusion matrixes can adapted to
compare actual results to predicted results
The score set is usually more recent than the model
set


2017/5/5
Model performance usually degrades over time
The model captures patterns from the past and, over time,
the patterns become less relevant
Data Mining
38
What Makes Predictive Modeling
Successful?
A. Modeling Shelf-Life
B. The whole process of predictive
modeling is based on some key
assumptions
2017/5/5
Data Mining
39
A. Modeling Shelf-Life
 Looking at time frames bring up two critical
questions about models and their predictions:
 What is the shelf-life of a model?
•
The things being modeled change over time
•
A model created five years ago, or last year, or last
month, may no longer be valid
•
You need to train a new model on more recent data
 What is the shelf-life of a prediction?
•
2017/5/5
Predictions are valid during a particular time frame
Data Mining
40
B. Key Assumption 1 (1/2)
 The Past Is a Good Predictor of the Future
–
–
How patients reacted to a drug in the past
However, external factors will always have an
influence on the model being built



Retail sales decrease during cold weather and blizzards
Mortgage lending increases when interest rates go down
Seasonal patterns
•

2017/5/5
The Christmas season and back-to-school season derive many
retail sales
The model developed during years of relatively stable
financial markets were not applicable in the more volatile
markets
Data Mining
41
B. Key Assumption 1 (2/2)
 The Past Is a Good Predictor of the Future
–
How do we know when the past is a good predictor
of the future ?
 We can never know for sure
 It is critical to
2017/5/5

Include domain experts (have insight about important
factors) in the modeling process

Include enough of the right data (seasonal factors) to
make good decisions
Data Mining
42
B. Key Assumption 2
 The Data is Available
–
–
Data may not be available for any number of
different reasons

The data may not be collected by the operational systems

The data base is too busy most of the time to prepare
extracts

The data is owned by an outside vendor

And so on
Ensuring that the right data is available is critical
to building successful predictive models
2017/5/5
Data Mining
43
B. Key Assumption 3
 The Data Contains What We Want to Predict
–
To apply the lessons of the past to the future, we need
to be comparing apples to apples and oranges to
oranges

Often, the business people phrase their needs very
ambiguously
 We are interested in people who do not pay their bills

Sometimes business users have unreasonable expectations
from their data
 When building a response model, it must know who responded to
the campaign and who received the campaign


2017/5/5
For advertising campaigns, the second group is not known
However, we can compare the responders to a random sample of the
general population
Data Mining
44
Selecting Data Mining Products (1/3)
 There are three main types of data mining products
1) Tools that are analysis aids for OLAP


Help OLAP users identify the most important dimensions and
segments on which they should focus attention
Business Objects Business Miner, Cognos Scenario
2) The “pure” data mining products


Horizontal tools aimed at data mining analysts concerned with
solving a broad range of problems
IBM Intelligent Miner, Oracle Darwin, SAS Enterprise Miner,
SGI MineSet, and SPSS Clementine
3) Analytic applications which implement specific business
processes for which data mining is an integral part

2017/5/5
Customized packages with the data mining imbedded
Data Mining
45
Selecting Data Mining Products (2/3)
 Basic capabilities
– Nothing substitutes for actual hands-on experience
– Depending on your particular circumstances – system
architecture, staff resources, database size, problem
complexity – some data mining products will be better suited
than others to meet your needs
– System architecture
 Work on a stand-alone desktop machine or a client-server architecture
– Data preparation
– Data access
 No single product can support the large variety of database servers
– Algorithms
2017/5/5
Data Mining
46
Selecting Data Mining Products (3/3)
Basic capabilities (continued)
– Interfaces to other products
 Many tools can help you understand your data before you build your
model, and help you interpret the results of your model
 These include traditional query and reporting tools, graphics and
visualization tools, and OLAP tools
– Model evaluation and interpretation
– Model deployment
 When you need to apply the model to new cases as they come, it is
usually necessary to incorporate the model into a program using an
API or code generated by the data mining tool
– Scalability
– User interface
 The people who build, deploy, and use the results of the models may
be different groups with varying skills
2017/5/5
Data Mining
47
The Virtuous Cycle of DM (1/2)
 Data mining can be applied to many problems in
many industries
–
Most common applications are in marketing, specifically
for CRM


Applied to prospecting for new customers, retaining existing ones,
and increasing customer value
Applied to understanding customer behavior and optimizing
manufacturing processes
 Although they may have much in common, every
application has its own unique characteristics
–
Within a single industry, different companies have different
strategic plans and different approaches
2017/5/5
Data Mining
48
The Virtuous Cycle of DM (2/2)
 The virtuous cycle is a high-level process,
consisting of four major business processes:
1.
2.
3.
4.
Identifying the business problem
Transforming data into actionable results
Acting on the results
Measuring the results
 There are no shortcuts – success in DM requires
all four processes
–
2017/5/5
Expertise grows as organizations focus on the right
business problems, learn about data and modeling
techniques, and improve Data Mining processes based
on the results of previous efforts
Data Mining
49
Data Description and Data Mining
Model Building (1/2)
 Data mining is a process that uses a variety of data
analysis tools to discover patterns and relationships in
data that may be used to make valid predictions
 The first and simplest analytical step in data mining is
to describe the data
– Summarize its statistical attributes (such as means and standard deviations)
– Visually review it using charts and graphics (visualization)
– Look for potentially meaningful links among variables (such as values that
often occur together)
– clustering
 collecting, exploring, and selecting the right data are
critically important
2017/5/5
Data Mining
50
Data Description and Data Mining
Model Building (2/2)

In general, Data description alone cannot provide an
action plan
– You must build a predictive model based on patterns
determined from known results (model training), then
test that model on results outside the original sample
(model testing)
 The accuracy (or error) rate is a good estimate of how the
model will perform on the future dataset that are similar to
the training and test datasets
–
finally, you must empirically verify the model
• e.g., send a mailing to a portion of the new list and see what
results you get
2017/5/5
Data Mining
51
Predictive Data Mining (1/2)
 A hierarchy of choices
– Business goal
 What is the ultimate purpose of mining this data?
 Retain good customers, identify customers likely to leave, or
predict customer profitability
– Type of Prediction
 Classification or Regression
– Model type
 Neural networks or decision trees
 Your choice of model type will influence what data preparation
you must do and how you go about it
– Algorithm
– Product
 They generally have different implementations of a particular
algorithm even they identify it with the same name
2017/5/5
Data Mining
52
Predictive Data Mining (2/2)
 No tool or technique is perfect for all data
– Many business goals are best met by building
multiple model types using a variety of algorithms
– You may not be able to determine which model type
is best until you’ve tried several approaches
2017/5/5
Data Mining
53
Summary (1/2)
Data mining offers great promise in helping
organizations uncover patterns hidden in their data
that can be used to predict the behavior of customers,
products and processes
However, data mining tools need to be guided by users
who understand the business, the data, and the general
nature of the analytical methods involved
2017/5/5
Data Mining
54
Summary (2/2)
Building models is only one step in knowledge discovery
– It is vital to properly collect and prepare the data, and to check
the models against the real world
– The “best” model is often found after building models of
several different types, or by trying different technologies or
algorithms
Choosing the right data mining products means finding a
tool with good basic capabilities
– an interface that matches the skill level of the people who’ll be
using it, and features relevant to your specific business
problems
– After you’ve narrowed down the list of potential solutions, get a
hands-on trial of the likeliest ones
2017/5/5
Data Mining
55
Data Mining: Classification Schemes
• General functionality
– Descriptive data mining
– Predictive data mining
• Different views, different classifications
– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted
2017/5/5
Data Mining
56
A Multi-Dimensional View of Data Mining
Classification
• Databases to be mined
– Relational, transactional, object-oriented, object-relational, active, spatial,
time-series, text, multi-media, heterogeneous, WWW, etc.
• Knowledge to be mined
– Characterization, discrimination, association, classification, clustering,
trend, deviation and outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, neural network, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc.
2017/5/5
Data Mining
57
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
2017/5/5
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Data
Providers,
Database Systems, OLTP
Mining
DBA
58
資料庫之知識發掘的相關技術
2017/5/5
Data Mining
59
Architecture of a Typical Data
Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Filtering
Data cleaning & data integration
Databases
2017/5/5
Data
Warehouse
Data Mining
60
資料探勘的基本元件與概念性架構
2017/5/5
Data Mining
61
資料探勘在顧客關係管理之應用
• 零售業者而言
– 瞭解顧客消費特性,發掘顧客採購模式,強化客戶關
係,達到留住顧客目的
• 銀行業者而言
– 瞭解信用卡發放可能產生之弊端,找出最有利潤、忠
誠度佳的顧客
• 保險業者而言
– 分析保戶要求理賠之模式,並可加強稽核,以防止詐
財之發生
• 優點
– 有效地在不同層面增加公司收益,達成營運目標
2017/5/5
Data Mining
62
資料探勘在網路行銷之應用
• 分析顧客於網站上之行為模式
– 當顧客拜訪網站時,往往提供許多寶貴的資料,如個人資料、點
選的網頁內容、在網頁所停留的時間、利用搜尋引擎時所使用的
關鍵字、以及顧客到訪網站的時間點等,企業可藉由分析這些資
訊來瞭解顧客的行為模式,藉以提高顧客對公司所提供之產品與
服務的滿意度。
•應用範例
–可用以下特性區分訪客的特質
•地理區隔
–包括訪客地址、收入、購買能力
•人格特質
–訪客之購買特性,是否為衝動性或精打細算型
•訪客使用之資訊設備
–網路頻寬、操作系統、瀏覽器或伺服器
2017/5/5
Data Mining
63
資料探勘在網路入侵行為分析之
應用
• 發掘異常網路行為
– 傳統分析突發網路狀況,需很長時間
– 利用高速運算,分析異常網路行為、動態調整與更
新防禦機制
• 應用範例
– 協助網管執行進階的網路控管,並動態調整與更新
防禦機制,進而遏阻網路入侵攻擊的潛在威脅
– 協助網管建立正常網路行為模型、異常的行為模型
2017/5/5
Data Mining
64
資料探勘在網路學習之應用
• 適性化網路學習(Adaptive E-learning)
– 提供適合學習路徑給不同背景學習者
–建構「學習概念圖(concept map)」規劃學生學習路徑
– 分析成績了解試題關連性,推導對應之概念
• 應用範例
–
–
–
–
–
利用關連法則探勘技術
分析學習者的學習成績並了解試題間的關連性
推導出相對應於試題之概念間的關連
找出可以幫助領域專家建構學習概念圖的法則
構建適切的課程概念圖。
2017/5/5
Data Mining
65
請 不要 輕 看 Data Mining
 Data Mining 的熱門應用領域
1. 生物科技產業與DNA資料分析
2. 金融資料分析
3. 零售業資料分析
4. 電信產業
 Data Stream mining
 Privacy-Preserving mining
 Distributed data mining
 Mining of sequence data, multimedia, Web data
 Biological and biomedical data analysis
2017/5/5
Data Mining
66
請 不要 高 估 Data Mining
 Data Mining 並不是萬靈丹
 Data Mining 的成功需要領域知識與經驗
 Data Mining 的應用需要各類專家
討論題
– 想想看: 一個銀行的Data Mining案子
– 想要Mining 出 那種人可能信用不好
– 請問: 可能需要那幾種專家?
2017/5/5
Data Mining
67
Data Mining: Confluence of
Multiple Disciplines
Database
Technology
Machine
Learning
Statistics
Data Mining
Information
Science
2017/5/5
Visualization
Other
Disciplines
Data Mining
68
如何成為 Data Mining 專家
Data Mining 之
觀念與技術
Domain Knowledge
(領域相關知識)
2017/5/5
不斷運用之經驗
Data Mining
69