Download Data Mining Concepts and Applications

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Chapter 7
DATA, TEXT,
AND WEB MINING
Learning Objectives
• Define data mining and list its objectives
and benefits
• Understand different purposes and
applications of data mining
• Understand different methods of data
mining, especially clustering and decision
tree models
• Build expertise in use of some data
mining software
Learning Objectives
• Learn the process of data mining projects
• Understand data mining pitfalls and
myths
• Define text mining and its objectives and
benefits
• Appreciate use of text mining in business
applications
• Define Web mining and its objectives and
benefits
Data Mining Concepts
and Applications
• Six factors behind the sudden rise in popularity of
data mining
1. General recognition of the untapped value in large
2.
3.
4.
5.
6.
databases;
Consolidation of database records tending toward a
single customer view;
Consolidation of databases, including the concept of
an information warehouse;
Reduction in the cost of data storage and processing,
providing for the ability to collect and accumulate data;
Intense competition for a customer’s attention in an
increasingly saturated marketplace; and
The movement toward the de-massification of
business practices
Data Mining Concepts
and Applications
• Data mining (DM)
A process that uses statistical,
mathematical, artificial intelligence and
machine-learning techniques to extract and
identify useful information and subsequent
knowledge from large databases
Data Mining Concepts
and Applications
• Major characteristics and objectives of data
mining
– Data are often buried deep within very large
databases, which sometimes contain data from
several years; sometimes the data are
cleansed and consolidated in a data
warehouse
– The data mining environment is usually
client/server architecture or a Web-based
architecture
Data Mining Concepts
and Applications
• Major characteristics and objectives of data
mining
– Sophisticated new tools help to remove the
information ore buried in corporate files or
archival public records; finding it involves
massaging and synchronizing the data to get
the right results.
– The miner is often an end user, empowered by
data drills and other power query tools to ask
ad hoc questions and obtain answers quickly,
with little or no programming skill
Data Mining Concepts
and Applications
• Major characteristics and objectives of data
mining
– Striking it rich often involves finding an
unexpected result and requires end users to
think creatively
– Data mining tools are readily combined with
spreadsheets and other software development
tools; the mined data can be analyzed and
processed quickly and easily
– Parallel processing is sometimes used
because of the large amounts of data and
massive search efforts
Data Mining Concepts
and Applications
•
How data mining works
– Data mining tools find patterns in data and
may even infer rules from them
– Three methods are used to identify patterns in
data:
1. Simple models
2. Intermediate models
3. Complex models
Data Mining Concepts
and Applications
•
•
Classification
Supervised induction used to analyze the
historical data stored in a database and to
automatically generate a model that can
predict future behavior
Common tools used for classification are:
– Neural networks
– Decision trees
– If-then-else rules
Data Mining Concepts
and Applications
•
Clustering
•
•
•
•
•
words cluster analysis is an exploratory data analysis tool which
aims at sorting different objects into groups in a way that the
degree of association between two objects is maximal if they
belong to the same group and minimal otherwise
cluster analysis simply discovers structures in data without
explaining why they exist.
The term cluster analysis (first used by Tryon, 1939)
encompasses a number of different algorithms and methods for
grouping objects of similar kind into respective categories.
Example, people and animal classification
Joining (Tree Clustering), Two-way Joining (Block Clustering),
and k-Means Clustering
Data Mining Concepts
and Applications
•
•
k-Means Clustering: the k-means method will produce exactly k
different clusters of greatest possible distinction.
Algorithms:
– Given a set of observations (x1, x2, …, xn), where each observation is a ddimensional real vector, k-means clustering aims to partition the n observations into
k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of
squares (WCSS)
•
where μi is the mean of points in Si.
See paper.
Data Mining Concepts
and Applications
• 1) k initial "means" (in this case k=3) are randomly generated within
the data domain (shown in color).
2) k clusters are created by associating every observation with the
nearest mean. The partitions here represent the Voronoi diagram
generated by the means.
3) The centroid of each of the k clusters becomes the new mean.
4) Steps 2 and 3 are repeated until convergence has been reached.
Data Mining Concepts
and Applications
• EM clustering on an artificial dataset ("mouse"). The tendency of kmeans to produce equi-sized clusters leads to bad results, while EM
benefits from the Gaussian distribution present in the data set
Data Mining Concepts
and Applications
•
•
Expectation Maximization) Clustering: to detect clusters in
observations (or variables) and to assign those observations to the
clusters.
A typical example application: a number of consumer behavior
related variables are measured for a large sample of respondents.
Data Mining Concepts
and Applications
•
Association
A category of data mining algorithm that establishes relationships about items that
occur together in a given record
– These powerful exploratory techniques have a wide range of applications in
many areas of business practice and also research - from the analysis of
consumer preferences or human resource management, to the history of
language.
– These techniques enable analysts and researchers to uncover hidden patterns in
large data sets, such as "customers who order product A often also order product
B or C" or "employees who said positive things about initiative X also frequently
complain about issue Y but are happy with issue Z."
– For example, if (Car=Porsche and Gender=Male and Age<20) then (Risk=High
and Insurance=High)). Book store recommendation.
– The implementation of the so-called a-priori algorithm (see Agrawal and Swami,
1993; Agrawal and Srikant, 1994; Han and Lakshmanan, 2001; see also Witten
and Frank, 2000) allows us to process rapidly huge data sets for such
associations, based on predefined "threshold" values for detection.
Data Mining Concepts
and Applications
•
Association
Sequence Analysis. Sequence analysis is concerned with a subsequent
purchase of a product or products given a previous buy. For instance, buying an
extended warranty is more likely to follow (in that specific sequential order) the
purchase of a TV or other electric appliances. Sequence rules, however, are not
always that obvious, and sequence analysis helps you to extract such rules no
matter how hidden they may be in your market basket data.
Link Analysis. In retailing or marketing, knowledge of purchase "patterns" can
help with the direct marketing of special offers to the "right" or "ready" customers
(i.e., those who, according to the rules, are most likely to purchase specific items
given their observed past consumption patterns). “Link analysis" is often used
when these techniques - for extracting sequential or non-sequential association
rules - are applied to organize complex "evidence." It is easy to see how the
"transactions" or "shopping basket" metaphor can be applied to situations where
individuals engage in certain actions, open accounts, contact other specific
individuals, and so on.
Unique data analysis requirements. Crosstabulation tables, and in particular
Multiple Response tables
Data Mining Concepts
and Applications
•
Visualization can be used in conjunction with data mining to gain a
clearer understanding of many underlying relationships
Data Mining Concepts
and Applications
Data Mining Concepts
and Applications
•
•
a-priori algorithm
See paper.
Data Mining Concepts
and Applications
•
•
•
Regression is a well-known statistical
technique that is used to map data to a
prediction value:
Forecasting estimates future values based
on patterns within large sets of data
Data Mining Concepts
and Applications
•
•
Hypothesis-driven data mining
Begins with a proposition by the user, who
then seeks to validate the truthfulness of
the proposition
Discovery-driven data mining
Finds patterns, associations, and
relationships among the data in order to
uncover facts that were previously
unknown or not even contemplated by an
organization
Data Mining Concepts
and Applications
Data mining applications
–
–
–
–
Marketing
Banking
Retailing and sales
Manufacturing and
production
– Brokerage and
securities trading
– Insurance
– Computer hardware
and software
– Government and
defense
– Airlines
– Health care
– Broadcasting
– Police
– Homeland security
Data Mining
Techniques and Tools
•
Data mining tools and techniques can be
classified based on the structure of the
data and the algorithms used:
– Statistical methods
– Decision trees
Defined as a root followed by internal nodes.
Each node (including root) is labeled with a
question and arcs associated with each node
cover all possible responses
Data Mining
Techniques and Tools
•
Data mining tools and techniques can be
classified based on the structure of the
data and the algorithms used:
–
–
–
–
–
Case-based reasoning
Neural computing
Intelligent agents
Genetic algorithms
Other tools
•
•
Rule induction
Data visualization
Data Mining
Techniques and Tools
•
A general algorithm for building a decision
tree:
1. Create a root node and select a splitting
attribute.
2. Add a branch to the root node for each split
candidate value and label
3. Take the following iterative steps:
a. Classify data by applying the split value.
b. If a stopping point is reached, then create leaf
node and label it. Otherwise, build another subtree
Data Mining
Techniques and Tools
•
Gini index
Used in economics to measure the diversity of the population. The
same concept can be used to determine the ‘purity’ of a specific
class as a result of a decision to branch along a particular
attribute/variable
Formula:
Gini(S)=1-∑pj2
Where S is a data set that contains example from n classes.
Pj is a relative frequency of class j in S.
Data Mining
Techniques and Tools
Example:
Sample patterns for Training a Decision Tree to Predict Loan Risk
Pattern #
Income
Credit Rating
Loan Risk
0
1
2
3
4
5
23
17
43
68
32
20
High
Low
Low
High
Moderate
High
High
High
High
Low
Low
High
There is only two classes, High and Low, the data set S with p High
and n low elements, then the Gini computation is as follows:
Data Mining
Techniques and Tools
Phigh=p/(p+n)
pLow=n/(n+p)
Gini(S)=1 – p2High – p2 Low
If data set S is split into S1 and S2, the splitting index is defined as follows:
GiniSPLIT(S)= (p1 + n 1)/(p + n)×Gini(S1) + (p2 + n 2)/(p + n)×Gini(S2)
Where p1,n 1 (p2+ n 2) denote p1 High elements and n1 Low element in the data set S1
(S2).
In this definition, the best split point is the one with the lowest value of the GiniSPLIT
index. For our example, reorder the data according to the income:
Pattern #
Income
Loan Risk
17
20
23
32
43
68
1
5
0
4
2
3
High
High
High
Low
High
Low
Data Mining
Techniques and Tools
Possible value of a split point for the Income attribute are Income<=17, Income<=20,
Income<=23, income<=32, Income<=43, and Income <=68.
Now we can compute the Gini index for each of these levels of splits:
Consider the choice of dividing the data at Income <=17. We have the following
choices of classifications:
Pattern Count
High
Low
Income<=17
Income >17
1
3
0
2
So the Gini index for Income<=17 and Income > 17 will be:
G(Income<=17) = 1 — (Proportion of records with High risk)2 – (Proportion of
records with High risk)2 =1 – 12 – 02=0.
Similarly,
G(Income > 17) = 1 — ((3/5)2 – (2/5)2)=12/25
Data Mining
Techniques and Tools
Gini index for the split choice is computed as follows:
GiniSPLIT= (Proportion of records at Income <=17×G(Income<=17) + (Proportion of
records at Income >17 )×G( Income >17)
That is
GSPLIT=(1/6) × 0 + (5/6) × (12/25) =2/5.
Now consider the choice Income <=20.
Pattern Count
High
Low
Income<=20
Income >20
2
2
0
2
So the Gini index for Income<=20 and Income > 20 will be:
G(Income<=20) = 1 — ((1)2 + (0)2) = 0.
G(Income > 20) = 1 — ((2/4)2 – (2/4)2)=1/2.
GSPLIT=(2/6) × 0 + (4/6) × (1/2) =1/3.
Data Mining
Techniques and Tools
For choice split at Income =23
Pattern Count
High
Income<=23
Income >23
3
1
Low
0
2
G(Income<=23) = 1 — ((1)2 + (0)2) = 0.
G(Income > 23) = 1 — ((1/3)2 – (2/3)2)=4/9.
GSPLIT=(3/6) × 0 + (3/6) × (4/9) =2/9.
For choice split at Income =32
Pattern Count
High
Low
Income<=32
Income >32
3
1
1
1
G(Income<=32) = 1 — ((3/4)2 + (1/4)2) = 3/8.
G(Income > 32) = 1 — ((1/2)2 – (1/2)2)=1/2.
GSPLIT=(4/6) × 3/8 + (2/6) × (1/2) =7/24.
Data Mining
Techniques and Tools
The lowest value of GSPLIT is for Income<=23. So we take the two nearest values
and average them. Thus, we have a split point at Income =(23+32)/2=27.5.
Attribute lists are divided at the split point. That is, we expect to have a rule that
says:
If Income<=27.5
Then
Else if Income>27.5
Then
The following is the attribute list for Income<=27.5
Income
Pattern #
Loan Risk
Credit Rating
17
20
23
1
5
0
High
High
High
Low
High
High
So the conclusion is if the Income<=27.5, the loan risk is high.
Data Mining
Techniques and Tools
But what about the Income > 27.5?
The following tables suggest that Income >27.5 is not a definitive indicator of Loan
Risk.
Income
Pattern #
Loan Risk
Credit Rating
32
43
68
4
2
3
High
Low
High
Moderate
Low
High
So we can borrow examining credit rating to develop the subtree for Income >
27.5 case.
However, credit rating is category variable. The rules for category variable is
slightly different from those for a continuous variable. The Gini index formula will
be
Gini ( Two Proportion)=1 – p2one proportion – p2 the other proportion
Data Mining
Techniques and Tools
In case of category variable, one proportion is the set of records of Credit Rating
={Low}, and the other proportion is the set of records of Credit Rating = not
{Low}, or {Moderate, High}. Thus we have to compute proportion of each
category and its complement. But what about the Income > 27.5?
The following tables suggest that Income >27.5 is not a definitive indicator of Loan
Risk.
Pattern Count
Loan Risk High
Loan Risk Low
Credit Rating={Low}
Credit Rating={Moderate}
Credit Rating={High}
0
1
1
1
0
0
First, compute the Gini index for each category
G( Credit Rating={Low}) =1 – 02 – 12= 0
G( Credit Rating={Moderate}) =1 – 12 – 02= 0
G( Credit Rating={Low}) =1 – 12 – 02= 0
Data Mining
Techniques and Tools
Next, compute the Gini index for complement categories:
G( Credit Rating  {Low, Moderate}) =1 – (½)2 – (1/2)2=1/2
G( Credit Rating {Low, High}) = 1/2
G( Credit Rating {Moderate, High}) =1 – 02 – 12= 0
Third, compute the Gini index for possible branches.
For branch choice of credit rating= {low} and = {Moderate, high}, we would have
GSPLIT =(Proportion of records with Credit Rating =Low) ×G (Credit Rating {Low})
+ (Proportion of records with Credit Rating =not Low) ×G (Credit Rating not {Low})
= (Proportion of records with Credit Rating =Low) ×G (Credit Rating {Low})
+ (Proportion of records with Credit Rating =High, Moderate) ×G (Credit Rating =
{High, Moderate})
GSPLIT(Credite Rating ={Low}) =(1/3) ×0+(2/3) ×0=0.
Data Mining
Techniques and Tools
Last, compute the Gini index for other categories:
GSPLIT(Credite Rating ={Moderate}) =(1/3) ×0+(2/3) ×(1/2)=1/3
GSPLIT(Credite Rating ={High}) =(1/3) ×0+(2/3) ×(1/2)=1/3
GSPLIT(Credite Rating ={Low, Moderate}) =(2/3) ×(1/2)+(1/3) ×0=1/3
GSPLIT(Credite Rating ={Low, High}) =(2/3) ×(1/2)+(1/3) ×0=1/3
GSPLIT(Credite Rating ={Moderate}) =(2/3) ×0+(1/3) ×0=0
The lowest value of the Gini index for the split is zero at Credit Rating= Low and
Credit Rating {Moderate, High}, thus this is split point and these are the next
branch of subtree. See figure.
Data Mining
Techniques and Tools
Data Mining
Techniques and Tools
•
The ID3 algorithm decision tree approach
– Entropy
Measures the extent of uncertainty or
randomness in a data set. If all the data in a
subset belong to just one class, then there is
no uncertainty or randomness in that dataset,
therefore the entropy is zero
Data Mining
Techniques and Tools
•
Cluster analysis for data mining
– Cluster analysis is an exploratory data
analysis tool for solving classification
problems
– The object is to sort cases into groups so that
the degree of association is strong between
members of the same cluster and weak
between members of different clusters
Data Mining
Techniques and Tools
•
Cluster analysis results may be used to:
– Help identify a classification scheme
– Suggest statistical models to describe
populations
– Indicate rules for assigning new cases to
classes for identification, targeting, and
diagnostic purposes
– Provide measures of definition, size, and
change in what were previously broad
concepts
– Find typical cases to represent classes
Data Mining
Techniques and Tools
•
Cluster analysis methods
–
–
–
–
–
•
Statistical methods
Optimal methods
Neural networks
Fuzzy logic
Genetic algorithms
Each of these methods generally works
with one of two general method classes:
– Divisive
– Agglomerative
Data Mining
Techniques and Tools
•
Hierarchical clustering method and example
1. Decide which data to record from the items
2. Calculate the distances between all initial clusters.
Store the results in a distance matrix
3. Search through the distance matrix and find the two
most similar clusters
4. Fuse those two clusters together to produce a cluster
that has at least two items
5. Calculate the distances between this new cluster and
all the other clusters
6. Repeat steps 3 to 5 until you have reached the
prespecified maximum number of clusters
Data Mining
Techniques and Tools
•
Classes of data mining tools and techniques
as they relate to information and business
intelligence (BI) technologies
–
–
–
–
–
–
–
Mathematical and statistical analysis packages
Personalization tools for Web-based marketing
Analytics built into marketing platforms
Advanced CRM tools
Analytics added to other vertical industry-specific
platforms
Analytics added to database tools (e.g., OLAP)
Standalone data mining tools
Data Mining Project Processes
Data Mining Project Processes
Data Mining Project Processes
•
Knowledge discovery in databases
(KDD)
A comprehensive process of using data
mining methods to find useful information
and patterns in data
Data Mining Project Processes
•
KDD process
1.
2.
3.
4.
5.
Selection
Preprocessing
Transformation
Data mining
Interpretation/evaluation
Text Mining
•
Text mining
Application of data mining to
nonstructured or less structured text files.
It entails the generation of meaningful
numerical indices from the unstructured
text and then processing these indices
using various data mining algorithms
Text Mining
•
Text mining helps organizations:
– Find the “hidden” content of documents,
including additional useful relationships
– Relate documents across previous unnoticed
divisions
– Group documents by common themes
Text Mining
•
Applications of text mining
– Automatic detection of e-mail spam or
phishing through analysis of the document
content
– Automatic processing of messages or e-mails
to route a message to the most appropriate
party to process that message
– Analysis of warranty claims, help desk
calls/reports, and so on to identify the most
common problems and relevant responses
Text Mining
•
Applications of text mining
– Analysis of related scientific publications in
journals to create an automated summary
view of a particular discipline
– Creation of a “relationship view” of a
document collection
– Qualitative analysis of documents to detect
deception
Text Mining
•
How to mine text
1. Eliminate commonly used words (stop-words)
2. Replace words with their stems or roots
(stemming algorithms)
3. Consider synonyms and phrases
4. Calculate the weights of the remaining terms
Web Mining
•
Web mining
The discovery and analysis of interesting
and useful information from the Web,
about the Web, and usually through Webbased tools
Data Mining Project Processes
Web Mining
•
•
•
Web content mining
The extraction of useful information from Web
pages
Web structure mining
The development of useful information from the
links included in the Web documents
Web usage mining
The extraction of useful information from the
data being generated through webpage visits,
transaction, etc.
Web Mining
•
Uses for Web mining:
– Determine the lifetime value of clients
– Design cross-marketing strategies across
products
– Evaluate promotional campaigns
– Target electronic ads and coupons at user
groups
– Predict user behavior
– Present dynamic information to users
Data Mining Project Processes