Section 5
Data Mining
Section Content
• 5.1 Introduction
• 5.2 Knowledge Discovery
• 5.3 Association Rules
• 5.4 Sequential Patterns
• 5.5 Classification and Regression
• 5.6 Other Forms of Data Mining
• 5.7 Applications of Data Mining
5.1 Data Mining Introduction
• Data mining:
+ the discovery of new information in terms of patterns or rules from huge
amounts of data
+ mining tools should identify these patterns, rules and trends with minimal user
+ data mining is related to
• statistics: exploratory data analysis
• artificial intelligence: knowledge discovery and machine learning
+ techniques from machine learning, statistics, neural networks and genetic
algorithms are used
+ due to the vastness of the amount of data, efficiency/scalability of data
mining algorithms is a key issue
Data Mining and Data Warehousing
• The goal of data warehousing is to support decision making with
• Data mining can help in conjunction with a data warehouse with
certain types of decisions.
• Data mining helps to extract new patterns/rules that cannot be
found by merely querying or processing data.
• Aggregated or summarised collections of data in warehouses
improves the efficiency of data mining in these cases.
• The potential use of data mining needs to be considered early in the
design of a data warehouse.
5.2 Knowledge Discovery
• Data mining is part of the knowledge discovery process:
data selection
data cleansing
data transformation / encoding
data mining
reporting and display
• Example:
+ Database: Transaction database for a goods retailer
+ Client data: name, zip code, phone, date of purchase, item code, price,
quantity, total amount
Knowledge Discovery - Example
• New knowledge can be discovered from the client data
+ data selection:
• data about specific items or categories of items
• items from stores in specific regions
+ data cleansing:
• correct incorrect zip codes
• eliminate records with incorrect phone numbers
+ enrichment: add additional information
• age, income, credit rating of client
+ data transformation: reduce the amount of data
• group items into product categories
• group zip codes into regions
Data Mining - Knowledge Discovery
• Data mining might discover
+ co-occurrences - items that are typically bought together
+ association rules - when a customer buys video equipment, he/she also buys
another electronic gadget
+ sequential patters - when a customer buys a camera, then within 3 months
he/she buys photographic supplies
+ classification trees - customers can be classified by frequency of visits, types
of finance used, etc. combined with statistics about the classes
• This information can then be used to for example
+ optimise store locations
+ run promotions
+ plan seasonal marketing strategies
Goals of Data Mining
• Prediction
+ show how certain attributes within the data will behave in the future
+ example: predict what customers will buy under certain discounts
+ example: predict sales volume for some period
• Identification
+ data patterns can be used to identify the existence of an item, an event, or an
+ example: detecting intruders by the commands they execute
Goals of Data Mining
• Classification
+ partition data such that different classes or categories can be identified
+ example: customers can be categorised into regular and infrequent shoppers,
into discount-seeking customers etc.
+ categorisation - e.g. into food categories - can reduce the complexity of data
• Optimisation
+ optimise the use of limited resources (time, space, money, etc)
+ example: what are the best products to spend our money on over the next
three months?
Types of Knowledge Discovered
• Co-occurrences
+ collection of items/actions/events that occur together
+ example: items that are bought together by a consumer in a shop
• Association rules
+ correlation of a set of items with another range of values for another set of
+ example: when someone buys bread, he/she is likely to buy cheese
• Classification hierarchies
+ create a hierarchy of classes from an existing set of events or transactions
+ example: customers might be divided into a credit worthiness hierarchy based
on their previous credit transactions
Types of Knowledge Discovered
• Sequential patterns
+ search for a sequence of events or actions
+ example: a patient that underwent cardiac surgery and later developed high
blood urea, is likely to suffer from kidney problems
• Patterns within time series
+ detection of similarities within positions of the time series
+ example: a pattern in a time series of stock market prices may be used to
predict employment rates
• Categorisation and segmentation
+ partition a set of events of items into segments/categories/classes
+ example: treatment data on a disease can be partitioned into groups based on
the side effects that are caused
Counting Co-occurrences
• The problem is to count co-occurring itemsets - motivated by
market basket analysis.
• A database of consumer transactions forms the basis
+ transaction: a single visit to a store, an order at a virtual store (Web site), or
a single order through a mail-order catalog
+ a transaction consists of a transaction ID, customer ID, date, item and
• The goal is to identify items that are typically purchased together.
• This can be used to improve the layout of shops or catalogs.
Frequent Itemsets (1)
• Consider the following transaction table:
Transaction Customer
Items bought
11/09 milk, bread, juice
12/09 milk, juice
14/09 milk, eggs
14/09 bread, coffee, biscuits
Items bought in one visit are already grouped together into
• Support of an itemset: the fraction of transactions that contain all
items in the itemset
• Examples
+ {milk, juice} has a support of 50 %
+ {bread, coffee} has a support of 25 %
Frequent Itemsets (2)
• Large itemsets are itemsets that have a certain minimum support,
i.e. are itemsets that occur frequently.
• Example:
+ for a minimum support of 40%, the large itemsets are {milk, juice}, {milk},
{juice}, {bread}
• Proposition:
+ every subset of a large itemset is also a large itemset
• Algorithm:
+ large itemsets can be computed incrementally
+ start with itemsets of cardinality 1 that have the required support
5.3 Association Rules
• A database can be regarded as a collection of transactions.
• Each transaction involves a set of items.
• Example: the items in a basket that a shopper uses in a
Items bought
milk, bread, juice
milk, juice
milk, eggs
bread, coffee, biscuits
Association Rules
• An association rule is of form X => Y where X and Y are two
disjoint sets of items
• Example:
+ for sets of goods as itemsets X and Y, the expression X => Y means that if a
customer buys X, he/she is also likely to buy Y.
+ if the customer buys milk, he/she is also likely to buy juice.
• The support for a rule X => Y is the percentage of transactions
that hold all of the items in the union X  Y.
• Examples:
+ Milk => Juice has 50% support
+ Bread => Juice has 25% support
Association Rules
• The confidence of a rule X => Y is the percentage (fraction) of all
transactions including X that also include Y.
• Example:
+ the rule Milk => Juice has confidence 66.7%
+ that means that 2/3 of all transactions with milk also include juice
• Note that support and confidence might be different.
• The goal is to discover rules with a certain minimum support and
• These rules can be used for prediction: for a rule
Pen => Ink
offer discounts on pens and you might increase ink sales.
Association Rules
• How to compute these rules?
+ Generate large itemsets (itemsets with a certain minimum support)
+ For each large itemset X, generate all rules with a certain minimum confidence
for X and Y  X, let Z = X - Y
(divide X into Y and Z)
if support(X) / support(Y) > mconf then
Y => Z is a valid rule
the confidence of rule Y => Z is defined as support(X) / support(Y)
+ Example:
for X={milk, juice} and Y={milk}  {milk, juice},
let Z={juice}
X, Y, Z have support 50%, 75% and 50%, resp. (support for itemsets 5.14)
for mconf=40% {milk} => {juice} is a valid rule with confidence 66.7% ( 50/75 )
Generating Association Rules
• In principle, generating rules based on large itemsets and their
support is straightforward.
• Computing all large itemsets and their support creates an efficiency
problem if the number of items is very high.
• If m is the number of items, then 2m is the number of different
• Example: a typical supermarket might have several thousands of
+ Computing the support of all itemsets might take a long time.
+ Reducing the combinatorial search space is therefore important - the following
properties can be used:
• subsets of large itemsets are large
• extensions of small itemsets are small
Association Rules - Algorithms
• Outline of an algorithm that finds large itemsets:
• Step 1:
+ test the support for itemsets of length 1 - called 1-itemsets - by scanning the
+ discard those that do not meet the minimum requirement.
• Step 2:
+ extend large 1-itemsets into 2-itemsets by appending one item each time (this
generates all itemsets of length two);
+ test the support and eliminate all 2-itemsets that do not meet the minumum
• Step 3:
+ repeat the above steps: extend (k-1)-itemsets into k-itemsets.
Association Rules among Hierarchies
• Items might be divided among disjoint hierarchies based on some
classification, e.g. Beverage can be divided into Juice and Milk
Associations might occur among the hierarchies of items.
• Example: healthy frozen yoghurt => bottled water
• Particularly interesting are associations across hierarchies.
+ this kind of information can be used to arrange different kinds of items in a
Negative Associations
• Negative associations are more difficult to detect than positive
• Example: 60% of customers who buy crisps do not buy bottled
• There are usually more negative associations than positive ones.
• The majority of itemset combinations do not occur in databases.
• Finding interesting negative associations can be difficult.
Association Rules - Additional Considerations
• Sampling:
+ For very large databases, sampling improves efficiency.
+ Truly representative samples can help to find most of the rules.
+ The danger is that
• false positives might be discovered (large itemsets that are not truly large);
• true positives might be missing.
• Other problems:
+ Cardinality of itemsets and volume of transactions can be very high.
+ Variablity of transactions (geographical, season) makes sampling difficult.
+ Multiple classifications along different dimensions.
5.4 Sequential Patterns
• Sequential patterns are based on sequences of itemsets.
• Assume transactions to be ordered by time.
• Example:
+ transactions in a supermarket
+ {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} may be based on
three visits of a customer
• A subsequence of a sequence is obtained by deleting one or more
• Example:
+ let {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} be the orginal
+ {milk, bread, juice} ; {bread, eggs} is a subsequence
+ {milk, bread, juice} ; {milk, coffee, biscuits} is a subsequence
Support for Sequences
• A sequence {a1, ... , am} is contained in another sequence S if
S has a subsequence {b1, ..., bn} such that ai  bi for 1 <= i <= n
• Example:
+ {milk, bread} ; {coffee, biscuits} is contained in {milk, bread, juice} ; {bread,
eggs} ; {milk, coffee, biscuits}
• The support of a sequence S is the percentage of a set of given
sequences that contain S as a subsequence.
Discovery of Patterns in Time Series
• Time series are sequences of events.
• An event might be a fixed type of transaction.
• Example:
+ closing price of a stock or fund each day.
• Analysis of time series:
+ find period of time in which the stock did not fluctuate more than 1%
+ find period (week/month/quarter) with the greatest loss
+ identify stocks with similar behaviour
5.5 Classification and Regression
• Classification Rules
• Regression
• Tree-structured Rules
Discovery of Classification Rules
• Classification means defining/identifying a function that maps an
object into one of many possible classes.
• Example: a bank wants to classify loan applicants into “loanworthy”
and “not loanworthy”
+ a classification rule could define the classification
• not loanworthy: current monthly debt obligation exceeds 25% of monthly net
• loanworthy: otherwise
+ loanworthiness is a dependent, categorical attribute
• In general there is one rule (set) per class
(var1 in range1) and ... and (varn in rangen)
=> object O in class C1
var1 , ..., varn are the predictor attributes
Support and Confidence
• Again we can define support and confidence for these rules.
• The support for a classification condition C is the percentage of
tuples that satisfy C.
• The support for a rule C1 => C2 is the support for the condition
C1  C2. (C1 AND C2 is the set of objects in both C1 and C2.)
• Consider those tuples that satisfy condition C1. The confidence for
a rule C1 => C2 is the percentage of such rules that also satisfy
condition C2.
• Regression is similar to classification, except that the dependent
variable is numerical (and not categorical).
• Rules (such as classification rules) can be regarded as functions.
• A regression rule is a function that maps variables into a target
class variable.
• Example:
LabTest(patientID, test1, ... , testn)
+ the values in that relation result from a series of lab tests
+ the target variable P is the probability of survival - a numerical variable
+ the regression rule:
(test1 in range1) and ... and (testn in rangen) => P = x
+ the regression function is P = f(test1, ... , testn)
Regression (2)
• If P appears as a function y = f(x1, ... , xn)
and f is linear in the domain variables,
then the process of deriving f from a given set of
tuples <x1, ... , xn, y> is called linear regression.
• Linear regression is a common statistical technique.
Tree-Structured Rules
• Specific classification and regression rules shall now be examined.
• These are rules that can be represented as trees - called
classification trees or decision trees.
• These trees are typically the output of the data mining activity.
• Each path from a root to a leaf node represents one classification
• Example: Insurance risk determination for motor insurance
<= 25
> 25
Car Type
Decision Trees
• A decision tree is a graphical representation of a collection of
classification rules.
• Each node in the tree is labelled with a predictor or splitting
• Each outgoing edge of an internal node is labelled with a predicate
that involves the splitting attribute.
• Each leaf node is labelled with a value of the depending attribute.
• A classification rule can be associated with each leaf node constructed as the conjunction of the predicates:
+ Age <= 25 and Car Type = sports for the YES-leaf
• Decision trees are constructed in two phases:
+ growth phase: create tree based on specialised rules from an input database
+ pruning phase: reduce tree size by generalising rules
5.6 Other Types of Data Mining
• Neural Networks
• Genetic Algorithms
• Clustering and Segmentation
Neural Networks
• Techniques from artificial intelligence can be used to generalise
• Neural networks provide an iterative method to carry out this
generalised regression.
• Neural networks use a curve-fitting approach to infer a function
from a set of samples.
• This process is based on learning: a test sample is the initial input,
the system then incrementally infers functions based on more
• Neural networks can be applied to classification problems.
• Modelling time series with neural networks is difficult.
Genetic Algorithms (1)
• Genetic algorithms (GA) are a class of randomised search
procedures for adaptive and robust search over a wide range of
search topologies.
• Principle:
+ Genetic algorithms extend the idea of characterising human DNA by a fourletter alphabet (A,C,T,G).
• Construction:
+ Devise an alphabet that allows the encoding of a solution to the decision
problem in terms of strings of that alphabet.
• Usage:
+ Study the cutting and combination of strings (compare natural reproduction
and evolution).
+ New generations of individuals (solutions) are generated and assessed survival of the fittest.
Genetic Algorithms (2)
• Generation of solutions - comparison with other techniques.
+ GA search uses a set of solutions during each generation rather than a single
+ The search in the string-space represents a much larger parallel search in the
space of encoded solutions.
+ The memory of the search completed is represented solely by the set of
solutions available for generation.
+ A GA is a randomised algorithm since search mechanisms use probabilistic
+ While progressing from one generation to the next, a GA finds near-optimal
balance between knowledge acquisition and exploitation by manipulating
encoded solutions.
Clustering and Segmentation
• Clustering is about identification and classification.
• Clustering tries to identify categories (or clusters) to which a data
object can be mapped.
• The categories can be disjoint or might overlap; they might be
organised into trees.
• A related problem: multivariate probability density functions.
5.7 Applications of Data Mining
• Decision-making contexts:
+ marketing:
• analysis of customer behaviour based on buying patterns;
• determination of marketing strategies (store locations, advertising campaigns,
• segmentation of customers, stores, products.
+ finance:
analysis of creditworthiness of clients;
performance analysis of finance investments;
evaluation of financing options;
fraud detection.
+ Manufacturing:
• optimisation of resources (machines, manpower, material);
• optimal design of manufacturing process, shop-floor layout, etc.
+ Health care:
analysis of effectiveness of certain treatments;
optimisation of processes in a hospital;
analysing side effects of drugs;
relating patient wellness and doctor qualifications.