Download Szucs.pdf

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Information Retrieval
from Data Bases
for Decisions
Dr. Gábor SZŰCS, Ph.D.
Assistant professor
BUTE, Department Information
and Knowledge Management
Contents





Aims
General steps in the procedure
Market basket analysis
Frequent itemsets
Conclusion
Aims
search hidden coherences in the existing
data bases (DB)
 help to take a well grounded decision
Data mining techniques are able to find
such relationships.
 they provide the ability to optimize decisionmaking
 they are the most powerful tools for retrieval
important information

Steps of the data mining
1.
2.
Declaration of the key and the predictor
variables in order to analyse
(Sampling from a large amount of data)
Modification of variables, where we should
examine whether some variables should be
integrated (in large DBs always occur some
mistakes)
(some transformations should be executed)
Additional steps of the data
mining
3.
4.
Modelling, data mining techniques: neural
network, decision tree, regression
procedures, cluster analysis, factor analysis,
discriminant analysis, etc.
Comparison the data mining models built on
the same DB (the best model can be
selected).
The procedure can be cyclically repeated.
After the whole procedure the hidden
relationships between different aspects can
be shown.
Market Basket Analysis
is used for finding groups of items that tend to
occur together.
 The models give the likelihood of different
products being purchased together.
 Market basket analysis is useful for:
1. items occur together
2. items occur in a particular sequence
Table of Co-Occurrence of
Products
Product 1
Product 2
Product 3
Product 4
Product 5
Product 1
234
12
0
125
54
Product 2
12
175
65
23
75
Product 3
0
65
229
67
62
Product 4
125
23
67
315
55
Product 5
54
75
62
55
292
Procedure of the market
basket analysis
1.
2.
3.
Choose the right level of the product
hierarchy for the items.
Probabilities and joint probabilities of the
items are calculated.
Determine the association rules.
Example
Bicycle (A)
140
Hand tools for bicycle (B)
100
Tool rack (C)
61
Bicycle and hand tool (A & B)
50
Bicycle and tool rack (A & C)
7
Hand tool and tool rack (B & C)
45
Bicycle and hand tool and tool rack 5
(A & B & C)
Table of probabilities and joint
probabilities of items
A
14 %
B
10 %
C
6,1 %
A&B
5%
A&C
0,7 %
B&C
4,5 %
A&B&C
0,5 %
Association rules

1.
2.

The rules (AB) consist of two parts:
condition and
consequence
A confidence can be defined for the rules:
p(condition & result )
confidence 
p(condition )
Example
P(AB) = 5 / 14 = 0.357
P((A&B)C) = 0.05 / 0.5 = 0.1
P((A&C)B) = 0.05 / 0.07 = 0.714
P((B&C)A) = 0.05 / 0.45 = 0.111
Is this association rule can help us?
 If we offer product A for everybody,
then 14 % of the persons will purchase.
 If A for only B and C,
then 11 % of the people will purchase.
Improvement
p( X  Y )
improvement ( X  Y ) 
p(Y )
This will help us to decide that the association
rule is useful or not.
In our example
Improvement ((B&C)A) = 0.111 / 0.14 =
0.794
 Improvement ((A&B)C) = 0.1 / 0.061 =
1.639
The value of improvement shows the usefulness
of the analysis:
a) improvement > 1
b) improvement < 1

Dissociation rules

similar to association rules
( A & B)  C

count the inverse of the original item, 
modify each transaction:
A transaction includes an inverse item if, and
only if, it does not contain the original item.
Time series
the transactions must have two additional
features:
 time information (e.g. time sequence or time
stamp)
 identifying information (e.g. customer id,
account number in a bank)
Frequent itemsets
appear in at least fixed ratio
 problem
 a-priori trick:
If a set of items S is frequent, then every subset
of S is also frequent.
 procedure built from lower level to upper level
(frequent items, frequent pairs, etc.)

A-Priori Algorithm
1.
2.
Define a threshold for relative frequency. All
items are examined.
The
set of the frequent items: L1.
Pairs of items in L1 become the candidate
(C2).
This is compared with the threshold limit.
L2 contains the frequent pairs.
A-Priori Algorithm (cont.)
The candidate triples (C3) are those sets
{A,B,C} such that all of subset are in L2.
L3 will contain the frequent triples.
4. Li is the frequent sets of size i,
Ci+1 is the candidate set of size i+1
until the sets become empty
3.
Criticism of A-Priori Algorithm



good if we would like to know only the
frequent pairs
at searhing maximal frequent itemsets
too many steps may be needed
physical capacity of computers
Market Basket Mining with
High Correlation Analysis


The data are organised in a matrix.
The cells contain Boolean.




1: yes
0: no
This matrix is very sparse.
We want to find the highly correlated pairs.
Applications of High
Correlation Mining
1.
2.
3.
Rows are the document, columns are the
words. The highly correlated pairs of
columns will give the words that appear
almost together.
Rows and columns are Web pages. The cell
contains 1, if the page of row links to the
page of column. Result: pages about the
same topic.
Page of columns links to the page of row.
Result: the mirror pages.
Conclusion



Planning store layout
Bundling products
Offering coupons
Future
Further development:
 hierarchical association rules
 association rules maintenance
 sequential pattern mining
 functional dependency mining
Thank you!
The flow is open for the discussion.
E-mail: [email protected]