Download Dta Mining and Data Warehousing Lectures Outline

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Transcript
CSCI6405 Fall 2003
Dta Mining and Data Warehousing
„
„
„
Instructor: Qigang Gao, Office: CS219,
Tel:494-3356, Email: [email protected]
Teaching Assistant: Christopher Jordan,
Email: [email protected]
Office Hours: TR, 1:30 - 3:00 PM
30 October 2003
1
Lectures Outline
…
„
„
„
Part III: Data Mining Methods/Algorithms
4. Data mining primitives (ch4, optional)
5. Classification data mining (ch7)
6. Association data mining (ch6)
7. Characterization data mining (ch5)
8. Clustering data mining (ch8)
Part IV: Mining Complex Types of Data
9. Mining the Web (Ch9)
10. Mining spatial data (Ch9)
Project Presentations
Ass3: Oct (14) 16 – Oct 30
Ass4: Oct 30 – Nov 13
Project Due: Dec 8
~prof6405/Doc/proj.guide
* Final Exam: Monday, Dec 08, 19:00
30 October 2003
2
1
Classification by Neural Networks
„
„
„
„
„
Machine learning by simulating the human brain architecture
A human brain can have about 10^11 neurons connected to each
other via huge number of so called synapses
The first learning algorithm came in 1959 (Rosenblatt) who suggested
that if a target output value is provided for a single neuron with fixed
inputs, one can incrementally change weights to learn to produce these
outputs using the perceptron learning rule
Neural networks are modeled on the human brain: nodes (neurons)
and connections (synapses)
Input nodes: receiving the input signals (object features).
Output nodes: give the output signals (classification).
Intermediate nodes: contained in hidden layers for carrying weight
pattern.
Back propagation networks
It is the most often used form of neural networks for updating
distributed weights.
30 October 2003
3
A Neuron
x0
w0
x1
w1
xn
wn
Input
weight
vector x vector w
„
∑
- µk
f
output y
weighted
sum
Activation
function
The n-dimensional input vector x is mapped into variable y
by means of the scalar product and a nonlinear function
mapping
30 October 2003
4
2
A Neuron
x0
w0
x1
w1
xn
wn
- µk
∑
output y
Input
weight
weighted
vector x vector w
sum
For Example
y = sign(
30 October 2003
f
n
∑
Activation
function
wi xi + µ k )
i=0
5
Categorization of customers for buying magazine
The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly connected to the output
nodes. In a untrained network the branches between the nodes have equal weights. During the training stage the network
receives examples of input and output pairs corresponding to records in the DB and adapts the weights of the different
branches until the inputs match the appropriate outputs.
30 October 2003
6
3
The network learning to recognize the readers of both the car magazine and the comics.
30 October 2003
7
It shows the internal state of the network after training. The configuration of the internal nodes shows that there is a certain
connection between the car magazine and the comic readers. However, the networks do not provide a rule to identify this
association.
30 October 2003
8
4
Network Training
„
„
The ultimate objective of training
„
obtain a set of weights that makes almost all the tuples in the
training data classified correctly
Steps
„
Initialize weights with random values
„
Feed the input tuples into the network one by one
„
For each unit
„ Compute the net input to the unit as a linear combination of all
the inputs to the unit
„ Compute the output value using the activation function
„ Compute the error
„ Update the weights and the bias
Neural networks (cont)
„
Analogy to Biological Systems (Indeed a great example of a
good learning system)
„
Can handle real number valued attributes, and learn
nonlinear functions
„
Massive Parallelism allowing for computational efficiency
„
Knowledge learned is weighted network which is not visible
to human
30 October 2003
10
5
Summary
„
Classification is an extensively studied problem (mainly in statistics,
machine learning & neural networks)
„
Classification is probably one of the most widely used data mining
techniques with a lot of extensions
„
Scalability is still an important issue for database applications: thus
combining classification with database techniques should be a
promising topic
„
Research directions: classification of non-relational data, e.g., text,
spatial, multimedia, etc..
30 October 2003
11
6. MINING ASSOCIATION RULES (ch6)
„
Association rule mining
„
Apriori algorithm
„
Mining various kinds of association/correlation rules
„
Constraint-based association mining
„
Summary
30 October 2003
12
6
What Is Association Mining?
„
Association rule mining
- First proposed by Agrawal, Imielinski and Swami [AIS93]
- Finding frequent patterns of associations, correlations, or causal structures
among sets of items or objects in transaction databases, relational
databases, etc.
- Frequent pattern: pattern (set of items, sequence, etc.) that occurs
frequently in a database
„
Motivation: finding regularities in data about relationships
- What products were often purchased together?— Beer and diapers?!
- What are the subsequent purchases after buying a PC?
- What kinds of DNA are sensitive to this new drug?
- Web log (click stream) analysis
30 October 2003
13
Applications
* Each association rule is a piece of knowledge describing some relationship
between two particular item sets
Association rules can reveal meaningful togetherness of items and combinations
of attributes.
The generation of association rules produces intuitively obvious results that the
business owner can use in order to take advantage of the knowledge within their
industry to improve their business.
* Market basket analysis To analyze customer buying habits by finding
associations among different items that customers place in their "shopping baskets".
"Which items are likely to be purchased together for a customer on a given trip to
the store?"
One basket tells us about one customer, but all purchases made by all customers
have much more information.
30 October 2003
14
7
Market basket analysis (cont)
1) Analyze grocery sale transactions:
Customer
Items
----------------------------------------------1
orange juice, soda
2
orange juice, milk, window cleaner
3
orange juice, detergent
4
orange juice, detergent, soda
5
soda, window cleaner
The association rule:
When a customer buys orange juice he/she will also buy soda:
orange juice => soda (40, 50)
Support: %40 transactions have both items.
Confidence: %50 transactions containing orange juice also contain soda.
30 October 2003
15
Applications (cont)
2) Video store data analysis: ~prof6405/Doc/thesisPothier.pdf
Coupon => New releases (10%, 52%)
If a customer uses a coupon in a transaction, then there is 52% confidence that
they will rent a new release. This occurs in over 10% of all transactions.
3) CRM application: www.amazon.com
"Customers who bought this book (book x) also bought:"
book x => book y, book z, ...
4) Add customer data into market basket analysis.
{age(30 ...39), income(42K...48K)} => {buys(DVD player)}
30 October 2003
16
8
What is an association rule?
An association rule is an expression of the form:
X => Y where X and Y are sets of items, called itemsets.
The meaning of an association rule for a transaction database is that the transactions
which contain the items in X tend to contain the items in Y.
E.g., The database:
ID Items
----------1 A,B,C,E
2 A,B,D,G
3 A,B,C,D
4 B,C,K
Some association rules:
{A} => {B} (75, 100)
{B} => {A} (75, 75}
{A,B} => {C} (50, 66)
{C} => {A,B} (50, 66)
{B,C} => {K} (25, 33}...
-The rule "X => Y" for a transaction database means that each
transaction is a set of items which may contains X, or Y or both.
- The rule is only meaning full if it passed required measures: support rate
and confidence rate.
30 October 2003
17
Basic Concepts: Frequent Patterns and Association
Rules
Transaction-id
Items bought
10
A, B, C
„
20
A, C
„
30
A, D
40
B, E, F
Customer
buys both
Customer
buys beer
30 October 2003
Customer
buys diaper
Itemset X={x1, …, xk}
Find all the rules XÆY with min confidence
and support
„
support, s, probability that a
transaction contains X∪Y, i.e. P(X∪Y)
„
confidence, c, conditional probability
that a transaction having X also
contains Y, i.e. P(Y|X)
Let min_support = 50%, min_conf
= 50%:
A Æ C (50%, 66.7%)
C Æ A (50%, 100%)
{A,C} Æ B ?
18
9
Mining Association Rules—an Example
Transaction-id
Items bought
10
A, B, C
20
A, C
30
A, D
40
B, E, F
Min. support 50%
Min. confidence 50%
Frequent pattern
Support
{A}
75%
{B}
50%
{C}
50%
{A, C}
50%
For rule {A} ⇒ {C}:
support = support({A} ∪ {C}) = 50%
confidence = support({A} ∪ {C})/support({A}) = 66.6%
30 October 2003
19
Strategy of Apriori Algorithm
Two major phases of the algorithm
1. Mine all frequent itemsets from the database
* Use supp factor as a filter.
* Apply the heuristic knowledge for efficient search.
2. Derive all valid rules from the mined large itemsets
* Use conf factor as a filter.
* Apply the prior knowledge again.
30 October 2003
20
10
What is Apriori property?
Apriori property:
All non-empty subsets of a frequent itemset must also be frequent.
E.g., If supp(I1,I2,I3) >= T_supp is true, then it is true that
supp(I1) >= T_supp, supp(I2) >= T_supp, supp(I3) >= T_supp,
supp(I1,I2) >= T_supp, supp(I1,I3) >= T_supp, supp(I2,I3) >= T_supp.
where T_supp is a minimum support rate.
If supp(I1) < T_supp, then
supp(I1,I2) < T_supp, supp(I1,I3) < T_supp, supp(I1,I2,I3) < T_supp.
30 October 2003
21
What is Apriori property? (cont)
In general, if S and R are two itemsets and supp(S) < T_supp, then supp(S U R) <
T_supp.
If itemset S is not frequent, then adding an itemset R to S, i.e. the itemset {S,R}, can
not occur more frequent than S alone.
E.g. TID
1
2
3
4
Items
A,B,C
A,C
A,B
B,C,D
If supp(A) < 80%, then without any calculation we can be sure that any item sets
which contain A will be < 80%, that is supp(A,B) < 80%, supp(A,C) < 80%,
supp(A,B,C) < 80%, and supp(A,B,C,D) < 80%.
30 October 2003
22
11