Download Data Warehousing Concepts and Design

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
Chapter 34 in textbook + Chapter 4
in DATA MINING by P. Adriaans and
D. Zantinge
1
Data Mining
Data Mining: the process of extracting valid, previously
unknown, comprehensible, and actionable information from
large databases and using it to make crucial business
decisions.

Involves analysis of data and use of software techniques for
finding hidden and unexpected patterns and relationships in
sets of data.

Examples:


A customer with income between 10,000 and 20,000 and age
between 20 and 25 who purchased milk and bread is likely to
purchase diapers within 5 years.
The amount of fish sold to people living in a certain area and
have income between 20,000 and 35,000 is increasing.
2
Data Mining

Most accurate and reliable results require large volumes of
data

Data mining can provide huge paybacks for companies who
have made a significant investment in DW.

Relatively new technology, however already used in many
industries.
3
Data Mining
Examples of applications:

Retail / Marketing



Identifying buying patterns
of customers.
Predicting response to
mailing campaigns.
Banking



Detecting patterns of CC
fraud
Identifying loyal customers.
Insurance



Claims analysis.
Predicting which customers
will buy new policies.
Medicine


Characterizing patient
behaviour to predict surgery
visits.
Identifying successful
medical therapies.
4
Data Mining and DW
Challenge: identifying suitable data to mine.
 Data mining requires single, separate, clean,
integrated, and self-consistent source of data.

A DW is well equipped for providing data for mining.
 Data quality and consistency is essential to
ensure the accuracy of the predictive models.
 DWs are populated with clean, consistent
data
5
Data Mining and DW

Advantageous to mine data from multiple sources
to discover as many interrelationships as possible.
 DWs contain data from a number of sources.

Selecting relevant subsets of records and fields for
data mining
 requires query capabilities of the DW.

Results of a data mining study are useful if can
further investigate the uncovered patterns.
 DWs provide capability to go back to the data
source.
6
The Knowledge Discovery Process
 Six stages:
1.
2.
3.
4.
5.
6.
Data selection.
Cleaning.
Enrichment.
Coding.
Data Mining.
Reporting.
7
The KDD Process
Action
8
1. Data Selection
 We will illustrate the process using a
magazine publisher operational data.
 We selected data about people who
subscribed to magazines.
 A copy of this operational data is
made.
9
Original Selected Data
10
2. Cleaning
 Types of cleaning:
 Some detected before starting.
 Some detected during coding or
discovery stages.
 Elements of cleaning:
1. De-duplication.
2. Lack of domain consistency.
11
2.1 De-duplication
 Some clients represented by several
records.
 Very common.
 Reasons:
 Negligence: typing errors.
 Data changed for client without notifying
company: exp. Moving to a new address.
 Deliberately giving wrong info: exp. Misspelling
names to avoid rejection.
 Solution: pattern analysis algorithms.
12
Data before De-duplication
13
Data after De-duplication
14
2.2 Lack of Domain Consistency
 Hard to trace.
 Greatly influences the DM results.
 Solution:
 NULL.
 Correct values.
15
Data before Correcting Lack of
Domain Consistency
16
Data after Correcting Lack of
Domain Consistency
17
3. Enrichment
 A company can purchase extra
information about clients.
18
4. Coding
1.
2.
Add purchased data to DB.
Select records with enough information to be of value.

3.
Keep important columns only.

4.
Exp. We could not get extra information on client King. So, we
choose to remove him from data.
Exp. We are not interested in clients’ names. So, remove this
column from data.
Code information.


What is coding?

Change data in columns to ranges and enumerations.

Info too detailed for pattern recognition algorithms.
Why code?

5.
Exp: if we use DOB, then the alg. Would put people of the
same DOB in the same category. Better if it was an age
group instead.
Flattening: n-cardinality attribute replaced by n binary
attributes.
19
4. Coding (continued)

1.
2.
3.
4.
5.
6.
Some examples of coding:
Address  region.
Birth date  age.
Divide income by 1,000.
Divide credit by 1,000.
yes-no fields  1-0 fields.
Purchase date  month numbers.
20
Data before Removing Insufficient
Records and Columns
21
Data after Removing Insufficient
Records and Columns and before
Coding
22
Data after Coding and before
Flattening
23
Data after Flattening
24
5. Data Mining
 Now, after we have cleaned the data and
prepared it, we perform actual discovery
(DM).
 Techniques:
1.
2.
3.
4.
5.
6.
7.
8.
Query tools & Statistical techniques.
Visualization.
Online analytical processing (OLAP).
Case-based learning (k-nearest neighbor).
Decision trees.
Association rules.
Neural networks.
Genetic algorithms.
25
5.1 Query Tools and Statistical
Techniques
 Perform preliminary analysis of data.
 Should be done before any complex DM
step.
 Uses simple SQL queries.
 No hidden patterns.
 But discovers 80% of the interesting
information to be extracted.
 20% discovered by complex techniques.
26
Data Averages
27
Age Distributions of Sports
Magazines Readers
28
5.2 Visualization Techniques
 Useful at the beginning of DM.
 Gives a feeling of where patterns
maybe hidden.
 Example: Scatter Diagram.
 Projection of 2 attributes in a Cartesian
space.
 Better example: 3D Interactive
Diagrams.
 Projection of 3 attributes.
29
Scatter Diagram
30
3D Interactive Diagram
31
5.2 Visualization Techniques
(continued)
 Importance of visualizing points in
multi-dimensional space lies in
detecting likelihood and distance.
 If distance between 2 points is small 
records representing them are similar 
it is likely that they will behave in the
same manner.
 If distance between 2 points is large 
records representing them have little in
common.
32
5.2 Visualization Techniques
(continued)

Exp: Age, credit and income are 3 attributes/dimensions in
our space.

First, normalize them so they would have the same effect.



Age: 1  100 while income and credit: 0  100,000.
Divide credit and age by 1,000.
Euclidean distance is used:
[(x1-x2)2 + (y1-y2)2 + (z1-z2)2]
33
5.2 Visualization Techniques
(continued)
 Benefits of points in multidimensional space is finding clusters.
 Clusters are groups of similar records.
 Likely to behave in the same manner.
 Can be targeted for marketing
campaigns.
 Low dimensionality  easy to detect
clusters.
 Higher dimensionality  need special
programs to detect clusters.
34
Finding Clusters
35
5.3 OLAP Tools
 OLAP: OnLine Analytical Processing.
 Expanding the idea of dimensionality.
 A table with n attr. = a space with n dimensions.
 Managers usually ask multi-dimensional questions.
 Not easy in traditional DBs.
 Multi-dimensional relationships require multiple keys
while traditional DBs have 1 key per record.
 OLAP useful with multi-dimensional queries.
 It stores data in special multi-dimensional format
kept in memory.
 DM vs. OLAP.
 OLAP doesn’t learn  less powerful than DM.
 OLAP gives you multi-dimensional knowledge NOT new
knowledge.
 OLAP needs data in special format unlike DM.
36
5.4 k-Nearest Neighbor
 When records are points in data space,
 Neighborhood: records close to each other are in the
same neighborhood.
 Useful in prediction.
 Records in the same neighborhood behave similarly.
 If you know how some will behave, you can assume
that the rest will behave in the same way.
 Do as your neighbors do.
 To predict an individual’s behavior,
 Get the closest k neighbors by applying k-nearest
neighbor alg.
 See how they behave.
 Average their behavior.
 your target is likely to behave in the same way.
 Search NOT learning algorithm.
 Not efficient with large data sets.
37
Predictions with k-Nearest
Neighbor
38
5.5 Decision Trees
Useful in classification and prediction.
 Puts records in classes.
 Predict behavior of an individual by observing
behavior of individuals in his\her class.
 Advantages:
 Good with large data sets.
 Intuitive and simple  simulates how humans make
decisions.
 Steps:
1. Choose most effective attribute.


Exp. Age could be the most effective in determining
who would buy a car magazine.
2. Split the range into 2 based on sales.
3. Go on to the next attribute (or same attribute).
4. Step 2 again until we run out of attributes.
39
Decision Trees for the Car
Magazine
 First tree.
Age > 44.5
99%
Age ≤ 44.5
38%
 Four-level tree.
Age > 48.5
100%
Age > 44.5
Age ≤ 48.5
92%
Age ≤ 44.5
Income > 34.5
Income ≤ 34.5
100%
Age > 31.5
Age ≤ 31.5
46%
0%
40
5.6 Association Rules
 Marketing managers like rules like:
 90% of women with red ports cars and small
dogs wear Chanel No. 5.
 Customer profiles for marketing campaigns.
 Relationship between attributes 
association rule.
 Binary attributes  flattening tables is
important.
 Algorithms for finding associations may find
good and bad associations.
 Need to introduce some measures for accuracy
to get rid of bad (useless) associations.
41
5.6 Association Rules (continued)
 Association rule:
 MUSIC_MAG, HOUSE_MAG => CAR_MAG
 Somebody who reads music and house magazines is
very likely to read a cars magazine.
 Interesting association rule: is a rule that occurs in
the DB with a high percentage = High support.
 Records with music and house and car are a big
percentage of total records in DB.
 May have lots of records that have music and house
but not car; high support but not good.
 We need another measure: Confidence.
 Confidence is the percentage of records with musichouse-car to the records with music-house.
42
Binary Associations between
Magazines
43
5.7 Neural Networks
 Modeled after the human brain.
 Input nodes: receive input signals.
 Output nodes: produce output signals.
 Intermediate nodes.
 Connect input and output.
 Organized into layers.
 Unlimited number.
 2 phases:
 Encoding: NN trained to perform a task.
 Decoding: NN classifies examples or makes
predictions.
44
Example NN: Learning
45
Example NN: Classifying
46
5.8 Genetic Algorithms
Based on evolution theory, Darwin’s theories, and the
structure of DNA.
 Genetic algorithm:
1. Encode the problem into limited-alphabet strings 
like DNA’s building blocks (4 alphabets).
2. Invent an artificial environment and a measure for
success\failure (fitness function)  survival of the
fittest.
3. Combine solutions and produce new ones based on
combined ones  DNA inherited from mother and
father.
4. Provide initial population and start generating
solutions from them. Remove bad solutions from
each generation and combine the good ones to
produce the next generation of solutions. Until you
reach a family of successful solutions  evolution.

47
Example Genetic Algorithm
48