Download ADM15 File

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Lecture 15
Data Mining Concepts
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Outline






Data Mining
Data Warehousing
Knowledge Discovery in Databases (KDD)
Goals of Data Mining and Knowledge Discovery
Association Rules
Additional Data Mining Algorithms





Sequential pattern analysis
Time Series Analysis
Regression
Neural Networks
Genetic Algorithms
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 2
Definitions of Data Mining



The discovery of new information in terms of
patterns or rules from vast amounts of data.
The process of finding interesting structure in
data.
The process of employing one or more computer
learning techniques to automatically analyze and
extract knowledge from data.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 3
Data Warehousing



The data warehouse is a historical database
designed for decision support.
Data mining can be applied to the data in a
warehouse to help with certain types of decisions.
Proper construction of a data warehouse is
fundamental to the successful use of data mining.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 4
Knowledge Discovery in Databases
(KDD)


Data mining is actually one step of a larger
process known as knowledge discovery in
databases (KDD).
The KDD process model comprises six phases






Data selection
Data cleansing
Enrichment
Data transformation or encoding
Data mining
Reporting and displaying discovered knowledge
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 5
Goals of Data Mining and Knowledge
Discovery (PICO)

Prediction:


Identification:


Identify the existence of an item, event, or activity.
Classification:


Determine how certain attributes will behave in the
future.
Partition data into classes or categories.
Optimization:

Optimize the use of limited resources.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 6
Types of Discovered Knowledge





Association Rules
Classification Hierarchies
Sequential Patterns
Patterns Within Time Series
Clustering
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 7
Association Rules



Association rules are frequently used to generate rules
from market-basket data.
 A market basket corresponds to the sets of items a
consumer purchases during one visit to a supermarket.
The set of items purchased by customers is known as an
itemset.
An association rule is of the form X=>Y, where X ={x1,
x2, …., xn }, and Y = {y1,y2, …., yn} are sets of items, with
xi and yi being distinct items for all i and all j.
 For an association rule to be of interest, it must satisfy
a minimum support and confidence.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 8
Association Rules
Confidence and Support

Support:



The minimum percentage of instances in the database that
contain all items listed in a given association rule.
Support is the percentage of transactions that contain all of
the items in the itemset, LHS U RHS.
Confidence:


Given a rule of the form A=>B, rule confidence is the
conditional probability that B is true when A is known to be
true.
Confidence can be computed as
 support(LHS U RHS) / support(LHS)
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 9
Generating Association Rules

The general algorithm for generating association
rules is a two-step process.


Generate all itemsets that have a support
exceeding the given threshold. Itemsets with this
property are called large or frequent itemsets.
Generate rules for each itemset as follows:


For itemset X and Y a subset of X, let Z = X – Y;
If support(X)/Support(Z) > minimum confidence, the
rule Z=>Y is a valid rule.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 10
Reducing Association Rule
Complexity

Two properties are used to reduce the search
space for association rule generation.

Downward Closure


A subset of a large itemset must also be large
Anti-monotonicity

A superset of a small itemset is also small. This
implies that the itemset does not have sufficient
support to be considered for rule generation.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 11
Generating Association Rules:
The Apriori Algorithm

The Apriori algorithm was the first algorithm
used to generate association rules.

The Apriori algorithm uses the general algorithm
for creating association rules together with
downward closure and anti-monotonicity.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 12
Generating Association Rules:
The Sampling Algorithm

The sampling algorithm selects samples from
the database of transactions that individually fit
into memory. Frequent itemsets are then formed
for each sample.


If the frequent itemsets form a superset of the
frequent itemsets for the entire database, then the
real frequent itemsets can be obtained by
scanning the remainder of the database.
In some rare cases, a second scan of the
database is required to find all frequent itemsets.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 13
Generating Association Rules:
Frequent-Pattern Tree Algorithm



The Frequent-Pattern Tree Algorithm reduces
the total number of candidate itemsets by
producing a compressed version of the database
in terms of an FP-tree.
The FP-tree stores relevant information and
allows for the efficient discovery of frequent
itemsets.
The algorithm consists of two steps:


Step 1 builds the FP-tree.
Step 2 uses the tree to find frequent itemsets.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 14
Step 1: Building the FP-Tree





First, frequent 1-itemsets along with the count of transactions
containing each item are computed.
The 1-itemsets are sorted in non-increasing order.
The root of the FP-tree is created with a “null” label.
For each transaction T in the database, place the frequent 1-itemsets
in T in sorted order. Designate T as consisting of a head and the
remaining items, the tail.
Insert itemset information recursively into the FP-tree as follows:
 if the current node, N, of the FP-tree has a child with an item
name = head, increment the count associated with N by 1 else
create a new node, N, with a count of 1, link N to its parent and
link N with the item header table.
 if tail is nonempty, repeat the above step using only the tail, i.e.,
the old head is removed and the new head is the first item from
the tail and the remaining items become the new tail.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 15
Step 2: The FP-growth Algorithm For
Finding Frequent Itemsets
Input: Fp-tree and minimum support, mins
Output: frequent patterns (itemsets)
procedure FP-growth (tree, alpha);
Begin
if tree contains a single path P then
for each combination, beta of the nodes in the path
generate pattern (beta U alpha)
with support = minimum support of nodes in beta
else
for each item, i, in the header of the tree do
begin
generate pattern beta = (i U alpha) with support = i.support;
construct beta’s conditional pattern base;
construct beta’s conditional FP-tree, beta_tree;
if beta_tree is not empty then
FP-growth(beta_tree, beta);
end;
End;
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 16
Generating Association Rules:
The Partition Algorithm






Divide the database into non-overlapping
subsets.
Treat each subset as a separate database where
each subset fits entirely into main memory.
Apply the Apriori algorithm to each partition.
Take the union of all frequent itemsets from each
partition.
These itemsets form the global candidate
frequent itemsets for the entire database.
Verify the global set of itemsets by having their
actual support measured for the entire database.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 17
Complications seen with
Association Rules




The cardinality of itemsets in most situations is
extremely large.
Association rule mining is more difficult when
transactions show variability in factors such as
geographic location and seasons.
Item classifications exist along multiple
dimensions.
Data quality is variable; data may be missing,
erroneous, conflicting, as well as redundant.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 18
Classification




Classification is the process of learning a model
that is able to describe different classes of data.
Learning is supervised as the classes to be
learned are predetermined.
Learning is accomplished by using a training set
of pre-classified data.
The model produced is usually in the form of a
decision tree or a set of rules.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 19
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 20
An Example Rule

Here is one of the rules extracted from the
decision tree of Figure 28.7.
IF 50K > salary >= 20K
AND age >=25
THEN class is “yes”
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 21
Clustering



Unsupervised learning or clustering builds
models from data without predefined classes.
The goal is to place records into groups where
the records in a group are highly similar to each
other and dissimilar to records in other groups.
The k-Means algorithm is a simple yet effective
clustering technique.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 22
Additional Data Mining Methods





Sequential pattern analysis
Time Series Analysis
Regression
Neural Networks
Genetic Algorithms
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 23
Sequential Pattern Analysis



Transactions ordered by time of purchase form a
sequence of itemsets.
The problem is to find all subsequences from a
given set of sequences that have a minimum
support.
The sequence S1, S2, S3, .. is a predictor of the
fact that a customer purchasing itemset S1 is
likely to buy S2 , and then S3, and so on.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 24
Time Series Analysis



Time series are sequences of events. For
example, the closing price of a stock is an event
that occurs each day of the week.
Time series analysis can be used to identify the
price trends of a stock or mutual fund.
Time series analysis is an extended functionality
of temporal data management.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 25
Regression Analysis




A regression equation estimates a dependent
variable using a set of independent variables
and a set of constants.
The independent variables as well as the
dependent variable are numeric.
A regression equation can be written in the form
Y=f(x1,x2,…,xn) where Y is the dependent
variable.
If f is linear in the domain variables xi, the
equation is call a linear regression equation.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 26
Neural Networks




A neural network is a set of interconnected
nodes designed to imitate the functioning of the
brain.
Node connections have weights which are
modified during the learning process.
Neural networks can be used for supervised
learning and unsupervised clustering.
The output of a neural network is quantitative
and not easily understood.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 27
Genetic Learning




Genetic learning is based on the theory of
evolution.
An initial population of several candidate
solutions is provided to the learning model.
A fitness function defines which solutions survive
from one generation to the next.
Crossover, mutation and selection are used to
create new population elements.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 28
Data Mining Applications

Marketing


Finance


Fraud detection, creditworthiness and investment
analysis
Manufacturing


Marketing strategies and consumer behavior
Resource optimization
Health

Image analysis, side effects of drug, and treatment
effectiveness
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 29
Overview of
Data Warehousing and OLAP
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Outline










Purpose of Data Warehousing
Introduction, Definitions, and Terminology
Comparison with Traditional Databases
Characteristics of Data Warehouses
Classification of Data Warehouses
Multi-dimensional Schemas
Building a Data Warehouse
Functionality of a Data Warehouse
Warehouse vs. Data Views
Implementation difficulties and open issues
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 31
Purpose of Data Warehousing





Traditional databases are not optimized for data access only they
have to balance the requirement of data access with the need to
ensure integrity of data.
Most of the times the data warehouse users need only read access
but, need the access to be fast over a large volume of data.
Most of the data required for data warehouse analysis comes from
multiple databases and these analysis are recurrent and predictable
to be able to design specific software to meet the requirements.
There is a great need for tools that provide decision makers with
information to make decisions quickly and reliably based on historical
data.
The above functionality is achieved by Data Warehousing and Online
analytical processing (OLAP)
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 32
Introduction, Definitions, and Terminology

W. H Inmon characterized a data warehouse as:

“A subject-oriented, integrated, nonvolatile,
time-variant collection of data in support of
management’s decisions.”
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 33
Introduction, Definitions, and Terminology

Data warehouses have the distinguishing characteristic
that they are mainly intended for decision support
applications.


Traditional databases are transactional.
Applications that data warehouse supports are:



OLAP (Online Analytical Processing) is a term used to
describe the analysis of complex data from the data
warehouse.
DSS (Decision Support Systems) also known as EIS
(Executive Information Systems) supports organization’s
leading decision makers for making complex and important
decisions.
Data Mining is used for knowledge discovery, the process
of searching data for unanticipated new knowledge.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 34
Conceptual Structure of Data Warehouse

Data Warehouse processing involves



Cleaning and reformatting of data
OLAP
Data Mining Back Flushing
Data Warehouse
OLAP
Cleaning
Databases
Other Data Inputs
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Data
Reformatting
Metadata
DSSI
EIS
Data
Mining
Updates/New Data
Slide 29- 35
Comparison with Traditional
Databases

Data Warehouses are mainly optimized for appropriate
data access.




Traditional databases are transactional and are optimized for
both access mechanisms and integrity assurance measures.
Data warehouses emphasize more on historical data as
their main purpose is to support time-series and trend
analysis.
Compared with transactional databases, data warehouses
are nonvolatile.
In transactional databases transaction is the mechanism
change to the database. By contrast information in data
warehouse is relatively coarse grained and refresh policy
is carefully chosen, usually incremental.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 36
Characteristics of Data Warehouses












Multidimensional conceptual view
Generic dimensionality
Unlimited dimensions and aggregation levels
Unrestricted cross-dimensional operations
Dynamic sparse matrix handling
Client-server architecture
Multi-user support
Accessibility
Transparency
Intuitive data manipulation
Consistent reporting performance
Flexible reporting
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 37
Classification of Data Warehouses


Generally, Data Warehouses are an order of magnitude
larger than the source databases.
The sheer volume of data is an issue, based on which
Data Warehouses could be classified as follows.



Enterprise-wide data warehouses
 They are huge projects requiring massive investment of time
and resources.
Virtual data warehouses
 They provide views of operational databases that are
materialized for efficient access.
Data marts
 These are generally targeted to a subset of organization, such
as a department, and are more tightly focused.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 38
Data Modeling for Data Warehouses

Traditional Databases generally deal with twodimensional data (similar to a spread sheet).


However, querying performance in a multidimensional data storage model is much more
efficient.
Data warehouses can take advantage of this
feature as generally these are


Non volatile
The degree of predictability of the analysis that will
be performed on them is high.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 39
Data Modeling for Data Warehouses

Example of Two- Dimensional vs. MultiDimensional
Two Dimensional Model
T hree d imensio nal d at a cub e
REGION
REG1
P
R
O
D
U
C
T
REG2
REG3
P123
P124
P125
P126
:
:
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
P
r
o
d
u
c
t
P1 2 3
er
a r t Q tr 4
u
Q
a l Qtr 3
c
F i s tr 2
Q
1
r
t
Q
Reg 1 Reg 2
Reg 3
P1 2 4
P1 2 5
P1 2 6
:
:
Region
Slide 29- 40
Data Modeling for Data Warehouses

Advantages of a multi-dimensional model


Multi-dimensional models lend themselves readily
to hierarchical views in what is known as roll-up
display and drill-down display.
The data can be directly queried in any
combination of dimensions, bypassing complex
database queries.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 41
Multi-dimensional Schemas

Multi-dimensional schemas are specified using:

Dimension table


It consists of tuples of attributes of the dimension.
Fact table

Each tuple is a recorded fact. This fact contains
some measured or observed variable (s) and
identifies it with pointers to dimension tables. The
fact table contains the data, and the dimensions to
identify each tuple in the data.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 42
Multi-dimensional Schemas

Two common multi-dimensional schemas are
 Star schema:


Consists of a fact table with a single table for each
dimension
Snowflake Schema:

It is a variation of star schema, in which the
dimensional tables from a star schema are
organized into a hierarchy by normalizing them.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 43
Multi-dimensional Schemas

Star schema:

Consists of a fact table with a single table for each
dimension.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 44
Multi-dimensional Schemas

Snowflake Schema:

It is a variation of star schema, in which the
dimensional tables from a star schema are
organized into a hierarchy by normalizing them.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 45
Multi-dimensional Schemas

Fact Constellation

Fact constellation is a set of tables that share
some dimension tables. However, fact
constellations limit the possible queries for the
warehouse.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 46
Multi-dimensional Schemas

Indexing



Data warehouse also utilizes indexing to support
high performance access.
A technique called bitmap indexing constructs a bit
vector for each value in domain being indexed.
Indexing works very well for domains of low
cardinality.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 47
Building A Data Warehouse

The builders of Data warehouse should take a
broad view of the anticipated use of the
warehouse.


The design should support ad-hoc querying
An appropriate schema should be chosen that
reflects the anticipated usage.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 48
Building A Data Warehouse

The Design of a Data Warehouse involves
following steps.



Acquisition of data for the warehouse.
Ensuring that Data Storage meets the query
requirements efficiently.
Giving full consideration to the environment in
which the data warehouse resides.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 49
Building A Data Warehouse

Acquisition of data for the warehouse



The data must be extracted from multiple,
heterogeneous sources.
Data must be formatted for consistency within the
warehouse.
The data must be cleaned to ensure validity.


Difficult to automate cleaning process.
Back flushing, upgrading the data with cleaned
data.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 50
Building A Data Warehouse

Acquisition of data for the warehouse (contd.)


The data must be fitted into the data model of the
warehouse.
The data must be loaded into the warehouse.

Proper design for refresh policy should be
considered.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 51
Building A Data Warehouse







Storing the data according to the data model of
the warehouse
Creating and maintaining required data structures
Creating and maintaining appropriate access
paths
Providing for time-variant data as new data are
added
Supporting the updating of warehouse data.
Refreshing the data
Purging data
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 52
Building A Data Warehouse







Usage projections
The fit of the data model
Characteristics of available resources
Design of the metadata component
Modular component design
Design for manageability and change
Considerations of distributed and parallel
architecture

Distributed vs. federated warehouses
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 53
Functionality of a Data Warehouse

Functionality that can be expected:







Roll-up: Data is summarized with increasing
generalization
Drill-Down: Increasing levels of detail are
revealed
Pivot: Cross tabulation is performed
Slice and dice: Performing projection operations
on the dimensions.
Sorting: Data is sorted by ordinal value.
Selection: Data is available by value or range.
Derived attributes: Attributes are computed by
operations on stored derived values.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 54
Warehouse vs. Data Views


Views and data warehouses are alike in that they both
have read-only extracts from the databases.
However, data warehouses are different from views in the
following ways:





Data Warehouses exist as persistent storage instead of being
materialized on demand.
Data Warehouses are not usually relational, but rather multidimensional.
Data Warehouses can be indexed for optimization.
Data Warehouses provide specific support of functionality.
Data Warehouses deals huge volumes of data that is
contained generally in more than one database.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 55
Difficulties of implementing Data
Warehouses

Lead time is huge in building a data warehouse



Both quality and consistency of data are major concerns.
Revising the usage projections regularly to meet the
current requirements.


Potentially it takes years to build and efficiently maintain a data
warehouse.
The data warehouse should be designed to accommodate
addition and attrition of data sources without major redesign
Administration of data warehouse would require far
broader skills than are needed for a traditional database.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 56
Open Issues in Data Warehousing


Data cleaning, indexing, partitioning, and views could be
given new attention with perspective to data warehousing.
Automation of






data acquisition
data quality management
selection and construction of access paths and structures
self-maintainability
functionality and performance optimization
Incorporating of domain and business rules appropriately
into the warehouse creation and maintenance process
more intelligently.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 57
Recap






Data Mining
Data Warehousing
Knowledge Discovery in Databases (KDD)
Goals of Data Mining and Knowledge Discovery
Association Rules
Additional Data Mining Algorithms





Sequential pattern analysis
Time Series Analysis
Regression
Neural Networks
Genetic Algorithms
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 28- 58
Recap










Purpose of Data Warehousing
Introduction, Definitions, and Terminology
Comparison with Traditional Databases
Characteristics of data Warehouses
Classification of Data Warehouses
Multi-dimensional Schemas
Building A Data Warehouse
Functionality of a Data Warehouse
Warehouse vs. Data Views
Implementation difficulties and open issues
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe
Slide 29- 59