Download UNIT-4 Data Mining Basics

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
MCA 204, Data Warehousing & Data Mining
UNIT-4
Data Mining Basics
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.1
Learning Objective
• Why Data Mining?
• What Is Data Mining?
• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.2
Evolution of Sciences: New Data Science Era
Before 1600: Empirical science
1600-1950s: Theoretical science
 Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
1950s-1990s: Computational science
 Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
 Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
1990-now: Data science
 The flood of data from new scientific instruments and simulations
 The ability to economically store and manage petabytes of data online
 The Internet and computing Grid that makes all these archives universally
accessible
 Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes
 Data mining is a major new challenge!
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.3
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.1
MCA 204, Data Warehousing & Data Mining
What is Data Mining?
Data mining refers to :
extracting or “mining” knowledge from large
amounts of data.
It is also known as Knowledge Discovery from Data.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.4
What is Data Mining?
Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data
 Data mining: a misnomer?
Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems
5
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.5
Knowledge Discovery (KDD) Process
• This is a view from typical database
systems and data warehousing
communities
• Data mining plays an essential role in the
knowledge discovery process
Pattern Evaluation
Data Mining
Task-relevant
T k l
t Data
D t
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.6
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.2
MCA 204, Data Warehousing & Data Mining
Example: A Web Mining Framework
Web mining usually involves
 Data cleaning
 Data integration from multiple sources
 Warehousing the data
 Data cube construction
 Data selection for data mining
 Data mining
 Presentation of the mining results
 Patterns and knowledge to be used or stored into
knowledge-base
U3.7
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Data Mining in Business Intelligence
Increasing potential
to support
business decisions
End User
Decision
Making
Data Presentation
Visualization Techniques
Business
Analyst
y
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.8
Example: Mining vs. Data Exploration
• Business intelligence view
 Warehouse, data cube, reporting but not much mining
• Business objects vs. data mining tools
• Supply chain example: tools
• Data presentation
• Exploration
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.9
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.3
MCA 204, Data Warehousing & Data Mining
KDD Process: A Typical View from ML and
Statistics
Input Data
Data PreProcessing
Data integration
Normalization
Feature selection
Dimension reduction
Data
Mining
Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
…………
PostProcessing
Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization
This is a view from typical machine learning and statistics communities
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.10
Example: Medical Data Mining
• Health care & medical data mining – often adopted such a
view in statistics and machine learning.
• Preprocessing of the data (including feature extraction and
dimension reduction)
• Classification or/and clustering processes
• Post-processing for presentation
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.11
Multi-Dimensional View of Data Mining
Data to be mined
 Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal,
time-series, sequence, text and web, multi-media, graphs & social
and information networks
Knowledge to be mined (or: Data mining functions)
 Characterization, discrimination, association, classification,
clustering trend/deviation
clustering,
trend/deviation, outlier analysis
analysis, etc
etc.
 Descriptive vs. predictive data mining
 Multiple/integrated functions and mining at multiple levels
Techniques utilized
 Data-intensive, data warehouse (OLAP), machine learning, statistics,
pattern recognition, visualization, high-performance, etc.
Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.12
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.4
MCA 204, Data Warehousing & Data Mining
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. biosequences))
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.13
Data Mining Function: (1) Generalization
Information integration and data warehouse construction
 Data cleaning, transformation, integration, and
multidimensional data model
Data cube technology
p
g ((i.e., materializing)
g)
 Scalable methods for computing
multidimensional aggregates
 OLAP (online analytical processing)
Multidimensional concept description: Characterization and
discrimination
 Generalize, summarize, and contrast data
characteristics, e.g. dry vs. wet region
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.14
Data Mining Function: (2) Association and
Correlation Analysis
Frequent patterns (or frequent itemsets)
 What items are frequently purchased together in your
Walmart?
Association, correlation vs. causality
 A typical association rule

Milk
Bread [0.5%, 75%] (support, confidence)
 Are strongly associated items also strongly
correlated?
How to mine such patterns and rules efficiently in large
datasets?
How to use such patterns for classification, clustering, and
other applications?
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.15
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.5
MCA 204, Data Warehousing & Data Mining
Data Mining Function: (3) Classification
Classification and label prediction
 Construct models (functions) based on some training examples
 Describe and distinguish classes or concepts for future
prediction
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
 Predict some unknown class labels
Typical methods
 Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, patternbased classification, logistic regression, …
Typical applications:
 Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.16
Data Mining Function: (4) Cluster Analysis
• Unsupervised learning (i.e., Class label is unknown)
• Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
• Principle: Maximizing intra-class similarity & minimizing
interclass similarity
• Many methods and applications
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.17
Data Mining Function: (5) Outlier Analysis
Outlier analysis
 Outlier: A data object that does not comply with the
general behavior of the data
 Noise or exception? ― One person’s garbage could be
another
th person’s
’ ttreasure
 Methods: by product of clustering or regression
analysis, …
 Useful in fraud detection, rare events analysis
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.18
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.6
MCA 204, Data Warehousing & Data Mining
Relationships
Let us take the case study of a fast food restaurant.
The combo meals that are available are designed after
applying data mining to the sales trends’ data over some
months or years.
• Data mining discovers relationships of this type. The
relationships may be between two or more different objects
along with the time dimension or between the attributes of
the same object.
• Discovery of knowledge is a key result of data mining.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.19
Case Study
The Fast Food industry is highly competitive, one where a
very small change in operations can have a significant
impact on the bottom line.
For this reason, quick access to comprehensive
information for both standard and on demand reporting is
essential. Implement the various data mining techniques to
address
dd
thi requirement
this
i
t for
f ABC Corporation,
C
ti
a fast
f t food
f d
franchisee operating approximately 80 outlets at different
places.
The results should provide strategic and tactical
decision support to all levels of management within the
Corporation.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.20
Time and Ordering: Sequential Pattern, Trend
and Evolution Analysis
Sequence, trend and evolution analysis
 Trend, time-series, and deviation analysis: e.g.,
regression and value prediction
 Sequential pattern mining
e.g., first buy digital camera, then buy large SD memory
cards
 Periodicity analysis
 Motifs and biological sequence analysis
Approximate and consecutive motifs
 Similarity-based analysis
Mining data streams
 Ordered, time-varying, potentially infinite, data streams
21
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.21
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.7
MCA 204, Data Warehousing & Data Mining
Patterns
• Data mining tools mine the usage pattern of the customers
which helps the restaurant owner to launch different special
offers at different places at different times.
• This potential usage pattern also deduces results which
help in designing a marketing campaign.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.22
Structure and Network Analysis
Graph mining
 Finding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
Information network analysis
 Social networks: actors (objects, nodes) and relationships (edges)
 e.g., author networks in CS, terrorist networks
 Multiple heterogeneous networks
 A person could be multiple information networks: friends,
family, classmates, …
 Links carry a lot of semantic information: Link mining
Web mining
 Web is a big information network: from PageRank to Google
 Analysis of Web information networks
 Web community discovery, opinion mining, usage mining, …
23
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.23
Evaluation of Knowledge
Are all mined knowledge interesting?
 One can mine tremendous amount of “patterns” and knowledge
 Some may fit only certain dimension space (time, location, …)
 Some may not be representative, may be transient, …
Evaluation of mined knowledge → directly mine only
interesting knowledge?
 Descriptive vs. predictive
 Coverage
 Typicality vs. novelty
 Accuracy
 Timeliness
 …
24
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.24
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.8
MCA 204, Data Warehousing & Data Mining
Data Mining: Confluence of Multiple Disciplines
Machine
Learning
Applications
Pattern
Recognition
Statistics
Visualization
Data Mining
Database
Technology
Algorithm
High-Performance
Computing
25
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.25
Why Confluence of Multiple Disciplines?
Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-bytes of
data
High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
New and sophisticated applications
26
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.26
Steps in Data Mining
Evaluation and
Application of
Results
Application of Suitable
Data Mi
D
Mining
i Techniques
T h i
Selection and
Preparation of
Data
Determination of
Business
Objectives
20%
15%
45%
20%
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.27
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.9
MCA 204, Data Warehousing & Data Mining
Steps in Data Mining (Cont...)
We do not try to predict the knowledge we are going to
discover but define the business objectives of the
engagement.
Step 1: Define Business Objectives
• State why do you need a data mining solution.
p
and express
p
how the final results
• Define yyour expectations
will be used in the operational system.
Step 2: Prepare Data
• Consists of data selection, pre-processing of data and data
transformation.
• Use the business objectives to determine what data has to
be selected. The variables selected are called active
variables.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.28
Steps in Data Mining (Cont...)
Pre-processing is meant to improve the quality of selected
data. It involves enriching the selected data with external
data, removal of noisy data and missing values.
Step 3: Perform Data Mining
g discoveryy engine
g
applies
pp
the selected
• The knowledge
algorithm to the prepared data.
• The output from this step is a relationship or pattern.
Step 4: Evaluate Results
• In this step, all the resulting patterns are examined.
• A filtering mechanism is applied and only the promising
patterns are selected.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.29
Cont...
Step 5: Present Discoveries
• This may be in the form of visual navigation, charts, graphs, or
free-form texts.
• It also includes storing of interesting discoveries in the knowledge
base for repeated use.
Step 6: Incorporate Usage of Discoveries
• This step is for using the results to create actionable items in the
business.
• The results are assembled in the best way so that they can be
exploited to improve the business.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.30
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.10
MCA 204, Data Warehousing & Data Mining
OLAP versus Data Mining
Features
OLAP
DATA MINING
Motivation for Information
Request
What is happening in the
enterprise?
Predict the future based on
why this is happening.
Data granularity
Summary data.
Detailed transaction-level data.
Number of business
dimensions
Limited number of
dimensions.
Large number of dimensions.
Number of dimension
attributes
Small number of attributes.
Many dimension attributes.
Sizes of datasets for the
dimensions
Not large for each dimension.
Usually very large for each
dimension.
Analysis approach
User-driven interactive
analysis.
Data-driven automatic
knowledge discovery.
Analysis techniques
Multidimensional, drill-drown,
and slice-and-dice.
Prepare data, launch mining
tool and sit back.
Mature and widely used.
Still emerging; some parts of
the technology more mature.
State of the technology
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.31
Data Mining in the Data Warehouse Environment
OLAP
System
DATA STAGING AREA
Source
Operational
Systems
Flat files with
extracted and
t
transformed
f
d
data
Load image
files ready for
lloading
di th
the d
data
t
warehouse
OPTIONS
FOR DATA
EXTRACTION
Data
Mining
Enterprise
p
Data
Warehouse
Data selected,
extracted, transformed,
and prepared for
mining
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.32
Functions and Application Areas
Application Areas
Examples of
Mining Functions
Mining Processes
Mining
Techniques
Fraud Detection
Credit card frauds
Internal audits
Warehouse pilferage
Determination of
variations from norms
Data Visualization
Memory-based
Reasoning
Risk Assessment
Credit card upgrades
Mortgage loans
Customer Retention
Credit Ratings
Detection and
analysis of links
Decision Trees
Memory-based
Reasoning
Market Analysis
Market basket
analysis
Target marketing
Cross selling
Customer relationship
Marketing
Productive Modelling
Database
segmentation
Cluster Detection
Decision Trees
Link Analysis
Genetic Algorithms
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.33
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.11
MCA 204, Data Warehousing & Data Mining
Applications of Data Mining
• Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
• Collaborative analysis & recommender systems
• Basket data analysis to targeted marketing
• Biological and medical data analysis: classification, cluster analysis
((microarray
y data analysis),
y ), biological
g
sequence
q
analysis,
y , biological
g
network analysis
• Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
• From major dedicated data mining systems/tools (e.g., SAS, MS
SQL-Server Analysis Manager, Oracle Data Mining Tools) to
invisible data mining
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.34
Major Issues in Data Mining (1)
Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked
environment
i
t
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining
User Interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.35
Major Issues in Data Mining (2)
Efficiency and Scalability
 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining
methods
Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.36
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.12
MCA 204, Data Warehousing & Data Mining
Summary
• Data mining: Discovering interesting patterns and knowledge
from massive amount of data
• A natural evolution of science and information technology, in
great demand, with wide applications
• A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
• Mining can be performed in a variety of data
• Data mining functionalities: characterization, discrimination,
association, classification, clustering, trend and outlier analysis,
etc.
• Data mining technologies and applications
• Major issues in data mining
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.37
Cont…
• Motivation- need to extract useful information and
knowledge from a large amount of data (data explosion
problem)
• Data Mining tools perform data analysis and may uncover
important data patterns, contributing greatly to business
strategies, knowledge bases, and scientific and medical
research.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.38
What is Data Mining?
• Data mining refers to extracting or “mining” knowledge
from large amounts of data. Also referred as Knowledge
Discovery in Databases.
• It is a process of discovering interesting knowledge from
large amounts of data stored either in databases, data
warehouses, or other information repositories.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.39
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.13
MCA 204, Data Warehousing & Data Mining
Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Knowledge base
D mining
Data
i i engine
i
Database or data warehouse server
Data cleansing
Data Integration
Filtering
Database
Data warehouse
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.40
Cont…
• Misconception: Data mining systems can autonomously
dig out all of the valuable knowledge from a given large
database, without human intervention.
• If there was no user intervention then the system would
uncover a large set of patterns that may even surpass the
size of the database.
database Hence,
Hence user interference is
required.
• This user communication with the system is provided by
using a set of data mining primitives.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.41
Data Mining Primitives
Data mining primitives define a data mining task, which
can be specified in the form of a data mining query.
• Task Relevant Data
• Kinds of knowledge to be mined
• Background knowledge
• Interestingness measure
• Presentation and visualization of discovered patterns
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.42
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.14
MCA 204, Data Warehousing & Data Mining
Task Relevant Data
• Data portion to be investigated.
• Attributes of interest (relevant attributes) can be
specified.
• Initial data relation
• Minable view
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.43
Example
• If a data mining task is to study associations between
items frequently purchased at AllElectronics by customers
in Canada, the task relevant data can be specified by
providing the following information:
• Name of the database or data warehouse to be used
(e.g., AllElectronics_db)
• Names
N
off the
th tables
t bl or data
d t cubes
b containing
t i i
relevant
l
t
data (e.g., item, customer, purchases and items_sold)
• Conditions for selecting the relevant data (e.g., retrieve
data pertaining to purchases made in Canada for the
current year)
• The relevant attributes or dimensions (e.g., name and
price from the item table and income and age from the
customer table)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.44
Kind of Knowledge to be Mined
• It is important to specify the knowledge to be mined, as
this determines the data mining function to be
performed.
• Kinds of knowledge include concept description,
association classification,
association,
classification prediction and clustering.
clustering
• User can also provide pattern templates.
metapatterns or metarules or metaqueries.
Also called
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.45
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.15
MCA 204, Data Warehousing & Data Mining
Example
A user studying the buying habits of allelectronics customers
may choose to mine association rules of the form:
P (X:customer,W) ^ Q (X,Y) => buys (X,Z)
Meta rules such as the following can be specified:
age (X
(X, “30
30…..39
39”)) ^ income (X,
(X “40k
40k….49K
49K”)) =>
> buys (X
(X, “VCR”)
VCR )
[2.2%, 60%]
occupation (X, “student ”) ^ age (X, “20…..29”)=> buys (X, “computer”)
[1.4%, 70%]
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.46
Background Knowledge
• It is the information about the domain to be mined
• Concept hierarchy: is a powerful form of background
knowledge.
• Four major types of concept hierarchies:




schema hierarchies
set-grouping hierarchies
operation-derived hierarchies
rule-based hierarchies
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.47
Concept Hierarchies (1)
• Defines a sequence of mappings from a set of low-level
concepts to higher-level (more general) concepts.
• Allows data to be mined at multiple levels of abstraction.
• These allow
perspectives,
relationships.
users to view data from different
allowing further insight into the
• Example (location)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.48
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.16
MCA 204, Data Warehousing & Data Mining
Example
Level 0
All
Level 1
USA
Canada
British
Columbia
Vancouver
Ontario
Victoria
Toronto
Ottawa
New
York
New York
Illinois
Buffalo
Chicago
Level 2
Level 3
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.49
Concept Hierarchies (2)
• Rolling Up - Generalization of data
 Allows to view data at more meaningful and explicit
abstractions.
 Makes it easier to understand
 Compresses the data
 Would require fewer input/output operations
• Drilling
D illi Down
D
- Specialization
S
i li ti off data
d t
 Concept values replaced by lower level concepts
• There may be more than concept hierarchy for a given
attribute or dimension based on different user viewpoints
• Example:
 Regional sales manager may prefer the previous concept hierarchy
but marketing manager might prefer to see location with respect to
linguistic lines in order to facilitate the distribution of commercial
ads.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.50
Schema Hierarchies
• Schema hierarchy is the total or partial order among
attributes in the database schema.
• May formally express existing semantic relationships
between attributes.
• Provides metadata information.
• Example: location hierarchy
street < city < province/state < country
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.51
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.17
MCA 204, Data Warehousing & Data Mining
Set-grouping Hierarchies
• Organizes values for a given attribute into groups or sets
or range of values.
• Total or partial order can be defined among groups.
• Used to refine or enrich schema-defined hierarchies.
• Typically used for small sets of object relationships.
• Example: Set-grouping hierarchy for age
{young, middle_aged, senior} all (age)
{20….29} young
{40….59} middle_aged
{60….89} senior
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.52
Operation-derived Hierarchies
• Operation-derived
based on operations specified operations may include
decoding of information-encoded strings information
extraction from complex data objects data clustering
• Example: URL or email address
[email protected]
@ iit i gives
i
l i name < dept.
login
d t < univ.
i < country
t
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.53
Rule-based Hierarchies
• Rule-based
Occurs when either whole or portion of a concept
hierarchy is defined as a set of rules and is evaluated
dynamically based on current database data and rule
definition
• E
Example:
l Following
F ll i rules
l are used
d to
t categorize
t
i items
it
as
low_profit, medium_profit and high_profit_margin.
• low_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)<50)
• medium_profit_margin(X)
P2)≥50)^((P1-P2)≤250)
<=
price(X,P1)^cost(X,P2)^((P1-
• high_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)>250)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.54
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.18
MCA 204, Data Warehousing & Data Mining
Interestingness Measure (1)
• Used to confine the number of uninteresting patterns returned by
the process.
• Based on the structure of patterns and statistics underlying them.
• Associate a threshold which can be controlled by the user.
• patterns not meeting the threshold are not presented to the user.
• Objective measures of pattern interestingness:
 Simplicity
 certainty (confidence)
 utility (support)
 novelty
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.55
Interestingness Measure (2)
• Simplicity
a patterns interestingness is based on its overall
simplicity for human comprehension.
Example: Rule length is a simplicity measure
• C
Certainty
t i t (confidence)
(
fid
)
Assesses the validity or trustworthiness of a pattern.
confidence is a certainty measure
confidence (A=>B) = # tuples containing both A and B
# tuples containing A
A confidence of 85% for the rule buys(X, “computer”)=>buys(X,“software”)
means that 85% of all customers who purchased a computer also bought
software
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.56
Interestingness Measure (3)
• Utility (support)
usefulness of a pattern
support (A=>B) = # tuples containing both A and B
total # of tuples
A support of 30% for the previous rule means that 30% of
all customers in the computer department purchased both
a computer
t and
d software.
ft
• Association rules that satisfy both the minimum confidence
and support threshold are referred to as strong
association rules.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.57
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.19
MCA 204, Data Warehousing & Data Mining
Interestingness Measure (3)
• Novelty
Patterns contributing new information to the given pattern
set are called novel patterns (example: Data exception)
removing redundant patterns is a strategy for detecting
novelty.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.58
Presentation and Visualization
• For data mining to be effective, data mining systems
should be able to display the discovered patterns in
multiple forms, such as rules, tables, crosstabs (crosstabulations), pie or bar charts, decision trees, cubes, or
other visual representations.
• User must be able to specify the forms of presentation to
be used for displaying the discovered patterns.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.59
Architectures of Data Mining System
• With popular and diverse application of data mining, it is
expected that a good variety of data mining system will be
designed and developed.
• Comprehensive information processing and data analysis
will be continuously and systematically surrounded by
data warehouse and databases.
• A critical question in design is whether to integrate data
mining systems with database systems.
• This gives rise to four architecture:




No coupling
Loose Coupling
Semi-tight Coupling
Tight Coupling
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.60
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.20
MCA 204, Data Warehousing & Data Mining
Cont.
• No Coupling
 DM system will not utilize any functionality of a DB or DW system
• Loose Coupling
 DM system will use some facilities of DB and DW system like storing
the data in either of DB or DW systems and using these systems for
data retrieval
• Semi-tight Coupling
 Besides linking a DM system to a DB/DW systems, efficient
implementation of a few DM primitives.
• Tight Coupling
 DM system is smoothly integrated with DB/DW systems. Each of
these DM, DB/DW is treated as main functional component of
information retrieval system.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.61
Data Preprocessing
• Data Preprocessing: An Overview
• Data Quality
• Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.62
62
Data Quality: Why Preprocess the Data?
• Measures for data quality: A multidimensional view
• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling,
…
• Timeliness:
Ti li
ti l update?
timely
d t ?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be
understood?
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.63
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.21
MCA 204, Data Warehousing & Data Mining
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.64
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.65
Incomplete (Missing) Data
• Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
 equipment malfunction
 inconsistent
i
i t t with
ith other
th recorded
d d data
d t and
d thus
th deleted
d l t d
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
• Missing data may need to be inferred
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.66
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.22
MCA 204, Data Warehousing & Data Mining
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same
class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.67
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology
t h l
li it ti
limitation
 inconsistency in naming convention
Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.68
68
How to Handle Noisy Data?
Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
 smooth
th by
b fitting
fitti the
th data
d t into
i t regression
i functions
f
ti
Clustering
 detect and remove outliers
Combined computer and human inspection
 detect suspicious values and check by human (e.g.,
deal with possible outliers)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.69
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.23
MCA 204, Data Warehousing & Data Mining
Data Cleaning as a Process
Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools
Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
D
Data
auditing:
di i
b analyzing
by
l i
d
data
to discover
di
rules
l
and
d
relationship to detect violators (e.g., correlation and clustering
to find outliers)
Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.70
Data Integration
Data integration:
 Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
Entity identification problem:
 Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales,
e.g., metric vs. British units
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.71
71
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple
databases
 Object identification: The same attribute or object may
have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.72
72
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.24
MCA 204, Data Warehousing & Data Mining
Correlation Analysis (Nominal Data)
Χ2 (chi-square) test
2  
(Observed  Expected) 2
Expected
• The larger the Χ2 value, the more likely the variables are
related
• The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
• Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.73
Chi-Square Calculation: An Example
Play chess
Not play chess
Like science fiction
250(90)
200(360)
Sum (row)
450
Not like science fiction
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
• Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
2 
( 250  90) 2 (50  210) 2 ( 200  360) 2 (1000  840) 2



 507.93
90
210
360
840
• It shows that like_science_fiction and play_chess are
correlated in the group
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.74
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product
moment coefficient)
rA, B 

n
i 1
(ai  A)(bi  B )
(n  1) A B


n
i 1
(ai bi )  n A B
( n  1) A B
 where n is the number of tuples, A and B are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
(atttribute discourage each other)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.75
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.25
MCA 204, Data Warehousing & Data Mining
Cont...
•
•
•
•
Mean A
A=∑A/n
Standard deviation
σA=sqrt(∑(A-A)2/(n-1)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.76
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.77
Correlation (viewed as linear relationship)
• Correlation measures the linear relationship
between objects
• To compute correlation, we standardize data
objects, A and B, and then take their dot product
a 'k  (ak  mean( A)) / std ( A)
b'k  (bk  mean( B )) / std ( B)
correlation( A, B )  A' B '
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.78
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.26
MCA 204, Data Warehousing & Data Mining
Covariance (Numeric Data)
• Covariance is similar to correlation
Correlation coefficient:
 where n is the number of tuples, A and B are the
respective mean or expected values of A and B, σA and
σB are the respective standard deviation of A and B.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.79
Covariance (Numeric Data)
• Positive covariance: If CovA,B > 0, then A and B both
tend to be larger than their expected values.
• Negative covariance: If CovA,B < 0 then if A is larger
than its expected value, B is likely to be smaller than its
B
A
expected value.
value
• Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a
covariance of 0 but are not independent. Only under
some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of
0 imply independence
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.80
Co-Variance: An Example
It can be simplified in computation as
Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8),
(5 10)
(5,
10), (4
(4, 11)
11), (6
(6, 14)
14).
Question: If the stocks are affected by the same industry trends, will their prices
rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.81
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.27
MCA 204, Data Warehousing & Data Mining
Data Reduction Strategies
• Data reduction
Obtain a reduced representation of the data set that is
much smaller in volume but yet produces the same (or
almost the same) analytical results
• Whyy data reduction?
• A database/data warehouse may store terabytes of data.
• Complex data analysis may take a very long time to run on
the complete data set.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.82
Data Reduction Strategies
Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant
attributes
Wavelet transforms
Principal Components Analysis (PCA)
Feature subset selection,
selection feature creation
 Numerosity reduction (some simply call it: Data
Reduction)
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation
 Data compression
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.83
Data Reduction 1: Dimensionality Reduction
Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering,
outlier analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
Dimensionality reduction
 Avoid
A id the
th curse off dimensionality
di
i
lit
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.84
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.28
MCA 204, Data Warehousing & Data Mining
Mapping Data to a New Space


Fourier transform
Wavelet transform
Two Sine Waves
Two Sine Waves + Noise
Frequency
U3.85
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
What Is Wavelet Transform?
• Decomposes a signal into
different frequency subbands
 Applicable to ndimensional signals
• Data are transformed to
preserve relative distance
between objects at different
levels of resolution
• Allow natural clusters to
become more distinguishable
• Used for image compression
U3.86
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Wavelet Transformation
Haar2
• Discrete wavelet transform (DWT) for linear signal
processing, multi-resolution analysis
Daubechie4
• Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
• Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
• Method:
 Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
 Each transform has 2 functions: smoothing, difference
 Applies to pairs of data, resulting in two set of data of length L/2
 Applies two functions recursively, until reaches the desired length
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.87
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.29
MCA 204, Data Warehousing & Data Mining
Wavelet Decomposition
• Wavelets: A math tool for space-efficient hierarchical
decomposition of functions
• S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, 11/4, 1/2, 0, 0, -1, -1, 0]
• Compression: many small detail coefficients can be
replaced by 0
0’ss, and only the significant coefficients are
retained
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.88
Why Wavelet Transform?
• Use hat-shape filters
 Emphasize region where points cluster
 Suppress weaker information in their boundaries
• Effective removal of outliers
 Insensitive to noise, insensitive to input order
• Multi-resolution
 Detect arbitrary shaped clusters at different scales
• Efficient
 Complexity O(N)
• Only applicable to low dimensional data
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.89
Principal Component Analysis (PCA)
• Find a projection that captures the largest amount of variation
in data
• The original data are projected onto a much smaller space,
resulting in dimensionality reduction. We find the eigenvectors
of the covariance matrix, and these eigenvectors define the
new space x2
e
x1
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.90
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.30
MCA 204, Data Warehousing & Data Mining
Principal Component Analysis (Steps)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
 Normalize input data: Each attribute falls within the same range
 Compute k orthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the k principal
component vectors
 The principal components are sorted in order of decreasing
“significance” or strength
 Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
Works for numeric data only
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.91
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
 Duplicate much or all of the information contained in one
or more other attributes
 E.g., purchase price of a product and the amount of sales
t paid
tax
id
• Irrelevant attributes
 Contain no information that is useful for the data mining
task at hand
 E.g., students' ID is often irrelevant to the task of
predicting students' GPA
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.92
Heuristic Search in Attribute Selection
• There are 2d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
 Best single attribute under the attribute independence
assumption: choose by significance tests
 Best step-wise feature selection:
The
The best single-attribute
single attribute is picked first
Then next best attribute condition to the first, ...
 Step-wise attribute elimination:
Repeatedly eliminate the worst attribute
 Best combined attribute selection and elimination
 Optimal branch and bound:
Use attribute elimination and backtracking
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.93
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.31
MCA 204, Data Warehousing & Data Mining
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the
important information in a data set more effectively than the
original ones
• Three general methodologies
 Attribute extraction
 Domain-specific
p
 Mapping data to new space (see: data reduction)
E.g., Fourier transformation, wavelet transformation,
manifold approaches (not covered)
 Attribute construction
Combining features (see: discriminative frequent patterns
in Chapter 7)
Data discretization
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.94
94
Data Reduction 2: Numerosity Reduction
• Reduce data volume by choosing alternative, smaller
forms of data representation
• Parametric methods (e.g., regression)
 Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
 Ex.: Log-linear models—obtain value at a point in mD space as the product on appropriate marginal
subspaces
• Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling, …
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.95
Parametric Data Reduction: Regression and
Log-Linear Models
• Linear regression
 Data modeled to fit a straight line
 Often uses the least-square method to fit the line
• Multiple regression
 Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
• Log-linear model
 Approximates discrete multidimensional probability
distributions
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.96
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.32
MCA 204, Data Warehousing & Data Mining
y
Regression Analysis
• Regression analysis: A collective name for
techniques for the modeling and analysis of
Y1
numerical data consisting of values of a
dependent variable (also called response
Y1’
y=x+1
variable or measurement) and of one or
more independent variables (aka.
explanatory variables or predictors)
x
X1
• Used for prediction
(including forecasting of
give a "best fit" of the data
time-series data),
• Most commonly the best fit is evaluated by
inference, hypothesis
using the least squares method, but other
testing, and modeling of
criteria have also been used
causal relationships
• The parameters are estimated so as to
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.97
Regress Analysis and Log-Linear Models
Linear regression: Y = w X + b
 Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
 Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above
Log-linear models:
 Approximate discrete multidimensional probability distributions
 Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset
of dimensional combinations
 Useful for dimensionality reduction and data smoothing
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.98
Histogram Analysis
40
• Divide data into buckets and
store average (sum) for each 35
30
bucket
• Partitioning rules:
25
 Equal-width: equal bucket 20
range
15
 Equal-frequency (or equal- 10
depth)
5
0
10000
30000
50000
70000
90000
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.99
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.33
MCA 204, Data Warehousing & Data Mining
Clustering
• Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multidimensional index tree structures
• There are many choices of clustering definitions and
clustering algorithms
U3.100
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Sampling
• Sampling: obtaining a small sample s to represent the
whole data set N
• Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
principle:
p Choose a representative
p
subset of the data
• Keyy p
 Simple random sampling may have
performance in the presence of skew
very
poor
 Develop adaptive sampling methods, e.g., stratified
sampling:
• Note: Sampling may not reduce database I/Os (page at a
time)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.101
Types of Sampling
• Simple random sampling
 There is an equal probability of selecting any particular
item
• Sampling without replacement
 Once an object is selected, it is removed from the
population
• Sampling
S
li
with
ith replacement
l
t
 A selected object is not removed from the population
• Stratified sampling:
 Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
 Used in conjunction with skewed data
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.102
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.34
MCA 204, Data Warehousing & Data Mining
Sampling: With or Without Replacement
Raw Data
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.103
Sampling: Cluster or Stratified Sampling
Raw Data
Cluster/Stratified Sample
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.104
Data Cube Aggregation
• The lowest level of a data cube (base cuboid)
 The aggregated data for an individual entity of interest
 E.g., a customer in a phone calling data warehouse
• Multiple levels of aggregation in data cubes
 Further
u e reduce
educe the
e ssize
eo
of da
data
a to
o dea
deal with
• Reference appropriate levels
 Use the smallest representation which is enough to
solve the task
• Queries regarding aggregated information should be
answered using data cube, when possible
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.105
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.35
MCA 204, Data Warehousing & Data Mining
Data Reduction 3: Data Compression
String compression
 There are extensive theories and well-tuned algorithms
 Typically lossless, but only limited manipulation is
possible without expansion
Audio/video compression
yp
y lossyy compression,
p
, with p
progressive
g
refinement
 Typically
 Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
 Typically short and vary slowly with time
Dimensionality and numerosity reduction may also be
considered as forms of data compression
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.106
Data Compression
Compressed
Data
Original Data
lossless
Original Data
Approximated
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.107
Data Transformation
• A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified with
one of the new values
• Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Discretization: Concept hierarchy climbing
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.108
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.36
MCA 204, Data Warehousing & Data Mining
Normalization
Min-max normalization: to [new_minA, new_maxA]
v' 
v  minA
(new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
1.0]. Then $73,000 is mapped to 73,600  12,000 (1.0  0)  0  0.716
98,000  12,000
Z-score normalization (μ: mean,
mean σ: standard deviation):
v'
v  A

A
 Ex. Let μ = 54,000, σ = 16,000. Then 73,600  54,000
Normalization by decimal scaling
v'
v
10 j
16,000
 1.225
Where j is the smallest integer such that Max(|ν’|) < 1
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.109
Discretization
• Three types of attributes
 Nominal—values from an unordered set, e.g., color, profession
 Ordinal—values from an ordered set, e.g., military or academic
rank
 Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
 Interval labels can then be used to replace actual data values
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Prepare for further analysis, e.g., classification
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.110
Data Discretization Methods
Typical methods: All the methods can be applied recursively
 Binning
Top-down split, unsupervised
 Histogram analysis
Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or
bottom-up merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.111
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.37
MCA 204, Data Warehousing & Data Mining
Simple Discretization: Binning
• Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well
• Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing approximately
same number of samples
 Good data scaling
 Managing categorical attributes can be tricky
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.112
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.113
Discretization Without Using Class Labels
(Binning vs. Clustering)
Data
Equal frequency (binning)
Equal interval width (binning)
K-means clustering leads to better results
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.114
114
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.38
MCA 204, Data Warehousing & Data Mining
Discretization by Classification & Correlation
Analysis
• Classification (e.g., decision tree analysis)
 Supervised: Given class labels, e.g., cancerous vs. benign
 Using entropy to determine split point (discretization point)
 Top-down, recursive split
 Details to be covered in Chapter 7
• Correlation analysis (e.g., Chi-merge: χ2-based discretization)
 Supervised: use class information
 Bottom-up merge: find the best neighboring intervals (those
having similar distributions of classes, i.e., low χ2 values) to merge
 Merge performed recursively, until a predefined stopping condition
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.115
115
Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
• Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
• Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)
• Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
• Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.116
Concept Hierarchy Generation
for Nominal Data
• Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
 street < city < state < country
• Specification of a hierarchy for a set of values by explicit
data grouping
 {Urbana, Champaign, Chicago} < Illinois
• Specification of only a partial set of attributes
 E.g., only street < city, not others
• Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.117
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.39
MCA 204, Data Warehousing & Data Mining
Automatic Concept Hierarchy Generation
• Some hierarchies can be automatically generated
based on the analysis of the number of distinct values
per attribute in the data set
 The attribute with the most distinct values is placed
at the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year
15 distinct values
country
province_or_ state
365 distinct values
city
3567 distinct values
674,339 distinct values
street
U3.118
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Measuring the Dispersion of Data
Quartiles, outliers and boxplots

Quartiles: Q1 (25th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 – Q1

Five number summary: min, Q1, median, Q3, max

Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually

Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: σ)

Variance: (algebraic, scalable computation)
1 n
1 n 2 1 n 2
s 
xi  ( xi ) ]
(xi  x)2  n 1[
n 1 i1
n i1
i 1
2

2 
1
N
n
 (x
i 1
i
  )2 
1
N
n
x
i 1
2
i
2
Standard deviation s (or σ) is the square root of variance s2 (or σ2)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.119
Conclusions
• Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
 Entity identification problem
 Remove redundancies
 Detect
D t t inconsistencies
i
i t
i
• Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
• Data transformation and data discretization
 Normalization
 Concept hierarchy generation
120
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.120
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.40
MCA 204, Data Warehousing & Data Mining
Review Questions
Objective Questions:
1)The types of information that can be garnered from datamining
include:
a) sequences, classifications, and clusters.
b) model-driven and data-driven.
c) associations and forecasts.
d) a and c.
e) a, b and c.
2) The term “associations” is associated with:
a) occurrences linked to a single event.
b) classifications when no groups have been defined.
c) pattern recognition describing the group to which an item belongs.
d) a series of existing values used to predict other values.
e) events linked over time.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.121
Review Questions Cont...
3)DSS assist management by combining ________ into a single
powerful system to support unstructured decision-making.
a) hardware and the Internet
b) data, analytical models and tools, and user-friendly software
c) analytical models and tools and data from the Internet
d) group decision processes and electronics
e) data and people
4)DSS, GDSS, and ESS are part of a special category of information
systems that are explicitly designed to:
a) make decisions for managers.
b) enhance Web performance.
c) gather data and build data warehouses.
d) enhance managerial decision-making.
e) interpret data for management.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.122
Review Questions Cont...
5)The term “sequences” is associated with:
a) occurrences linked to a single event.
b) classifications when no groups have been defined.
c) pattern recognition describing the group to which an item belongs.
d) a series of existing values used to predict other values.
e) events linked over time.
)
earliest DSS tended to:
6)The
a) rely on Internet data.
b) draw on small subsets of corporate data.
c) be heavily model-driven.
d) b and c.
e) a and c.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.123
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.41
MCA 204, Data Warehousing & Data Mining
Review Questions Cont...
7)The term “classifications” is associated with:
a) occurrences linked to a single event.
b) classifications when no groups have been defined.
c) pattern recognition describing the group to which an item belongs.
d) a series of existing values used to predict other values.
e) events linked over time.
)
DSS:
8)Model-driven
a) analyze large pools of data.
b) are an outgrowth of data mining.
c) use TPS and OLAP.
d) begin with a given group of data and change variables.
e) use events linked over time.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.124
Review Questions Cont...
9)The term “forecasting” is associated with:
a) Occurrences linked to a single event.
b) Classifications when no groups have been defined.
c) Pattern recognition describing the group to which an item belongs.
d) A series of existing values used to predict other values.
e) Events linked over time.
) goal
g
of data mining
g includes which of the following?
g
10)A
a) To explain some observed event or condition
b) To confirm that data exists
c) To analyze data for expected relationships
d) To create a new data warehouse
e) None of these
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.125
Review Questions Cont...
Short answer type Questions
1. Define data mining in two or three sentences
2. How is data mining different from OLAP?
3. Is the data warehouse prerequisite for data mining? Does the data
warehouse help data mining? If so, in what ways?
4. Name the three common problems of link analysis technique?
5. What is market basket analysis? Give two examples of this application
in business.
6. Give three broad reasons why you think data mining is being used in
today’s businesses.
7. What business problems can data mining help solve?
8. What is Predictive Analytics?
9. What is the difference between data mining, online analytical
processing (OLAP) ?
10. State various benefits of Data mining.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.126
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.42
MCA 204, Data Warehousing & Data Mining
Review Questions Cont...
Long answer type Questions
1. Describe how decision trees work. Explain with the help of an example.
2. What do you mean by KDD? Explain all the steps of KDD in detail.
3. What are the basic principles of genetic algorithms? Use the example
to describe how this technique works
4. Describe cluster detection technique?
5. Discuss Data mining Application in the field of Banking and finance.
6. Do neural networks and genetic algorithms have anything in common?
Point out differences.
7. How does the memory-based reasoning technique work? What is the
underlying principle?
8. Explain Neural Network in detail?
9. What are the golden rules for data mining?
10. Discuss Data mining Application in the field of Retail Industry.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.127
Suggested Reading/References
1.
2.
3.
4
4.
Kamber and Han, “Data Mining Concepts and Techniques”, Hartcourt
India P. Ltd.,2001
Paul Raj Poonia, “Fundamentals of Data Warehousing”, John Wiley &
Sons, 2003.
Sam Anahony, “Data Warehousing in the real world: A practical guide
for building decision support systems”, John Wiley, 2004
W H.
W.
H Inmon,
Inmon “Building
Building the operational data store
store”, 2nd Ed.,
Ed John
Wiley, 1999.
5.
E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press,
6.
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and
2011
Semi-Structured Data. Morgan Kaufmann, 2002
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.128
Suggested Reading/References
7.
8.
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification,
2ed., Wiley-Interscience, 2000
T. Dasu and T. Johnson. Exploratory Data Mining and Data
Cleaning. John Wiley & Sons, 2003
9.
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
Uthurusamy. Advances in Knowledge Discovery and Data
Mining. AAAI/MIT Press, 1996
10. U.
Fayyad,
G.
Grinstein,
and
A.
Wierse,
Information
Visualization in Data Mining and Knowledge Discovery, Morgan
Kaufmann, 2001
11. J. Han and M. Kamber. Data Mining: Concepts and Techniques.
Morgan Kaufmann, 2nd ed., 2006 (3ed. 2011).
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.129
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.43
MCA 204, Data Warehousing & Data Mining
References
12. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of
Statistical Learning: Data Mining, Inference, and Prediction,
2nd ed., Springer-Verlag, 2009.
13. B. Liu, Web Data Mining, Springer 2006.
14. T. M. Mitchell, Machine Learning, McGraw Hill, 1997
15 P.-N.
15.
P N Tan,
T
M Steinbach
M.
St i b h and
d V.
V Kumar,
K
I t d ti
Introduction
t Data
to
D t
Mining, Wiley, 2005
16. S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan
Kaufmann, 1998
17. I. H. Witten and E. Frank,
Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations,
Morgan Kaufmann, 2nd ed. 2005
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.130
References
18. D. P. Ballou and G. K. Tayi. Enhancing data quality in data
warehouse environments. Comm. of ACM, 42:73-78, 1999
19. T. Dasu and T. Johnson. Exploratory Data Mining and Data
Cleaning. John Wiley, 2003
20. T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining
Database Structure; Or, How to Build a Data Quality Browser.
SIGMOD’02
SIGMOD
02
21. H. V. Jagadish et al., Special Issue on Data Reduction
Techniques. Bulletin of the Technical Committee on Data
Engineering, 20(4), Dec. 1997
22. D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann,
1999
23. E. Rahm and H. H. Do. Data Cleaning: Problems and Current
Approaches. IEEE Bulletin of the Technical Committee on Data
Engineering. Vol.23, No.4.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.131
References
18. V. Raman and J. Hellerstein. Potters Wheel: An Interactive
Framework for Data Cleaning and Transformation, VLDB’2001
19. T. Redman. Data Quality: Management and Technology.
Bantam Books, 1992
20. R. Wang, V. Storey, and C. Firth. A framework for analysis of
data quality research. IEEE Trans. Knowledge and Data
Engineering, 7:623
7:623-640,
640, 1995
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.132
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.44
MCA 204, Data Warehousing & Data Mining
Data Mining Techniques
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.133
Learning Objective
• Data Mining Query Language
• Major Data Mining Techniques and Benefits
• Data Mining Applications
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.134
Data Mining Query Language
• There are two powerful tools:
• Database Management Systems
• Efficient and effective data mining algorithms and
frameworks
• Generally, this work asks:
 “How can we merge the two?”
 “How can we integrate data mining more closely with
traditional database systems, particularly querying?”
 The answer lies in Data Mining Query Language
(DMQL).
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.135
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.45
MCA 204, Data Warehousing & Data Mining
Data Mining Query Languages
• Data mining language must be designed to facilitate
flexible and effective knowledge discovery.
• Having a query language for data mining may help
standardize the development of platforms for data mining
systems.
• But designed a language is challenging because data
mining covers a wide spectrum of tasks and each task has
different requirement.
• Hence, the design of a language requires deep
understanding of the limitations and underlying
mechanism of the various kinds of tasks.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.136
Cont…
• So…how would
language???
you
design
an
efficient
query
• Based on the primitives discussed earlier.
• DMQL allows mining of different kinds of knowledge from
relational databases and data warehouses at multiple
levels of abstraction.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.137
Cont…
• DMQL commands specify the following:
• The set of data relevant to the data mining task (the training set)
• The kinds of knowledge to be discovered
•
•
•
•
•
Generalized relation
Characteristic rules
Discriminant rules
Classification rules
Association rules
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.138
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.46
MCA 204, Data Warehousing & Data Mining
DMQL
• Adopts SQL-like syntax
• Hence, can be easily integrated with relational query
languages
• Defined in BNF grammar
• [ ] represents 0 or one occurrence
• { } represents 0 or more occurrences
• Words in sans serif represent keywords
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.139
DMQL Syntax
• DMQL-Syntax
for
task-relevant
data
specification
• Names of the relevant database or data warehouse,
conditions and relevant attributes or dimensions must be
specified
• use database ‹database_name› or use data warehouse
‹data_warehouse_name›
• from ‹relation(s)/cube(s)› [where condition]
• in relevance to ‹attribute_or_dimension_list›
• order by ‹order_list›
• group by ‹grouping_list›
• having ‹condition›
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.140
Example
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.141
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.47
MCA 204, Data Warehousing & Data Mining
Syntax for Kind of Knowledge to be Mined
Characterization
‹Mine_Knowledge_Specification› ::=
mine characteristics [as ‹pattern_name›]
analyze ‹measure(s)›
Example:
• mine characteristics as customerPurchasing analyze count%
Discrimination
‹Mine Knowledge Specification› ::=
‹Mine_Knowledge_Specification›
mine comparison [as ‹ pattern_name›]
for ‹target_class› where ‹target_condition›
{versus ‹contrast_class_i where ‹contrast_condition_i›}
analyze ‹measure(s)›
• Example:
•
Mine comparison as purchaseGroups
 for bigspenders where avg(I.price) >= $100
 versus budgetspenders where avg(I.price) < $100
 analyze count
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.142
Syntax for Kind of Knowledge to be Mined (2)
• Association:
‹Mine_Knowledge_Specification› ::=
mine associations [as ‹pattern_name›]
[matching ‹metapattern›]
• Example: mine associations as buyingHabits
 matching P(X: customer, W) ^ Q(X,Y) => buys (X,Z)
• Classification:
‹Mine_Knowledge_Specification› ::=
mine classification [as ‹pattern_name›]
analyze ‹classifying_attribute_or_dimension›
• Example: mine classification as classifyCustomerCreditRating
 analyze credit_rating
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.143
Syntax for Concept Hierarchy Specification
• More than one concept per attribute can be specified
• Use hierarchy ‹hierarchy_name› for ‹attribute_or_dimension›
• Examples:
•
Schema concept hierarchy (ordering is important)
define hierarchy location_hierarchy on address as
[street,city,province_or_state,country]
Set-Grouping concept hierarchy
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level2: {40, ..., 59} < level1: middle_aged
level2: {60, ..., 89} < level1: senior
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.144
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.48
MCA 204, Data Warehousing & Data Mining
Syntax for Concept Hierarchy Specification (2)
operation-derived concept hierarchy
 define hierarchy age_hierarchy for age on customer as
 {age_category(1), ..., age_category(5)} := cluster (default,
age, 5) < all(age)
rule-based concept hierarchy
 define hierarchy profit_margin_hierarchy
profit margin hierarchy on item as
 level_1: low_profit_margin < level_0: all

if (price - cost)< $50
 level_1: medium-profit_margin < level_0: all

if ((price - cost) > $50) and ((price - cost) <=
$250))
 level_1: high_profit_margin < level_0: all

if (price - cost) > $250
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.145
Syntax For Interestingness Measure
Specification
• with [‹interest_measure_name›] threshold =
‹threshold_value›
• Example:
pp threshold = 5%
with support
with confidence threshold = 70%
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.146
Syntax for Pattern Presentation and
Visualization Specification
• display as ‹result_form›
• The result form can be rules, tables, cubes, crosstabs, pie or bar
charts, decision trees, curves or surfaces.
• To facilitate interactive viewing at different concept levels or different
angles, the following syntax is defined:
‹Multilevel_Manipulation› ::= roll up on ‹attribute_or_dimension›
| drill down on
‹attribute_or_dimension›
| add ‹attribute_or_dimension›
| drop ‹attribute_or_dimension›
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.147
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.49
MCA 204, Data Warehousing & Data Mining
Major Data Mining Techniques






Cluster Detection
Decision Trees
Memory-Based Reasoning
Link Analysis
Neural Networks
Genetic Algorithms
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.148
Cluster Detection
• Cluster means forming groups.
• Clustering helps you take specific and proper action for the
individual pieces that make up the cluster.
• The algorithm searches for groups or clusters of data elements
th t are similar
that
i il to
t one another.
th This
Thi is
i because
b
similar
i il customers
t
or similar products are expected to behave in the same way.
• It is not always easy to discern the meaning of every cluster the
data mining algorithm formed. If there are two or three dimensions
or variables, it is fairly easy to spot the clusters. But while dealing
with 500 variables from 100,000 records, a special tool is needed.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.149
Cluster Detection
Number of years as customer
If there are two variables, then points in a 2-D graph represent the values of sets of
these two variables.
Total value to the enterprise
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.150
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.50
MCA 204, Data Warehousing & Data Mining
Cluster Detection
•But if we want the algorithm to use 50 different variables for each customer,
we’ll have to have a point in 50-dimensional space.
•Suppose that the number of clusters or groups is 15. so, for the K-means
clustering algorithm, we’ll set K=15.
•15 initial records(“seeds”) are chosen as the first set of centroids based on best
guesses.
•In the next step, the algorithm assigns each customer record in the database to a
cluster based on the seed to which it is the closest. Now, we have the first set of 15
clusters. The value of the cluster is taken to be the values of the 50 variables in
each centroid.
•In the next iteration, each customer record is re-matched with the new sets of
centroids and cluster boundaries are redrawn.
•After a few iterations, the final clusters emerge.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.151
Cluster Detection
1
3
2
1 Initial cluster boundaries
based on initial seeds.
2 Centroids of new clusters
calculated
3 Cluster boundaries redrawn
at each iteration.
Initial seed
Calculated centroid
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.152
Cluster Detection
• How does the algorithm redraw the cluster boundaries?
• What factors determine that one customer record is near one
centroid and not the other?
• Each implementation of the cluster detection algorithms adopts a
method
th d for
f comparing
i the
th values
l
off the
th variables
i bl in
i individual
i di id l
records with those in the centroids.
• The algorithm uses these comparisons to calculate the distances
of individual customer records from the centroids. After
calculating the distances, the algorithm redraws the cluster
boundaries.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.153
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.51
MCA 204, Data Warehousing & Data Mining
Decision Trees
• This technique applies to classification and prediction.
• By following a tree, we can decipher the rules and understand
why a record is classified in a certain way.
• A decision tree represents a series of questions. Each question
determines
• What follow-up question is best to be asked.
• The question at the root must be the one that best
differentiates among the target classes. The leaf node
determines the classification of the record.
• A tree showing a high level of correctness is more effective.
• Also, attention must be paid to the branches. Some paths are
better than others because the rules are better. By pruning the
incompetent branches, you can enhance the predictive
effectiveness of the whole tree.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.154
Decision Trees
• How do the decision tree algorithms build the trees?
• First, the algorithm attempts to find the test that will split the
records in the best possible manner among the wanted
classifications
classifications.
• At each lower level node from the root, whatever rule works best
to split the subset is applied. This process of finding each
additional level of the tree continues.
• The tree is allowed to grow until you cannot find better ways to
split input records.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.155
Memory-Based Reasoning
•
•
•
•
•
MBR uses known instances of a model to predict unknown
instances.
This data mining technique maintains a dataset of known
records. The algorithm knows the characteristics of the
records in this training dataset.
When a new record arrives at the data mining tool, first the
t l calculates
tool
l l t
th “distance”
the
“di t
” between
b t
thi record
this
d and
d the
th
records in the training dataset using its distance function.
The results determine which data records in the training
dataset qualify to be considered as neighbours to the
incoming data records.
Next, the algorithm uses a combination function to combine
the results of the various distance functions to obtain the final
answer.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.156
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.52
MCA 204, Data Warehousing & Data Mining
Memory-Based Reasoning
• For solving a data mining problem using MBR, we are
concerned with three critical issues:
• Selecting the most suitable historical records to form the
training dataset.
• Establishing the best way to compose the historical record.
• Determining the two essential functions, namely, the distance
function and the combination function.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.157
Link Analysis
• This algorithm is extremely useful for finding patterns from
relationships.
• The link analysis technique mines relationships and discovers
knowledge.
• For eg. If the Fast Food Restaurant owner in the case study has
to apply link analysis technique to mine data from the data
warehouse, he might find out that in more than 80% of the cases,
customers order a soft drink if they order a pizza. The restaurant
owner will try to analyse the link between the two products and
promote them together.
• Depending upon the types of knowledge discovery, link analysis
techniques have three types of applications: associations
discovery, sequential pattern discovery and similar time
U3.158
discovery.
© Bharati
Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
Link Analysis
Associations Discovery:
•These algorithms find combinations where the presence of one item suggests
the presence of another.
•When we apply these algorithms to the daily sales of the fast food restaurant,
they will uncover affinities among menu items that are likely
to be ordered together.
Association rule head
A customer in a
restaurant also orders
soft drink in 65% of
the cases.
Association rule body
Confidence
Factor
Whenever the
customer orders a
pizza, this is
happening for 20% of
all orders.
Support
Factor
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.159
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.53
MCA 204, Data Warehousing & Data Mining
Link Analysis
Sequential Pattern Discovery:
These algorithms discover patterns where one set of items follows another
specific set. Time plays a role in these patterns. When we select records for
analysis, we must have date and time as data items to enable discovery of
sequential patterns.
For eg. Consider the transaction data file given below:
SALE DATE
NAME OF CUSTOMER
15/11/2000
15/11/2000
15/11/2000
19/12/2000
19/12/2000
19/12/2000
19/12/2000
20/12/2000
20/12/2000
ABC
DEF
EFG
GHI
ABC
GHI
EFG
DEF
XYZ
PRODUCTS PURCHASED
Desktop PC, MP3 Player
Desktop PC, MP3 Player, Digital Camera
Laptop PC
Laptop PC
Digital Camera
Digital Camera
Digital Camera
Tape Backup Drive
Desktop PC, MP3 Player
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.160
Link Analysis
Sequential Patterns--Customer Sequence
NAME OF CUSTOMER
ABC
DEF
EFG
GHI
XYZ
PRODUCT SEQUENCE FOR CUSTOMER
Desktop PC, MP3 Player, Digital Camera
Desktop PC, MP3 Player, Digital Camera, Tape Backup Drive
Laptop PC, Digital Camera
Laptop PC, Digital Camera
Desktop PC, MP3 Player
Sequential Patterns (Support Factor >60%)
Desktop PC, MP3 Player
Supporting Customers
ABC, DEF, XYZ
Sequential Pattern (Support Factor >40%)
Desktop PC, MP3 Player, Digital Camera
Laptop PC, Digital Camera
Supporting Customers
ABC, DEF
EFG, GHI
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.161
Link Analysis
Typical discoveries include associations of the following types:
• Purchase of a digital camera is followed by purchase of a colour printer 60%
of the time
• Purchase of a desktop is followed by purchase of a tape backup drive 65% of
the time
• Similar Time Sequence Discovery:
• This technique depends on the availability of time sequences.
• The results of the previous technique indicate sequential events over time.
This technique finds a sequence of events and then comes up with other
similar sequences of events.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.162
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.54
MCA 204, Data Warehousing & Data Mining
Neural Networks
“A type of artificial intelligence that attempts to
imitate the way a human brain works”
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.163
Cont…
 Neural networks resemble the human brain in the following
two ways:
 A neural network acquires knowledge through learning.
 A neural network's knowledge is stored within inter-neuron
connection strengths known as synaptic weights.
weights
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.164
Basic Neural Network Structure
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.165
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.55
MCA 204, Data Warehousing & Data Mining
Cont…
 Input Layer: Consists of neurons that receive input from
external environment
 Output Layer: Consists of neurons that communicate to
the user or external environment
 Hidden Layer: Consists of neurons that only communicate
with other layers of the network
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.166
Neural Network Models
• Supervised: network given facts about various cases along
with expected outputs
• Unsupervised: network receives only inputs and no
expected outputs
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.167
Data Mining Process Based on Neural Networks
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.168
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.56
MCA 204, Data Warehousing & Data Mining
Data Preparation
• Data Cleansing
• Data Option
• Data Processing
• Data Expression
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.169
Rule Extraction
• Extraction of hidden predictive information from large
database
• Some extracting rules: LRE method, Black Box method
etc…
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.170
Rule Assessment
• Process of extracting and collecting evidences and making
judgments
• Tells how well a rule can achieve the intended output
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.171
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.57
MCA 204, Data Warehousing & Data Mining
Implementing Neural
Networks using: MATLAB
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.172
MATLAB
• Matrix Laboratory
• High level technical computing language
• Programming environment for algorithm development, data
analysis visualization,
analysis,
visualization and numerical computation
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.173
MATLAB Applications
• Signal and image processing
• Communications
• Control design
• Test and measurement
• Financial modeling and analysis
• Computational biology
• Neural networks
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.174
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.58
MCA 204, Data Warehousing & Data Mining
Neural Network Process Using MATLAB
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.175
Some Important Terms
•
•
•
•
•
•
•
Training Function
Adaption Learning Function
Performance Function
Transfer Function
Network Simulation
Feed forward neural networks
Feed-forward
Back-propagation
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.176
Training Function
 Mathematical procedures used to automatically adjust the
network's weights and biases
 Some Backpropogation functions:
 TRAINLM
 TRAINOSS
 TRAINGDX
 TRAINBFG etc…
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.177
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.59
MCA 204, Data Warehousing & Data Mining
Adaption Learning Function
 Used for learning. It can be applied to individual weights
and biases within a network.
 Some functions:




LEARNGDM
LEARNHD
LEARNPN
LEARNSOM etc…
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.178
Performance Function
 Used for comparing the observed and inferred outputs for a
data sample
 Some of the functions:
 MAE: Mean absolute error performance function
 MSE: Mean squared normalized error performance function
 SSE: Sum squared error performance function
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.179
Transfer Function
 Used to describe the system with all input-output pairs.
 Calculate a layer's output from its net input.
 Some functions:
 TANSIG
 LOGSIG
 PURELIN etc…
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.180
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.60
MCA 204, Data Warehousing & Data Mining
Network Simulation
 Way of testing on the network to see if it meets our
expectations
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.181
Feed-forward Neural Networks
 First and simplest type of artificial neural network
 The information moves in only one direction, forward, from
the input nodes, through the hidden nodes (if any) and to
the output nodes
 There are no cycles or loops in the network
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.182
Back-propagation
 Back-propagation is a common method of training artificial
neural network so as to minimize the objective function.
 Systematic method of training multi-layer artificial neural
networks.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.183
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.61
MCA 204, Data Warehousing & Data Mining
Process of XOR Network
 Design
 Training
 Simulation
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.184
Steps to Implement a XOR Network in MATLAB
 Open the Matlab Toolbox
 To begin using the NN GUI:
 >> nntool
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.185
Design Phase
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.186
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.62
MCA 204, Data Warehousing & Data Mining
Cont…
Let:
Input :
P = [0 0 1 1; 0 1 0 1]
Target/ Output:
T = [0 1 1 0]
Click on New Data
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.187
Cont…
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.188
Cont…
 Click on Create to confirm
 Now to create a XORNet, click on New Network
 Set the parameters as follows:






Network Type = Feedforward Backprop
Input Ranges = [0 1;0 1]
Train Function = TRAINLM
Adaption Learning Function = LEARNGDM
Performance Function = MSE
Number of Layers = 2
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.189
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.63
MCA 204, Data Warehousing & Data Mining
Cont…
 Set Layer 1 properties as:
 Number of Neurons = 2
 Transfer Function = TANSIG
 Set Layer 2 properties as:
 Number of Neurons = 1
 Transfer Function = TANSIG
 Confirm by hitting the create button
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.190
Cont…
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.191
Network Training
 Highlight XORNet with One click
 Click on Train button
 On Training Info, select P as Inputs and T as Targets
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.192
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.64
MCA 204, Data Warehousing & Data Mining
Cont…
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.193
Cont…
 On Training Parameters, set:
 epochs = 1000 (train network for longer duration)
 goal = 0.000000000000001 (precise result)
 Max_fail
Max fail = 50
 Hit Train Network
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.194
Cont…
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.195
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.65
MCA 204, Data Warehousing & Data Mining
Cont…
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.196
Cont…
 To confirm the XORNet structure and values of various
weights and bias of the trained network click on View on the
Network/Data Manager window
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.197
Cont…
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.198
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.66
MCA 204, Data Warehousing & Data Mining
Network Simulation
 Create new test data S = [1; 0] and follow same procedure
as before (like for input P)
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.199
Cont…
 Again click on XORNet and then click on Simulate button
on the Network Manager.
 Select S as the Inputs
 Type in XORNet_outputSim as Outputs
 Hit the Simulate Network button
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.200
Cont…
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.201
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.67
MCA 204, Data Warehousing & Data Mining
Cont…
 Check the result of XorNet_outputSim on the NN Network
Manager, by clicking View
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.202
Neural Networks
 Neural networks mimic the human brain by learning from a
training dataset and applying the learning to generate
patterns for classification and prediction.
 These algorithms are effective when the data is shapeless
and lacks any apparent pattern.
 The basic unit of a neural network is called node and is one
of the two main structures of the neural network model. The
other structure is the link between these nodes.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.203
Cont…
INPUT
Output from node
OUTPUT
Values for input variables
Input values
weighted
Nodes
Discovered value for o
output variable
Input to next
node
Links
Neural Network Model
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.204
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.68
MCA 204, Data Warehousing & Data Mining
Cont…
Age
35
0.35
Weight=0.9
1.065
Upgrade to Gold
Credit Card—
Pre-approved
Weight=1.0
Incom
e
$75,00
0
0.75
Neural Network for pre-approval of Gold Credit Card
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.205
Genetic Algorithms
 Genetic algorithms apply the principle of ‘natural selection
and survival of the fittest’ to data mining.
 This technique uses a highly iterative process of selection,
cross-over and mutation operators to evolve successive
generations of models.
 At each iteration, every model competes with everyone
other by inheriting traits from previous ones until only the
most predictive model survives.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.206
Cont…
• Eg. Taking the Fast Food Restaurant case study.
•Suppose that, the owner wants to do a promotional mailing and wants to
include free coupons in the mailing, with the goal of increasing the profits. At
the same time, the promotional mailing must not produce the opposite result
of lost revenue.
•The
question
optimum
number of couponsThird
to be
placed in
First
Generationis: What is the Second
Generation
Generation
each mailer to maximize profits?
16
19
13
15
00
11
15
3
31
10
36
39
13
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.207
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.69
MCA 204, Data Warehousing & Data Mining
Comparison
Data Mining
Technique
Underlying Structure
Basic Process
Validation Method
Cluster Detection
Distance calculation in
n- vector space
Grouping of values in
the same
neighbourhood
Cross validation to
verify accuracy
Decision Trees
Binary tree
Splits at decision
points based on
entropy
Cross validation
Memory-Based
Memory
Based
Reasoning
Predictive structure
based on distance and
combination functions
Association of
unknown instances
with known instances
Cross validation
Based on linking of
variables
Discover links among
variables by their
values
Not applicable
Forward propagation
network
Weighted inputs of
predictors at each
node
Not applicable
Not applicable
Survival of the fittest
on mutation of derived
values
Mostly cross
validation
Link Analysis
Neural Networks
Genetic Algorithms
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.208
Review Questions
Objective Questions:
1)The types of information that can be garnered from datamining
include:
a) sequences, classifications, and clusters.
b) model-driven and data-driven.
c) associations and forecasts.
d) a and c.
e) a, b and c.
2) The term “associations” is associated with:
a) occurrences linked to a single event.
b) classifications when no groups have been defined.
c) pattern recognition describing the group to which an item belongs.
d) a series of existing values used to predict other values.
e) events linked over time.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.209
Review Questions cont..
3)DSS assist management by combining ________ into a single
powerful system to support unstructured decision-making.
a) hardware and the Internet
b) data, analytical models and tools, and user-friendly software
c) analytical models and tools and data from the Internet
d) group decision processes and electronics
e) data and people
4)DSS, GDSS, and ESS are part of a special category of information
systems that are explicitly designed to:
a) make decisions for managers.
b) enhance Web performance.
c) gather data and build data warehouses.
d) enhance managerial decision-making.
e) interpret data for management.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.210
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.70
MCA 204, Data Warehousing & Data Mining
Review Questions cont..
5)The term “sequences” is associated with:
a) occurrences linked to a single event.
b) classifications when no groups have been defined.
c) pattern recognition describing the group to which an item
belongs.
d) a series of existing values used to predict other values.
e)) events linked over time.
6)The earliest DSS tended to:
a) rely on Internet data.
b) draw on small subsets of corporate data.
c) be heavily model-driven.
d) b and c.
e) a and c.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.211
Review Questions cont..
7)The term “classifications” is associated with:
a) occurrences linked to a single event.
b) classifications when no groups have been defined.
c) pattern recognition describing the group to which an item
belongs.
d) a series of existing values used to predict other values.
e)) events linked over time.
8)Model-driven DSS:
a) analyze large pools of data.
b) are an outgrowth of data mining.
c) use TPS and OLAP.
d) begin with a given group of data and change variables.
e) use events linked over time.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.212
Review Questions cont..
9)The term “forecasting” is associated with:
a) Occurrences linked to a single event.
b) Classifications when no groups have been defined.
c) Pattern recognition describing the group to which an item
belongs.
d) A series of existing values used to predict other values.
e)) Events linked over time.
10)A goal of data mining includes which of the following?
a) To explain some observed event or condition
b) To confirm that data exists
c) To analyze data for expected relationships
d) To create a new data warehouse
e) None of these
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.213
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.71
MCA 204, Data Warehousing & Data Mining
Review Questions cont..
Short answer type Questions
1. Define data mining in two or three sentences
2. How is data mining different from OLAP?
3. Is the data warehouse prerequisite for data
g Does the data warehouse help
p data
mining?
mining? If so, in what ways?
4. Name the three common problems of link analysis
technique?
5. What is market basket analysis? Give two
examples of this application in business
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.214
Review Questions cont..
6. Give three broad reasons why you think data
mining is being used in today’s businesses.
7. What business problems can data mining help
solve?
8. What is Predictive Analytics?
9. What is the difference between data mining, online
analytical processing (OLAP) ?
10. State various benefits of Data mining.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.215
Review Questions cont..
Long answer type Questions
1. Describe how decision trees work. Explain with
the help of an example.
2. What do you mean by KDD? Explain all the steps
of KDD in detail.
3. What are the basic principles of genetic
algorithms? Use the example to describe how this
technique works
4. Describe cluster detection technique?
5. Discuss Data mining Application in the field of
Banking and finance.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.216
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.72
MCA 204, Data Warehousing & Data Mining
Review Questions cont..
6. Do neural networks and genetic algorithms have
anything in common? Point out differences.
7. How does the memory-based reasoning technique
work? What is the underlying principle?
8. Explain Neural Network in detail?
9. What are the golden rules for data mining?
10. Discuss Data mining Application in the field of
Retail Industry.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.217
Suggested Reading/References
1. Paul Raj Poonia, “Fundamentals of Data Warehousing”,
John Wiley & Sons, 2003.
2. Sam Anahony, “Data Warehousing in the Real World: A
Practical Guide for Building Decision Support Systems”,
John Wiley, 2004
3. W. H. Inmon, “Building the Operational Data Store”, 2nd Ed.,
John Wiley, 1999.
4. Kamber and Han, Data Mining Concepts and Techniques”,
Hartcourt India P. Ltd.,2001”.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.218
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63, by Dr. Deepali Kamthania
U3.73