Download Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Cluster analysis wikipedia, lookup

Transcript
UNIT - I
Data Mining
UNIT - I
• Introduction :
Fundamentals of data mining, Data Mining Functionalities,
Classification of Data Mining systems, Major issues in Data
Mining
• Data Preprocessing :
Needs Preprocessing the Data, Data Cleaning, Data
Integration
and
Transformation,
Data
Reduction,
Discretization and Concept Hierarchy Generation. Data
Mining Primitives, Data Mining Query Languages,
Architectures of Data Mining Systems.
• Applications :
Medical / Pharmacy, Insurance and Health Care.
We are in data rich situation
Most of the data never analyzed at all
 There is a gap between the generation of data &
our understanding.
 But potentially useful knowledge may lie hidden
in the data.
 We need to use computers to automate
extraction of the knowledge from the data.
Need of Mining?
 Lots of data is being collected and warehoused
◦ Web data, e-commerce
◦ purchases at department/grocery stores
◦ Bank/Credit Card transactions
Data
 Data are raw facts and figures that on their own have no meaning
 These can be any alphanumeric characters
i.e. text, numbers, symbols
 Eg.
Yes,Yes, No,Yes, No,Yes, No,Yes
42, 63, 96, 74, 56, 86
 None of the above data sets have any meaning until they are given a
CONTEXT and PROCESSED into a useable form
 Data must be processed in a context in order to give it meaning
Information
 Data that has been processed into a form that gives it meaning
In next example we will see What information can then be derived from
the data?
Raw Data
Yes,Yes, No,Yes, No,Yes, No,
Yes, No,Yes,Yes
Context
Responses to the market
research question – “Would
you buy brand x at price y?”
Processing
Information
???
Example II
Raw Data
42, 63, 96, 74, 56, 86
Context
Jayne’s scores in the six
AS/A2 ICT modules
Processing
Information
???
What is Data Mining ?
• Extracting(“mining”) knowledge from large amount
of data. (KDD: Knowledge discovery from data).
• Data mining is the process of automatically
discovering useful information in large data
repositories
• We need computational techniques to extract
knowledge out of data.
This information can be used for any of the
following applications:
 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration
Need of Data Mining
• In field of Information technology we have huge amount of
data available that need to be turned into useful information.
• It is nothing but extraction of data from large databases for
some specialized work.
• This information further can be used for various
applications such as consumer research marketing, product
analysis, demand and supply analysis, e-commerce,
investment trend in stocks & real estates,
telecommunications and so on.
Data Mining Applications
 Market Analysis and Management
 Corporate Analysis & Risk Management
 Fraud Detection
 Other Applications
Market Analysis and Management
Following are the various fields of market where data mining is used:
• Customer Profiling Data Mining helps to determine what kind of people buy what kind of products.
• Identifying Customer Requirements Data Mining helps in identifying the best products for different customers.
It uses prediction to find the factors that may attract new customers.
• Cross Market Analysis Data Mining performs Association/correlations between product sales.
• Target Marketing Data Mining helps to find clusters of model customers who share the same
characteristics such as interest, spending habits, income etc.
• Determining Customer purchasing pattern Data mining helps in determining customer purchasing pattern.
• Providing Summary Information Data Mining provide us various multidimensional summary reports
Corporate Analysis & Risk Management
Following are the various fields of Corporate
Sector where data mining is used:
• Finance Planning and Asset Evaluation It involves cash flow analysis and prediction, contingent claim
analysis to evaluate assets.
• Resource Planning It involves summarizing and comparing the resources and spending.
• Competition It involves monitoring competitors and market directions.
Fraud Detection
• Data Mining is also used in fields of credit card services
and telecommunication to detect fraud.
• In fraud telephone call it helps to find destination of call,
duration of call, time of day or week.
• It also analyze the patterns that deviate from an expected
norms.
Other Applications
• Data Mining also used in other fields such as sports,
astrology and Internet Web Surf-Aid.
What is Not a Data Mining?
 Data Mining isn’t ….
◦
◦
◦
◦
Looking up a phone number in a directory
Issuing a search engine query for “amazon”
Query processing
Experts systems or statistical programs
 Data Mining is….
◦ Certain names are more prevalent in certain India locations eg.
Mumbai, Bangalore, Hyderabad…
◦ Group together similar documents returned by a search engine
eg. Google.com
Examples of Data Mining
 Safeway:
◦ Your purchase data -> relevant coupns
 Amazon:
◦ Your browse history -> times you may like
 State Farm:
◦ Your likelihood of filing claim based on people like you
 Neuroscience:
◦ Find functionally connected brain regions from functional MRI
data.
 Many more…
Origins of Data Mining
 Draw ideas from machine learning / AI, Pattern
recognition and databases.
 Traditional techniques may be unsuitable due to
 Enormity of data
 Dimensionality of data
 Distributed nature of data.
Data mining overlaps with many disciplines
Statistics
Machine Learning
Information Retrieval (Web mining)
Distributed Computing
Database Systems
We can say that they are all related, but they are all different things.
Although you can have things in common among them, such as that in
statistics and data mining you use clustering methods.
Let me try to briefly define each:
Statistics is a very old discipline mainly based on classical mathematical
methods, which can be used for the same purpose that data mining
sometimes is which is classifying and grouping things.
Data mining consists of building models in order to detect the patterns that
allow us to classify or predict situations given an amount of facts or
factors.
Artificial intelligence (check Marvin Minsky*) is the discipline that tries to
emulate how the brain works with programming methods, for example
building a program that plays chess.
Machine learning is the task of building knowledge and storing it in some
form in the computer; that form can be of mathematical models,
algorithms, etc... Anything that can help detect patterns.
Why Not Traditional Data Analysis?
 Tremendous amount of data
◦ Algorithms must be highly scalable to handle such as tera-bytes of data
 High-dimensionality of data
◦ Micro-array may have tens of thousands of dimensions
 High complexity of data
◦ Data streams and sensor data
◦ Time-series data, temporal data, sequence data
◦ Structure data, graphs, social networks and multi-linked data
◦ Heterogeneous databases and legacy databases
◦ Spatial, spatiotemporal, multimedia, text and Web data
◦ Software programs, scientific simulations
 New and sophisticated applications
Definition of Knowledge Discovery in Data
”KDD Process is the process of using data mining methods (algorithms) to extract
(identify) what is deemed knowledge according to the specifications of
measures and thresholds, using database F along with any required
preprocessing, subsampling, and transformation of F.”
”The nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data”
Goals (e.g., Fayyad et al. 1996):
–
Verification of user’s hypothesis (this against the EDA principle…)
–
Autonomous discovery of new patterns and models
–
Prediction of future behavior of some entities
–
Description of interesting patterns and models
KDD Process
Data mining plays an essential role in the knowledge discovery
process
Interpretation
Data Mining
Transformation
Preprocessing
Knowledge
Selection
Patterns
Transformed
PreprocessedData
Original
Data
Target
Data
Data
KDD versus DM
 DM is a component of the KDD process that is mainly concerned with means
by which patterns and models are extracted and enumerated from the data
◦ DM is quite technical
 Knowledge discovery involves evaluation and interpretation of the patterns
and models to make the decision of what constitutes knowledge and what
does not
◦ KDD requires a lot of domain understanding
 It also includes, e.g., the choice of encoding schemes, preprocessing, sampling,
and projections of the data prior to the data mining step
 The DM and KDD are often used intergchangebly
 Perhaps DM is a more common term in business world, and KDD in academic
world
The main steps of the KDD process
7 steps in KDD process
1.
2.
3.
4.
5.
6.
7.
Data Cleaning:
to remove noise and inconsistent data
Data integration :
where multiple data sources may be combined
Data selection:
where data relevant to the analysis task are retrieved from the data
base.
Data transformation:
where data are transformed and consolidated into forms appropriate for
mining by performing summary or aggregation operations.
Data mining:
an essential process where intelligent methods are applied to extract
data patterns
Pattern evaluation:
to identify the truly interesting patterns representing knowledge based
on interestingness measures.
Knowledge presentation:
where visualization and knowledge representation techniques are used
to present mined knowledge to users
Typical Data Mining System Architecture
 Database, data warehouse, World Wide Web, or
other information repository:
This is one or a set of databases, data warehouses, spreadsheets, or
other kinds of information repositories.
 Data cleaning and data integration techniques may be performed on
the data.
 Database or data warehouse server:
The database or data warehouse server is responsible for fetching
the relevant data, based on the user’s data mining request.
 Knowledge base:
This is the domain knowledge that is used to guide the search or evaluate
the interestingness of resulting patterns.
 Such knowledge can include concept hierarchies, used to organize
attributes or attribute values into different levels of abstraction.
 Data mining engine:
This is essential to the data mining system and ideally consists of a set of
functional modules for tasks such a characterization, association and
correlation analysis, classification, prediction, cluster analysis, outlier
analysis, and evolution analysis.
 Pattern evaluation module:
This component typically employs interestingness measures and
interacts with the data mining modules so as to focus the search toward
interesting patterns.
 User interface:
This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data
mining query or task, providing information to help focus the search,
and performing exploratory data mining based on the intermediate data
mining results.
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
End User
Decision
Making
Data Presentation
Visualization Techniques
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications
◦ Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
◦ Data streams and sensor data
◦ Time-series data, temporal data, sequence data (incl. bio-sequences)
◦ Structure data, graphs, social networks and multi-linked data
◦ Heterogeneous databases and legacy databases
◦ Spatial data and spatiotemporal data
◦ Multimedia database
◦ Text databases
◦ The World-Wide Web
Database-oriented data sets
Relational Database:
•
A relational database is a collection of tables, each of which is assigned a unique name.
•
Each table consists of a set of attributes (columns or fields) and usually stores a large set of
tuples (records or rows). Each tuple in a relational table represents an object identified by a
unique key and described by a set of attribute values. A semantic data model, such as an entityrelationship is often constructed for relational databases.
Data Warehouse:
•
A data warehouse is usually modeled by a multidimensional database structure, where each
dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the
value of some aggregate measure, such as count or sales amount.
•
The actual physical structure of a data warehouse may be a relational data store or a
multidimensional data cube. A data cube provides a multidimensional view of data and allows the
pre computation and fast accessing of summarized data.
Transactional Database:
•
A transactional database consists of a file where each record represents a transaction.
•
A transaction typically includes a unique transaction identity number (trans ID) and a list of the
items making up the transaction (such as items purchased in a store). The transactional database
may have additional tables associated with it, which contain other information regarding the sale,
such as the date of the transaction, the customer ID number, the ID number of the salesperson and
of the branch at which the sale occurred, and so on.
Advanced data sets
Object-Relational Databases:
•
Object-relational databases are constructed based on an object-relational data model.
•
This model extends the relational model by providing a rich data type for handling complex
objects and object orientation. Because most sophisticated database applications need to handle
complex objects and structures, object-relational databases are becoming increasingly popular in
industry and applications.
Temporal Databases:
•
A temporal database typically stores relational data that include time-related attributes.
•
These attributes may involve several timestamps, each having different semantics.
Sequence Databases:
•
A sequence database stores sequences of ordered events, with or without a concrete notion of
time. Examples include customer shopping sequences , Web click streams, and biological
sequences.
Advanced data sets
Time Series Databases:
•
A time-series database stores sequences of values or events obtained over repeated measurements
of time (e.g., hourly, daily, weekly).
•
Examples include data collected from the stock exchange, inventory control, and the observation
of natural phenomena (like temperature and wind).
Spatial Databases:
•
Spatial databases contain spatial-related information. Examples include geographic (map)
databases, very large-scale integration (VLSI) or computed-aided design databases, and medical
and satellite image databases.
•
Spatial data may be represented in raster format, consisting of n-dimensional bit maps or pixel
maps.
Spatialtemporal Databases:
•
A spatial database that stores spatial objects that change with time is called a spatiotemporal
database, from which interesting information can be mined.
Advanced data sets
Text Databases:
•
Text databases are databases that contain word descriptions for objects. These word descriptions
are usually not simple keywords but rather long sentences or paragraphs, such as product
specifications, error or bug reports, warning messages, summary reports, notes, or other
documents.
•
Text databases may be highly unstructured (such as some Web pages on the World Wide Web).
Multimedia Databases:
•
Multimedia databases store image, audio, and video data. They are used in applications such as
picture content-based retrieval, voice-mail systems, video-on-demand systems, the World Wide
Web, and speech-based user interfaces that recognize spoken commands.
Heterogeneous Databases:
•
A heterogeneous database consists of a set of interconnected, autonomous component databases.
The components communicate in order to exchange information and answer queries.
Legacy Databases:
•
A legacy database is a group of heterogeneous databases that combines different kinds of data
systems, such as relational or object-oriented databases, hierarchical databases, network
databases, spreadsheets, multimedia databases, or file systems. The heterogeneous databases in a
legacy database may be connected by intra- or inter-computer networks.
Advanced data sets
Data Streams:
•
Many applications involve the generation and analysis of a new kind of data, called stream data,
where data flow in and out of an observation platform (or window) dynamically.
•
Such data streams have the following unique features: huge or possibly infinite volume,
dynamically changing, flowing in and out in a fixed order, allowing only one or a small number
of scans, and demanding fast (often real-time) response time.
•
Typical examples of data streams include various kinds of scientific and engineering data, timeseries data, and data produced in other dynamic environments, such as power supply, network
traffic, stock exchange, telecommunications, Web click streams, video surveillance, and weather
or environment monitoring.
World Wide Web:
•
The World Wide Web and its associated distributed information services, such as Yahoo!, Google,
America Online, and AltaVista, provide rich, worldwide, on-line information services, where data
objects are linked together to facilitate interactive access.
•
For example, understanding user access patterns will not only help improve system design (by
providing efficient access between highly correlated objects), but also leads to better marketing
decisions (e.g., by placing advertisements in frequently visited documents, or by providing better
customer/user classification and behavior analysis). Capturing user access patterns in such
distributed information environments is called Web usage mining (or Weblog mining).
Data Mining Functionalities –
What kind of patterns Can be mined?
 Descriptive Mining:
Descriptive mining tasks characterize the general properties of the
data in the database.
Example : Identifying web pages that are accessed together.
(human interpretable pattern)
 Predictive Mining:
Predictive mining tasks perform inference on the current data in
order to make predictions.
Example: Judge if a patient has specific disease based on his/her
medical tests results.
Data Mining Functionalities –
What kind of patterns Can be mined?
1.
2.
3.
4.
5.
6.
Characterization and Discrimination
Mining Frequent Patterns
Classification and Prediction
Cluster Analysis
Outlier Analysis
Evolution Analysis
Data Mining Functionalities:
Characterization and Discrimination

Data can be associated with classes or concepts, it can be useful
to describe individual classes or concepts in summarized,
concise, and yet precise terms.

For example, in the AllElectronics store,
classes of items for sale include computers and printers, and
concepts of customers include bigSpenders and budgetSpenders.

Such descriptions of a concept or class are called class/concept
descriptions. These descriptions can be derived via
-
Data Characterization
-
Data Discrimination
Data characterization
Data characterization is a summarization of the general
characteristics or features of a target class of data.
The data corresponding to the user-specified class are typically
collected by a query.
Ex: For example, to study the characteristics of software products
whose sales increased by 10% in the last year, the data related to such
products can be collected by executing an SQL query.
The output of data characterization can be presented in pie charts,
bar charts, multidimensional data cubes, and multidimensional
tables. They can also be presented as generalized relations or in rule
form (called characteristic rules).
Data discrimination
Data discrimination is a comparison of the target class data objects
against the objects from one or multiple contrasting classes with
respect to customers that share specified generalized feature(s).
A data mining system should be able to compare two groups of
AllElectronics customers, such as
those who shop for computer products regularly (more than two times
a month) versus those who rarely shop for such products (i.e., less than
three times a year).
The resulting description provides a general comparative profile of the
customers, such as 80% of the customers who frequently purchase computer
products are between 20 and 40 years old and have a university education,
whereas 60% of the customers who infrequently buy such products are either
seniors or youths, and have no university degree.
Data discrimination
The forms of output presentation are similar to those for characteristic
descriptions, although discrimination descriptions should include
comparative measures that help to distinguish between the target and
contrasting classes.
Drilling down on a dimension, such as occupation, or adding new
dimensions, such as income level, may help in finding even more
discriminative features between the two classes.
Data Mining Functionalities:
Mining Frequent Patterns, Association & Correlations

Frequent patterns, as the name suggests, are patterns that occur
frequently in data.
 There are many kinds of frequent patterns, including itemsets,
subsequences, and substructures.
 A frequent itemset typically refers to a set of items that frequently
appear together in a transactional data set, such as milk and bread.
 Mining frequent patterns leads to the discovery of interesting
associations and correlations within data.
 A data mining system may find association rules like
age(X, “20:::29”)^income(X, “20K:::29K”))buys(X, “CD player”)
[support = 2%, confidence = 60%]
 The rule indicates that of the AllElectronics customers under study,
2% are 20 to 29 years of age with an income of 20,000 to 29,000 and have purchased
a CD player at AllElectronics.
There is a 60% probability that a customer in this age and income group will purchase
a CD player.
 The above rule can be referred to as a multidimensional
association rule.
Single dimensional association rule
 Marketing manager wants to know which items are Frequently
purchased together i.e, within the same transaction .
 Example mined from the AllElectronics transactional database, is
buys(T, “computer”) ^ buys(T, “software”) [support = 1%; confidence = 50%]
 Where;
T is a Transaction .
A confidence, or certainty, of 50% means that if a customer buys a computer, there
is a 50% chance that she/he will buy software as well. 1% says he will buy both.
Data Mining Functionalities:
Classification & Prediction
Classification:
 Classification is the process of finding a model (or function) that
describes and distinguishes data classes or concepts, for the purpose of
being able to use the model to predict the class of objects whose class
label is unknown.
 The derived model is based on the analysis of a set of training data (i.e.,
data objects whose class label is known).
Prediction:
 Prediction models continuous-valued functions. That is, it is used to
predict missing or unavailable numerical data values rather than class
labels.
Representation of the data
Data Mining Functionalities:
Cluster Analysis
 Unlike classification and prediction, which analyze class-labeled data
objects, clustering analyzes data objects without consulting a known
class label.
 The class labels are not present in the training data simply because they
are not known to begin with.
 Clustering can be used to generate such labels.
 The objects are clustered or grouped based on the principle of
maximizing the intraclass similarity and minimizing the interclass
similarity.
 That is, clusters of objects are formed so that objects within a cluster
have high similarity in comparison to one another, but are very dissimilar
to objects in other clusters
Data Mining Functionalities:
Outlier Analysis
A database may contain data objects that do not comply with the general
behavior or model of the data. These data objects are outliers.
Most data mining methods discard outliers as noise or exceptions.
However, in some applications such as fraud detection, the rare events
can be more interesting than the more regularly occurring ones.
The analysis of outlier data is referred to as outlier mining.
Outlier analysis may uncover fraudulent usage of credit cards by
detecting purchases of extremely large amounts for a given account
number in comparison to regular charges incurred by the same account.
Outlier values may also be detected with respect to the location and type
of purchase, or the purchase frequency.
Data Mining Functionalities:
Evolution Analysis
Data evolution analysis describes and models regularities or trends for
objects whose behavior changes over time.
Although this may include characterization, discrimination, association
and correlation analysis, classification, prediction, or clustering of time
related data, distinct features of such an analysis include time-series data
analysis, sequence or periodicity pattern matching, and similarity-based
data analysis.
Are all Patterns Interesting?
What makes a pattern is interesting?
Not known before
16
validates a hypothesis that user sought to confirm
Novel, Potentially useful or desired,
understandable and valid
Easily understood by humans
Valid on new set of data with a degree of certainty
Are all Patterns Interesting?
Objective measures of interestingness are (measurable):
Support: The percentage of transactions
transaction database that the given rule satisfies
from
support(X=>Y) = P(XUY)
Confidence: The degree of certainty of given transaction
Confidence(X=>Y)=P(Y|X)
17
Are all Patterns Interesting?
Many patterns that are interesting by objective standards may represent common
sense and, therefore, are actually un-interesting.
So Objective measures are coupled with subjective measures that reflects users
needs and interests.
Subjective interestingness measures are based on user beliefs in the data.
These measures find patterns interesting if the patterns are unexpected
(contradicting user’s belief), actionable (offer strategic information on which the
user can act) or expected (confirm a hypothesis)
Are all Patterns Interesting?
Can a data mining system generate all of the interesting
patterns?
A data mining algorithm is complete if it mines all interesting patterns.
It is often unrealistic and inefficient for data mining systems to generate all
possible patterns. Instead, user-provided constraints and interestingness measures
should be used to focus the search.
For some mining tasks, such as association, this is often sufficient to ensure the
completeness of the algorithm.
Are all Patterns Interesting?
Can a data mining system generate only interesting patterns?
A data mining algorithm is consistent if it mines only interesting patterns.
It is an optimization problem.
It is highly desirable for data mining systems to generate only interesting
patterns. This would be efficient for users and data mining systems because
neither would have to search through the patterns generated to identify the
truly interesting ones.
Sufficient progress has been made in this direction, but it still a
challenging issue in data mining.
Data Mining Softwares
Angoss Software
CART and MARS
Clementine
Data Miner Software kit
DBMiner Technologies
Enterprise Miner
GhostMiner
Intelligent Miner
JDA Intellect
Mantas
MCubiX from Diagnos
MineSet
Mining Mart
Oracle
Weka 3
Classification of Data Mining Systems
 Data mining is interdisciplinary field
 it is necessary to provide a clear classification of data mining systems,
which may help potential users distinguish between such systems and
identify those that best match their needs.
Database
Technology
Machine
Learning
Statistics
Data Mining
Pattern
Recognition
Algorithm
Visualization
Other
Disciplines
Data mining systems can be categorized according to
various criteria, as follows:
Classification according to the kinds of databases mined
Classification according to the kinds of knowledge mined
Classification according to the kinds of techniques utilized
Classification according to the applications adapted
 Data to be mined
◦ Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
 Knowledge to be mined
◦ Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
◦ Multiple/integrated functions and mining at multiple levels
 Techniques utilized
◦ Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, etc.
 Applications adapted
◦ Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
Data Mining Task Primitives
 Task-relevant data
◦ Database or data warehouse name
◦ Database tables or data warehouse cubes
◦ Condition for data selection
◦ Relevant attributes or dimensions
◦ Data grouping criteria
 Type of knowledge to be mined
◦ Characterization, discrimination, association, classification,
prediction, clustering, outlier analysis, other data mining tasks
 Background knowledge
 Pattern interestingness measurements
 Visualization/presentation of discovered patterns
Major Issues in Data Mining
Mining methodology and user interaction issues:
◦ Mining different kinds of knowledge in databases
◦ Interactive mining of knowledge at multiple levels of
abstraction
◦ Incorporation of background knowledge
◦ Data mining query languages and ad hoc data mining
◦ Presentation and visualization of data mining results
◦ Handling noisy or incomplete data
◦ Pattern evaluation—the interestingness problem
Performance issues
These include efficiency, scalability, and
parallelization of data mining algorithms.
 Efficiency and scalability of data mining algorithms
 Parallel, distributed, and incremental mining
algorithms
Issues relating to the diversity of database types
 Handling of relational and complex types of data
 Mining information from heterogeneous
databases and global information systems
Integrating a Data Mining System with a
DB/DW System
If a data mining system is not integrated with a
database or a data warehouse system, then there will be
no system to communicate with.
This scheme is known as the no-coupling scheme.
In this scheme, the main focus is on data mining design
and on developing efficient and effective algorithms;
for mining the available data sets.
Integrating a Data Mining System with a
DB/DW System
Data mining systems, DBMS, Data warehouse systems coupling
No coupling, loose-coupling, semi-tight-coupling, tight-coupling
On-line analytical mining data
integration of mining and OLAP technologies
Interactive mining multi-level knowledge
Necessity of mining knowledge and patterns at different levels of abstraction by
drilling/rolling, pivoting, slicing/dicing, etc.
Integration of multiple mining functions
Characterized classification, first clustering and then association
Coupling Data Mining with DB/DW Systems
No coupling—flat file processing, not recommended
Loose coupling
Fetching data from DB/DW
Semi-tight coupling—enhanced DM performance
Provide efficient implement a few data mining primitives in a
DB/DW system, e.g., sorting, indexing, aggregation, histogram
analysis, multiway join, precomputation of some stat functions
Tight coupling—A uniform information processing environment
DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query, indexing, query processing
methods, etc.
Different coupling schemes:
With this analysis, it is easy to see that a data mining system should be
coupled with a DB/DW system.
Loose coupling, though not efficient, is better than no coupling because
it uses both data and system facilities of a DB/DW system.
Tight coupling is highly desirable, but its implementation is nontrivial
and more research is needed in this area.
Semi tight coupling is a compromise between loose and tight coupling.
It is important to identify commonly used data mining primitives
and provide efficient implementations of such primitives in DB or
DW systems.
DBMS, OLAP, and Data Mining
Summary
 Data mining: Discovering interesting patterns from large amounts of
data
 A natural evolution of database technology, in great demand, with wide
applications
 A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
 Mining can be performed in a variety of information repositories
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
 Data mining systems and architectures
 Major issues in data mining
UNIT – I
Data Mining:
Concepts and Techniques
— Chapter 2 —
Data Preprocessing
Data Preprocessing
 Why preprocess the data?
 Descriptive data summarization
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
Why Data Preprocessing?
 Data in the real world is dirty
◦ incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
 e.g., occupation=“ ”
◦ noisy: containing errors or outliers
 e.g., Salary=“-10”
◦ inconsistent: containing discrepancies in codes or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
Why Is Data Dirty?
 Incomplete data may come from
◦ “Not applicable” data value when collected
◦ Different considerations between the time when the data was collected
and when it is analyzed.
◦ Human/hardware/software problems
 Noisy data (incorrect values) may come from
◦ Faulty data collection instruments
◦ Human or computer error at data entry
◦ Errors in data transmission
 Inconsistent data may come from
◦ Different data sources
◦ Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
Why Is Data Preprocessing Important?
 No quality data, no quality mining results!
◦ Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even misleading
statistics.
◦ Data warehouse needs consistent integration of quality data
 Data extraction, cleaning, and transformation comprises the majority of the work of
building a data warehouse
Multi-Dimensional Measure of Data Quality
 A well-accepted multidimensional view:
◦
◦
◦
◦
◦
◦
◦
◦
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
 Broad categories:
◦ Intrinsic, contextual, representational, and accessibility
Major Tasks in Data Preprocessing
 Data cleaning
◦ Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
 Data integration
◦ Integration of multiple databases, data cubes, or files
 Data transformation
◦ Normalization and aggregation
 Data reduction
◦ Obtains reduced representation in volume but produces the same or
similar analytical results
 Data discretization
◦ Part of data reduction but with particular importance, especially for
numerical data
Forms of Data
Preprocessing
85
Data Pre-processing
 Why preprocess the data?
 Descriptive data summarization
 Descriptive data summarization techniques can be used to
identify the typical properties of your data and highlight which
data values should be treated as noise or outliers.
 Need to study central tendency and dispersion of the data.
 Measures of central tendency include mean, median, mode,
and midrange
 Measures of data dispersion include quartiles, interquartile
range (IQR), and variance.
 These descriptive statistics are of great help in understanding the
distribution of the data.
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population):
◦ Distributive measure: sum() and count ()
◦ Algebric Measure : avg()
1n
x= xi
n i=1
◦ Weighted arithmetic mean / weighted avg:
n
∑
̄x=
wi x i
i= 1
n
∑
wi
i= 1
◦ Trimmed mean: which is the mean obtained after chopping off
values at the high and low extremes.
For example, we can sort the values observed for salary and remove the top and
bottom 2% before computing the mean. We should avoid trimming too large a portion
(such as 20%) at both ends as this can result in the loss of valuable information.
Problem : Mean is sensitive to extreme values
Measuring the Central Tendency
Median:
◦ Middle value if odd number of values, or average of the
middle two values otherwise
◦ A holistic measure : is a measure that must be computed on the
entire data set as a whole.
◦ Holistic measures are much more expensive to compute than
distributive measures
◦ Estimated by interpolation (for grouped data):
n/2− ( ∑ f )l
median=L1 +(
)c
f median
Measuring the Central Tendency
 Mode
◦ Value that occurs most frequently in the data set
◦ Unimodal, bimodal, trimodal
◦ Empirical formula: for unimodel frequency ;
mean

mode
=
3

(mean

med
The midrange can also be used to assess the central tendency of a data set.
It is the average of the largest and smallest values in the set. This algebraic
measure is easy to compute using the SQL aggregate functions, max() and
min().
Symmetric vs. Skewed
Data
Median, mean and mode of symmetric, positively and
negatively skewed data
February 19, 2008
9090
Measuring the Dispersion of Data
Quartiles, Range, outliers and boxplots :
◦ Quartiles: Q1 (25th percentile), Q3 (75th percentile)
◦ Range : The range of the set is the difference between the largest (max()) and
smallest (min()) values.
◦ Inter-quartile range: IQR = Q3 – Q1
Distance between the first and third quartiles is a simple measure of spread that gives the range
covered by the middle half of the data.
◦ Five number summary: min, Q1, M, Q3, max
◦ Boxplot: ends of the box are the quartiles, median is marked, whiskers, and
plot outlier individually
◦ Outlier: usually, a value higher/lower than 1.5 x IQR
Box plot example
Where; Q1 = 60
Q3 = 100
Outliers : 175 & 202
Median = 80
IQR = 1.5*(40) = 60
Measuring the Dispersion of Data
 Variance and standard deviation (sample: s, population: σ)
◦ Variance: (algebraic, scalable computation)
n
n
n
1
1
1
2
2
s2=
(
x
−
x
)
=
[
x
−
(
x
)
]
∑
̄
∑
∑
n− 1 i=1 i
n− 1 i=1 i n i=1 i
2
◦ Standard deviation s (or σ) is the square root of variance s2 (or σ2)
1
σ =
N
2
n
n
1
(
x
−
μ
)
=
x i − μ2
∑ i
∑
N i=1
i=1
2
2
The computation of the variance and standard deviation is scalable in large
databases.
Visualization of Data Dispersion: Boxplot Analysis
February 19, 2008
9494
Graphic Displays of Basic Descriptive Data Summaries
Aside from the bar charts, pie charts, and line graphs used in most statistical or graphical
data presentation software packages, there are other popular types of graphs for the display
of data summaries and distributions.
These include histograms, quantile plots, q-q plots, scatter plots, and loess curves. Such
graphs are very helpful for the visual inspection of your data.
Data Preprocessing
 Why preprocess the data?
 Descriptive data summarization
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
Chapter 2: Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
Data Cleaning


Importance
◦
“Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball
◦
“Data cleaning is the number one problem in data
warehousing”—
DCI survey
Data cleaning tasks
1.
Fill in missing values
2.
Identify outliers and smooth out noisy data
3.
Correct inconsistent data
4.
Resolve redundancy caused by data integration
Missing Data
 Data is not always available
◦ E.g., many tuples have no recorded value for several
attributes, such as Customer Income in sales data
 Missing data may be due to
◦ equipment malfunction
◦ inconsistent with other recorded data and thus deleted
◦ data not entered due to misunderstanding
◦ certain data may not be considered imp. at the time of entry
◦ not register history or changes of the data
 Missing data may need to be inferred.
How to Handle Missing Data?
1.
Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification) not effective when the percentage of missing values per attribute varies
considerably.
2.
Fill in the missing value manually:
time-consuming + infeasible in large data sets?
3.
Fill in it automatically with
◦
a global constant : e.g., “unknown”, a new class?
(if so, the mining prog may mistakenly think that they form an interesting concept, since
they all have a value in common as “unknown”- it Is simple but foolproof.
◦
the attribute mean or median
◦
the attribute mean for all samples belonging to the same class: smarter
( ex: if classifying custmoers acc. To credit-risk, we may replace the missing value with
the mean income value for customers in the same credit risk category as that of the
given tuple.
◦
the most probable value: inference-based such as Bayesian
formula or decision tree
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may due to
◦ faulty data collection instruments
◦ data entry problems
◦ data transmission problems
◦ technology limitation
◦ inconsistency in naming convention
 Other data problems which requires data cleaning
◦ duplicate records
◦ incomplete data
◦ inconsistent data
How to Handle Noisy Data?
 Binning
◦ first sort data and partition into (equal-frequency) bins
◦ then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
 Regression
◦ smooth by fitting the data into regression functions
 Clustering
◦ detect and remove outliers
 Semi-automated method: combined computer and human inspection
◦ detect suspicious values and check manually
Simple Discretization Methods: Binning
 Equal-width (distance) partitioning
◦ Divides the range into N intervals of equal size: uniform grid
◦ if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
◦ The most straightforward, but outliers may dominate presentation
◦ Skewed data is not handled well
 Equal-depth (frequency) partitioning
◦ Divides the range into N intervals, each containing approximately same
number of samples
◦ Good data scaling
◦ Managing categorical attributes can be tricky
Binning Methods for Data Smoothing
 Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression
Data can be smoothed by fitting the
data to a function, such as with
regression.
y
Linear regression –
find the best line to fit
two variables and use
regression function to
smooth data
Y1
y=x+1
Y1’
•Linear regression (best line to fit
two variables)
•Multiple linear regression (more
than two variables), fit to a
multidimensional surface
X1
x
Cluster Analysis
 detect and remove outliers, Where similar values are organized into
groups or “clusters”
How to Handle Inconsistent Data?
 Manual correction using external references
 Semi-automatic using various tools
◦ To detect violation of known functional dependencies and data
constraints
◦ To correct redundant data
Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
Data Integration
 Data integration:
◦ Combines data from multiple sources into a coherent store
Issues to be considered
 Schema integration: e.g., “cust-id” & “cust-no”
◦ Integrate metadata from different sources
◦ Entity identification problem:
 Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
◦ Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different sources are
different
 Possible reasons: different representations, different scales,
e.g., metric vs. British units
Handling Redundancy in Data Integration
 Redundant data occur often when integration of multiple databases is
done.
◦ Object identification: The same attribute or object may have
different names in different databases
◦ Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue, age
 Redundant attributes can be detected by correlation analysis
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality.
Correlation Analysis (Numerical Data)
 Correlation coefficient (also called Pearson’s product moment
coefficient)
r A ,B
( A − A )( B− B ) ∑ ( AB )− n A B
∑
=
=
(n− 1)σ A σ B
(n− 1)σ A σ B
Where; n is the number of tuples
A B are the respective means of A and B,
σA and σB are the respective standard deviation of A and B,
Σ(AB) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
 rA,B = 0: independent;
 rA,B < 0: negatively correlated
Correlation analysis of categorical (discrete) attributes using
chi square.
For given example expected frequency for the cell ( male , Fiction) is:
Chi square computation is :
For 1 degree of freedom, the chi square value needed to reject hypothesis at the 0.001
significance level is 10.828.
Our value is above this so we can reject the hypothesis that gender and prefered_reading
are independent.
Data Transformation
 Smoothing: remove noise from data using smoothing techniques
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified range
◦ min-max normalization
◦ z-score normalization
◦ normalization by decimal scaling
 Attribute/feature construction:
◦ New attributes constructed from the given ones
Data Transformation: Normalization
 Min-max normalization: For Linear Transformation; to [new_minA, new_maxA]
v − min A
v '=
( new max A − new min A )+new min A
max A − min A
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is
mapped to
73,600  12,000
( 1.0  0 )+ 0 = 0.716
98,000  12,000
 Z-score normalization (μ: mean, σ: standard deviation):
Ex. Let μ (mean) = 54,000,
σ (std. dev)= 16,000. Then v ' =
 Normalization by decimal scaling
v− μ A
σA
v
v' = j
10
Where j is the smallest integer such that, Max(|ν’|) < 1
73,600− 54,000
= 1.225
16,000
Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
Data Reduction
Problem:
• Data Warehouse may store terabytes of data
• Complex data analysis/mining may take a very long time to run on the complete
data set
Solution?
◦ Data reduction…
Data Reduction
Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same)
analytical results
•Data reduction strategies
– Data cube aggregation :
– Dimensionality reduction:
e.g., remove unimportant attributes
– Data compression :
– Numerosity reduction:
e.g., fit data into models
– Discretization and concept hierarchy generation :
2- Dimensional Aggregation
Imagine that you have collected the data for your analysis.
These data consist of the AllElectronics sales per quarter, for the years 2002 to
2004. You are, however, interested in the annual sales (total per year), rather than
the total per quarter.
Thus the data can be aggregated so that the resulting data summarize the total sales
per year instead of per quarter.
Data cube
Data cubes store multidimensional aggregated information.
Each cell holds an aggregate data value, corresponding to the data point in
multidimensional space.
Data cubes provide fast access to precomputed, summarized data, thereby
benefiting OLAP as well as data mining.
Data Cube Aggregation
 The lowest level of a data cube (base cuboid)
◦ The cube created at the lowest level of abstraction is referred to as the base
cuboid.
◦ The aggregated data for an individual entity of interest
◦ E.g., a customer in a phone calling data warehouse
 A cube at the highest level of abstraction is the apex cuboid.
 Multiple levels of aggregation in data cubes
◦ Further reduce the size of data to deal with
 Queries regarding aggregated information should be answered
using data cube, when possible
Dimensionality reduction:
Attribute Subset Selection
 Feature selection (i.e., attribute subset selection):
◦ Select a minimum set of features such that the probability distribution of
different classes given the values for those features is as close as possible to
the original distribution given the values of all features
◦ reduce number of patterns in the patterns, easier to understand
 Heuristic methods (due to exponential # of choices):
◦ Step-wise forward selection
◦ Step-wise backward elimination
◦ Combining forward selection and backward elimination
◦ Decision-tree induction
“How can we find a ‘good’ subset of the original attributes?”
 For n attributes, there are 2n possible subsets.
 An exhaustive search for the optimal subset of attributes can be prohibitively expensive,
especially as n and the number of data classes increase.

Therefore, heuristic methods that explore a reduced search space are commonly used for
attribute subset selection.
 These methods are typically greedy in that, while searching through attribute space, they
always make what looks to be the best choice at the time.
 Their strategy is to make a locally optimal choice in the hope that this will lead to a
globally optimal solution.

Such greedy methods are effective in practice and may come close to estimating an
optimal solution.
 The “best” (and “worst”) attributes are typically determined using tests of statistical
significance, which assume that the attributes are independent of one another.
Heuristic Feature Selection Methods

Several heuristic feature selection methods:
Best single features under the feature independence assumption:
choose by significance tests
1.
Best step-wise forward selection:
1. The best single-feature is picked first
2. Then next best feature condition to the first, ...
2.
Step-wise backward elimination:
1. Repeatedly eliminate the worst feature
3.
1.
Best combined forward selection and backward elimination
Optimal branch and bound:
1. Use feature elimination and backtracking
Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
Y
N
A6?
A1?
Y
Class 1
N
Class 2
Y
Class 1
> Reduced attribute set: {A1, A4, A6}
N
Class 2
Dimensionality Reduction
 Data transformations are applied so as to obtain a reduced or compressed
representation of the original data.
 If the original data can be reconstructed from the compressed data without
any loss of information is called lossless.
 If we can construct only an approximation of the original data, then the data
reduction is called lossy.
Data Compression
 String compression
◦ There are extensive theories and well-tuned algorithms
◦ Typically lossless
◦ But only limited manipulation is possible without expansion
 Audio/video compression
◦ Typically lossy compression, with progressive refinement
◦ Sometimes small fragments of signal can be reconstructed without
reconstructing the whole
 Time sequence is not audio
◦ Typically short and vary slowly with time
Data Compression
lossless
Original Data
Original Data
Approximated
Compressed
Data
How to handle Dimensionality Reduction
• DWT (Discrete Wavelet Transform)
 Principal Components Analysis
 Numerosity Reduction
Wavelet transforms
Haar2
Daubechie4
 DWT (Discrete Wavelet Transform):
is a linear signal processing technique.
 The data vector X transforms it to a numerically different
vector X’ of Wavelet coefficients.
 The two vector of same length.
 A compressed approximation of the data can be retained by
storing only a small fraction of the strongest of the wavelet
coefficients.
 Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
Implementing 2D-DWT
Decomposition
ROW i
COLUMN j
2-D DWT ON MATLAB
Load
Image
(must be
.mat file)
Choose
wavelet type
Hit
Analyze
Choose
display
options
Data Compression: Principal Component Analysis (PCA)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
 Steps:
◦ Normalize input data: Each attribute falls within the same range
◦ Compute k orthonormal (unit) vectors, i.e., principal components
◦ Each input data (vector) is a linear combination of the k principal component vectors
◦ The principal components are sorted in order of decreasing “significance” or
strength
◦ Since the components are sorted, the size of the data can be reduced by eliminating
the weak components, i.e., those with low variance. (i.e., using the strongest
principal components, it is possible to reconstruct a good approximation of the
original data
 Works for numeric data only
 Used when the number of dimensions is large
Principal Component Analysis
Y1 & Y2 are the first principal components for the given data
X2
Y1
Y2
X1
137
Numerosity Reduction
 Reduce data volume by choosing alternative, smaller forms of data
representation
 Parametric methods
◦ Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
◦ Example:
Log-linear models—obtain value at a point in m-D space as the product on
appropriate marginal subspaces
 Non-parametric methods
◦ Do not assume models
◦ Major families: histograms, clustering, sampling
Parametric methods
1. Regression
 Linear regression: Data are modeled to fit a .....straight line
◦ Often uses the least-square method to fit the line
 Linear regression:
◦ Two parameters , w and b specify the line
and are to be estimated by using the data at hand.
◦ using the least squares criterion to the known values of Y1, Y2, …
X2, ….
, X1,
Data Reduction Method (1):
Regression and Log-Linear Models
 Linear regression: Data are modeled to fit a straight line
◦ Often uses the least-square method to fit the line
 Multiple regression: allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
 Log-linear model: approximates discrete multidimensional probability
distributions
Regress Analysis and Log-Linear Models
 Linear regression: Y = w X + b
◦ Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
◦ Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2.
◦ Many nonlinear functions can be transformed into the above
 Log-linear models:
◦ The multi-way table of joint probabilities is approximated by a
product of lower-order tables
◦ Probability: p(a, b, c, d)
Data Reduction Method (2): Histograms
 Divide data into buckets
and store average (sum) for
each bucket
 Partitioning rules:
◦ Equal-width: equal bucket
range
◦ Equal-frequency (or
equal-depth)
◦ V-optimal: with the least
histogram variance
(weighted sum of the
original values that each
bucket represents)
◦ MaxDiff: set bucket
boundary between each pair
for pairs have the β–1
largest differences
40
35
30
25
20
15
10
5
0
10000
30000
50000
70000
90000
Data Reduction Method (3): Clustering
 Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
 Can be very effective if data is clustered but not if data is “smeared”
 Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
 There are many choices of clustering definitions and clustering
algorithms
 Cluster analysis will be studied in depth in Chapter 7
Clustering
Raw Data
Data Reduction Method (4): Sampling
 Sampling: obtaining a small sample s to represent the whole data set N
 Allow a mining algorithm to run in complexity that is potentially sublinear to the size of the data
 Choose a representative subset of the data
◦ Simple random sampling may have very poor performance in the
presence of skew
 Develop adaptive sampling methods
◦ Stratified sampling:
 Approximate the percentage of each class (or subpopulation of
interest) in the overall database
 Used in conjunction with skewed data
 Note: Sampling may not reduce database I/Os (page at a time)
Sampling: with or without Replacement
Raw Data
Sampling: Cluster or Stratified Sampling
Cluster/Stratified Sample
Raw Data
Chapter 3: Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
Discretization
 Three types of attributes:
◦ Nominal — values from an unordered set, e.g., color, profession
◦ Ordinal — values from an ordered set, e.g., military or academic rank
◦ Continuous — real numbers, e.g., integer or real numbers
 Discretization:
◦ Divide the range of a continuous attribute into intervals
◦ Some classification algorithms only accept categorical attributes.
◦ Reduce data size by discretization
◦ Prepare for further analysis
Discretization and Concept Hierarchy
 Discretization
◦ Reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals
◦ Interval labels can then be used to replace actual data values
◦ Supervised vs. unsupervised
 If Discretization process Used class information then we say Supervised.
◦ Split (top-down) vs. merge (bottom-up)
If the process starts by first finding one or a few points (called split points or cut
points) to split the entire attribute range, and then repeats this recursively on the
resulting intervals, it is called top-down discretization or splitting.
In contrast bottom-up starts by considering all of the continuous values as potential
split- points, removes some by merging neighborhood values to form intervals, and
then recursively applies this process to the resulting intervals.
◦ Discretization can be performed recursively on an attribute
 Concept hierarchy formation
◦ Recursively reduce the data by collecting and replacing low level
concepts (such as numeric values for age) by higher level concepts (such
as young, middle-aged, or senior)
Discretization and Concept Hierarchy Generation for
Numeric Data
 Typical methods: All the methods can be applied recursively
◦ Binning (covered above)
 Top-down split, unsupervised,
◦ Histogram analysis (covered above)
 Top-down split, unsupervised
◦ Clustering analysis (covered above)
 Either top-down split or bottom-up merge, unsupervised
◦ Entropy-based discretization: supervised, top-down split
◦ Interval merging by X2 Analysis: unsupervised, bottom-up merge
◦ Segmentation by natural partitioning: top-down split, unsupervised
Entropy-Based Discretization
 Given a set of samples S, if S is partitioned into two intervals S1 and S2 using
boundary T, the information gain after partitioning is
∣ S1∣
∣ S 2∣
I ( S ,T )=
Entropy ( S 1 )+
Entropy( S 2 )
∣S∣
∣ S∣
 Entropy is calculated based on class distribution of the samples in the set.
m
Given m classes, the entropy of S1 is
Entropy ( S 1 )= − ∑ pi log 2 ( p i )
i= 1
where pi is the probability of class i in S1
 The boundary that minimizes the entropy function over all possible
boundaries is selected as a binary discretization
 The process is recursively applied to partitions obtained until some stopping
criterion is met
 Such a boundary may reduce data size and improve classification accuracy
153
Segmentation by Natural Partitioning

A simply 3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.
◦
If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width
intervals
◦
If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervals
◦
If it covers 1, 5, or 10 distinct values at the most significant
digit, partition the range into 5 intervals
Concept Hierarchy Generation for Categorical Data
 Specification of a partial/total ordering of attributes explicitly at
the schema level by users or experts
◦ street < city < state < country
 Specification of a hierarchy for a set of values by explicit data
grouping
◦ {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
◦ E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
◦ E.g., for a set of attributes: {street, city, state, country}
Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on the
analysis of the number of distinct values per attribute in the data
set
◦ The attribute with the most distinct values is placed at the lowest level
of the hierarchy
◦ Exceptions, e.g., weekday, month, quarter, year
country
province_or_ state
15 distinct values
365 distinct values
city
3567 distinct values
street
674,339 distinct values
Chapter : Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
Summary
 Data preparation or preprocessing is a big issue for both data
warehousing and data mining
 Discriptive data summarization is need for quality data preprocessing
 Data preparation includes
◦ Data cleaning and data integration
◦ Data reduction and feature selection
◦ Discretization
 A lot a methods have been developed but data preprocessing still an
active area of research