Download Data Mining What is Data Mining Overview

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining What is Data Mining
Overview
Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data
from different perspectives and summarizing it into useful information - information that can be used to
increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for
analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it,
and summarize the relationships identified. Technically, data mining is the process of finding
correlations or patterns among dozens of fields in large relational databases.
Continuous Innovation
Although data mining is a relatively new term, the technology is not. Companies have used powerful
computers to sift through volumes of supermarket scanner data and analyze market research reports
for years. However, continuous innovations in computer processing power, disk storage, and statistical
software are dramatically increasing the accuracy of analysis while driving down the cost.
Example
For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze
local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they
also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery
shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded
that they purchased the beer to have it available for the upcoming weekend. The grocery chain could
use this newly discovered information in various ways to increase revenue. For example, they could
move the beer display closer to the diaper display. And, they could make sure beer and diapers were
sold at full price on Thursdays.
Data, Information, and Knowledge
Data
Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are
accumulating vast and growing amounts of data in different formats and different databases. This
includes
operational or transactional data such as, sales, cost, inventory, payroll, and accounting
nonoperational data, such as industry sales, forecast data, and macro economic data
meta data - data about the data itself, such as logical database design or data dictionary definitions
Information
The patterns, associations, or relationships among all this data can provide information. For example,
analysis of retail point of sale transaction data can yield information on which products are selling and
when.
Knowledge
Information can be converted into knowledge about historical patterns and future trends. For example,
summary information on retail supermarket sales can be analyzed in light of promotional efforts to
provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine
which items are most susceptible to promotional efforts.
Data Warehouses
Dramatic advances in data capture, processing power, data transmission, and storage capabilities are
enabling organizations to integrate their various databases into data warehouses. Data warehousing is
defined as a process of centralized data management and retrieval. Data warehousing, like data mining,
is a relatively new term although the concept itself has been around for years. Data warehousing
represents an ideal vision of maintaining a central repository of all organizational data. Centralization of
data is needed to maximize user access and analysis. Dramatic technological advances are making this
vision a reality for many companies. And, equally dramatic advances in data analysis software are
allowing users to access this data freely. The data analysis software is what supports data mining.
What can data mining do
Data mining is primarily used today by companies with a strong consumer focus - retail, financial,
communication, and marketing organizations. It enables these companies to determine relationships
among internal factors such as price, product positioning, or staff skills, and external factors such as
economic indicators, competition, and customer demographics. And, it enables them to determine the
impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to drill down into
summary information to view detail transactional data.
With data mining, a retailer could use point-of-sale records of customer purchases to send targeted
promotions based on an individual's purchase history. By mining demographic data from comment or
warranty cards, the retailer could develop products and promotions to appeal to specific customer
segments.
For example, Blockbuster Entertainment mines its video rental history database to recommend rentals
to individual customers. American Express can suggest products to its cardholders based on analysis of
their monthly expenditures.
WalMart is pioneering massive data mining to transform its supplier relationships. WalMart captures
point-of-sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to
its massive 7.5 terabyte Teradata data warehouse. WalMart allows more than 3,500 suppliers, to access
data on their products and perform data analyses. These suppliers use this data to identify customer
buying patterns at the store display level. They use this information to manage local store inventory and
identify new merchandising opportunities. In 1995, WalMart computers processed over 1 million
complex data queries.
The National Basketball Association (NBA) is exploring a data mining application that can be used in
conjunction with image recordings of basketball games. The Advanced Scout software analyzes the
movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the
play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on
January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four
jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is
interesting because it differs considerably from the average shooting percentage of 49.30% for the
Cavaliers during that game.
By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the
jump shots attempted by Williams with Price on the floor, without needing to comb through hours of
video footage. Those clips show a very successful pick-and-roll play in which Price draws the Knick's
defense and then finds Williams for an open jump shot.
How does data mining work
While large-scale information technology has been evolving separate transaction and analytical systems,
data mining provides the link between the two. Data mining software analyzes relationships and
patterns in stored transaction data based on open-ended user queries. Several types of analytical
software are available statistical, machine learning, and neural networks. Generally, any of four types of
relationships are sought
Classes Stored data is used to locate data in predetermined groups. For example, a restaurant chain
could mine customer purchase data to determine when customers visit and what they typically order.
This information could be used to increase traffic by having daily specials.
Clusters Data items are grouped according to logical relationships or consumer preferences. For
example, data can be mined to identify market segments or consumer affinities.
Associations Data can be mined to identify associations. The beer-diaper example is an example of
associative mining.
Sequential patterns Data is mined to anticipate behavior patterns and trends. For example, an outdoor
equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's
purchase of sleeping bags and hiking shoes.
Data mining consists of five major elements
Extract, transform, and load transaction data onto the data warehouse system.
Store and manage the data in a multidimensional database system.
Provide data access to business analysts and information technology professionals.
Analyze the data by application software.
Present the data in a useful format, such as a graph or table.
Different levels of analysis are available
Artificial neural networks Non-linear predictive models that learn through training and resemble
biological neural networks in structure.
Genetic algorithms Optimization techniques that use processes such as genetic combination, mutation,
and natural selection in a design based on the concepts of natural evolution.
Decision trees Tree-shaped structures that represent sets of decisions. These decisions generate rules
for the classification of a dataset. Specific decision tree methods include Classification and Regression
Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision
tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a
new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset
by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART
typically requires less data preparation than CHAID.
Nearest neighbor method A technique that classifies each record in a dataset based on a combination of
the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the
k-nearest neighbor technique.
Rule induction The extraction of useful if-then rules from data based on statistical significance.
Data visualization The visual interpretation of complex relationships in multidimensional data. Graphics
tools are used to illustrate data relationships.
What technological infrastructure is required
Today, data mining applications are available on all size systems for mainframe, clientserver, and PC
platforms. System prices range from several thousand dollars for the smallest applications up to $1
million a terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes
to over 11 terabytes. NCR has the capacity to deliver applications exceeding 100 terabytes. There are
two critical technological drivers
Size of the database the more data being processed and maintained, the more powerful the system
required.
Query complexity the more complex the queries and the greater the number of queries being
processed, the more powerful the system required.
Relational database storage and management technology is adequate for many data mining applications
less than 50 gigabytes. However, this infrastructure needs to be significantly enhanced to support larger
applications. Some vendors have added extensive indexing capabilities to improve query performance.
Others use new hardware architectures such as Massively Parallel Processors (MPP) to achieve order-ofmagnitude improvements in query time. For example, MPP systems from NCR link hundreds of highspeed Pentium processors to achieve performance levels exceeding those of the largest
supercomputers.
--------------------------------------------------------------------------------
Home Library Learn Downloads Support Community Sign in | United States - English | |
MSDN Library Servers and Enterprise Development SQL Server SQL Server 2008 R2 Product
Documentation SQL Server 2008 R2 Books Online Analysis Services - Data Mining Planning and
Architecture Logical Architecture (Analysis Services - Data Mining) Mining Structures (Analysis Services Data Mining) Mining Structure Columns
Data Types (Data Mining) Content Types (Data Mining) Classified Columns (Data Mining) Column
Distributions (Data Mining) Discretization Methods (Data Mining) Modeling Flags (Data Mining)
Community Content
Add code samples and tips to enhance this topic.
More... Data Types (Data Mining)
SQL Server 2008 R2 Other Versions SQL Server "Denali"
SQL Server 2008
SQL Server 2005
When you create a mining model or a mining structure in Microsoft SQL Server Analysis Services, you
must define the data types for each of the columns in the mining structure. The data type tells the data
mining engine whether the data in the data source is numerical or text, and how the data should be
processed. For example, if your source data contains numerical data, you can specify whether the
numbers be treated as integers or by using decimal places.
Each data type supports one or more content types. By setting the content type, you can customize the
way that data in the column is processed or calculated in the mining model.
For example, if you have numeric data in a column, you can choose to handle it either as a numeric or
text data type. If you choose the numeric data type, you can set several different content types: you can
discretize the numbers, or handle them as continuous values. For a list of all the content types, see
Content Types (Data Mining).
Analysis Services supports the following data types for mining structure columns:
Data Type
Supported Content Types
Text
Cyclical, Discrete, Discretized, Key Sequence, Ordered, Sequence
Long
Continuous, Cyclical, Discrete, Discretized, Key, Key Sequence, Key Time, Ordered, Sequence, Time
Classified
Boolean
Cyclical, Discrete, Ordered
Double
Continuous, Cyclical, Discrete, Discretized, Key, Key Sequence, Key Time, Ordered, Sequence, Time
Classified
Date
Continuous, Cyclical, Discrete, Discretized, Key, Key Sequence, Key Time, Ordered
Note
The Time and Sequence content types are only supported by third-party algorithms. The Cyclical and
Ordered content types are supported, but most algorithms treat them as discrete values and do not
perform special processing.
Specifying a Data Type
--------------------------------------------------------------------------------
If you create the mining model directly by using Data Mining Extensions (DMX), you can define the data
type for each column as you define the model, and Analysis Services will create the corresponding
mining structure with the specified data types at the same time. If you create the mining model or
mining structure by using a wizard, Analysis Services will suggest a data type, or you can choose a data
type from a list.
Changing a Data Type
--------------------------------------------------------------------------------
If you change the data type of a column, you must always reprocess the mining structure and any mining
models that are based on that structure. Sometimes if you change the data type, that column can no
longer be used in a particular model. In that case, Analysis Services will either raise an error when you
reprocess the model, or will process the model but leave out that particular column.
See Also
--------------------------------------------------------------------------------
Reference
Content Types (DMX)
Data Types (DMX)
Concepts
Content Types (Data Mining)
Data Mining Algorithms (Analysis Services - Data Mining)
Mining Structures (Analysis Services - Data Mining)
Mining Model Columns
Other Resources
Mining Structure Columns
Community Content Add FAQ
© 2011 Microsoft. All rights reserved.Terms of Use | Trademarks | Privacy Statement | Feedback
Feedback Feedbackx
Tell us about your experience...
Did the page load quickly?
Yes No
Do you like the page design?
Yes No
How useful is this topic?
Tell us more
Enter description here.
Free Online Articles Directory
Why Submit Articles? Top Authors Top Articles FAQ ABAnswers Publish Article
Login Login via
Register Hello My Home Sign Out Email
Password
Remember me?
Lost Password?
Home Page > Business > What are the Types of Data Mining? What are the Types of Data Mining?
Posted: Jan 21, 2009 |Comments: 0 | Views: 4,622 | 1Ads by Google
Legal Outsourcing India Legal outsourcing inside out. See video of industry leading LPO.
www.pangea3.com
Registry Fixer - Download Powerful Registry cleaner others can't touch. Turbocharge your PC!
www.iolo.com/Registry-Repair
Manufacturing OutSourcing Global Leading Manufacturing Market Magnetize Foreign Buyers now!
www.MFG.com
Relation Extraction API AlchemyAPI extracts facts, events, and relations from raw text.
www.alchemyapi.com/
Web mining, an extension of data mining implies employing the techniques of data mining to
documents on the Internet. Web mining is used to study various aspects of a website and recognize the
relationships and patterns in user behavior in order to get an insight into crucial information. For
example, if you have to improve the accessibility quotient of your website, you need to know crucial
points that need to be improved. Web mining services presents you the required results. It takes into
consideration the IP addresses of website visitors, browser logs, cookies and so on.
Web mining tools analyze these logs and process them accordingly to produce meaningful and
understandable information. For example, various bits of information can be analyzed to track the
browsing route of website visitors. This may assist you in devising ways to make your website more
effective.
The whole process of web mining involves extracting information from the internet through traditional
practices of data mining and applying it to specific features of the website.
Types of Web Mining
Ads by GoogleWeb mining helps to discover information, find related data and documents, identify
patterns and trends and make sure that the web resources remain efficient. There are three main types
of web mining:
• Web Content Mining
• Web Usage Mining
• Web Structure Mining
Web Content Mining
This process seeks to discover all the links of hyperlinks within a document in order to generate a
structural report on the web page. Information about various facets, for example if users are able to find
information, if the website structure is too deep or too shallow, are the web page elements placed
correctly, what are the most visited and least visited areas of a website and do they have anything to do
with the page design, all these are evaluated and analyzed for further research.
Web Usage Mining
In this process, data mining techniques are applied to discover patterns and trends in the browsing
behavior of website visitors. Navigation patters are extracted and so that browsing patters can be
deciphered and website structure and designed accordingly. For instance, if there is any particular
feature of the website that visitors tend to use very often, you should seek to make it more pronounced
and enhanced in order to increase the usability and appeal more to the users. This process makes use of
logs and accesses of the web.
By understanding visitor movement and behavior as they surf the internet, you can seek to cater to their
needs and preferences better and thus make your website popular among the internet masses.
Web Structure Mining
Web structure mining involves the use of graph theory to analyze the node and connection structure of
a website. And as per the nature and type of web structure data, web mining is further divided into two
types.
One, extracting patterns from hyperlinks on the internet. A hyperlink is a structural web address that
connects the web page to another location. Second kind of web mining is mining the document
structure. A tree-like structure is used to analyze and describe the HTML or XHTML tags within the web
page.
Retrieved from "http://www.articlesbase.com/business-articles/what-are-the-types-of-data-mining731327.html"
Was this Helpful ?
00i
Maneet Puri - About the Author:
Maneet Puri is the managing director of LeXolution IT Services, a premier off shore outsourcing
company that specializes in providing efficient KPO services. Some of the services provided by the
company are Data mining, internet market research and virtual private assistance.
Ads By Google
Mining software Mining simulation software and services, SimMine www.simmine.com
On & Near Shore Facility We Provide Outsource Facilities Fixed Costs Fully Serviced Equipped
colbus.com.au
Jobs for Freshers Companies Hiring Freshers Now. Sign up for Free to Apply. MonsterIndia.com/Jobs
Mining Tools Search Thousands of Catalogs for Mining Tools www.globalspec.com
Questions and Answers Ask our experts your Business related questions here... Ask 200 Characters
leftWhat are the data types in oracle ?What are the data types in javascript ?What are the data types in
c ?Rate this Article 1 2 3 4 5 vote(s) 3 vote(s)Feedback Print Re-Publish Source:
http://www.articlesbase.com/business-articles/what-are-the-types-of-data-mining-731327.html
Article Tags:data miningRelated Videos to Watch
‹...›By 5min
PlayClick to PlayFallout 3 - Walkthrough - Part 98 - Data MiningPlayClick to PlayProcessing Data From an
AJAX RequestPlayClick to PlayMicrosoft Expression Web - Looking at ASP.net and Data...PlayClick to
PlayHow to Combine Internet Data into one RSS FeedPlayClick to PlayMicrosoft Expressions Web - How
Dynamic Pages Display Data...
PlayClick to PlayHow to Add Condition to a Data View in MS SharePoint DesignerPlayClick to PlayLearn
about Different Types of Control with ASP.NET Data Control...PlayClick to PlayHow to Stop Data Leaks
with Code Green NetworksPlayClick to PlayMicrosoft Expression Web - Useful Shortcuts for Designing
Web...PlayClick to PlayMicrosoft Expression Web - Changing the Background Color of...
PlayClick to PlayThe 2012 Award Season Kicks Off With Nomination SurprisesPlayClick to PlayHow to Get
Paid to RecyclePlayClick to PlayHealthy Alternatives to Drive Through Value MealsPlayClick to PlaySmart
Tips for Holiday ShoppingPlayClick to PlayAnimal Testing Facts and Alternatives
PlayClick to PlayTaking Jessica Alba's Awesome Look from Day to NightPlayClick to PlayLife After
Twilight: What Will Happen to Its StarsPlayClick to PlayFight Childhood Obesity with the Walking School
BusPlayClick to PlayHow to Dance Yourself Thin with the Zumba Wii GamePlayClick to PlayBest Victoria's
Secret Fashion Show Moments - What to Watch For...
Related ArticlesLatest Business ArticlesMore from Maneet PuriOutsource Data Cleaning Services project
at lowest costShri data entry services is data entry company base in India providing data entry, data
conversion, data processing, data cleansing and data cleaning services.
By: Outsource Data Entry Servicesl Business> Outsourcingl Jul 20, 2011 Various online data backup
services- server data backup, PC data backupData backup services are regarded as one of the must
haves in every business.In present scenario, everything is getting online.The same way with regard to
data storage, most of us prefer storing data online.And, to ensure safe storage backup is necessary.
By: Online Data Backup Solutionsl Computers> Data Recoveryl Mar 08, 2011 Outsource Data Entry
Services, Online Data Entry, Offshore Offline Data Entry, Data Entry Outsourcing, Back Office / Bpo
Services IndiaAdept Data Services offers an efficient, affordable and very cost-effective data entry
services with minimum hassles. Our data entry services are entirely designed to assist businesses large
or small that seek speedy and valuable results. We work personally with you using mature equipped
policies that ensure data privacy and services continuity.
By: Adept Data Servicesl Business> Outsourcingl Feb 22, 2009 lViews: 418 lComments: 1 Data Recovery –
a Need for All Computer UsersTrust your data on professionals not hawkers. Do not wait for the worst to
happen just call in the professional or choose Do-it–yourself Data recovery Software by searching the
terms like “Data Recovery, Data Recovery Software” on the Google toolbar. We at Recover Data assure
you to provide best and professional Data recovery software to help you retrieve data or information
that has previously been deleted. It can also help you recover any data that has been lost because of
virus attacks.
By: Recover Datal Computers> Data Recoveryl Apr 26, 2008 lViews: 327 Data Entry Outsourcing,
Outsource Data Entry Services to IndiaData Entry Outsourcing is buzzing these days. Many companies
from diversified verticals are taking it as a viable option for their businesses to outsource their data
entry projects to professional data management companies. In this manner the outsourcing companies
get away from the burden of whipping data, and largely concentrate on their core processes.
By: Data Outsourcing Indial Business> Outsourcingl Nov 20, 2011 Ten Top Tips For Data Cleansing &
Data Cleaning SuccessData cleaning or data cleansing should be regularly performed on any database of
customers or prospects. Data is an integral part of a communications strategy enabling a company to
provide excellent customer service and effectively manage the customer or prospect relationship.
Follow CCR's top tips to ensure data cleaning success.
By: Chris Turner - CCR Datal Marketing> Marketing Tipsl Sep 23, 2011 Why Business Owners Need An
AccountantOperating a successful business takes the know how to run your organization and the drive
necessary to bring that success about, but the know-how isn't always up to par with the ambition we
sometimes have. Though we may have the will to take our business to the top of the Forbes 500, the
"know how" on how to get there is often times lacking.
By: Suzanne Pricel Businessl Dec 04, 2011 Tips for Choosing an AccountantHiring accountants can be
quite difficult, depending on who you ask. For large corporations, certainly there are Human Resources
experts to handle the hiring process. More often than not, they already have a set number of
requirements that every applicant must achieve in order to at least qualify for the position.
By: Suzanne Pricel Businessl Dec 04, 2011 Business Accountants in BradfordAs a business owner there
are a couple of things you really can't substitute in any business. One, is quality staff that provide quality
services for you and your customers, and the second non-substitutable business requirement are quality
outsourced services for those things your in-house staff will not perform. One of the outsourced services
in need of special attention are the duties handled by your accountant.
By: Suzanne Pricel Businessl Dec 04, 2011 Unique Strategic Internet Marketing: A Key to Generate More
MoneyNowadays, marketing on the internet has become an extended decline in general business
activity. But almost all of the businesses are doing it in a wrong way. Some of these businesses had
thought that it was just easy as they create their own social networking account as they have heard it on
popular social media sites. Other owner of the business still put up in their mind that if they do have
their own website, it will create more money for them. But sad to say, these business owners fail to c
By: Mike Taylorl Businessl Dec 04, 2011 How to Market your Business for Just a Few DollarsIt may not be
as dramatic as the other materials but it has proven to be the most effective.
By: Katel Businessl Dec 03, 2011 Online Research – Why Important In Data Entry WorkThere are
hundreds of thousands of sites available on Internet that offer services of online data entry work.
Google has once announced that nearly 5.3 million sites on Google offer data entry services and are
potential source for big monthly income.
By: Maneet Puril Businessl Nov 03, 2010 The Basics – Why Most Businesses Require To Outsource Data
EntryIn this modern business world, outsourcing data entry is the most profitable area considered in a
business process. What you preferably need is just a reliable resource to outsource your work and
projects.
By: Maneet Puril Business> Outsourcingl Oct 27, 2010 Things To Keep In Mind While Conducting
Internet Market ResearchInternet Market Research is one of the most important aspects of KPO
solutions. This article will discuss about the ways to conduct Internet Market Research so that you as a
business owner can get a detailed idea of industry trends, find economic data, consumer attitudes &
behaviors with right approach.
By: Maneet Puril Internet> Internet Marketingl Oct 25, 2010 Great Tips To Select 'The Right'
Outsourcing AdvisorsIt would not be wrong to say that while outsourcing, selecting an outsourcing
advisor is the most important decision you need to make for the successful relationship with your IT
services provider.
By: Maneet Puril Business> Outsourcingl Aug 25, 2010 Offshore Outsourcing – How Can It Benefit Small
Businesses?Even though the offshore outsourcing industry has been showing phenomenal growth in the
past years, it has been largely restricted to large corporations that had the resources to establish
overseas infrastructure and harness it to achieve a high level of cost effectiveness. However, in the
recent times, the growth curve has shown a major tilt to small businesses and enterprises.
By: Maneet Puril Business> Customer Servicel Jul 09, 2010 Discuss this Article
Author NavigationMy Home Publish Article View/Edit Articles View/Edit Q&A Edit your Account Manage
Authors Statistics Page Personal RSS Builder My Home Edit your Account Update Profile View/Edit Q&A
Publish Article Author Box Maneet Puri has 54 articles online Contact Author Subscribe to RSS Print
article Send to friend Re-Publish article Articles CategoriesAll Categories Advertising Arts &
Entertainment Automotive Beauty Business Careers Computers Education Finance Food and Beverage
Health Hobbies Home and Family Home Improvement Internet Law Marketing News and Society
Relationships Self Improvement Shopping Spirituality Sports and Fitness Technology Travel Writing
Business
Agriculture Ask an Expert Business Ideas Business Opportunities Corporate Customer Service
Entrepreneurship Ethics Franchise Fundraising Home Business Human Resources Industrial International
Business Leadership Management Negotiation Networking Non Profit Organizations Online Business
Organizational Outsourcing Presentation Project Management Public Company Public Relations Sales
Shipping Six Sigma Small Business Strategic Planning Team Building Training Ads by Google
Need Help?
Contact Us
FAQ
Submit Articles
Editorial Guidelines
Blog
Site Links
Recent Articles
Top Authors
Top Articles
Find Articles
Site Map
Mobile Version
Webmasters
RSS Builder Data Mining
Classification: Basic Concepts, Decision
Trees, and Model Evaluation
Lecture Notes for Chapter 4
Introduction to Data Mining
by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
1
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
2
Classification: Definition
O Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
O Find a model for class attribute as a function
of the values of other attributes.
O Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
3
Illustrating Classification Task
Apply
Model
Induction
Deduction
L earn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
4
Examples of Classification Task
O Predicting tumor cells as benign or malignant
O Classifying credit card transactions
as legitimate or fraudulent
O Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil
O Categorizing news stories as finance,
weather, entertainment, sports, etc© Tan,Steinbach, Kumar Introduction to Data Mining
5
Classification Techniques
O Decision Tree based Methods
O Rule-based Methods
O Memory based reasoning
O Neural Networks
O Naïve Bayes and Bayesian Belief Networks
O Support Vector Machines
© Tan,Steinbach, Kumar Introduction to Data Mining
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
4/18/2004
6
4/18/2004
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
categorical
categorical
continuous
class
Refund
MarSt
TaxInc
NO YES
NO
NO
Yes No
Single, Divorced Married
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree© Tan,Steinbach, Kumar Introduction to Data Mining
7
4/18/2004
Another Example of Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
categorical
categorical
continuous
class
MarSt
Refund
TaxInc
NO YES
NO
NO
Yes
No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
© Tan,Steinbach, Kumar Introduction to Data Mining
Decision Tree Classification Task
Apply
Model
Induction
Deduction
L earn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
4/18/2004
8
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree© Tan,Steinbach, Kumar Introduction to Data Mining
Apply Model to Test Data
Refund
MarSt
TaxInc
NO YES
NO
4/18/2004
9
NO
Yes No
Single, Divorced Married
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Start from the root of tree.
© Tan,Steinbach, Kumar Introduction to Data Mining
Apply Model to Test Data
Refund
MarSt
TaxInc
NO YES
NO
NO
Yes No
Single, Divorced Married
< 80K > 80K
Refund Marital
Status
4/18/2004
10
Taxable
Income Cheat
No Married 80K ?
10
Test Data© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
Apply Model to Test Data
Refund
MarSt
TaxInc
NO YES
NO
NO
Yes No
Single, Divorced Married
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
© Tan,Steinbach, Kumar Introduction to Data Mining
Apply Model to Test Data
Refund
4/18/2004
12
11
MarSt
TaxInc
NO YES
NO
NO
Yes No
Single, Divorced Married
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data© Tan,Steinbach, Kumar Introduction to Data Mining
Apply Model to Test Data
Refund
MarSt
TaxInc
NO YES
NO
NO
Yes No
Single, Divorced Married
< 80K > 80K
4/18/2004
13
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
© Tan,Steinbach, Kumar Introduction to Data Mining
Apply Model to Test Data
Refund
MarSt
TaxInc
NO YES
NO
NO
Yes No
Single, Divorced Married
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
4/18/2004
14
Assign Cheat to “No”© Tan,Steinbach, Kumar Introduction to Data Mining
Decision Tree Classification Task
Apply
Model
Induction
Deduction
L earn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
4/18/2004
15
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
16
Decision Tree Induction
O Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT© Tan,Steinbach, Kumar Introduction to Data Mining
General Structure of Hunt’s Algorithm
O Let Dt
be the set of training records
that reach a node t
O General Procedure:
– If Dt
contains records that
belong the same class y
4/18/2004
17
t
, then t
is a leaf node labeled as y
t
– If Dt
is an empty set, then t is a
leaf node labeled by the default
class, yd
– If Dt
contains records that
belong to more than one class,
use an attribute test to split the
data into smaller subsets.
Recursively apply the
procedure to each subset.
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Dt
?
© Tan,Steinbach, Kumar Introduction to Data Mining
Hunt’s Algorithm
Don’t
Cheat
Refund
Don’t
Cheat
Don’t
Cheat
Yes No
Refund
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
4/18/2004
18
Cheat
Single,
Divorced
Married
Taxable
Income
Don’t
Cheat
< 80K >= 80K
Refund
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
19
Tree Induction
O Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
O Issues
– Determine how to split the records
?How to specify the attribute test condition?
?How to determine the best split?
– Determine when to stop splitting
© Tan,Steinbach, Kumar Introduction to Data Mining
Tree Induction
O Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
4/18/2004
20
O Issues
– Determine how to split the records
?How to specify the attribute test condition?
?How to determine the best split?
– Determine when to stop splitting© Tan,Steinbach, Kumar Introduction to Data Mining
21
How to Specify Test Condition?
O Depends on attribute types
– Nominal
– Ordinal
– Continuous
O Depends on number of ways to split
– 2-way split
– Multi-way split
© Tan,Steinbach, Kumar Introduction to Data Mining
Splitting Based on Nominal Attributes
O Multi-way split: Use as many partitions as distinct
values.
O Binary split: Divides values into two subsets.
Need to find optimal partitioning.
CarType
Family
Sports
Luxury
CarType
{Family,
4/18/2004
22
4/18/2004
Luxury}
{Sports}
CarType
{Sports,
Luxury}
{Family} OR© Tan,Steinbach, Kumar Introduction to Data Mining
O Multi-way split: Use as many partitions as distinct
values.
O Binary split: Divides values into two subsets.
Need to find optimal partitioning.
O What about this split?
Splitting Based on Ordinal Attributes
Size
Small
Medium
Large
Size
{Medium,
Large}
{Small}
Size
{Small,
Medium}
{Large}
OR
4/18/2004
23
Size
{Small,
Large}
{Medium}
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
24
Splitting Based on Continuous Attributes
O Different ways of handling
– Discretization to form an ordinal categorical
attribute
? Static – discretize once at the beginning
? Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
– Binary Decision: (A < v) or (A = v)
? consider all possible splits and finds the best cut
? can be more compute intensive© Tan,Steinbach, Kumar Introduction to Data Mining
25
Splitting Based on Continuous Attributes
Taxable
Income
> 80K?
Yes No
Taxable
Income?
(i) Binary split (ii) Multi-way split
< 10K
4/18/2004
[10K,25K) [25K,50K) [50K,80K)
> 80K
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
26
Tree Induction
O Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
O Issues
– Determine how to split the records
?How to specify the attribute test condition?
?How to determine the best split?
– Determine when to stop splitting© Tan,Steinbach, Kumar Introduction to Data Mining
27
How to determine the Best Split
Own
Car?
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
4/18/2004
Car
Type?
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Student
ID?
...
Yes No Family
Sports
Luxury c
1
c
10
c
20
C0: 0
C1: 1
...
c
11
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
28
How to determine the Best Split
O Greedy approach:
– Nodes with homogeneous class distribution
are preferred
O Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
Measures of Node Impurity
O Gini Index
O Entropy
O Misclassification error
© Tan,Steinbach, Kumar Introduction to Data Mining
How to Find the Best Split
B?
Yes No
Node N3 Node N4
4/18/2004
30
29
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10
C1 N11
C0 N20
C1 N21
C0 N30
C1 N31
C0 N40
C1 N41
C0 N00
C1 N01
M0
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34© Tan,Steinbach, Kumar Introduction to Data Mining
31
Measure of Impurity: GINI
O Gini Index for a given node t :
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc
) when records are equally
distributed among all classes, implying least
interesting information
4/18/2004
– Minimum (0.0) when all records belong to one class,
implying most interesting information
=-?
j
GINI t p j t
2
( ) 1 [ ( | )]
C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278
© Tan,Steinbach, Kumar Introduction to Data Mining
Examples for computing GINI
C1 0
C2 6
C1 2
C2 4
4/18/2004
32
C1 1
C2 5
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Gini = 1 – P(C1)
2
– P(C2)
2
= 1 –0 –1 = 0
=-?
j
GINI t p j t
2
( ) 1 [ ( | )]
P(C1) = 1/6
P(C2) = 5/6
Gini = 1 – (1/6)
2
– (5/6)
2
= 0.278
P(C1) = 2/6
P(C2) = 4/6
Gini = 1 – (2/6)
2
– (4/6)
2
= 0.444© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
33
Splitting Based on GINI
O Used in CART, SLIQ, SPRINT.
O When a node p is split into k partitions (children), the
quality of split is computed as,
where, ni
= number of records at child i,
n = number of records at node p.
?
=
=
k
i
i
split
GINI i
n
n
GINI
1
()
© Tan,Steinbach, Kumar Introduction to Data Mining
Binary Attributes: Computing GINI Index
O Splits into two partitions
O Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
4/18/2004
34
B?
Yes No
Node N1 Node N2
Parent
C1 6
C2 6
Gini = 0.500
N1 N2
C1 5 1
C2 2 4
Gini=0.333
Gini(N1)
= 1 – (5/6)
2
– (2/6)
2
= 0.194
Gini(N2)
= 1 – (1/6)
2
– (4/6)
2
= 0.528
Gini(Children)
= 7/12 * 0.194 +
5/12 * 0.528
= 0.333© Tan,Steinbach, Kumar Introduction to Data Mining
Categorical Attributes: Computing Gini Index
O For each distinct value, gather counts for each class in
the dataset
O Use the count matrix to make decisions
CarType
{Sports,
Luxury}
{Family}
C1 3 1
C2 2 4
Gini 0.400
CarType
{Sports}
{Family,
Luxury}
C1 2 2
C2 1 5
Gini 0.419
CarType
Family Sports Luxury
C1 1 2 1
C2 4 1 1
Gini 0.393
4/18/2004
35
Multi-way split Two-way split
(find best partition of values)
© Tan,Steinbach, Kumar Introduction to Data Mining
Continuous Attributes: Computing Gini Index
O Use Binary Decisions based on one
value
O Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values
O Each splitting value has a count matrix
associated with it
– Class counts in each of the
partitions, A < v and A = v
O Simple method to choose best v
– For each v, scan the database to
gather count matrix and compute
its Gini index
– Computationally Inefficient!
Repetition of work.
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
4/18/2004
36
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Taxable
Income
> 80K?
Yes No© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
Continuous Attributes: Computing Gini Index...
O For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No
Taxable Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
37
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Split Positions
Sorted Values
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
38
Alternative Splitting Criteria based on INFO
O Entropy at a given node t:
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Measures homogeneity of a node.
?Maximum (log nc
) when records are equally distributed
among all classes implying least information
?Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are similar to the
GINI index computations
= -?
j
Entropy(t) p( j | t) log p( j | t)© Tan,Steinbach, Kumar Introduction to Data Mining
39
Examples for computing Entropy
C1 0
C2 6
C1 2
C2 4
C1 1
4/18/2004
C2 5
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C1) = 1/6
P(C2) = 5/6
Entropy = – (1/6) log2
(1/6) – (5/6) log2
(1/6) = 0.65
P(C1) = 2/6
P(C2) = 4/6
Entropy = – (2/6) log2
(2/6) – (4/6) log2
(4/6) = 0.92
= -?
j
Entropy(t) p( j | t) log p( j | t)
2
© Tan,Steinbach, Kumar Introduction to Data Mining
Splitting Based on INFO...
O Information Gain:
Parent Node, p is split into k partitions;
ni
is number of records in partition i
– Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
– Used in ID3 and C4.5
4/18/2004
40
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
?
?
?
?
?
?
=-?
=
k
i
i
split
Entropy i
n
n
GAIN Entropy p
1
( ) ( )© Tan,Steinbach, Kumar Introduction to Data Mining
Splitting Based on INFO...
O Gain Ratio:
Parent Node, p is split into k partitions
ni
is the number of records in partition i
4/18/2004
41
– Adjusts Information Gain by the entropy of the
partitioning (SplitINFO). Higher entropy partitioning
(large number of small partitions) is penalized!
– Used in C4.5
– Designed to overcome the disadvantage of Information
Gain
SplitINFO
GAIN
GainRATIO
Split
split
=?
=
=k
i
ii
n
n
n
n
SplitINFO
1
log
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
42
Splitting Criteria based on Classification Error
O Classification error at a node t :
O Measures misclassification error made by a node.
? Maximum (1 - 1/nc
) when records are equally distributed
among all classes, implying least interesting information
? Minimum (0.0) when all records belong to one class, implying
most interesting information
Error(t) 1 max P(i | t)
i
= -© Tan,Steinbach, Kumar Introduction to Data Mining
Examples for Computing Error
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6
P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6
P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Error(t) 1 max P(i | t)
4/18/2004
43
i
=© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
44
Comparison among Splitting Criteria
For a 2-class problem:© Tan,Steinbach, Kumar Introduction to Data Mining
Misclassification Error vs Gini
A?
Yes No
Node N1 Node N2
Parent
C1 7
C2 3
Gini = 0.42
N1 N2
C1 3 4
C2 0 3
Gini=0.361
Gini(N1)
= 1 – (3/3)
2
– (0/3)
2
=0
Gini(N2)
= 1 – (4/7)
4/18/2004
45
2
– (3/7)
2
= 0.489
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Gini improves !!
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
46
Tree Induction
O Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
O Issues
– Determine how to split the records
?How to specify the attribute test condition?
?How to determine the best split?
– Determine when to stop splitting© Tan,Steinbach, Kumar Introduction to Data Mining
47
Stopping Criteria for Tree Induction
O Stop expanding a node when all the records
belong to the same class
O Stop expanding a node when all the records have
similar attribute values
O Early termination (to be discussed later)
4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
48
Decision Tree Based Classification
O Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
49
Example: C4.5
O Simple depth-first construction.
O Uses Information Gain
O Sorts Continuous Attributes at each node.
O Needs entire data to fit in memory.
O Unsuitable for Large Datasets.
– Needs out-of-core sorting.
O You can download the software from:
http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
50
Practical Issues of Classification
O Underfitting and Overfitting
O Missing Values
O Costs of Classification© Tan,Steinbach, Kumar Introduction to Data Mining
Underfitting and Overfitting (Example)
500 circular and 500
triangular data points.
4/18/2004
51
Circular points:
0.5 = sqrt(x
1
2
+x
2
2
)=1
Triangular points:
sqrt(x
1
2
+x
2
2
) > 0.5 or
sqrt(x
1
2
+x
2
2
)<1
© Tan,Steinbach, Kumar Introduction to Data Mining
Underfitting and Overfitting
4/18/2004
52
Overfitting
Underfitting: when model is too simple, both training and test errors are large © Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
53
Overfitting due to Noise
Decision boundary is distorted by noise point
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
54
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
55
Notes on Overfitting
O Overfitting results in decision trees that are more
complex than necessary
O Training error no longer provides a good estimate
of how well the tree will perform on previously
unseen records
O Need new ways for estimating errors
© Tan,Steinbach, Kumar Introduction to Data Mining
Estimating Generalization Errors
O Re-substitution errors: error on training (S e(t) )
O Generalization errors: error on testing (S e’(t))
O Methods for estimating generalization errors:
– Optimistic approach: e’(t) = e(t)
4/18/2004
56
– Pessimistic approach:
? For each leaf node: e’(t) = (e(t)+0.5)
? Total errors: e’(T) = e(T) + N × 0.5 (N: number of leaf nodes)
? For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
Generalization error = (10 + 30×0.5)/1000 = 2.5%
– Reduced error pruning (REP):
? uses validation data set to estimate generalization
error© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
57
Occam’s Razor
O Given two models of similar generalization errors,
one should prefer the simpler model over the
more complex model
O For complex models, there is a greater chance
that it was fitted accidentally by errors in data
O Therefore, one should include model complexity
when evaluating a model
© Tan,Steinbach, Kumar Introduction to Data Mining
Minimum Description Length (MDL)
O Cost(Model,Data) = Cost(Data|Model) + Cost(Model)
– Cost is the number of bits needed for encoding.
– Search for the least costly model.
O Cost(Data|Model) encodes the misclassification errors.
O Cost(Model) uses node encoding (number of children)
4/18/2004
58
plus splitting condition encoding.
AB
A?
B?
C?
01
0
1
Yes No
B1
B2
C1 C2
Xy
X1 1
X2 0
X3 0
X4 1
……
Xn 1
Xy
X1 ?
X2 ?
X3 ?
X4 ?
……
Xn ?© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
59
How to Address Overfitting
O Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
? Stop if all instances belong to the same class
? Stop if all the attribute values are the same
– More restrictive conditions:
? Stop if number of instances is less than some user-specified
threshold
? Stop if class distribution of instances are independent of the
available features (e.g., using ?
2
test)
? Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
© Tan,Steinbach, Kumar Introduction to Data Mining
How to Address Overfitting…
O Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a
bottom-up fashion
– If generalization error improves after trimming,
replace sub-tree by a leaf node.
– Class label of leaf node is determined from
4/18/2004
60
majority class of instances in the sub-tree
– Can use MDL for post-pruning© Tan,Steinbach, Kumar Introduction to Data Mining
61
Example of Post-Pruning
A?
A1
A2 A3
A4
Class = No 10
Error = 10/30
Class = Yes 20
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 9/30
Pessimistic error (After splitting)
= (9 + 4 × 0.5)/30 = 11/30
PRUNE!
Class = No 4
Class = Yes 8
Class = No 4
Class = Yes 3
Class = No 1
Class = Yes 4
Class = No 1
Class = Yes 5
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
62
4/18/2004
Examples of Post-pruning
– Optimistic error?
– Pessimistic error?
– Reduced error pruning?
C0: 11
C1: 3
C0: 2
C1: 4
C0: 14
C1: 3
C0: 2
C1: 2
Don’t prune for both cases
Don’t prune case 1, prune case 2
Case 1:
Case 2:
Depends on validation set© Tan,Steinbach, Kumar Introduction to Data Mining
63
Handling Missing Attribute Values
O Missing values affect decision tree construction in
three different ways:
– Affects how impurity measures are computed
– Affects how to distribute instance with missing
value to child nodes
– Affects how a test instance with missing value
is classified
4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining
Computing Impurity Measure
Tid Refund Marital
Status
Taxable
Income Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 ? Single 90K Yes
10
Class
= Yes
Class
= No
Refund=Yes 0 3
Refund=No 2 4
Refund=? 1 0
Split on Refund:
4/18/2004
64
Entropy(Refund=Yes) = 0
Entropy(Refund=No)
= -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183
Entropy(Children)
= 0.3 (0) + 0.6 (0.9183) = 0.551
Gain = 0.9 × (0.8813 – 0.551) = 0.3303
Missing
value
Before Splitting:
Entropy(Parent)
= -0.3 log(0.3)-(0.7)log(0.7) = 0.8813© Tan,Steinbach, Kumar Introduction to Data Mining
65
Distribute Instances
Tid Refund Marital
Status
Taxable
Income Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
4/18/2004
10
Refund
Yes No
Class=Yes 0
Class=No 3
Cheat=Yes 2
Cheat=No 4
Refund
Yes
Tid Refund Marital
Status
Taxable
Income Class
10 ? Single 90K Yes
10
No
Class=Yes 2 + 6/9
Class=No 4
Probability that Refund=Yes is 3/9
Probability that Refund=No is 6/9
Assign record to the left child with
weight = 3/9 and to the right child
with weight = 6/9
Class=Yes 0 + 3/9
Class=No 3
© Tan,Steinbach, Kumar Introduction to Data Mining
Classify Instances
Refund
MarSt
TaxInc
NO YES
NO
NO
Yes
No
Married
Single,
Divorced
< 80K > 80K
Total 3.67 2 1 6.67
Class=Yes 6/9 1 1 2.67
Class=No 3 1 0 4
Married Single Divorced Total
Tid Refund Marital
Status
Taxable
Income Class
11 No ? 85K ?
10
New record:
4/18/2004
66
Probability that Marital Status
= Married is 3.67/6.67
Probability that Marital Status
={Single,Divorced} is 3/6.67© Tan,Steinbach, Kumar Introduction to Data Mining
67
4/18/2004
Other Issues
O Data Fragmentation
O Search Strategy
O Expressiveness
O Tree Replication
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
68
Data Fragmentation
O Number of instances gets smaller as you traverse
down the tree
O Number of instances at the leaf nodes could be
too small to make any statistically significant
decision© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
69
Search Strategy
O Finding an optimal decision tree is NP-hard
O The algorithm presented so far uses a greedy,
top-down, recursive partitioning strategy to
induce a reasonable solution
O Other strategies?
– Bottom-up
– Bi-directional
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
70
Expressiveness
O Decision tree provides expressive representation for
learning discrete-valued function
– But they do not generalize well to certain types of
Boolean functions
? Example: parity function:
– Class = 1 if there is an even number of Boolean attributes with truth
value = True
– Class = 0 if there is an odd number of Boolean attributes with truth
value = True
? For accurate modeling, must have a complete tree
O Not expressive enough for modeling continuous variables
– Particularly when test condition involves only a single
attribute at-a-time© Tan,Steinbach, Kumar Introduction to Data Mining
Decision Boundary
y < 0.33?
:0
:3
:4
:0
y < 0.47?
:4
:0
:0
:4
4/18/2004
71
x < 0.43?
Yes
Yes
No
No Yes No
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time
© Tan,Steinbach, Kumar Introduction to Data Mining
Oblique Decision Trees
4/18/2004
72
x+y<1
Class = + Class =
• Test condition may involve multiple attributes
• More expressive representation
• Finding optimal test condition is computationally expensive© Tan,Steinbach, Kumar Introduction to
Data Mining
4/18/2004
73
Tree Replication
P
QR
S01
01
Q
S0
01
• Same subtree appears in multiple branches
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
74
Model Evaluation
O Metrics for Performance Evaluation
– How to evaluate the performance of a model?
O Methods for Performance Evaluation
– How to obtain reliable estimates?
O Methods for Model Comparison
– How to compare the relative performance
among competing models?© Tan,Steinbach, Kumar Introduction to Data Mining
75
Model Evaluation
4/18/2004
O Metrics for Performance Evaluation
– How to evaluate the performance of a model?
O Methods for Performance Evaluation
– How to obtain reliable estimates?
O Methods for Model Comparison
– How to compare the relative performance
among competing models?
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
76
Metrics for Performance Evaluation
O Focus on the predictive capability of a model
– Rather than how fast it takes to classify or
build models, scalability, etc.
O Confusion Matrix:
Class=No c d
Class=Yes a b
Class=Yes Class=No
ACTUAL
CLASS
PREDICTED CLASS
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)© Tan,Steinbach, Kumar Introduction to Data Mining
Metrics for Performance Evaluation…
O Most widely-used metric:
4/18/2004
77
d
(TN)
c
(FP)
Class=No
b
(FN)
a
(TP)
Class=Yes
Class=Yes Class=No
ACTUAL
CLASS
PREDICTED CLASS
TP TN FP FN
TP TN
abcd
ad
+++
+
=
+++
+
Accuracy =
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
78
Limitation of Accuracy
O Consider a 2-class problem
– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10
O If model predicts everything to be class 0,
accuracy is 9990/10000 = 99.9 %
– Accuracy is misleading because model does
not detect any class 1 example© Tan,Steinbach, Kumar Introduction to Data Mining
79
Cost Matrix
Class=No C(Yes|No) C(No|No)
Class=Yes C(Yes|Yes) C(No|Yes)
C(i|j) Class=Yes Class=No
ACTUAL
CLASS
PREDICTED CLASS
C(i|j): Cost of misclassifying class j example as class i
© Tan,Steinbach, Kumar Introduction to Data Mining
Computing Cost of Classification
-10
+ -1 100
C(i|j) + ACTUAL
CLASS
Cost PREDICTED CLASS
Matrix
4/18/2004
80
4/18/2004
- 60 250
+ 150 40
+ACTUAL
CLASS
Model PREDICTED CLASS
M1
- 5 200
+ 250 45
+ACTUAL
CLASS
Model PREDICTED CLASS
M2
Accuracy = 80%
Cost = 3910
Accuracy = 90%
Cost = 4255© Tan,Steinbach, Kumar Introduction to Data Mining
Cost vs Accuracy
c d Class=No
a b Class=Yes
Class=Yes Class=No
ACTUAL
CLASS
Count PREDICTED CLASS
4/18/2004
81
q p Class=No
p q Class=Yes
Class=Yes Class=No
ACTUAL
CLASS
Cost PREDICTED CLASS
N=a+b+c+d
Accuracy = (a + d)/N
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N – a – d)
= q N – (q – p)(a + d)
= N [q – (q-p) × Accuracy]
Accuracy is proportional to cost if
1. C(Yes|No)=C(No|Yes) = q
2. C(Yes|Yes)=C(No|No) = p
© Tan,Steinbach, Kumar Introduction to Data Mining
Cost-Sensitive Measures
abc
a
rp
rp
ab
a
ac
a
4/18/2004
82
++
=
+
=
+
=
+
=
2
22
F - measure (F)
Recall (r)
Precision (p)
O Precision is biased towards C(Yes|Yes) & C(Yes|No)
O Recall is biased towards C(Yes|Yes) & C(No|Yes)
O F-measure is biased towards all except C(No|No)
wawbwcwd
wawd
1234
14
Weighted Accuracy
+++
+
=© Tan,Steinbach, Kumar Introduction to Data Mining
Model Evaluation
4/18/2004
83
O Metrics for Performance Evaluation
– How to evaluate the performance of a model?
O Methods for Performance Evaluation
– How to obtain reliable estimates?
O Methods for Model Comparison
– How to compare the relative performance
among competing models?
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
84
Methods for Performance Evaluation
O How to obtain a reliable estimate of
performance?
O Performance of a model may depend on other
factors besides the learning algorithm:
– Class distribution
– Cost of misclassification
– Size of training and test sets© Tan,Steinbach, Kumar Introduction to Data Mining
85
Learning Curve
O Learning curve shows
how accuracy changes
with varying sample size
O Requires a sampling
schedule for creating
learning curve:
O Arithmetic sampling
(Langley, et al)
4/18/2004
O Geometric sampling
(Provost et al)
Effect of small sample size:
- Bias in the estimate
- Variance of estimate
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
86
Methods of Estimation
O Holdout
– Reserve 2/3 for training and 1/3 for testing
O Random subsampling
– Repeated holdout
O Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
O Stratified sampling
– oversampling vs undersampling
O Bootstrap
– Sampling with replacement© Tan,Steinbach, Kumar Introduction to Data Mining
87
Model Evaluation
O Metrics for Performance Evaluation
– How to evaluate the performance of a model?
O Methods for Performance Evaluation
– How to obtain reliable estimates?
O Methods for Model Comparison
4/18/2004
– How to compare the relative performance
among competing models?
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
88
ROC (Receiver Operating Characteristic)
O Developed in 1950s for signal detection theory to
analyze noisy signals
– Characterize the trade-off between positive
hits and false alarms
O ROC curve plots TP (on the y-axis) against FP
(on the x-axis)
O Performance of each classifier represented as a
point on the ROC curve
– changing the threshold of algorithm, sample
distribution or cost matrix changes the location
of the point© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
ROC Curve
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive
© Tan,Steinbach, Kumar Introduction to Data Mining
ROC Curve
(TP,FP):
O (0,0): declare everything
to be negative class
4/18/2004
90
89
O (1,1): declare everything
to be positive class
O (1,0): ideal
O Diagonal line:
– Random guessing
– Below diagonal line:
? prediction is opposite of
the true class© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
Using ROC for Model Comparison
O No model consistently
outperform the other
O M1
is better for
small FPR
O M2
is better for
large FPR
O Area Under the ROC
curve
O Ideal:
? Area = 1
O Random guess:
? Area = 0.5
© Tan,Steinbach, Kumar Introduction to Data Mining
How to Construct an ROC curve
4/18/2004
92
91
10 0.25 +
9 0.43 8 0.53 +
7 0.76 6 0.85 +
5 0.85 4 0.85 3 0.87 2 0.93 +
1 0.95 +
Instance P(+|A) True Class
• Use classifier that produces
posterior probability for each
test instance P(+|A)
• Sort the instances according
to P(+|A) in decreasing order
• Apply threshold at each
unique value of P(+|A)
• Count the number of TP, FP,
TN, FN at each threshold
• TP rate, TPR = TP/(TP+FN)
• FP rate, FPR = FP/(FP + TN)© Tan,Steinbach, Kumar Introduction to Data Mining
93
How to construct an ROC curve
Class + - + - - - + - + +
0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
4/18/2004
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0
Threshold >=
ROC Curve:
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
94
Test of Significance
O Given two models:
– Model M1: accuracy = 85%, tested on 30 instances
– Model M2: accuracy = 75%, tested on 5000 instances
O Can we say M1 is better than M2?
– How much confidence can we place on accuracy of
M1 and M2?
– Can the difference in performance measure be
explained as a result of random fluctuations in the test
set?© Tan,Steinbach, Kumar Introduction to Data Mining
Confidence Interval for Accuracy
O Prediction can be regarded as a Bernoulli trial
– A Bernoulli trial has 2 possible outcomes
– Possible outcomes for prediction: correct or wrong
– Collection of Bernoulli trials has a Binomial distribution:
? x ~ Bin(N, p)
x: number of correct predictions
4/18/2004
95
? e.g: Toss a fair coin 50 times, how many heads would turn up?
Expected number of heads = N×p = 50 × 0.5 = 25
O Given x (# of correct predictions) or equivalently,
acc=x/N, and N (# of test instances),
Can we predict p (true accuracy of model)?
© Tan,Steinbach, Kumar Introduction to Data Mining
Confidence Interval for Accuracy
O For large test sets (N > 30),
– acc has a normal distribution
with mean p and variance
p(1-p)/N
O Confidence Interval for p:
a
aa
=<
<1
)
(1 ) /
(
/2
Z1 / 2
4/18/2004
96
ppN
acc p
PZ
Area = 1 - a
Za/2
Z1- a /2
2( )
244
2
/2
22
/2
2
/2
a
aa
NZ
N acc Z Z N acc N acc
p
+
××+±+××-××
=© Tan,Steinbach, Kumar Introduction to Data Mining
Confidence Interval for Accuracy
O Consider a model that produces an accuracy of
80% when evaluated on 100 test instances:
4/18/2004
97
– N=100, acc = 0.8
– Let 1-a = 0.95 (95% confidence)
– From probability table, Za/2
=1.96
0.90 1.65
0.95 1.96
0.98 2.33
0.99 2.58
1-a Z
p(lower) 0.670 0.711 0.763 0.774 0.789
p(upper) 0.888 0.866 0.833 0.824 0.811
N 50 100 500 1000 5000
© Tan,Steinbach, Kumar Introduction to Data Mining
Comparing Performance of 2 Models
O Given two models, say M1 and M2, which is
better?
– M1 is tested on D1 (size=n1), found error rate = e1
– M2 is tested on D2 (size=n2), found error rate = e2
– Assume D1 and D2 are independent
– If n1 and n2 are sufficiently large, then
– Approximate:
()
()222
111
~,
4/18/2004
98
~,
µs
µs
eN
eN
i
ii
i
n
e (1 e )
ˆ
s =© Tan,Steinbach, Kumar Introduction to Data Mining
Comparing Performance of 2 Models
O To test if performance difference is statistically
significant: d = e1 – e2
– d ~ (dt
,st
) where dt
is the true difference
– Since D1 and D2 are independent, their variance
adds up:
– At (1-a) confidence level,
2
2(1 2)
4/18/2004
99
1
1(1 1)
ˆˆ
2
2
2
1
2
2
2
1
2
n
ee
n
ee
t
+
=
s = s +s ? s +s
dt
d Za
st
ˆ=±/2
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
100
An Illustrative Example
O Given: M1: n1 = 30, e1 = 0.15
M2: n2 = 5000, e2 = 0.25
O d = |e2 – e1| = 0.1 (2-sided test)
O At 95% confidence level, Za/2
=1.96
=> Interval contains 0 => difference may not be
statistically significant
0.0043
5000
0.25(1 0.25)
30
0.15(1 0.15)
ˆ=
+
sd
=
dt
= 0.100 ±1.96× 0.0043 = 0.100 ± 0.128© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
101
Comparing Performance of 2 Algorithms
O Each learning algorithm may produce k models:
– L1 may produce M11 , M12, …, M1k
– L2 may produce M21 , M22, …, M2k
O If models are generated on the same test sets
D1,D2, …, Dk (e.g., via cross-validation)
– For each set: compute dj
= e1j
– e2j
– dj
has mean dt
and variance st
– Estimate:
tkt
k
j
j
t
ddt
kk
dd
s
s
a
ˆ
( 1)
()
ˆ
1,1
1
2
2
-=
=±
=
?
RSS
Link to Us
Business Info
Advertising
More Languages: Use of this web site constitutes acceptance of the Terms Of Use and Privacy Policy |
User published content is licensed under a Creative Commons License.
Copyright © 2005-2011 Free Articles by ArticlesBase.com, All rights reserved.
[ Overview] [ What is Data Mining] [ Issues] [ More Information]ing
Classification: Basic Concepts, Decision
Trees, and Model Evaluation
Lecture Notes for Chapter 4
Introduction to Data Mining
by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
1
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
2
Classification: Definition
O Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
O Find a model for class attribute as a function
of the values of other attributes.
O Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
3
Illustrating Classification Task
Apply
Model
Induction
Deduction
L earn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
4
Examples of Classification Task
O Predicting tumor cells as benign or malignant
O Classifying credit card transactions
as legitimate or fraudulent
O Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil
O Categorizing news stories as finance,
weather, entertainment, sports, etc© Tan,Steinbach, Kumar Introduction to Data Mining
5
Classification Techniques
O Decision Tree based Methods
O Rule-based Methods
O Memory based reasoning
O Neural Networks
O Naïve Bayes and Bayesian Belief Networks
O Support Vector Machines
© Tan,Steinbach, Kumar Introduction to Data Mining
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4/18/2004
6
4/18/2004
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
categorical
categorical
continuous
class
Refund
MarSt
TaxInc
NO YES
NO
NO
Yes No
Single, Divorced Married
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree© Tan,Steinbach, Kumar Introduction to Data Mining
7
Another Example of Decision Tree
Tid Refund Marital
4/18/2004
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
categorical
categorical
continuous
class
MarSt
Refund
TaxInc
NO YES
NO
NO
Yes
No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
© Tan,Steinbach, Kumar Introduction to Data Mining
Decision Tree Classification Task
Apply
Model
Induction
Deduction
L earn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
4/18/2004
8
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree© Tan,Steinbach, Kumar Introduction to Data Mining
Apply Model to Test Data
Refund
MarSt
TaxInc
NO YES
NO
NO
Yes No
4/18/2004
9
Single, Divorced Married
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Start from the root of tree.
© Tan,Steinbach, Kumar Introduction to Data Mining
Apply Model to Test Data
Refund
MarSt
TaxInc
NO YES
NO
NO
Yes No
Single, Divorced Married
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
4/18/2004
10
No Married 80K ?
10
Test Data© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
Apply Model to Test Data
Refund
MarSt
TaxInc
NO YES
NO
NO
Yes No
Single, Divorced Married
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
© Tan,Steinbach, Kumar Introduction to Data Mining
Apply Model to Test Data
Refund
MarSt
TaxInc
4/18/2004
12
11
NO YES
NO
NO
Yes No
Single, Divorced Married
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data© Tan,Steinbach, Kumar Introduction to Data Mining
Apply Model to Test Data
Refund
MarSt
TaxInc
NO YES
NO
NO
Yes No
Single, Divorced Married
< 80K > 80K
Refund Marital
Status
4/18/2004
13
Taxable
Income Cheat
No Married 80K ?
10
Test Data
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
14
Apply Model to Test Data
Refund
MarSt
TaxInc
NO YES
NO
NO
Yes No
Single, Divorced Married
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”© Tan,Steinbach, Kumar Introduction to Data Mining
Decision Tree Classification Task
4/18/2004
15
Apply
Model
Induction
Deduction
L earn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
16
Decision Tree Induction
O Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT© Tan,Steinbach, Kumar Introduction to Data Mining
General Structure of Hunt’s Algorithm
O Let Dt
be the set of training records
that reach a node t
O General Procedure:
– If Dt
contains records that
belong the same class y
t
, then t
4/18/2004
17
is a leaf node labeled as y
t
– If Dt
is an empty set, then t is a
leaf node labeled by the default
class, yd
– If Dt
contains records that
belong to more than one class,
use an attribute test to split the
data into smaller subsets.
Recursively apply the
procedure to each subset.
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Dt
?
© Tan,Steinbach, Kumar Introduction to Data Mining
Hunt’s Algorithm
Don’t
Cheat
Refund
Don’t
Cheat
Don’t
Cheat
Yes No
Refund
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
4/18/2004
18
Divorced
Married
Taxable
Income
Don’t
Cheat
< 80K >= 80K
Refund
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
19
Tree Induction
O Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
O Issues
– Determine how to split the records
‹How to specify the attribute test condition?
‹How to determine the best split?
– Determine when to stop splitting
© Tan,Steinbach, Kumar Introduction to Data Mining
Tree Induction
O Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
O Issues
– Determine how to split the records
4/18/2004
20
‹How to specify the attribute test condition?
‹How to determine the best split?
– Determine when to stop splitting© Tan,Steinbach, Kumar Introduction to Data Mining
21
How to Specify Test Condition?
O Depends on attribute types
– Nominal
– Ordinal
– Continuous
O Depends on number of ways to split
– 2-way split
– Multi-way split
© Tan,Steinbach, Kumar Introduction to Data Mining
Splitting Based on Nominal Attributes
O Multi-way split: Use as many partitions as distinct
values.
O Binary split: Divides values into two subsets.
Need to find optimal partitioning.
CarType
Family
Sports
Luxury
CarType
{Family,
Luxury}
{Sports}
4/18/2004
22
4/18/2004
CarType
{Sports,
Luxury}
{Family} OR© Tan,Steinbach, Kumar Introduction to Data Mining
O Multi-way split: Use as many partitions as distinct
values.
O Binary split: Divides values into two subsets.
Need to find optimal partitioning.
O What about this split?
Splitting Based on Ordinal Attributes
Size
Small
Medium
Large
Size
{Medium,
Large}
{Small}
Size
{Small,
Medium}
{Large}
OR
Size
{Small,
4/18/2004
23
Large}
{Medium}
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
24
Splitting Based on Continuous Attributes
O Different ways of handling
– Discretization to form an ordinal categorical
attribute
‹ Static – discretize once at the beginning
‹ Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
– Binary Decision: (A < v) or (A ≥ v)
‹ consider all possible splits and finds the best cut
‹ can be more compute intensive© Tan,Steinbach, Kumar Introduction to Data Mining
25
Splitting Based on Continuous Attributes
Taxable
Income
> 80K?
Yes No
Taxable
Income?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K
4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
26
Tree Induction
O Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
O Issues
– Determine how to split the records
‹How to specify the attribute test condition?
‹How to determine the best split?
– Determine when to stop splitting© Tan,Steinbach, Kumar Introduction to Data Mining
27
How to determine the Best Split
Own
Car?
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type?
4/18/2004
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Student
ID?
...
Yes No Family
Sports
Luxury c
1
c
10
c
20
C0: 0
C1: 1
...
c
11
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
28
How to determine the Best Split
O Greedy approach:
– Nodes with homogeneous class distribution
are preferred
O Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
Measures of Node Impurity
O Gini Index
O Entropy
O Misclassification error
© Tan,Steinbach, Kumar Introduction to Data Mining
How to Find the Best Split
B?
Yes No
Node N3 Node N4
A?
Yes No
4/18/2004
30
29
Node N1 Node N2
Before Splitting:
C0 N10
C1 N11
C0 N20
C1 N21
C0 N30
C1 N31
C0 N40
C1 N41
C0 N00
C1 N01
M0
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34© Tan,Steinbach, Kumar Introduction to Data Mining
31
Measure of Impurity: GINI
O Gini Index for a given node t :
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc
) when records are equally
distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
4/18/2004
=−∑
j
GINI t p j t
2
( ) 1 [ ( | )]
C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278
© Tan,Steinbach, Kumar Introduction to Data Mining
Examples for computing GINI
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
4/18/2004
32
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Gini = 1 – P(C1)
2
– P(C2)
2
= 1 –0 –1 = 0
=−∑
j
GINI t p j t
2
( ) 1 [ ( | )]
P(C1) = 1/6
P(C2) = 5/6
Gini = 1 – (1/6)
2
– (5/6)
2
= 0.278
P(C1) = 2/6
P(C2) = 4/6
Gini = 1 – (2/6)
2
– (4/6)
2
= 0.444© Tan,Steinbach, Kumar Introduction to Data Mining
Splitting Based on GINI
O Used in CART, SLIQ, SPRINT.
4/18/2004
33
O When a node p is split into k partitions (children), the
quality of split is computed as,
where, ni
= number of records at child i,
n = number of records at node p.
∑
=
=
k
i
i
split
GINI i
n
n
GINI
1
()
© Tan,Steinbach, Kumar Introduction to Data Mining
Binary Attributes: Computing GINI Index
O Splits into two partitions
O Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
B?
Yes No
4/18/2004
34
Node N1 Node N2
Parent
C1 6
C2 6
Gini = 0.500
N1 N2
C1 5 1
C2 2 4
Gini=0.333
Gini(N1)
= 1 – (5/6)
2
– (2/6)
2
= 0.194
Gini(N2)
= 1 – (1/6)
2
– (4/6)
2
= 0.528
Gini(Children)
= 7/12 * 0.194 +
5/12 * 0.528
= 0.333© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
35
Categorical Attributes: Computing Gini Index
O For each distinct value, gather counts for each class in
the dataset
O Use the count matrix to make decisions
CarType
{Sports,
Luxury}
{Family}
C1 3 1
C2 2 4
Gini 0.400
CarType
{Sports}
{Family,
Luxury}
C1 2 2
C2 1 5
Gini 0.419
CarType
Family Sports Luxury
C1 1 2 1
C2 4 1 1
Gini 0.393
Multi-way split Two-way split
(find best partition of values)
© Tan,Steinbach, Kumar Introduction to Data Mining
Continuous Attributes: Computing Gini Index
O Use Binary Decisions based on one
value
O Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values
O Each splitting value has a count matrix
associated with it
– Class counts in each of the
partitions, A < v and A ≥ v
O Simple method to choose best v
– For each v, scan the database to
gather count matrix and compute
its Gini index
– Computationally Inefficient!
Repetition of work.
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
4/18/2004
36
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Taxable
Income
> 80K?
Yes No© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
Continuous Attributes: Computing Gini Index...
O For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No
Taxable Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
37
Split Positions
Sorted Values
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
38
Alternative Splitting Criteria based on INFO
O Entropy at a given node t:
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Measures homogeneity of a node.
‹Maximum (log nc
) when records are equally distributed
among all classes implying least information
‹Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are similar to the
GINI index computations
= −∑
j
Entropy(t) p( j | t) log p( j | t)© Tan,Steinbach, Kumar Introduction to Data Mining
39
Examples for computing Entropy
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
4/18/2004
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C1) = 1/6
P(C2) = 5/6
Entropy = – (1/6) log2
(1/6) – (5/6) log2
(1/6) = 0.65
P(C1) = 2/6
P(C2) = 4/6
Entropy = – (2/6) log2
(2/6) – (4/6) log2
(4/6) = 0.92
= −∑
j
Entropy(t) p( j | t) log p( j | t)
2
© Tan,Steinbach, Kumar Introduction to Data Mining
Splitting Based on INFO...
O Information Gain:
Parent Node, p is split into k partitions;
ni
is number of records in partition i
– Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
4/18/2004
40
=−∑
=
k
i
i
split
Entropy i
n
n
GAIN Entropy p
1
( ) ( )© Tan,Steinbach, Kumar Introduction to Data Mining
Splitting Based on INFO...
O Gain Ratio:
Parent Node, p is split into k partitions
ni
is the number of records in partition i
– Adjusts Information Gain by the entropy of the
partitioning (SplitINFO). Higher entropy partitioning
4/18/2004
41
(large number of small partitions) is penalized!
– Used in C4.5
– Designed to overcome the disadvantage of Information
Gain
SplitINFO
GAIN
GainRATIO
Split
split
=∑
=
=−
k
i
ii
n
n
n
n
SplitINFO
1
log
© Tan,Steinbach, Kumar Introduction to Data Mining
Splitting Criteria based on Classification Error
O Classification error at a node t :
4/18/2004
42
O Measures misclassification error made by a node.
‹ Maximum (1 - 1/nc
) when records are equally distributed
among all classes, implying least interesting information
‹ Minimum (0.0) when all records belong to one class, implying
most interesting information
Error(t) 1 max P(i | t)
i
= −© Tan,Steinbach, Kumar Introduction to Data Mining
Examples for Computing Error
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6
P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6
P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Error(t) 1 max P(i | t)
i
=−
4/18/2004
43
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
44
Comparison among Splitting Criteria
For a 2-class problem:© Tan,Steinbach, Kumar Introduction to Data Mining
Misclassification Error vs Gini
A?
Yes No
Node N1 Node N2
Parent
C1 7
C2 3
Gini = 0.42
N1 N2
C1 3 4
C2 0 3
Gini=0.361
Gini(N1)
= 1 – (3/3)
2
– (0/3)
2
=0
Gini(N2)
= 1 – (4/7)
2
– (3/7)
4/18/2004
45
2
= 0.489
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Gini improves !!
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
46
Tree Induction
O Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
O Issues
– Determine how to split the records
‹How to specify the attribute test condition?
‹How to determine the best split?
– Determine when to stop splitting© Tan,Steinbach, Kumar Introduction to Data Mining
47
Stopping Criteria for Tree Induction
O Stop expanding a node when all the records
belong to the same class
O Stop expanding a node when all the records have
similar attribute values
O Early termination (to be discussed later)
© Tan,Steinbach, Kumar Introduction to Data Mining
Decision Tree Based Classification
4/18/2004
48
4/18/2004
O Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
49
Example: C4.5
O Simple depth-first construction.
O Uses Information Gain
O Sorts Continuous Attributes at each node.
O Needs entire data to fit in memory.
O Unsuitable for Large Datasets.
– Needs out-of-core sorting.
O You can download the software from:
http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
50
Practical Issues of Classification
O Underfitting and Overfitting
O Missing Values
O Costs of Classification© Tan,Steinbach, Kumar Introduction to Data Mining
Underfitting and Overfitting (Example)
500 circular and 500
triangular data points.
Circular points:
0.5 ≤ sqrt(x
4/18/2004
51
1
2
+x
2
2
)≤1
Triangular points:
sqrt(x
1
2
+x
2
2
) > 0.5 or
sqrt(x
1
2
+x
2
2
)<1
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
52
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large © Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
53
Overfitting due to Noise
Decision boundary is distorted by noise point
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
54
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
55
Notes on Overfitting
O Overfitting results in decision trees that are more
complex than necessary
O Training error no longer provides a good estimate
of how well the tree will perform on previously
unseen records
O Need new ways for estimating errors
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
Estimating Generalization Errors
O Re-substitution errors: error on training (Σ e(t) )
O Generalization errors: error on testing (Σ e’(t))
O Methods for estimating generalization errors:
– Optimistic approach: e’(t) = e(t)
– Pessimistic approach:
‹ For each leaf node: e’(t) = (e(t)+0.5)
‹ Total errors: e’(T) = e(T) + N × 0.5 (N: number of leaf nodes)
56
‹ For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
Generalization error = (10 + 30×0.5)/1000 = 2.5%
– Reduced error pruning (REP):
‹ uses validation data set to estimate generalization
error© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
57
Occam’s Razor
O Given two models of similar generalization errors,
one should prefer the simpler model over the
more complex model
O For complex models, there is a greater chance
that it was fitted accidentally by errors in data
O Therefore, one should include model complexity
when evaluating a model
© Tan,Steinbach, Kumar Introduction to Data Mining
Minimum Description Length (MDL)
O Cost(Model,Data) = Cost(Data|Model) + Cost(Model)
– Cost is the number of bits needed for encoding.
– Search for the least costly model.
O Cost(Data|Model) encodes the misclassification errors.
O Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.
AB
A?
4/18/2004
58
B?
C?
01
0
1
Yes No
B1
B2
C1 C2
Xy
X1 1
X2 0
X3 0
X4 1
……
Xn 1
Xy
X1 ?
X2 ?
X3 ?
X4 ?
……
Xn ?© Tan,Steinbach, Kumar Introduction to Data Mining
How to Address Overfitting
O Pre-Pruning (Early Stopping Rule)
4/18/2004
59
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
‹ Stop if all instances belong to the same class
‹ Stop if all the attribute values are the same
– More restrictive conditions:
‹ Stop if number of instances is less than some user-specified
threshold
‹ Stop if class distribution of instances are independent of the
available features (e.g., using χ
2
test)
‹ Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
60
How to Address Overfitting…
O Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a
bottom-up fashion
– If generalization error improves after trimming,
replace sub-tree by a leaf node.
– Class label of leaf node is determined from
majority class of instances in the sub-tree
– Can use MDL for post-pruning© Tan,Steinbach, Kumar Introduction to Data Mining
61
Example of Post-Pruning
4/18/2004
A?
A1
A2 A3
A4
Class = No 10
Error = 10/30
Class = Yes 20
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 9/30
Pessimistic error (After splitting)
= (9 + 4 × 0.5)/30 = 11/30
PRUNE!
Class = No 4
Class = Yes 8
Class = No 4
Class = Yes 3
Class = No 1
Class = Yes 4
Class = No 1
Class = Yes 5
© Tan,Steinbach, Kumar Introduction to Data Mining
Examples of Post-pruning
– Optimistic error?
– Pessimistic error?
4/18/2004
62
– Reduced error pruning?
C0: 11
C1: 3
C0: 2
C1: 4
C0: 14
C1: 3
C0: 2
C1: 2
Don’t prune for both cases
Don’t prune case 1, prune case 2
Case 1:
Case 2:
Depends on validation set© Tan,Steinbach, Kumar Introduction to Data Mining
63
Handling Missing Attribute Values
O Missing values affect decision tree construction in
three different ways:
– Affects how impurity measures are computed
– Affects how to distribute instance with missing
value to child nodes
– Affects how a test instance with missing value
is classified
© Tan,Steinbach, Kumar Introduction to Data Mining
Computing Impurity Measure
Tid Refund Marital
4/18/2004
64
4/18/2004
Status
Taxable
Income Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 ? Single 90K Yes
10
Class
= Yes
Class
= No
Refund=Yes 0 3
Refund=No 2 4
Refund=? 1 0
Split on Refund:
Entropy(Refund=Yes) = 0
Entropy(Refund=No)
= -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183
Entropy(Children)
= 0.3 (0) + 0.6 (0.9183) = 0.551
Gain = 0.9 × (0.8813 – 0.551) = 0.3303
Missing
value
Before Splitting:
Entropy(Parent)
= -0.3 log(0.3)-(0.7)log(0.7) = 0.8813© Tan,Steinbach, Kumar Introduction to Data Mining
65
Distribute Instances
Tid Refund Marital
Status
Taxable
Income Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10
Refund
Yes No
4/18/2004
Class=Yes 0
Class=No 3
Cheat=Yes 2
Cheat=No 4
Refund
Yes
Tid Refund Marital
Status
Taxable
Income Class
10 ? Single 90K Yes
10
No
Class=Yes 2 + 6/9
Class=No 4
Probability that Refund=Yes is 3/9
Probability that Refund=No is 6/9
Assign record to the left child with
weight = 3/9 and to the right child
with weight = 6/9
Class=Yes 0 + 3/9
Class=No 3
© Tan,Steinbach, Kumar Introduction to Data Mining
Classify Instances
Refund
4/18/2004
66
MarSt
TaxInc
NO YES
NO
NO
Yes
No
Married
Single,
Divorced
< 80K > 80K
Total 3.67 2 1 6.67
Class=Yes 6/9 1 1 2.67
Class=No 3 1 0 4
Married Single Divorced Total
Tid Refund Marital
Status
Taxable
Income Class
11 No ? 85K ?
10
New record:
Probability that Marital Status
= Married is 3.67/6.67
Probability that Marital Status
={Single,Divorced} is 3/6.67© Tan,Steinbach, Kumar Introduction to Data Mining
67
4/18/2004
Other Issues
O Data Fragmentation
O Search Strategy
O Expressiveness
O Tree Replication
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
68
Data Fragmentation
O Number of instances gets smaller as you traverse
down the tree
O Number of instances at the leaf nodes could be
too small to make any statistically significant
decision© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
69
Search Strategy
O Finding an optimal decision tree is NP-hard
O The algorithm presented so far uses a greedy,
top-down, recursive partitioning strategy to
induce a reasonable solution
O Other strategies?
– Bottom-up
– Bi-directional
© Tan,Steinbach, Kumar Introduction to Data Mining
Expressiveness
O Decision tree provides expressive representation for
learning discrete-valued function
4/18/2004
70
– But they do not generalize well to certain types of
Boolean functions
‹ Example: parity function:
– Class = 1 if there is an even number of Boolean attributes with truth
value = True
– Class = 0 if there is an odd number of Boolean attributes with truth
value = True
‹ For accurate modeling, must have a complete tree
O Not expressive enough for modeling continuous variables
– Particularly when test condition involves only a single
attribute at-a-time© Tan,Steinbach, Kumar Introduction to Data Mining
Decision Boundary
y < 0.33?
:0
:3
:4
:0
y < 0.47?
:4
:0
:0
:4
x < 0.43?
Yes
Yes
4/18/2004
71
No
No Yes No
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time
© Tan,Steinbach, Kumar Introduction to Data Mining
Oblique Decision Trees
x+y<1
Class = + Class =
• Test condition may involve multiple attributes
4/18/2004
72
• More expressive representation
• Finding optimal test condition is computationally expensive© Tan,Steinbach, Kumar Introduction to
Data Mining
4/18/2004
73
Tree Replication
P
QR
S01
01
Q
S0
01
• Same subtree appears in multiple branches
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
74
Model Evaluation
O Metrics for Performance Evaluation
– How to evaluate the performance of a model?
O Methods for Performance Evaluation
– How to obtain reliable estimates?
O Methods for Model Comparison
– How to compare the relative performance
among competing models?© Tan,Steinbach, Kumar Introduction to Data Mining
75
Model Evaluation
O Metrics for Performance Evaluation
– How to evaluate the performance of a model?
O Methods for Performance Evaluation
4/18/2004
– How to obtain reliable estimates?
O Methods for Model Comparison
– How to compare the relative performance
among competing models?
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
76
Metrics for Performance Evaluation
O Focus on the predictive capability of a model
– Rather than how fast it takes to classify or
build models, scalability, etc.
O Confusion Matrix:
Class=No c d
Class=Yes a b
Class=Yes Class=No
ACTUAL
CLASS
PREDICTED CLASS
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)© Tan,Steinbach, Kumar Introduction to Data Mining
Metrics for Performance Evaluation…
O Most widely-used metric:
d
(TN)
c
4/18/2004
77
(FP)
Class=No
b
(FN)
a
(TP)
Class=Yes
Class=Yes Class=No
ACTUAL
CLASS
PREDICTED CLASS
TP TN FP FN
TP TN
abcd
ad
+++
+
=
+++
+
Accuracy =
© Tan,Steinbach, Kumar Introduction to Data Mining
Limitation of Accuracy
O Consider a 2-class problem
– Number of Class 0 examples = 9990
4/18/2004
78
– Number of Class 1 examples = 10
O If model predicts everything to be class 0,
accuracy is 9990/10000 = 99.9 %
– Accuracy is misleading because model does
not detect any class 1 example© Tan,Steinbach, Kumar Introduction to Data Mining
79
Cost Matrix
Class=No C(Yes|No) C(No|No)
Class=Yes C(Yes|Yes) C(No|Yes)
C(i|j) Class=Yes Class=No
ACTUAL
CLASS
PREDICTED CLASS
C(i|j): Cost of misclassifying class j example as class i
© Tan,Steinbach, Kumar Introduction to Data Mining
Computing Cost of Classification
-10
+ -1 100
C(i|j) + ACTUAL
CLASS
Cost PREDICTED CLASS
Matrix
- 60 250
+ 150 40
+-
4/18/2004
80
4/18/2004
ACTUAL
CLASS
Model PREDICTED CLASS
M1
- 5 200
+ 250 45
+ACTUAL
CLASS
Model PREDICTED CLASS
M2
Accuracy = 80%
Cost = 3910
Accuracy = 90%
Cost = 4255© Tan,Steinbach, Kumar Introduction to Data Mining
Cost vs Accuracy
c d Class=No
a b Class=Yes
Class=Yes Class=No
ACTUAL
CLASS
Count PREDICTED CLASS
q p Class=No
p q Class=Yes
Class=Yes Class=No
4/18/2004
81
ACTUAL
CLASS
Cost PREDICTED CLASS
N=a+b+c+d
Accuracy = (a + d)/N
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N – a – d)
= q N – (q – p)(a + d)
= N [q – (q-p) × Accuracy]
Accuracy is proportional to cost if
1. C(Yes|No)=C(No|Yes) = q
2. C(Yes|Yes)=C(No|No) = p
© Tan,Steinbach, Kumar Introduction to Data Mining
Cost-Sensitive Measures
abc
a
rp
rp
ab
a
ac
a
++
=
+
4/18/2004
82
=
+
=
+
=
2
22
F - measure (F)
Recall (r)
Precision (p)
O Precision is biased towards C(Yes|Yes) & C(Yes|No)
O Recall is biased towards C(Yes|Yes) & C(No|Yes)
O F-measure is biased towards all except C(No|No)
wawbwcwd
wawd
1234
14
Weighted Accuracy
+++
+
=© Tan,Steinbach, Kumar Introduction to Data Mining
Model Evaluation
O Metrics for Performance Evaluation
– How to evaluate the performance of a model?
O Methods for Performance Evaluation
4/18/2004
83
– How to obtain reliable estimates?
O Methods for Model Comparison
– How to compare the relative performance
among competing models?
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
84
Methods for Performance Evaluation
O How to obtain a reliable estimate of
performance?
O Performance of a model may depend on other
factors besides the learning algorithm:
– Class distribution
– Cost of misclassification
– Size of training and test sets© Tan,Steinbach, Kumar Introduction to Data Mining
85
Learning Curve
O Learning curve shows
how accuracy changes
with varying sample size
O Requires a sampling
schedule for creating
learning curve:
O Arithmetic sampling
(Langley, et al)
O Geometric sampling
(Provost et al)
Effect of small sample size:
4/18/2004
- Bias in the estimate
- Variance of estimate
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
86
Methods of Estimation
O Holdout
– Reserve 2/3 for training and 1/3 for testing
O Random subsampling
– Repeated holdout
O Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
O Stratified sampling
– oversampling vs undersampling
O Bootstrap
– Sampling with replacement© Tan,Steinbach, Kumar Introduction to Data Mining
87
Model Evaluation
O Metrics for Performance Evaluation
– How to evaluate the performance of a model?
O Methods for Performance Evaluation
– How to obtain reliable estimates?
O Methods for Model Comparison
– How to compare the relative performance
among competing models?
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
88
4/18/2004
ROC (Receiver Operating Characteristic)
O Developed in 1950s for signal detection theory to
analyze noisy signals
– Characterize the trade-off between positive
hits and false alarms
O ROC curve plots TP (on the y-axis) against FP
(on the x-axis)
O Performance of each classifier represented as a
point on the ROC curve
– changing the threshold of algorithm, sample
distribution or cost matrix changes the location
of the point© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
ROC Curve
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive
© Tan,Steinbach, Kumar Introduction to Data Mining
ROC Curve
(TP,FP):
O (0,0): declare everything
to be negative class
O (1,1): declare everything
to be positive class
O (1,0): ideal
4/18/2004
90
89
O Diagonal line:
– Random guessing
– Below diagonal line:
‹ prediction is opposite of
the true class© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
Using ROC for Model Comparison
O No model consistently
outperform the other
O M1
is better for
small FPR
O M2
is better for
large FPR
O Area Under the ROC
curve
O Ideal:
ƒ Area = 1
O Random guess:
ƒ Area = 0.5
© Tan,Steinbach, Kumar Introduction to Data Mining
How to Construct an ROC curve
10 0.25 +
9 0.43 8 0.53 +
4/18/2004
92
91
7 0.76 6 0.85 +
5 0.85 4 0.85 3 0.87 2 0.93 +
1 0.95 +
Instance P(+|A) True Class
• Use classifier that produces
posterior probability for each
test instance P(+|A)
• Sort the instances according
to P(+|A) in decreasing order
• Apply threshold at each
unique value of P(+|A)
• Count the number of TP, FP,
TN, FN at each threshold
• TP rate, TPR = TP/(TP+FN)
• FP rate, FPR = FP/(FP + TN)© Tan,Steinbach, Kumar Introduction to Data Mining
93
How to construct an ROC curve
Class + - + - - - + - + +
0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
4/18/2004
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0
Threshold >=
ROC Curve:
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
94
Test of Significance
O Given two models:
– Model M1: accuracy = 85%, tested on 30 instances
– Model M2: accuracy = 75%, tested on 5000 instances
O Can we say M1 is better than M2?
– How much confidence can we place on accuracy of
M1 and M2?
– Can the difference in performance measure be
explained as a result of random fluctuations in the test
set?© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
Confidence Interval for Accuracy
O Prediction can be regarded as a Bernoulli trial
– A Bernoulli trial has 2 possible outcomes
– Possible outcomes for prediction: correct or wrong
– Collection of Bernoulli trials has a Binomial distribution:
‹ x ∼ Bin(N, p)
x: number of correct predictions
‹ e.g: Toss a fair coin 50 times, how many heads would turn up?
Expected number of heads = N×p = 50 × 0.5 = 25
O Given x (# of correct predictions) or equivalently,
95
acc=x/N, and N (# of test instances),
Can we predict p (true accuracy of model)?
© Tan,Steinbach, Kumar Introduction to Data Mining
Confidence Interval for Accuracy
O For large test sets (N > 30),
– acc has a normal distribution
with mean p and variance
p(1-p)/N
O Confidence Interval for p:
α
αα
=−
<
−
−
<−
1
)
(1 ) /
(
/2
Z1 / 2
ppN
acc p
PZ
4/18/2004
96
Area = 1 - α
Zα/2
Z1- α /2
2( )
244
2
/2
22
/2
2
/2
α
αα
NZ
N acc Z Z N acc N acc
p
+
××+±+××−××
=© Tan,Steinbach, Kumar Introduction to Data Mining
Confidence Interval for Accuracy
O Consider a model that produces an accuracy of
80% when evaluated on 100 test instances:
– N=100, acc = 0.8
– Let 1-α = 0.95 (95% confidence)
– From probability table, Zα/2
4/18/2004
97
=1.96
0.90 1.65
0.95 1.96
0.98 2.33
0.99 2.58
1-α Z
p(lower) 0.670 0.711 0.763 0.774 0.789
p(upper) 0.888 0.866 0.833 0.824 0.811
N 50 100 500 1000 5000
© Tan,Steinbach, Kumar Introduction to Data Mining
Comparing Performance of 2 Models
O Given two models, say M1 and M2, which is
better?
– M1 is tested on D1 (size=n1), found error rate = e1
– M2 is tested on D2 (size=n2), found error rate = e2
– Assume D1 and D2 are independent
– If n1 and n2 are sufficiently large, then
– Approximate:
()
()222
111
~,
~,
µσ
µσ
4/18/2004
98
eN
eN
i
ii
i
n
e (1 e )
ˆ
−
σ =© Tan,Steinbach, Kumar Introduction to Data Mining
Comparing Performance of 2 Models
O To test if performance difference is statistically
significant: d = e1 – e2
– d ~ (dt
,σt
) where dt
is the true difference
– Since D1 and D2 are independent, their variance
adds up:
– At (1-α) confidence level,
2
2(1 2)
1
1(1 1)
ˆˆ
4/18/2004
99
2
2
2
1
2
2
2
1
2
n
ee
n
ee
t
−
+
−
=
σ = σ +σ ≅ σ +σ
dt
d Zα
σt
ˆ=±/2
© Tan,Steinbach, Kumar Introduction to Data Mining
An Illustrative Example
4/18/2004
100
O Given: M1: n1 = 30, e1 = 0.15
M2: n2 = 5000, e2 = 0.25
O d = |e2 – e1| = 0.1 (2-sided test)
O At 95% confidence level, Zα/2
=1.96
=> Interval contains 0 => difference may not be
statistically significant
0.0043
5000
0.25(1 0.25)
30
0.15(1 0.15)
ˆ=
−
+
−
σd
=
dt
= 0.100 ±1.96× 0.0043 = 0.100 ± 0.128© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004
101
Comparing Performance of 2 Algorithms
O Each learning algorithm may produce k models:
– L1 may produce M11 , M12, …, M1k
– L2 may produce M21 , M22, …, M2k
O If models are generated on the same test sets
D1,D2, …, Dk (e.g., via cross-validation)
– For each set: compute dj
= e1j
– e2j
– dj
has mean dt
and variance σt
– Estimate:
tkt
k
j
j
t
ddt
kk
dd
σ
σ
α
ˆ
( 1)
()
ˆ
1,1
1
2
2
−−
=
=±
−
−
=
∑2