Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining What is Data Mining Overview Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Continuous Innovation Although data mining is a relatively new term, the technology is not. Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years. However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost. Example For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays. Data, Information, and Knowledge Data Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes operational or transactional data such as, sales, cost, inventory, payroll, and accounting nonoperational data, such as industry sales, forecast data, and macro economic data meta data - data about the data itself, such as logical database design or data dictionary definitions Information The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when. Knowledge Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts. Data Warehouses Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling organizations to integrate their various databases into data warehouses. Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data. Centralization of data is needed to maximize user access and analysis. Dramatic technological advances are making this vision a reality for many companies. And, equally dramatic advances in data analysis software are allowing users to access this data freely. The data analysis software is what supports data mining. What can data mining do Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among internal factors such as price, product positioning, or staff skills, and external factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to drill down into summary information to view detail transactional data. With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments. For example, Blockbuster Entertainment mines its video rental history database to recommend rentals to individual customers. American Express can suggest products to its cardholders based on analysis of their monthly expenditures. WalMart is pioneering massive data mining to transform its supplier relationships. WalMart captures point-of-sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5 terabyte Teradata data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products and perform data analyses. These suppliers use this data to identify customer buying patterns at the store display level. They use this information to manage local store inventory and identify new merchandising opportunities. In 1995, WalMart computers processed over 1 million complex data queries. The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the Cavaliers during that game. By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the jump shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage. Those clips show a very successful pick-and-roll play in which Price draws the Knick's defense and then finds Williams for an open jump shot. How does data mining work While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought Classes Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials. Clusters Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities. Associations Data can be mined to identify associations. The beer-diaper example is an example of associative mining. Sequential patterns Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes. Data mining consists of five major elements Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals. Analyze the data by application software. Present the data in a useful format, such as a graph or table. Different levels of analysis are available Artificial neural networks Non-linear predictive models that learn through training and resemble biological neural networks in structure. Genetic algorithms Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution. Decision trees Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID. Nearest neighbor method A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique. Rule induction The extraction of useful if-then rules from data based on statistical significance. Data visualization The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships. What technological infrastructure is required Today, data mining applications are available on all size systems for mainframe, clientserver, and PC platforms. System prices range from several thousand dollars for the smallest applications up to $1 million a terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes to over 11 terabytes. NCR has the capacity to deliver applications exceeding 100 terabytes. There are two critical technological drivers Size of the database the more data being processed and maintained, the more powerful the system required. Query complexity the more complex the queries and the greater the number of queries being processed, the more powerful the system required. Relational database storage and management technology is adequate for many data mining applications less than 50 gigabytes. However, this infrastructure needs to be significantly enhanced to support larger applications. Some vendors have added extensive indexing capabilities to improve query performance. Others use new hardware architectures such as Massively Parallel Processors (MPP) to achieve order-ofmagnitude improvements in query time. For example, MPP systems from NCR link hundreds of highspeed Pentium processors to achieve performance levels exceeding those of the largest supercomputers. -------------------------------------------------------------------------------- Home Library Learn Downloads Support Community Sign in | United States - English | | MSDN Library Servers and Enterprise Development SQL Server SQL Server 2008 R2 Product Documentation SQL Server 2008 R2 Books Online Analysis Services - Data Mining Planning and Architecture Logical Architecture (Analysis Services - Data Mining) Mining Structures (Analysis Services Data Mining) Mining Structure Columns Data Types (Data Mining) Content Types (Data Mining) Classified Columns (Data Mining) Column Distributions (Data Mining) Discretization Methods (Data Mining) Modeling Flags (Data Mining) Community Content Add code samples and tips to enhance this topic. More... Data Types (Data Mining) SQL Server 2008 R2 Other Versions SQL Server "Denali" SQL Server 2008 SQL Server 2005 When you create a mining model or a mining structure in Microsoft SQL Server Analysis Services, you must define the data types for each of the columns in the mining structure. The data type tells the data mining engine whether the data in the data source is numerical or text, and how the data should be processed. For example, if your source data contains numerical data, you can specify whether the numbers be treated as integers or by using decimal places. Each data type supports one or more content types. By setting the content type, you can customize the way that data in the column is processed or calculated in the mining model. For example, if you have numeric data in a column, you can choose to handle it either as a numeric or text data type. If you choose the numeric data type, you can set several different content types: you can discretize the numbers, or handle them as continuous values. For a list of all the content types, see Content Types (Data Mining). Analysis Services supports the following data types for mining structure columns: Data Type Supported Content Types Text Cyclical, Discrete, Discretized, Key Sequence, Ordered, Sequence Long Continuous, Cyclical, Discrete, Discretized, Key, Key Sequence, Key Time, Ordered, Sequence, Time Classified Boolean Cyclical, Discrete, Ordered Double Continuous, Cyclical, Discrete, Discretized, Key, Key Sequence, Key Time, Ordered, Sequence, Time Classified Date Continuous, Cyclical, Discrete, Discretized, Key, Key Sequence, Key Time, Ordered Note The Time and Sequence content types are only supported by third-party algorithms. The Cyclical and Ordered content types are supported, but most algorithms treat them as discrete values and do not perform special processing. Specifying a Data Type -------------------------------------------------------------------------------- If you create the mining model directly by using Data Mining Extensions (DMX), you can define the data type for each column as you define the model, and Analysis Services will create the corresponding mining structure with the specified data types at the same time. If you create the mining model or mining structure by using a wizard, Analysis Services will suggest a data type, or you can choose a data type from a list. Changing a Data Type -------------------------------------------------------------------------------- If you change the data type of a column, you must always reprocess the mining structure and any mining models that are based on that structure. Sometimes if you change the data type, that column can no longer be used in a particular model. In that case, Analysis Services will either raise an error when you reprocess the model, or will process the model but leave out that particular column. See Also -------------------------------------------------------------------------------- Reference Content Types (DMX) Data Types (DMX) Concepts Content Types (Data Mining) Data Mining Algorithms (Analysis Services - Data Mining) Mining Structures (Analysis Services - Data Mining) Mining Model Columns Other Resources Mining Structure Columns Community Content Add FAQ © 2011 Microsoft. All rights reserved.Terms of Use | Trademarks | Privacy Statement | Feedback Feedback Feedbackx Tell us about your experience... Did the page load quickly? Yes No Do you like the page design? Yes No How useful is this topic? Tell us more Enter description here. Free Online Articles Directory Why Submit Articles? Top Authors Top Articles FAQ ABAnswers Publish Article Login Login via Register Hello My Home Sign Out Email Password Remember me? Lost Password? Home Page > Business > What are the Types of Data Mining? What are the Types of Data Mining? Posted: Jan 21, 2009 |Comments: 0 | Views: 4,622 | 1Ads by Google Legal Outsourcing India Legal outsourcing inside out. See video of industry leading LPO. www.pangea3.com Registry Fixer - Download Powerful Registry cleaner others can't touch. Turbocharge your PC! www.iolo.com/Registry-Repair Manufacturing OutSourcing Global Leading Manufacturing Market Magnetize Foreign Buyers now! www.MFG.com Relation Extraction API AlchemyAPI extracts facts, events, and relations from raw text. www.alchemyapi.com/ Web mining, an extension of data mining implies employing the techniques of data mining to documents on the Internet. Web mining is used to study various aspects of a website and recognize the relationships and patterns in user behavior in order to get an insight into crucial information. For example, if you have to improve the accessibility quotient of your website, you need to know crucial points that need to be improved. Web mining services presents you the required results. It takes into consideration the IP addresses of website visitors, browser logs, cookies and so on. Web mining tools analyze these logs and process them accordingly to produce meaningful and understandable information. For example, various bits of information can be analyzed to track the browsing route of website visitors. This may assist you in devising ways to make your website more effective. The whole process of web mining involves extracting information from the internet through traditional practices of data mining and applying it to specific features of the website. Types of Web Mining Ads by GoogleWeb mining helps to discover information, find related data and documents, identify patterns and trends and make sure that the web resources remain efficient. There are three main types of web mining: • Web Content Mining • Web Usage Mining • Web Structure Mining Web Content Mining This process seeks to discover all the links of hyperlinks within a document in order to generate a structural report on the web page. Information about various facets, for example if users are able to find information, if the website structure is too deep or too shallow, are the web page elements placed correctly, what are the most visited and least visited areas of a website and do they have anything to do with the page design, all these are evaluated and analyzed for further research. Web Usage Mining In this process, data mining techniques are applied to discover patterns and trends in the browsing behavior of website visitors. Navigation patters are extracted and so that browsing patters can be deciphered and website structure and designed accordingly. For instance, if there is any particular feature of the website that visitors tend to use very often, you should seek to make it more pronounced and enhanced in order to increase the usability and appeal more to the users. This process makes use of logs and accesses of the web. By understanding visitor movement and behavior as they surf the internet, you can seek to cater to their needs and preferences better and thus make your website popular among the internet masses. Web Structure Mining Web structure mining involves the use of graph theory to analyze the node and connection structure of a website. And as per the nature and type of web structure data, web mining is further divided into two types. One, extracting patterns from hyperlinks on the internet. A hyperlink is a structural web address that connects the web page to another location. Second kind of web mining is mining the document structure. A tree-like structure is used to analyze and describe the HTML or XHTML tags within the web page. Retrieved from "http://www.articlesbase.com/business-articles/what-are-the-types-of-data-mining731327.html" Was this Helpful ? 00i Maneet Puri - About the Author: Maneet Puri is the managing director of LeXolution IT Services, a premier off shore outsourcing company that specializes in providing efficient KPO services. Some of the services provided by the company are Data mining, internet market research and virtual private assistance. Ads By Google Mining software Mining simulation software and services, SimMine www.simmine.com On & Near Shore Facility We Provide Outsource Facilities Fixed Costs Fully Serviced Equipped colbus.com.au Jobs for Freshers Companies Hiring Freshers Now. Sign up for Free to Apply. MonsterIndia.com/Jobs Mining Tools Search Thousands of Catalogs for Mining Tools www.globalspec.com Questions and Answers Ask our experts your Business related questions here... Ask 200 Characters leftWhat are the data types in oracle ?What are the data types in javascript ?What are the data types in c ?Rate this Article 1 2 3 4 5 vote(s) 3 vote(s)Feedback Print Re-Publish Source: http://www.articlesbase.com/business-articles/what-are-the-types-of-data-mining-731327.html Article Tags:data miningRelated Videos to Watch ‹...›By 5min PlayClick to PlayFallout 3 - Walkthrough - Part 98 - Data MiningPlayClick to PlayProcessing Data From an AJAX RequestPlayClick to PlayMicrosoft Expression Web - Looking at ASP.net and Data...PlayClick to PlayHow to Combine Internet Data into one RSS FeedPlayClick to PlayMicrosoft Expressions Web - How Dynamic Pages Display Data... PlayClick to PlayHow to Add Condition to a Data View in MS SharePoint DesignerPlayClick to PlayLearn about Different Types of Control with ASP.NET Data Control...PlayClick to PlayHow to Stop Data Leaks with Code Green NetworksPlayClick to PlayMicrosoft Expression Web - Useful Shortcuts for Designing Web...PlayClick to PlayMicrosoft Expression Web - Changing the Background Color of... PlayClick to PlayThe 2012 Award Season Kicks Off With Nomination SurprisesPlayClick to PlayHow to Get Paid to RecyclePlayClick to PlayHealthy Alternatives to Drive Through Value MealsPlayClick to PlaySmart Tips for Holiday ShoppingPlayClick to PlayAnimal Testing Facts and Alternatives PlayClick to PlayTaking Jessica Alba's Awesome Look from Day to NightPlayClick to PlayLife After Twilight: What Will Happen to Its StarsPlayClick to PlayFight Childhood Obesity with the Walking School BusPlayClick to PlayHow to Dance Yourself Thin with the Zumba Wii GamePlayClick to PlayBest Victoria's Secret Fashion Show Moments - What to Watch For... Related ArticlesLatest Business ArticlesMore from Maneet PuriOutsource Data Cleaning Services project at lowest costShri data entry services is data entry company base in India providing data entry, data conversion, data processing, data cleansing and data cleaning services. By: Outsource Data Entry Servicesl Business> Outsourcingl Jul 20, 2011 Various online data backup services- server data backup, PC data backupData backup services are regarded as one of the must haves in every business.In present scenario, everything is getting online.The same way with regard to data storage, most of us prefer storing data online.And, to ensure safe storage backup is necessary. By: Online Data Backup Solutionsl Computers> Data Recoveryl Mar 08, 2011 Outsource Data Entry Services, Online Data Entry, Offshore Offline Data Entry, Data Entry Outsourcing, Back Office / Bpo Services IndiaAdept Data Services offers an efficient, affordable and very cost-effective data entry services with minimum hassles. Our data entry services are entirely designed to assist businesses large or small that seek speedy and valuable results. We work personally with you using mature equipped policies that ensure data privacy and services continuity. By: Adept Data Servicesl Business> Outsourcingl Feb 22, 2009 lViews: 418 lComments: 1 Data Recovery – a Need for All Computer UsersTrust your data on professionals not hawkers. Do not wait for the worst to happen just call in the professional or choose Do-it–yourself Data recovery Software by searching the terms like “Data Recovery, Data Recovery Software” on the Google toolbar. We at Recover Data assure you to provide best and professional Data recovery software to help you retrieve data or information that has previously been deleted. It can also help you recover any data that has been lost because of virus attacks. By: Recover Datal Computers> Data Recoveryl Apr 26, 2008 lViews: 327 Data Entry Outsourcing, Outsource Data Entry Services to IndiaData Entry Outsourcing is buzzing these days. Many companies from diversified verticals are taking it as a viable option for their businesses to outsource their data entry projects to professional data management companies. In this manner the outsourcing companies get away from the burden of whipping data, and largely concentrate on their core processes. By: Data Outsourcing Indial Business> Outsourcingl Nov 20, 2011 Ten Top Tips For Data Cleansing & Data Cleaning SuccessData cleaning or data cleansing should be regularly performed on any database of customers or prospects. Data is an integral part of a communications strategy enabling a company to provide excellent customer service and effectively manage the customer or prospect relationship. Follow CCR's top tips to ensure data cleaning success. By: Chris Turner - CCR Datal Marketing> Marketing Tipsl Sep 23, 2011 Why Business Owners Need An AccountantOperating a successful business takes the know how to run your organization and the drive necessary to bring that success about, but the know-how isn't always up to par with the ambition we sometimes have. Though we may have the will to take our business to the top of the Forbes 500, the "know how" on how to get there is often times lacking. By: Suzanne Pricel Businessl Dec 04, 2011 Tips for Choosing an AccountantHiring accountants can be quite difficult, depending on who you ask. For large corporations, certainly there are Human Resources experts to handle the hiring process. More often than not, they already have a set number of requirements that every applicant must achieve in order to at least qualify for the position. By: Suzanne Pricel Businessl Dec 04, 2011 Business Accountants in BradfordAs a business owner there are a couple of things you really can't substitute in any business. One, is quality staff that provide quality services for you and your customers, and the second non-substitutable business requirement are quality outsourced services for those things your in-house staff will not perform. One of the outsourced services in need of special attention are the duties handled by your accountant. By: Suzanne Pricel Businessl Dec 04, 2011 Unique Strategic Internet Marketing: A Key to Generate More MoneyNowadays, marketing on the internet has become an extended decline in general business activity. But almost all of the businesses are doing it in a wrong way. Some of these businesses had thought that it was just easy as they create their own social networking account as they have heard it on popular social media sites. Other owner of the business still put up in their mind that if they do have their own website, it will create more money for them. But sad to say, these business owners fail to c By: Mike Taylorl Businessl Dec 04, 2011 How to Market your Business for Just a Few DollarsIt may not be as dramatic as the other materials but it has proven to be the most effective. By: Katel Businessl Dec 03, 2011 Online Research – Why Important In Data Entry WorkThere are hundreds of thousands of sites available on Internet that offer services of online data entry work. Google has once announced that nearly 5.3 million sites on Google offer data entry services and are potential source for big monthly income. By: Maneet Puril Businessl Nov 03, 2010 The Basics – Why Most Businesses Require To Outsource Data EntryIn this modern business world, outsourcing data entry is the most profitable area considered in a business process. What you preferably need is just a reliable resource to outsource your work and projects. By: Maneet Puril Business> Outsourcingl Oct 27, 2010 Things To Keep In Mind While Conducting Internet Market ResearchInternet Market Research is one of the most important aspects of KPO solutions. This article will discuss about the ways to conduct Internet Market Research so that you as a business owner can get a detailed idea of industry trends, find economic data, consumer attitudes & behaviors with right approach. By: Maneet Puril Internet> Internet Marketingl Oct 25, 2010 Great Tips To Select 'The Right' Outsourcing AdvisorsIt would not be wrong to say that while outsourcing, selecting an outsourcing advisor is the most important decision you need to make for the successful relationship with your IT services provider. By: Maneet Puril Business> Outsourcingl Aug 25, 2010 Offshore Outsourcing – How Can It Benefit Small Businesses?Even though the offshore outsourcing industry has been showing phenomenal growth in the past years, it has been largely restricted to large corporations that had the resources to establish overseas infrastructure and harness it to achieve a high level of cost effectiveness. However, in the recent times, the growth curve has shown a major tilt to small businesses and enterprises. By: Maneet Puril Business> Customer Servicel Jul 09, 2010 Discuss this Article Author NavigationMy Home Publish Article View/Edit Articles View/Edit Q&A Edit your Account Manage Authors Statistics Page Personal RSS Builder My Home Edit your Account Update Profile View/Edit Q&A Publish Article Author Box Maneet Puri has 54 articles online Contact Author Subscribe to RSS Print article Send to friend Re-Publish article Articles CategoriesAll Categories Advertising Arts & Entertainment Automotive Beauty Business Careers Computers Education Finance Food and Beverage Health Hobbies Home and Family Home Improvement Internet Law Marketing News and Society Relationships Self Improvement Shopping Spirituality Sports and Fitness Technology Travel Writing Business Agriculture Ask an Expert Business Ideas Business Opportunities Corporate Customer Service Entrepreneurship Ethics Franchise Fundraising Home Business Human Resources Industrial International Business Leadership Management Negotiation Networking Non Profit Organizations Online Business Organizational Outsourcing Presentation Project Management Public Company Public Relations Sales Shipping Six Sigma Small Business Strategic Planning Team Building Training Ads by Google Need Help? Contact Us FAQ Submit Articles Editorial Guidelines Blog Site Links Recent Articles Top Authors Top Articles Find Articles Site Map Mobile Version Webmasters RSS Builder Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Classification: Definition O Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. O Find a model for class attribute as a function of the values of other attributes. O Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Illustrating Classification Task Apply Model Induction Deduction L earn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Learning algorithm Training Set © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Examples of Classification Task O Predicting tumor cells as benign or malignant O Classifying credit card transactions as legitimate or fraudulent O Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil O Categorizing news stories as finance, weather, entertainment, sports, etc© Tan,Steinbach, Kumar Introduction to Data Mining 5 Classification Techniques O Decision Tree based Methods O Rule-based Methods O Memory based reasoning O Neural Networks O Naïve Bayes and Bayesian Belief Networks O Support Vector Machines © Tan,Steinbach, Kumar Introduction to Data Mining Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 4/18/2004 6 4/18/2004 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 categorical categorical continuous class Refund MarSt TaxInc NO YES NO NO Yes No Single, Divorced Married < 80K > 80K Splitting Attributes Training Data Model: Decision Tree© Tan,Steinbach, Kumar Introduction to Data Mining 7 4/18/2004 Another Example of Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 categorical categorical continuous class MarSt Refund TaxInc NO YES NO NO Yes No Married Single, Divorced < 80K > 80K There could be more than one tree that fits the same data! © Tan,Steinbach, Kumar Introduction to Data Mining Decision Tree Classification Task Apply Model Induction Deduction L earn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 4/18/2004 8 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Tree Induction algorithm Training Set Decision Tree© Tan,Steinbach, Kumar Introduction to Data Mining Apply Model to Test Data Refund MarSt TaxInc NO YES NO 4/18/2004 9 NO Yes No Single, Divorced Married < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Start from the root of tree. © Tan,Steinbach, Kumar Introduction to Data Mining Apply Model to Test Data Refund MarSt TaxInc NO YES NO NO Yes No Single, Divorced Married < 80K > 80K Refund Marital Status 4/18/2004 10 Taxable Income Cheat No Married 80K ? 10 Test Data© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Apply Model to Test Data Refund MarSt TaxInc NO YES NO NO Yes No Single, Divorced Married < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data © Tan,Steinbach, Kumar Introduction to Data Mining Apply Model to Test Data Refund 4/18/2004 12 11 MarSt TaxInc NO YES NO NO Yes No Single, Divorced Married < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data© Tan,Steinbach, Kumar Introduction to Data Mining Apply Model to Test Data Refund MarSt TaxInc NO YES NO NO Yes No Single, Divorced Married < 80K > 80K 4/18/2004 13 Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data © Tan,Steinbach, Kumar Introduction to Data Mining Apply Model to Test Data Refund MarSt TaxInc NO YES NO NO Yes No Single, Divorced Married < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data 4/18/2004 14 Assign Cheat to “No”© Tan,Steinbach, Kumar Introduction to Data Mining Decision Tree Classification Task Apply Model Induction Deduction L earn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 4/18/2004 15 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Tree Induction algorithm Training Set Decision Tree © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Decision Tree Induction O Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID3, C4.5 – SLIQ,SPRINT© Tan,Steinbach, Kumar Introduction to Data Mining General Structure of Hunt’s Algorithm O Let Dt be the set of training records that reach a node t O General Procedure: – If Dt contains records that belong the same class y 4/18/2004 17 t , then t is a leaf node labeled as y t – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Dt ? © Tan,Steinbach, Kumar Introduction to Data Mining Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat Yes No Refund Don’t Cheat Yes No Marital Status Don’t Cheat 4/18/2004 18 Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K >= 80K Refund Don’t Cheat Yes No Marital Status Don’t Cheat Cheat Single, Divorced Married Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Tree Induction O Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. O Issues – Determine how to split the records ?How to specify the attribute test condition? ?How to determine the best split? – Determine when to stop splitting © Tan,Steinbach, Kumar Introduction to Data Mining Tree Induction O Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. 4/18/2004 20 O Issues – Determine how to split the records ?How to specify the attribute test condition? ?How to determine the best split? – Determine when to stop splitting© Tan,Steinbach, Kumar Introduction to Data Mining 21 How to Specify Test Condition? O Depends on attribute types – Nominal – Ordinal – Continuous O Depends on number of ways to split – 2-way split – Multi-way split © Tan,Steinbach, Kumar Introduction to Data Mining Splitting Based on Nominal Attributes O Multi-way split: Use as many partitions as distinct values. O Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Family, 4/18/2004 22 4/18/2004 Luxury} {Sports} CarType {Sports, Luxury} {Family} OR© Tan,Steinbach, Kumar Introduction to Data Mining O Multi-way split: Use as many partitions as distinct values. O Binary split: Divides values into two subsets. Need to find optimal partitioning. O What about this split? Splitting Based on Ordinal Attributes Size Small Medium Large Size {Medium, Large} {Small} Size {Small, Medium} {Large} OR 4/18/2004 23 Size {Small, Large} {Medium} © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Splitting Based on Continuous Attributes O Different ways of handling – Discretization to form an ordinal categorical attribute ? Static – discretize once at the beginning ? Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. – Binary Decision: (A < v) or (A = v) ? consider all possible splits and finds the best cut ? can be more compute intensive© Tan,Steinbach, Kumar Introduction to Data Mining 25 Splitting Based on Continuous Attributes Taxable Income > 80K? Yes No Taxable Income? (i) Binary split (ii) Multi-way split < 10K 4/18/2004 [10K,25K) [25K,50K) [50K,80K) > 80K © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Tree Induction O Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. O Issues – Determine how to split the records ?How to specify the attribute test condition? ?How to determine the best split? – Determine when to stop splitting© Tan,Steinbach, Kumar Introduction to Data Mining 27 How to determine the Best Split Own Car? C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 4/18/2004 Car Type? C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 Student ID? ... Yes No Family Sports Luxury c 1 c 10 c 20 C0: 0 C1: 1 ... c 11 Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 How to determine the Best Split O Greedy approach: – Nodes with homogeneous class distribution are preferred O Need a measure of node impurity: C0: 5 C1: 5 C0: 9 C1: 1 Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Measures of Node Impurity O Gini Index O Entropy O Misclassification error © Tan,Steinbach, Kumar Introduction to Data Mining How to Find the Best Split B? Yes No Node N3 Node N4 4/18/2004 30 29 A? Yes No Node N1 Node N2 Before Splitting: C0 N10 C1 N11 C0 N20 C1 N21 C0 N30 C1 N31 C0 N40 C1 N41 C0 N00 C1 N01 M0 M1 M2 M3 M4 M12 M34 Gain = M0 – M12 vs M0 – M34© Tan,Steinbach, Kumar Introduction to Data Mining 31 Measure of Impurity: GINI O Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). – Maximum (1 - 1/nc ) when records are equally distributed among all classes, implying least interesting information 4/18/2004 – Minimum (0.0) when all records belong to one class, implying most interesting information =-? j GINI t p j t 2 ( ) 1 [ ( | )] C1 0 C2 6 Gini=0.000 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=0.500 C1 1 C2 5 Gini=0.278 © Tan,Steinbach, Kumar Introduction to Data Mining Examples for computing GINI C1 0 C2 6 C1 2 C2 4 4/18/2004 32 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1) 2 – P(C2) 2 = 1 –0 –1 = 0 =-? j GINI t p j t 2 ( ) 1 [ ( | )] P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6) 2 – (5/6) 2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6) 2 – (4/6) 2 = 0.444© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Splitting Based on GINI O Used in CART, SLIQ, SPRINT. O When a node p is split into k partitions (children), the quality of split is computed as, where, ni = number of records at child i, n = number of records at node p. ? = = k i i split GINI i n n GINI 1 () © Tan,Steinbach, Kumar Introduction to Data Mining Binary Attributes: Computing GINI Index O Splits into two partitions O Effect of Weighing partitions: – Larger and Purer Partitions are sought for. 4/18/2004 34 B? Yes No Node N1 Node N2 Parent C1 6 C2 6 Gini = 0.500 N1 N2 C1 5 1 C2 2 4 Gini=0.333 Gini(N1) = 1 – (5/6) 2 – (2/6) 2 = 0.194 Gini(N2) = 1 – (1/6) 2 – (4/6) 2 = 0.528 Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333© Tan,Steinbach, Kumar Introduction to Data Mining Categorical Attributes: Computing Gini Index O For each distinct value, gather counts for each class in the dataset O Use the count matrix to make decisions CarType {Sports, Luxury} {Family} C1 3 1 C2 2 4 Gini 0.400 CarType {Sports} {Family, Luxury} C1 2 2 C2 1 5 Gini 0.419 CarType Family Sports Luxury C1 1 2 1 C2 4 1 1 Gini 0.393 4/18/2004 35 Multi-way split Two-way split (find best partition of values) © Tan,Steinbach, Kumar Introduction to Data Mining Continuous Attributes: Computing Gini Index O Use Binary Decisions based on one value O Several Choices for the splitting value – Number of possible splitting values = Number of distinct values O Each splitting value has a count matrix associated with it – Class counts in each of the partitions, A < v and A = v O Simple method to choose best v – For each v, scan the database to gather count matrix and compute its Gini index – Computationally Inefficient! Repetition of work. Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 4/18/2004 36 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Taxable Income > 80K? Yes No© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Continuous Attributes: Computing Gini Index... O For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Cheat No No No Yes Yes Yes No No No No Taxable Income 60 70 75 85 90 95 100 120 125 220 55 65 72 80 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 37 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Split Positions Sorted Values © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Alternative Splitting Criteria based on INFO O Entropy at a given node t: (NOTE: p( j | t) is the relative frequency of class j at node t). – Measures homogeneity of a node. ?Maximum (log nc ) when records are equally distributed among all classes implying least information ?Minimum (0.0) when all records belong to one class, implying most information – Entropy based computations are similar to the GINI index computations = -? j Entropy(t) p( j | t) log p( j | t)© Tan,Steinbach, Kumar Introduction to Data Mining 39 Examples for computing Entropy C1 0 C2 6 C1 2 C2 4 C1 1 4/18/2004 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92 = -? j Entropy(t) p( j | t) log p( j | t) 2 © Tan,Steinbach, Kumar Introduction to Data Mining Splitting Based on INFO... O Information Gain: Parent Node, p is split into k partitions; ni is number of records in partition i – Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) – Used in ID3 and C4.5 4/18/2004 40 – Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. ? ? ? ? ? ? =-? = k i i split Entropy i n n GAIN Entropy p 1 ( ) ( )© Tan,Steinbach, Kumar Introduction to Data Mining Splitting Based on INFO... O Gain Ratio: Parent Node, p is split into k partitions ni is the number of records in partition i 4/18/2004 41 – Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! – Used in C4.5 – Designed to overcome the disadvantage of Information Gain SplitINFO GAIN GainRATIO Split split =? = =k i ii n n n n SplitINFO 1 log © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 Splitting Criteria based on Classification Error O Classification error at a node t : O Measures misclassification error made by a node. ? Maximum (1 - 1/nc ) when records are equally distributed among all classes, implying least interesting information ? Minimum (0.0) when all records belong to one class, implying most interesting information Error(t) 1 max P(i | t) i = -© Tan,Steinbach, Kumar Introduction to Data Mining Examples for Computing Error C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 Error(t) 1 max P(i | t) 4/18/2004 43 i =© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 Comparison among Splitting Criteria For a 2-class problem:© Tan,Steinbach, Kumar Introduction to Data Mining Misclassification Error vs Gini A? Yes No Node N1 Node N2 Parent C1 7 C2 3 Gini = 0.42 N1 N2 C1 3 4 C2 0 3 Gini=0.361 Gini(N1) = 1 – (3/3) 2 – (0/3) 2 =0 Gini(N2) = 1 – (4/7) 4/18/2004 45 2 – (3/7) 2 = 0.489 Gini(Children) = 3/10 * 0 + 7/10 * 0.489 = 0.342 Gini improves !! © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46 Tree Induction O Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. O Issues – Determine how to split the records ?How to specify the attribute test condition? ?How to determine the best split? – Determine when to stop splitting© Tan,Steinbach, Kumar Introduction to Data Mining 47 Stopping Criteria for Tree Induction O Stop expanding a node when all the records belong to the same class O Stop expanding a node when all the records have similar attribute values O Early termination (to be discussed later) 4/18/2004 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48 Decision Tree Based Classification O Advantages: – Inexpensive to construct – Extremely fast at classifying unknown records – Easy to interpret for small-sized trees – Accuracy is comparable to other classification techniques for many simple data sets© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49 Example: C4.5 O Simple depth-first construction. O Uses Information Gain O Sorts Continuous Attributes at each node. O Needs entire data to fit in memory. O Unsuitable for Large Datasets. – Needs out-of-core sorting. O You can download the software from: http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 50 Practical Issues of Classification O Underfitting and Overfitting O Missing Values O Costs of Classification© Tan,Steinbach, Kumar Introduction to Data Mining Underfitting and Overfitting (Example) 500 circular and 500 triangular data points. 4/18/2004 51 Circular points: 0.5 = sqrt(x 1 2 +x 2 2 )=1 Triangular points: sqrt(x 1 2 +x 2 2 ) > 0.5 or sqrt(x 1 2 +x 2 2 )<1 © Tan,Steinbach, Kumar Introduction to Data Mining Underfitting and Overfitting 4/18/2004 52 Overfitting Underfitting: when model is too simple, both training and test errors are large © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 53 Overfitting due to Noise Decision boundary is distorted by noise point © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54 Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 55 Notes on Overfitting O Overfitting results in decision trees that are more complex than necessary O Training error no longer provides a good estimate of how well the tree will perform on previously unseen records O Need new ways for estimating errors © Tan,Steinbach, Kumar Introduction to Data Mining Estimating Generalization Errors O Re-substitution errors: error on training (S e(t) ) O Generalization errors: error on testing (S e’(t)) O Methods for estimating generalization errors: – Optimistic approach: e’(t) = e(t) 4/18/2004 56 – Pessimistic approach: ? For each leaf node: e’(t) = (e(t)+0.5) ? Total errors: e’(T) = e(T) + N × 0.5 (N: number of leaf nodes) ? For a tree with 30 leaf nodes and 10 errors on training (out of 1000 instances): Training error = 10/1000 = 1% Generalization error = (10 + 30×0.5)/1000 = 2.5% – Reduced error pruning (REP): ? uses validation data set to estimate generalization error© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 57 Occam’s Razor O Given two models of similar generalization errors, one should prefer the simpler model over the more complex model O For complex models, there is a greater chance that it was fitted accidentally by errors in data O Therefore, one should include model complexity when evaluating a model © Tan,Steinbach, Kumar Introduction to Data Mining Minimum Description Length (MDL) O Cost(Model,Data) = Cost(Data|Model) + Cost(Model) – Cost is the number of bits needed for encoding. – Search for the least costly model. O Cost(Data|Model) encodes the misclassification errors. O Cost(Model) uses node encoding (number of children) 4/18/2004 58 plus splitting condition encoding. AB A? B? C? 01 0 1 Yes No B1 B2 C1 C2 Xy X1 1 X2 0 X3 0 X4 1 …… Xn 1 Xy X1 ? X2 ? X3 ? X4 ? …… Xn ?© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 59 How to Address Overfitting O Pre-Pruning (Early Stopping Rule) – Stop the algorithm before it becomes a fully-grown tree – Typical stopping conditions for a node: ? Stop if all instances belong to the same class ? Stop if all the attribute values are the same – More restrictive conditions: ? Stop if number of instances is less than some user-specified threshold ? Stop if class distribution of instances are independent of the available features (e.g., using ? 2 test) ? Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). © Tan,Steinbach, Kumar Introduction to Data Mining How to Address Overfitting… O Post-pruning – Grow decision tree to its entirety – Trim the nodes of the decision tree in a bottom-up fashion – If generalization error improves after trimming, replace sub-tree by a leaf node. – Class label of leaf node is determined from 4/18/2004 60 majority class of instances in the sub-tree – Can use MDL for post-pruning© Tan,Steinbach, Kumar Introduction to Data Mining 61 Example of Post-Pruning A? A1 A2 A3 A4 Class = No 10 Error = 10/30 Class = Yes 20 Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Training Error (After splitting) = 9/30 Pessimistic error (After splitting) = (9 + 4 × 0.5)/30 = 11/30 PRUNE! Class = No 4 Class = Yes 8 Class = No 4 Class = Yes 3 Class = No 1 Class = Yes 4 Class = No 1 Class = Yes 5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 62 4/18/2004 Examples of Post-pruning – Optimistic error? – Pessimistic error? – Reduced error pruning? C0: 11 C1: 3 C0: 2 C1: 4 C0: 14 C1: 3 C0: 2 C1: 2 Don’t prune for both cases Don’t prune case 1, prune case 2 Case 1: Case 2: Depends on validation set© Tan,Steinbach, Kumar Introduction to Data Mining 63 Handling Missing Attribute Values O Missing values affect decision tree construction in three different ways: – Affects how impurity measures are computed – Affects how to distribute instance with missing value to child nodes – Affects how a test instance with missing value is classified 4/18/2004 © Tan,Steinbach, Kumar Introduction to Data Mining Computing Impurity Measure Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 ? Single 90K Yes 10 Class = Yes Class = No Refund=Yes 0 3 Refund=No 2 4 Refund=? 1 0 Split on Refund: 4/18/2004 64 Entropy(Refund=Yes) = 0 Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183 Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551 Gain = 0.9 × (0.8813 – 0.551) = 0.3303 Missing value Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813© Tan,Steinbach, Kumar Introduction to Data Mining 65 Distribute Instances Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 4/18/2004 10 Refund Yes No Class=Yes 0 Class=No 3 Cheat=Yes 2 Cheat=No 4 Refund Yes Tid Refund Marital Status Taxable Income Class 10 ? Single 90K Yes 10 No Class=Yes 2 + 6/9 Class=No 4 Probability that Refund=Yes is 3/9 Probability that Refund=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9 Class=Yes 0 + 3/9 Class=No 3 © Tan,Steinbach, Kumar Introduction to Data Mining Classify Instances Refund MarSt TaxInc NO YES NO NO Yes No Married Single, Divorced < 80K > 80K Total 3.67 2 1 6.67 Class=Yes 6/9 1 1 2.67 Class=No 3 1 0 4 Married Single Divorced Total Tid Refund Marital Status Taxable Income Class 11 No ? 85K ? 10 New record: 4/18/2004 66 Probability that Marital Status = Married is 3.67/6.67 Probability that Marital Status ={Single,Divorced} is 3/6.67© Tan,Steinbach, Kumar Introduction to Data Mining 67 4/18/2004 Other Issues O Data Fragmentation O Search Strategy O Expressiveness O Tree Replication © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 68 Data Fragmentation O Number of instances gets smaller as you traverse down the tree O Number of instances at the leaf nodes could be too small to make any statistically significant decision© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 69 Search Strategy O Finding an optimal decision tree is NP-hard O The algorithm presented so far uses a greedy, top-down, recursive partitioning strategy to induce a reasonable solution O Other strategies? – Bottom-up – Bi-directional © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 70 Expressiveness O Decision tree provides expressive representation for learning discrete-valued function – But they do not generalize well to certain types of Boolean functions ? Example: parity function: – Class = 1 if there is an even number of Boolean attributes with truth value = True – Class = 0 if there is an odd number of Boolean attributes with truth value = True ? For accurate modeling, must have a complete tree O Not expressive enough for modeling continuous variables – Particularly when test condition involves only a single attribute at-a-time© Tan,Steinbach, Kumar Introduction to Data Mining Decision Boundary y < 0.33? :0 :3 :4 :0 y < 0.47? :4 :0 :0 :4 4/18/2004 71 x < 0.43? Yes Yes No No Yes No 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y • Border line between two neighboring regions of different classes is known as decision boundary • Decision boundary is parallel to axes because test condition involves a single attribute at-a-time © Tan,Steinbach, Kumar Introduction to Data Mining Oblique Decision Trees 4/18/2004 72 x+y<1 Class = + Class = • Test condition may involve multiple attributes • More expressive representation • Finding optimal test condition is computationally expensive© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 73 Tree Replication P QR S01 01 Q S0 01 • Same subtree appears in multiple branches © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 74 Model Evaluation O Metrics for Performance Evaluation – How to evaluate the performance of a model? O Methods for Performance Evaluation – How to obtain reliable estimates? O Methods for Model Comparison – How to compare the relative performance among competing models?© Tan,Steinbach, Kumar Introduction to Data Mining 75 Model Evaluation 4/18/2004 O Metrics for Performance Evaluation – How to evaluate the performance of a model? O Methods for Performance Evaluation – How to obtain reliable estimates? O Methods for Model Comparison – How to compare the relative performance among competing models? © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 76 Metrics for Performance Evaluation O Focus on the predictive capability of a model – Rather than how fast it takes to classify or build models, scalability, etc. O Confusion Matrix: Class=No c d Class=Yes a b Class=Yes Class=No ACTUAL CLASS PREDICTED CLASS a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)© Tan,Steinbach, Kumar Introduction to Data Mining Metrics for Performance Evaluation… O Most widely-used metric: 4/18/2004 77 d (TN) c (FP) Class=No b (FN) a (TP) Class=Yes Class=Yes Class=No ACTUAL CLASS PREDICTED CLASS TP TN FP FN TP TN abcd ad +++ + = +++ + Accuracy = © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 78 Limitation of Accuracy O Consider a 2-class problem – Number of Class 0 examples = 9990 – Number of Class 1 examples = 10 O If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % – Accuracy is misleading because model does not detect any class 1 example© Tan,Steinbach, Kumar Introduction to Data Mining 79 Cost Matrix Class=No C(Yes|No) C(No|No) Class=Yes C(Yes|Yes) C(No|Yes) C(i|j) Class=Yes Class=No ACTUAL CLASS PREDICTED CLASS C(i|j): Cost of misclassifying class j example as class i © Tan,Steinbach, Kumar Introduction to Data Mining Computing Cost of Classification -10 + -1 100 C(i|j) + ACTUAL CLASS Cost PREDICTED CLASS Matrix 4/18/2004 80 4/18/2004 - 60 250 + 150 40 +ACTUAL CLASS Model PREDICTED CLASS M1 - 5 200 + 250 45 +ACTUAL CLASS Model PREDICTED CLASS M2 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255© Tan,Steinbach, Kumar Introduction to Data Mining Cost vs Accuracy c d Class=No a b Class=Yes Class=Yes Class=No ACTUAL CLASS Count PREDICTED CLASS 4/18/2004 81 q p Class=No p q Class=Yes Class=Yes Class=No ACTUAL CLASS Cost PREDICTED CLASS N=a+b+c+d Accuracy = (a + d)/N Cost = p (a + d) + q (b + c) = p (a + d) + q (N – a – d) = q N – (q – p)(a + d) = N [q – (q-p) × Accuracy] Accuracy is proportional to cost if 1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p © Tan,Steinbach, Kumar Introduction to Data Mining Cost-Sensitive Measures abc a rp rp ab a ac a 4/18/2004 82 ++ = + = + = + = 2 22 F - measure (F) Recall (r) Precision (p) O Precision is biased towards C(Yes|Yes) & C(Yes|No) O Recall is biased towards C(Yes|Yes) & C(No|Yes) O F-measure is biased towards all except C(No|No) wawbwcwd wawd 1234 14 Weighted Accuracy +++ + =© Tan,Steinbach, Kumar Introduction to Data Mining Model Evaluation 4/18/2004 83 O Metrics for Performance Evaluation – How to evaluate the performance of a model? O Methods for Performance Evaluation – How to obtain reliable estimates? O Methods for Model Comparison – How to compare the relative performance among competing models? © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 84 Methods for Performance Evaluation O How to obtain a reliable estimate of performance? O Performance of a model may depend on other factors besides the learning algorithm: – Class distribution – Cost of misclassification – Size of training and test sets© Tan,Steinbach, Kumar Introduction to Data Mining 85 Learning Curve O Learning curve shows how accuracy changes with varying sample size O Requires a sampling schedule for creating learning curve: O Arithmetic sampling (Langley, et al) 4/18/2004 O Geometric sampling (Provost et al) Effect of small sample size: - Bias in the estimate - Variance of estimate © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 86 Methods of Estimation O Holdout – Reserve 2/3 for training and 1/3 for testing O Random subsampling – Repeated holdout O Cross validation – Partition data into k disjoint subsets – k-fold: train on k-1 partitions, test on the remaining one – Leave-one-out: k=n O Stratified sampling – oversampling vs undersampling O Bootstrap – Sampling with replacement© Tan,Steinbach, Kumar Introduction to Data Mining 87 Model Evaluation O Metrics for Performance Evaluation – How to evaluate the performance of a model? O Methods for Performance Evaluation – How to obtain reliable estimates? O Methods for Model Comparison 4/18/2004 – How to compare the relative performance among competing models? © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 88 ROC (Receiver Operating Characteristic) O Developed in 1950s for signal detection theory to analyze noisy signals – Characterize the trade-off between positive hits and false alarms O ROC curve plots TP (on the y-axis) against FP (on the x-axis) O Performance of each classifier represented as a point on the ROC curve – changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ROC Curve At threshold t: TP=0.5, FN=0.5, FP=0.12, FN=0.88 - 1-dimensional data set containing 2 classes (positive and negative) - any points located at x > t is classified as positive © Tan,Steinbach, Kumar Introduction to Data Mining ROC Curve (TP,FP): O (0,0): declare everything to be negative class 4/18/2004 90 89 O (1,1): declare everything to be positive class O (1,0): ideal O Diagonal line: – Random guessing – Below diagonal line: ? prediction is opposite of the true class© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Using ROC for Model Comparison O No model consistently outperform the other O M1 is better for small FPR O M2 is better for large FPR O Area Under the ROC curve O Ideal: ? Area = 1 O Random guess: ? Area = 0.5 © Tan,Steinbach, Kumar Introduction to Data Mining How to Construct an ROC curve 4/18/2004 92 91 10 0.25 + 9 0.43 8 0.53 + 7 0.76 6 0.85 + 5 0.85 4 0.85 3 0.87 2 0.93 + 1 0.95 + Instance P(+|A) True Class • Use classifier that produces posterior probability for each test instance P(+|A) • Sort the instances according to P(+|A) in decreasing order • Apply threshold at each unique value of P(+|A) • Count the number of TP, FP, TN, FN at each threshold • TP rate, TPR = TP/(TP+FN) • FP rate, FPR = FP/(FP + TN)© Tan,Steinbach, Kumar Introduction to Data Mining 93 How to construct an ROC curve Class + - + - - - + - + + 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 4/18/2004 TP 5 4 4 3 3 3 3 2 2 1 0 FP 5 5 4 4 3 2 1 1 0 0 0 TN 0 0 1 1 2 3 4 4 5 5 5 FN 0 1 1 2 2 2 2 3 3 4 5 TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0 FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0 Threshold >= ROC Curve: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 94 Test of Significance O Given two models: – Model M1: accuracy = 85%, tested on 30 instances – Model M2: accuracy = 75%, tested on 5000 instances O Can we say M1 is better than M2? – How much confidence can we place on accuracy of M1 and M2? – Can the difference in performance measure be explained as a result of random fluctuations in the test set?© Tan,Steinbach, Kumar Introduction to Data Mining Confidence Interval for Accuracy O Prediction can be regarded as a Bernoulli trial – A Bernoulli trial has 2 possible outcomes – Possible outcomes for prediction: correct or wrong – Collection of Bernoulli trials has a Binomial distribution: ? x ~ Bin(N, p) x: number of correct predictions 4/18/2004 95 ? e.g: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = N×p = 50 × 0.5 = 25 O Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances), Can we predict p (true accuracy of model)? © Tan,Steinbach, Kumar Introduction to Data Mining Confidence Interval for Accuracy O For large test sets (N > 30), – acc has a normal distribution with mean p and variance p(1-p)/N O Confidence Interval for p: a aa =< <1 ) (1 ) / ( /2 Z1 / 2 4/18/2004 96 ppN acc p PZ Area = 1 - a Za/2 Z1- a /2 2( ) 244 2 /2 22 /2 2 /2 a aa NZ N acc Z Z N acc N acc p + ××+±+××-×× =© Tan,Steinbach, Kumar Introduction to Data Mining Confidence Interval for Accuracy O Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: 4/18/2004 97 – N=100, acc = 0.8 – Let 1-a = 0.95 (95% confidence) – From probability table, Za/2 =1.96 0.90 1.65 0.95 1.96 0.98 2.33 0.99 2.58 1-a Z p(lower) 0.670 0.711 0.763 0.774 0.789 p(upper) 0.888 0.866 0.833 0.824 0.811 N 50 100 500 1000 5000 © Tan,Steinbach, Kumar Introduction to Data Mining Comparing Performance of 2 Models O Given two models, say M1 and M2, which is better? – M1 is tested on D1 (size=n1), found error rate = e1 – M2 is tested on D2 (size=n2), found error rate = e2 – Assume D1 and D2 are independent – If n1 and n2 are sufficiently large, then – Approximate: () ()222 111 ~, 4/18/2004 98 ~, µs µs eN eN i ii i n e (1 e ) ˆ s =© Tan,Steinbach, Kumar Introduction to Data Mining Comparing Performance of 2 Models O To test if performance difference is statistically significant: d = e1 – e2 – d ~ (dt ,st ) where dt is the true difference – Since D1 and D2 are independent, their variance adds up: – At (1-a) confidence level, 2 2(1 2) 4/18/2004 99 1 1(1 1) ˆˆ 2 2 2 1 2 2 2 1 2 n ee n ee t + = s = s +s ? s +s dt d Za st ˆ=±/2 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 100 An Illustrative Example O Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25 O d = |e2 – e1| = 0.1 (2-sided test) O At 95% confidence level, Za/2 =1.96 => Interval contains 0 => difference may not be statistically significant 0.0043 5000 0.25(1 0.25) 30 0.15(1 0.15) ˆ= + sd = dt = 0.100 ±1.96× 0.0043 = 0.100 ± 0.128© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 101 Comparing Performance of 2 Algorithms O Each learning algorithm may produce k models: – L1 may produce M11 , M12, …, M1k – L2 may produce M21 , M22, …, M2k O If models are generated on the same test sets D1,D2, …, Dk (e.g., via cross-validation) – For each set: compute dj = e1j – e2j – dj has mean dt and variance st – Estimate: tkt k j j t ddt kk dd s s a ˆ ( 1) () ˆ 1,1 1 2 2 -= =± = ? RSS Link to Us Business Info Advertising More Languages: Use of this web site constitutes acceptance of the Terms Of Use and Privacy Policy | User published content is licensed under a Creative Commons License. Copyright © 2005-2011 Free Articles by ArticlesBase.com, All rights reserved. [ Overview] [ What is Data Mining] [ Issues] [ More Information]ing Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Classification: Definition O Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. O Find a model for class attribute as a function of the values of other attributes. O Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Illustrating Classification Task Apply Model Induction Deduction L earn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Learning algorithm Training Set © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Examples of Classification Task O Predicting tumor cells as benign or malignant O Classifying credit card transactions as legitimate or fraudulent O Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil O Categorizing news stories as finance, weather, entertainment, sports, etc© Tan,Steinbach, Kumar Introduction to Data Mining 5 Classification Techniques O Decision Tree based Methods O Rule-based Methods O Memory based reasoning O Neural Networks O Naïve Bayes and Bayesian Belief Networks O Support Vector Machines © Tan,Steinbach, Kumar Introduction to Data Mining Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4/18/2004 6 4/18/2004 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 categorical categorical continuous class Refund MarSt TaxInc NO YES NO NO Yes No Single, Divorced Married < 80K > 80K Splitting Attributes Training Data Model: Decision Tree© Tan,Steinbach, Kumar Introduction to Data Mining 7 Another Example of Decision Tree Tid Refund Marital 4/18/2004 Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 categorical categorical continuous class MarSt Refund TaxInc NO YES NO NO Yes No Married Single, Divorced < 80K > 80K There could be more than one tree that fits the same data! © Tan,Steinbach, Kumar Introduction to Data Mining Decision Tree Classification Task Apply Model Induction Deduction L earn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 4/18/2004 8 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Tree Induction algorithm Training Set Decision Tree© Tan,Steinbach, Kumar Introduction to Data Mining Apply Model to Test Data Refund MarSt TaxInc NO YES NO NO Yes No 4/18/2004 9 Single, Divorced Married < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Start from the root of tree. © Tan,Steinbach, Kumar Introduction to Data Mining Apply Model to Test Data Refund MarSt TaxInc NO YES NO NO Yes No Single, Divorced Married < 80K > 80K Refund Marital Status Taxable Income Cheat 4/18/2004 10 No Married 80K ? 10 Test Data© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Apply Model to Test Data Refund MarSt TaxInc NO YES NO NO Yes No Single, Divorced Married < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data © Tan,Steinbach, Kumar Introduction to Data Mining Apply Model to Test Data Refund MarSt TaxInc 4/18/2004 12 11 NO YES NO NO Yes No Single, Divorced Married < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data© Tan,Steinbach, Kumar Introduction to Data Mining Apply Model to Test Data Refund MarSt TaxInc NO YES NO NO Yes No Single, Divorced Married < 80K > 80K Refund Marital Status 4/18/2004 13 Taxable Income Cheat No Married 80K ? 10 Test Data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Apply Model to Test Data Refund MarSt TaxInc NO YES NO NO Yes No Single, Divorced Married < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Assign Cheat to “No”© Tan,Steinbach, Kumar Introduction to Data Mining Decision Tree Classification Task 4/18/2004 15 Apply Model Induction Deduction L earn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Tree Induction algorithm Training Set Decision Tree © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Decision Tree Induction O Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID3, C4.5 – SLIQ,SPRINT© Tan,Steinbach, Kumar Introduction to Data Mining General Structure of Hunt’s Algorithm O Let Dt be the set of training records that reach a node t O General Procedure: – If Dt contains records that belong the same class y t , then t 4/18/2004 17 is a leaf node labeled as y t – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Dt ? © Tan,Steinbach, Kumar Introduction to Data Mining Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat Yes No Refund Don’t Cheat Yes No Marital Status Don’t Cheat Cheat Single, 4/18/2004 18 Divorced Married Taxable Income Don’t Cheat < 80K >= 80K Refund Don’t Cheat Yes No Marital Status Don’t Cheat Cheat Single, Divorced Married Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Tree Induction O Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. O Issues – Determine how to split the records ‹How to specify the attribute test condition? ‹How to determine the best split? – Determine when to stop splitting © Tan,Steinbach, Kumar Introduction to Data Mining Tree Induction O Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. O Issues – Determine how to split the records 4/18/2004 20 ‹How to specify the attribute test condition? ‹How to determine the best split? – Determine when to stop splitting© Tan,Steinbach, Kumar Introduction to Data Mining 21 How to Specify Test Condition? O Depends on attribute types – Nominal – Ordinal – Continuous O Depends on number of ways to split – 2-way split – Multi-way split © Tan,Steinbach, Kumar Introduction to Data Mining Splitting Based on Nominal Attributes O Multi-way split: Use as many partitions as distinct values. O Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Family, Luxury} {Sports} 4/18/2004 22 4/18/2004 CarType {Sports, Luxury} {Family} OR© Tan,Steinbach, Kumar Introduction to Data Mining O Multi-way split: Use as many partitions as distinct values. O Binary split: Divides values into two subsets. Need to find optimal partitioning. O What about this split? Splitting Based on Ordinal Attributes Size Small Medium Large Size {Medium, Large} {Small} Size {Small, Medium} {Large} OR Size {Small, 4/18/2004 23 Large} {Medium} © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Splitting Based on Continuous Attributes O Different ways of handling – Discretization to form an ordinal categorical attribute ‹ Static – discretize once at the beginning ‹ Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. – Binary Decision: (A < v) or (A ≥ v) ‹ consider all possible splits and finds the best cut ‹ can be more compute intensive© Tan,Steinbach, Kumar Introduction to Data Mining 25 Splitting Based on Continuous Attributes Taxable Income > 80K? Yes No Taxable Income? (i) Binary split (ii) Multi-way split < 10K [10K,25K) [25K,50K) [50K,80K) > 80K 4/18/2004 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Tree Induction O Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. O Issues – Determine how to split the records ‹How to specify the attribute test condition? ‹How to determine the best split? – Determine when to stop splitting© Tan,Steinbach, Kumar Introduction to Data Mining 27 How to determine the Best Split Own Car? C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 Car Type? 4/18/2004 C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 Student ID? ... Yes No Family Sports Luxury c 1 c 10 c 20 C0: 0 C1: 1 ... c 11 Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 How to determine the Best Split O Greedy approach: – Nodes with homogeneous class distribution are preferred O Need a measure of node impurity: C0: 5 C1: 5 C0: 9 C1: 1 Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Measures of Node Impurity O Gini Index O Entropy O Misclassification error © Tan,Steinbach, Kumar Introduction to Data Mining How to Find the Best Split B? Yes No Node N3 Node N4 A? Yes No 4/18/2004 30 29 Node N1 Node N2 Before Splitting: C0 N10 C1 N11 C0 N20 C1 N21 C0 N30 C1 N31 C0 N40 C1 N41 C0 N00 C1 N01 M0 M1 M2 M3 M4 M12 M34 Gain = M0 – M12 vs M0 – M34© Tan,Steinbach, Kumar Introduction to Data Mining 31 Measure of Impurity: GINI O Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). – Maximum (1 - 1/nc ) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information 4/18/2004 =−∑ j GINI t p j t 2 ( ) 1 [ ( | )] C1 0 C2 6 Gini=0.000 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=0.500 C1 1 C2 5 Gini=0.278 © Tan,Steinbach, Kumar Introduction to Data Mining Examples for computing GINI C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 4/18/2004 32 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1) 2 – P(C2) 2 = 1 –0 –1 = 0 =−∑ j GINI t p j t 2 ( ) 1 [ ( | )] P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6) 2 – (5/6) 2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6) 2 – (4/6) 2 = 0.444© Tan,Steinbach, Kumar Introduction to Data Mining Splitting Based on GINI O Used in CART, SLIQ, SPRINT. 4/18/2004 33 O When a node p is split into k partitions (children), the quality of split is computed as, where, ni = number of records at child i, n = number of records at node p. ∑ = = k i i split GINI i n n GINI 1 () © Tan,Steinbach, Kumar Introduction to Data Mining Binary Attributes: Computing GINI Index O Splits into two partitions O Effect of Weighing partitions: – Larger and Purer Partitions are sought for. B? Yes No 4/18/2004 34 Node N1 Node N2 Parent C1 6 C2 6 Gini = 0.500 N1 N2 C1 5 1 C2 2 4 Gini=0.333 Gini(N1) = 1 – (5/6) 2 – (2/6) 2 = 0.194 Gini(N2) = 1 – (1/6) 2 – (4/6) 2 = 0.528 Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Categorical Attributes: Computing Gini Index O For each distinct value, gather counts for each class in the dataset O Use the count matrix to make decisions CarType {Sports, Luxury} {Family} C1 3 1 C2 2 4 Gini 0.400 CarType {Sports} {Family, Luxury} C1 2 2 C2 1 5 Gini 0.419 CarType Family Sports Luxury C1 1 2 1 C2 4 1 1 Gini 0.393 Multi-way split Two-way split (find best partition of values) © Tan,Steinbach, Kumar Introduction to Data Mining Continuous Attributes: Computing Gini Index O Use Binary Decisions based on one value O Several Choices for the splitting value – Number of possible splitting values = Number of distinct values O Each splitting value has a count matrix associated with it – Class counts in each of the partitions, A < v and A ≥ v O Simple method to choose best v – For each v, scan the database to gather count matrix and compute its Gini index – Computationally Inefficient! Repetition of work. Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 4/18/2004 36 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Taxable Income > 80K? Yes No© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Continuous Attributes: Computing Gini Index... O For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Cheat No No No Yes Yes Yes No No No No Taxable Income 60 70 75 85 90 95 100 120 125 220 55 65 72 80 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 37 Split Positions Sorted Values © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Alternative Splitting Criteria based on INFO O Entropy at a given node t: (NOTE: p( j | t) is the relative frequency of class j at node t). – Measures homogeneity of a node. ‹Maximum (log nc ) when records are equally distributed among all classes implying least information ‹Minimum (0.0) when all records belong to one class, implying most information – Entropy based computations are similar to the GINI index computations = −∑ j Entropy(t) p( j | t) log p( j | t)© Tan,Steinbach, Kumar Introduction to Data Mining 39 Examples for computing Entropy C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 4/18/2004 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92 = −∑ j Entropy(t) p( j | t) log p( j | t) 2 © Tan,Steinbach, Kumar Introduction to Data Mining Splitting Based on INFO... O Information Gain: Parent Node, p is split into k partitions; ni is number of records in partition i – Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) – Used in ID3 and C4.5 – Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. 4/18/2004 40 =−∑ = k i i split Entropy i n n GAIN Entropy p 1 ( ) ( )© Tan,Steinbach, Kumar Introduction to Data Mining Splitting Based on INFO... O Gain Ratio: Parent Node, p is split into k partitions ni is the number of records in partition i – Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning 4/18/2004 41 (large number of small partitions) is penalized! – Used in C4.5 – Designed to overcome the disadvantage of Information Gain SplitINFO GAIN GainRATIO Split split =∑ = =− k i ii n n n n SplitINFO 1 log © Tan,Steinbach, Kumar Introduction to Data Mining Splitting Criteria based on Classification Error O Classification error at a node t : 4/18/2004 42 O Measures misclassification error made by a node. ‹ Maximum (1 - 1/nc ) when records are equally distributed among all classes, implying least interesting information ‹ Minimum (0.0) when all records belong to one class, implying most interesting information Error(t) 1 max P(i | t) i = −© Tan,Steinbach, Kumar Introduction to Data Mining Examples for Computing Error C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 Error(t) 1 max P(i | t) i =− 4/18/2004 43 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 Comparison among Splitting Criteria For a 2-class problem:© Tan,Steinbach, Kumar Introduction to Data Mining Misclassification Error vs Gini A? Yes No Node N1 Node N2 Parent C1 7 C2 3 Gini = 0.42 N1 N2 C1 3 4 C2 0 3 Gini=0.361 Gini(N1) = 1 – (3/3) 2 – (0/3) 2 =0 Gini(N2) = 1 – (4/7) 2 – (3/7) 4/18/2004 45 2 = 0.489 Gini(Children) = 3/10 * 0 + 7/10 * 0.489 = 0.342 Gini improves !! © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46 Tree Induction O Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. O Issues – Determine how to split the records ‹How to specify the attribute test condition? ‹How to determine the best split? – Determine when to stop splitting© Tan,Steinbach, Kumar Introduction to Data Mining 47 Stopping Criteria for Tree Induction O Stop expanding a node when all the records belong to the same class O Stop expanding a node when all the records have similar attribute values O Early termination (to be discussed later) © Tan,Steinbach, Kumar Introduction to Data Mining Decision Tree Based Classification 4/18/2004 48 4/18/2004 O Advantages: – Inexpensive to construct – Extremely fast at classifying unknown records – Easy to interpret for small-sized trees – Accuracy is comparable to other classification techniques for many simple data sets© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49 Example: C4.5 O Simple depth-first construction. O Uses Information Gain O Sorts Continuous Attributes at each node. O Needs entire data to fit in memory. O Unsuitable for Large Datasets. – Needs out-of-core sorting. O You can download the software from: http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 50 Practical Issues of Classification O Underfitting and Overfitting O Missing Values O Costs of Classification© Tan,Steinbach, Kumar Introduction to Data Mining Underfitting and Overfitting (Example) 500 circular and 500 triangular data points. Circular points: 0.5 ≤ sqrt(x 4/18/2004 51 1 2 +x 2 2 )≤1 Triangular points: sqrt(x 1 2 +x 2 2 ) > 0.5 or sqrt(x 1 2 +x 2 2 )<1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 52 Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 53 Overfitting due to Noise Decision boundary is distorted by noise point © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54 Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 55 Notes on Overfitting O Overfitting results in decision trees that are more complex than necessary O Training error no longer provides a good estimate of how well the tree will perform on previously unseen records O Need new ways for estimating errors © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Estimating Generalization Errors O Re-substitution errors: error on training (Σ e(t) ) O Generalization errors: error on testing (Σ e’(t)) O Methods for estimating generalization errors: – Optimistic approach: e’(t) = e(t) – Pessimistic approach: ‹ For each leaf node: e’(t) = (e(t)+0.5) ‹ Total errors: e’(T) = e(T) + N × 0.5 (N: number of leaf nodes) 56 ‹ For a tree with 30 leaf nodes and 10 errors on training (out of 1000 instances): Training error = 10/1000 = 1% Generalization error = (10 + 30×0.5)/1000 = 2.5% – Reduced error pruning (REP): ‹ uses validation data set to estimate generalization error© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 57 Occam’s Razor O Given two models of similar generalization errors, one should prefer the simpler model over the more complex model O For complex models, there is a greater chance that it was fitted accidentally by errors in data O Therefore, one should include model complexity when evaluating a model © Tan,Steinbach, Kumar Introduction to Data Mining Minimum Description Length (MDL) O Cost(Model,Data) = Cost(Data|Model) + Cost(Model) – Cost is the number of bits needed for encoding. – Search for the least costly model. O Cost(Data|Model) encodes the misclassification errors. O Cost(Model) uses node encoding (number of children) plus splitting condition encoding. AB A? 4/18/2004 58 B? C? 01 0 1 Yes No B1 B2 C1 C2 Xy X1 1 X2 0 X3 0 X4 1 …… Xn 1 Xy X1 ? X2 ? X3 ? X4 ? …… Xn ?© Tan,Steinbach, Kumar Introduction to Data Mining How to Address Overfitting O Pre-Pruning (Early Stopping Rule) 4/18/2004 59 – Stop the algorithm before it becomes a fully-grown tree – Typical stopping conditions for a node: ‹ Stop if all instances belong to the same class ‹ Stop if all the attribute values are the same – More restrictive conditions: ‹ Stop if number of instances is less than some user-specified threshold ‹ Stop if class distribution of instances are independent of the available features (e.g., using χ 2 test) ‹ Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 60 How to Address Overfitting… O Post-pruning – Grow decision tree to its entirety – Trim the nodes of the decision tree in a bottom-up fashion – If generalization error improves after trimming, replace sub-tree by a leaf node. – Class label of leaf node is determined from majority class of instances in the sub-tree – Can use MDL for post-pruning© Tan,Steinbach, Kumar Introduction to Data Mining 61 Example of Post-Pruning 4/18/2004 A? A1 A2 A3 A4 Class = No 10 Error = 10/30 Class = Yes 20 Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Training Error (After splitting) = 9/30 Pessimistic error (After splitting) = (9 + 4 × 0.5)/30 = 11/30 PRUNE! Class = No 4 Class = Yes 8 Class = No 4 Class = Yes 3 Class = No 1 Class = Yes 4 Class = No 1 Class = Yes 5 © Tan,Steinbach, Kumar Introduction to Data Mining Examples of Post-pruning – Optimistic error? – Pessimistic error? 4/18/2004 62 – Reduced error pruning? C0: 11 C1: 3 C0: 2 C1: 4 C0: 14 C1: 3 C0: 2 C1: 2 Don’t prune for both cases Don’t prune case 1, prune case 2 Case 1: Case 2: Depends on validation set© Tan,Steinbach, Kumar Introduction to Data Mining 63 Handling Missing Attribute Values O Missing values affect decision tree construction in three different ways: – Affects how impurity measures are computed – Affects how to distribute instance with missing value to child nodes – Affects how a test instance with missing value is classified © Tan,Steinbach, Kumar Introduction to Data Mining Computing Impurity Measure Tid Refund Marital 4/18/2004 64 4/18/2004 Status Taxable Income Class 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 ? Single 90K Yes 10 Class = Yes Class = No Refund=Yes 0 3 Refund=No 2 4 Refund=? 1 0 Split on Refund: Entropy(Refund=Yes) = 0 Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183 Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551 Gain = 0.9 × (0.8813 – 0.551) = 0.3303 Missing value Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813© Tan,Steinbach, Kumar Introduction to Data Mining 65 Distribute Instances Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 Refund Yes No 4/18/2004 Class=Yes 0 Class=No 3 Cheat=Yes 2 Cheat=No 4 Refund Yes Tid Refund Marital Status Taxable Income Class 10 ? Single 90K Yes 10 No Class=Yes 2 + 6/9 Class=No 4 Probability that Refund=Yes is 3/9 Probability that Refund=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9 Class=Yes 0 + 3/9 Class=No 3 © Tan,Steinbach, Kumar Introduction to Data Mining Classify Instances Refund 4/18/2004 66 MarSt TaxInc NO YES NO NO Yes No Married Single, Divorced < 80K > 80K Total 3.67 2 1 6.67 Class=Yes 6/9 1 1 2.67 Class=No 3 1 0 4 Married Single Divorced Total Tid Refund Marital Status Taxable Income Class 11 No ? 85K ? 10 New record: Probability that Marital Status = Married is 3.67/6.67 Probability that Marital Status ={Single,Divorced} is 3/6.67© Tan,Steinbach, Kumar Introduction to Data Mining 67 4/18/2004 Other Issues O Data Fragmentation O Search Strategy O Expressiveness O Tree Replication © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 68 Data Fragmentation O Number of instances gets smaller as you traverse down the tree O Number of instances at the leaf nodes could be too small to make any statistically significant decision© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 69 Search Strategy O Finding an optimal decision tree is NP-hard O The algorithm presented so far uses a greedy, top-down, recursive partitioning strategy to induce a reasonable solution O Other strategies? – Bottom-up – Bi-directional © Tan,Steinbach, Kumar Introduction to Data Mining Expressiveness O Decision tree provides expressive representation for learning discrete-valued function 4/18/2004 70 – But they do not generalize well to certain types of Boolean functions ‹ Example: parity function: – Class = 1 if there is an even number of Boolean attributes with truth value = True – Class = 0 if there is an odd number of Boolean attributes with truth value = True ‹ For accurate modeling, must have a complete tree O Not expressive enough for modeling continuous variables – Particularly when test condition involves only a single attribute at-a-time© Tan,Steinbach, Kumar Introduction to Data Mining Decision Boundary y < 0.33? :0 :3 :4 :0 y < 0.47? :4 :0 :0 :4 x < 0.43? Yes Yes 4/18/2004 71 No No Yes No 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y • Border line between two neighboring regions of different classes is known as decision boundary • Decision boundary is parallel to axes because test condition involves a single attribute at-a-time © Tan,Steinbach, Kumar Introduction to Data Mining Oblique Decision Trees x+y<1 Class = + Class = • Test condition may involve multiple attributes 4/18/2004 72 • More expressive representation • Finding optimal test condition is computationally expensive© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 73 Tree Replication P QR S01 01 Q S0 01 • Same subtree appears in multiple branches © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 74 Model Evaluation O Metrics for Performance Evaluation – How to evaluate the performance of a model? O Methods for Performance Evaluation – How to obtain reliable estimates? O Methods for Model Comparison – How to compare the relative performance among competing models?© Tan,Steinbach, Kumar Introduction to Data Mining 75 Model Evaluation O Metrics for Performance Evaluation – How to evaluate the performance of a model? O Methods for Performance Evaluation 4/18/2004 – How to obtain reliable estimates? O Methods for Model Comparison – How to compare the relative performance among competing models? © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 76 Metrics for Performance Evaluation O Focus on the predictive capability of a model – Rather than how fast it takes to classify or build models, scalability, etc. O Confusion Matrix: Class=No c d Class=Yes a b Class=Yes Class=No ACTUAL CLASS PREDICTED CLASS a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)© Tan,Steinbach, Kumar Introduction to Data Mining Metrics for Performance Evaluation… O Most widely-used metric: d (TN) c 4/18/2004 77 (FP) Class=No b (FN) a (TP) Class=Yes Class=Yes Class=No ACTUAL CLASS PREDICTED CLASS TP TN FP FN TP TN abcd ad +++ + = +++ + Accuracy = © Tan,Steinbach, Kumar Introduction to Data Mining Limitation of Accuracy O Consider a 2-class problem – Number of Class 0 examples = 9990 4/18/2004 78 – Number of Class 1 examples = 10 O If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % – Accuracy is misleading because model does not detect any class 1 example© Tan,Steinbach, Kumar Introduction to Data Mining 79 Cost Matrix Class=No C(Yes|No) C(No|No) Class=Yes C(Yes|Yes) C(No|Yes) C(i|j) Class=Yes Class=No ACTUAL CLASS PREDICTED CLASS C(i|j): Cost of misclassifying class j example as class i © Tan,Steinbach, Kumar Introduction to Data Mining Computing Cost of Classification -10 + -1 100 C(i|j) + ACTUAL CLASS Cost PREDICTED CLASS Matrix - 60 250 + 150 40 +- 4/18/2004 80 4/18/2004 ACTUAL CLASS Model PREDICTED CLASS M1 - 5 200 + 250 45 +ACTUAL CLASS Model PREDICTED CLASS M2 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255© Tan,Steinbach, Kumar Introduction to Data Mining Cost vs Accuracy c d Class=No a b Class=Yes Class=Yes Class=No ACTUAL CLASS Count PREDICTED CLASS q p Class=No p q Class=Yes Class=Yes Class=No 4/18/2004 81 ACTUAL CLASS Cost PREDICTED CLASS N=a+b+c+d Accuracy = (a + d)/N Cost = p (a + d) + q (b + c) = p (a + d) + q (N – a – d) = q N – (q – p)(a + d) = N [q – (q-p) × Accuracy] Accuracy is proportional to cost if 1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p © Tan,Steinbach, Kumar Introduction to Data Mining Cost-Sensitive Measures abc a rp rp ab a ac a ++ = + 4/18/2004 82 = + = + = 2 22 F - measure (F) Recall (r) Precision (p) O Precision is biased towards C(Yes|Yes) & C(Yes|No) O Recall is biased towards C(Yes|Yes) & C(No|Yes) O F-measure is biased towards all except C(No|No) wawbwcwd wawd 1234 14 Weighted Accuracy +++ + =© Tan,Steinbach, Kumar Introduction to Data Mining Model Evaluation O Metrics for Performance Evaluation – How to evaluate the performance of a model? O Methods for Performance Evaluation 4/18/2004 83 – How to obtain reliable estimates? O Methods for Model Comparison – How to compare the relative performance among competing models? © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 84 Methods for Performance Evaluation O How to obtain a reliable estimate of performance? O Performance of a model may depend on other factors besides the learning algorithm: – Class distribution – Cost of misclassification – Size of training and test sets© Tan,Steinbach, Kumar Introduction to Data Mining 85 Learning Curve O Learning curve shows how accuracy changes with varying sample size O Requires a sampling schedule for creating learning curve: O Arithmetic sampling (Langley, et al) O Geometric sampling (Provost et al) Effect of small sample size: 4/18/2004 - Bias in the estimate - Variance of estimate © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 86 Methods of Estimation O Holdout – Reserve 2/3 for training and 1/3 for testing O Random subsampling – Repeated holdout O Cross validation – Partition data into k disjoint subsets – k-fold: train on k-1 partitions, test on the remaining one – Leave-one-out: k=n O Stratified sampling – oversampling vs undersampling O Bootstrap – Sampling with replacement© Tan,Steinbach, Kumar Introduction to Data Mining 87 Model Evaluation O Metrics for Performance Evaluation – How to evaluate the performance of a model? O Methods for Performance Evaluation – How to obtain reliable estimates? O Methods for Model Comparison – How to compare the relative performance among competing models? © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 88 4/18/2004 ROC (Receiver Operating Characteristic) O Developed in 1950s for signal detection theory to analyze noisy signals – Characterize the trade-off between positive hits and false alarms O ROC curve plots TP (on the y-axis) against FP (on the x-axis) O Performance of each classifier represented as a point on the ROC curve – changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ROC Curve At threshold t: TP=0.5, FN=0.5, FP=0.12, FN=0.88 - 1-dimensional data set containing 2 classes (positive and negative) - any points located at x > t is classified as positive © Tan,Steinbach, Kumar Introduction to Data Mining ROC Curve (TP,FP): O (0,0): declare everything to be negative class O (1,1): declare everything to be positive class O (1,0): ideal 4/18/2004 90 89 O Diagonal line: – Random guessing – Below diagonal line: ‹ prediction is opposite of the true class© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Using ROC for Model Comparison O No model consistently outperform the other O M1 is better for small FPR O M2 is better for large FPR O Area Under the ROC curve O Ideal: ƒ Area = 1 O Random guess: ƒ Area = 0.5 © Tan,Steinbach, Kumar Introduction to Data Mining How to Construct an ROC curve 10 0.25 + 9 0.43 8 0.53 + 4/18/2004 92 91 7 0.76 6 0.85 + 5 0.85 4 0.85 3 0.87 2 0.93 + 1 0.95 + Instance P(+|A) True Class • Use classifier that produces posterior probability for each test instance P(+|A) • Sort the instances according to P(+|A) in decreasing order • Apply threshold at each unique value of P(+|A) • Count the number of TP, FP, TN, FN at each threshold • TP rate, TPR = TP/(TP+FN) • FP rate, FPR = FP/(FP + TN)© Tan,Steinbach, Kumar Introduction to Data Mining 93 How to construct an ROC curve Class + - + - - - + - + + 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 3 3 2 2 1 0 FP 5 5 4 4 3 2 1 1 0 0 0 TN 0 0 1 1 2 3 4 4 5 5 5 4/18/2004 FN 0 1 1 2 2 2 2 3 3 4 5 TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0 FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0 Threshold >= ROC Curve: © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 94 Test of Significance O Given two models: – Model M1: accuracy = 85%, tested on 30 instances – Model M2: accuracy = 75%, tested on 5000 instances O Can we say M1 is better than M2? – How much confidence can we place on accuracy of M1 and M2? – Can the difference in performance measure be explained as a result of random fluctuations in the test set?© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Confidence Interval for Accuracy O Prediction can be regarded as a Bernoulli trial – A Bernoulli trial has 2 possible outcomes – Possible outcomes for prediction: correct or wrong – Collection of Bernoulli trials has a Binomial distribution: ‹ x ∼ Bin(N, p) x: number of correct predictions ‹ e.g: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = N×p = 50 × 0.5 = 25 O Given x (# of correct predictions) or equivalently, 95 acc=x/N, and N (# of test instances), Can we predict p (true accuracy of model)? © Tan,Steinbach, Kumar Introduction to Data Mining Confidence Interval for Accuracy O For large test sets (N > 30), – acc has a normal distribution with mean p and variance p(1-p)/N O Confidence Interval for p: α αα =− < − − <− 1 ) (1 ) / ( /2 Z1 / 2 ppN acc p PZ 4/18/2004 96 Area = 1 - α Zα/2 Z1- α /2 2( ) 244 2 /2 22 /2 2 /2 α αα NZ N acc Z Z N acc N acc p + ××+±+××−×× =© Tan,Steinbach, Kumar Introduction to Data Mining Confidence Interval for Accuracy O Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: – N=100, acc = 0.8 – Let 1-α = 0.95 (95% confidence) – From probability table, Zα/2 4/18/2004 97 =1.96 0.90 1.65 0.95 1.96 0.98 2.33 0.99 2.58 1-α Z p(lower) 0.670 0.711 0.763 0.774 0.789 p(upper) 0.888 0.866 0.833 0.824 0.811 N 50 100 500 1000 5000 © Tan,Steinbach, Kumar Introduction to Data Mining Comparing Performance of 2 Models O Given two models, say M1 and M2, which is better? – M1 is tested on D1 (size=n1), found error rate = e1 – M2 is tested on D2 (size=n2), found error rate = e2 – Assume D1 and D2 are independent – If n1 and n2 are sufficiently large, then – Approximate: () ()222 111 ~, ~, µσ µσ 4/18/2004 98 eN eN i ii i n e (1 e ) ˆ − σ =© Tan,Steinbach, Kumar Introduction to Data Mining Comparing Performance of 2 Models O To test if performance difference is statistically significant: d = e1 – e2 – d ~ (dt ,σt ) where dt is the true difference – Since D1 and D2 are independent, their variance adds up: – At (1-α) confidence level, 2 2(1 2) 1 1(1 1) ˆˆ 4/18/2004 99 2 2 2 1 2 2 2 1 2 n ee n ee t − + − = σ = σ +σ ≅ σ +σ dt d Zα σt ˆ=±/2 © Tan,Steinbach, Kumar Introduction to Data Mining An Illustrative Example 4/18/2004 100 O Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25 O d = |e2 – e1| = 0.1 (2-sided test) O At 95% confidence level, Zα/2 =1.96 => Interval contains 0 => difference may not be statistically significant 0.0043 5000 0.25(1 0.25) 30 0.15(1 0.15) ˆ= − + − σd = dt = 0.100 ±1.96× 0.0043 = 0.100 ± 0.128© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 101 Comparing Performance of 2 Algorithms O Each learning algorithm may produce k models: – L1 may produce M11 , M12, …, M1k – L2 may produce M21 , M22, …, M2k O If models are generated on the same test sets D1,D2, …, Dk (e.g., via cross-validation) – For each set: compute dj = e1j – e2j – dj has mean dt and variance σt – Estimate: tkt k j j t ddt kk dd σ σ α ˆ ( 1) () ˆ 1,1 1 2 2 −− = =± − − = ∑2