Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Instructor: Bajuna Salehe Email: [email protected] Web: http://www.ifm.ac.tz/staff/bajuna/courses Classification and Prediction Classification and Prediction  Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Such analysis can help provide us with a better understanding of the data at large. An example application     An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether to put a new patient in an intensive-care unit. Due to the high cost of ICU, those patients who may survive less than a month are given higher priority. Problem: to predict high-risk patients and discriminate them from low-risk patients. Another application  A credit card company receives thousands of applications for new cards. Each application contains information about an applicant,  age  Marital status  annual salary  outstanding debts  credit rating  etc.  Problem: to decide whether an application should approved, or to classify applications into two categories, approved and not approved. Machine learning and our focus      Like human learning from past experiences. A computer does not have “experiences”. A computer system learns from data, which represent some “past experiences” of an application domain. Our focus: learn a target function that can be used to predict the values of a discrete class attribute, e.g., approve or not-approved, and high-risk or low risk. The task is commonly called: Supervised learning, classification, or inductive learning. Classification and Prediction  Whereas classification predicts categorical (discrete, unordered) labels, prediction models continuous valued functions. Classification and Prediction  For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation. Classification Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown.  The derived model is based on the analysis of a set of training data (i.e., data objects whose class label is known).  What is Classification  Classification is the task of assigning objects to their respective categories.  Examples include classifying email messages as spam or non-spam based upon the message header and content, and classifying galaxies based upon their respective shapes. What is Classification   Classification can provide a valuable support for informed decision making in the organisation. For example, suppose a mobile phone company would like to promote a new cellphone product to the public. Instead of mass mailing the promotional catalog to everyone, the company may be able to reduce the campaign cost by targeting only a small segment of the population What is Classification  It may classify each person as a potential buyer or non-buyer based on their personal information such as income, occupation, lifestyle, and credit ratings. Discrete Data  Discrete Data – A set of data is said to be discrete if the values / observations belonging to it are distinct and separate, i.e. they can be counted (1,2,3,....). Examples might include the number of kittens in a litter; the number of patients in a doctors surgery; the number of flaws in one metre of cloth; gender (male, female); blood group (O, A, B, AB). Discrete Data  Any data measurements that are not quantified on an infinitely divisible numeric scale. Includes items like counts, proportions, ratios, or percentage of a characteristics, (i.e. sex, loan forms, department attendance, etc.) that have measurements like pass or fail, leak or no leak, small, medium, or large, go or no go tests. (SixSigma.com Dictonary) Continuous Data  Continuous/Variable Data – A set of data is said to be continuous if the values / observations belonging to it may take on any value within a finite or infinite interval. You can count, order and measure continuous data. For example height, weight, temperature, the amount of sugar in an orange, the time required to run a mile. Continuous Data  Variable data type have real numbers in the measurement like 2.34, 2.55, etc. (i.e. data that can be measured on a continuous scale) Categorical Data  Categorical Data – A set of data is said to be categorical if the values or observations belonging to it can be sorted according to category. Each value is chosen from a set of non-overlapping categories. For example, shoes in a cupboard can be sorted according to colour: the characteristic 'colour' can have nonoverlapping categories 'black', 'brown', 'red' and 'other'. People have the characteristic of 'gender' with categories 'male' and 'female'. Nominal Data  Nominal Data – A set of data is said to be nominal if the values / observations belonging to it can be assigned a code in the form of a number where the numbers are simply labels. You can count but not order or measure nominal data. For example, in a data set males could be coded as 0, females as 1; marital status of an individual could be coded as Y if married, N if single. Ordinal Data  Ordinal Data - A set of data is said to be ordinal if the values / observations belonging to it can be ranked (put in order) or have a rating scale attached. You can count and order, but not measure, ordinal data. Ordinal Data  The categories for an ordinal set of data have a natural order, for example, suppose a group of people were asked to taste varieties of biscuit and classify each biscuit on a rating scale of 1 to 5, representing strongly dislike, dislike, neutral, like, strongly like. A rating of 5 indicates more enjoyment than a rating of 4, for example, so such data are ordinal. Preliminaries The input data for classification task is given in the form of collection of records.  Each record also known as instance or example is characterised by a tuple (x,y), where x is the attribute set and y is the class label  Preliminaries Table 1. Vertebrate Data Set Preliminaries  In the above slide, the table shows a sample data set used for classifying vertebrates into one of the following categories: mammal, bird, fish, reptile, or amphibian.  The attribute set includes properties of a vertebrate such as its body temperature, skin cover, method of reproduction, ability to fly and ability to live in water. Preliminaries  The attribute set may contain discrete and continuous features, however on the table above attribute set contains mostly discrete values.  The class label on the other hand, must be a discrete attribute.  This is a key characteristics that distinguishes classification from another predictive modeling task known as regression, where y is a continuous attribute. What is Classification  Classification can be described as a task of assigning objects to one of several predefined categories. Input Attribute Set (x) Output Classification Model Class label (y) The diagram show the classification as task of mapping an input attribute set x into its class label y Simple Definition  Classification is the task of learning a target function f that maps each attribute set x into one of the pre-defined class labels y.  The target function is also known informally as a classification model. Usefulness of Classification Model  A classification model is useful for the following purposes:  It may serve as an explanatory tool to distinguish between objects of different classes (Descriptive Modeling).  It may also be used to predict the class label of unknown records (Predictive Modeling). Consider the table below: Usefulness of Classification Model A classification model can be treated as a black box that automatically assigns a class label when presented with the attribute set of an unknown record.  Example you can be given the characteristics of creature known as gila monster.  Usefulness of Classification Model  By building a classification model from the data set shown in Table 1, you may use the model to determine the class to which the creature belongs.  Classification models are most suited for predicting or describing data sets with binary or nominal target attributes. Classification & Prediction  Classification:  Predicts categorical class labels  Classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data  Prediction:  Models continuous-valued functions, i.e., predicts unknown or missing values  Typical Applications  Credit approval  Target marketing – Medical diagnosis – Treatment effectiveness analysis Classification Techniques Classification Technique A classification technique is a systematic approach for building classification models from an input data set.  Examples of classification techniques include:   Decision Tree Classifiers  Rule-Based Classifiers  Neural Networks  Support Vector Machines  Naıve Bayes Classifiers  Nearest-Neighbor Classifiers Classification Technique  Each technique employs a learning algorithm to identify a model that best fits the relationship between the attribute set and class label of the input data (produces outputs consistent with the class labels of the input data). Classification Technique A good classification model must predict correctly the class labels of records it has never seen before.  Building models with good generalization capability, i.e., models that accurately predict the class labels of previously unseen records, is therefore a key objective of the learning algorithm.  General Approach to Solve a Classification Problem  A general strategy to solving a classification problem is that:  First, the input data is divided into two disjoint sets, known as the training set and test set, respectively.   The training set will be used for building a classification model. The induced model is later applied to the test set to predict the class label of each test record. Why are we dividing the data into two set?  This strategy of dividing the data into independent training and test sets allows us to obtain an unbiased estimate of the performance of a model on previously unseen records.  A figure below in the next slide depicts General Approach to Solve a Classification Problem Performance Measurement of Model  Evaluation of the performance of a classification model is based upon the number of test records predicted correctly and wrongly by the model.  The counts are tabulated in a table known as a confusion matrix. Performance Measurement of Model  Table 2 depicts the confusion matrix for a binary classification problem. Performance Measurement of Model Each entry fij in this table denotes the number of records from class i predicted to be of class j.  For instance, f01 is the number of records from class 0 wrongly predicted as class 1  Based on the entries in the confusion matrix, the total number of correct predictions made by the model is (f11 + f00) and the total number of wrong predictions is (f10 + f01).  Performance Measurement of Model  Although a confusion matrix provides the information needed to determine how good is a classification model, it is useful to summarize this information into a single number.  This would make it more convenient to compare the performance of different classification models. Performance Measurement of Model There are several performance metrics available for doing this. One of the most popular metrics is model accuracy, which is defined as:  Accuracy = Number of correct predictions Total number of predictions = f11 + f00 f11 + f10 + f01 + f00  Performance Measurement of Model Equivalently, the performance of a model can be expressed in terms of its error rate given by the following equation:  Error rate = Number of wrong predictions Total number of predictions = f10 + f01 f11 + f10 + f01 + f00  Decision Trees