Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISAM 5931 Team: William Hellela Rauf Gadar Alex Prewett Classification as a Data Mining Tool As the world turn, information becomes more and more important as businesses become more complex, corporations expand globally, and industries become more competitive. Data mining is a useful tool that is/ can be utilized by business executives in search of information that is useful in making strategic decisions. There are different data mining techniques but the most commonly used are classification, regression, estimation, prediction, and clustering. This paper focuses on definition and exemplification of classification as a data mining tool. Classification is the process of dividing a dataset into mutually exclusive groups so that the members of each group are as "close" as possible to one another, and different groups are as "far" as possible from one another, where distance is measured with respect to specific variable(s) we are trying to predict. For example, a typical classification problem is to divide a database of companies into groups that are as homogeneous as possible with respect to a creditworthiness variable with values "Good" and "Bad"(Dunham, Data Mining 1). Classification and regression are the most common types of problems to which data mining is applied today. Data miners use classification and regression to predict customer behavior, to signal potentially fraudulent transactions, to predict store profitability, and to identify candidates for medical procedures, to name just a few of many applications. A common thread in these applications is that they have a very high payoff. Data mining can increase revenue, prevent theft, save lives, and help make better decisions. What distinguishes classification from regression is the type of output that is predicted. Classification, as the name implies, predicts class membership. For example, a model predicts that Jane Doe, a potential customer, will respond to an offer. With classification, the predicted output (the class) is categorical. A categorical variable has only a few possible values, such as "Yes" or "No," or "Low," "Middle," or "High." However, regression predicts a specific value. For example, a model predicts that Mike Smith’s customer profitability will be $854 (Brand, 4). Here is an example of classified data table versus unclassified data table as demonstrated by Bruce Moxon at <http://dsg.harvard.edu/courses/hst951/Spring02/951-8P.pdf> Basically, classification differs from other mining approaches in terms of classes of problems they are able to solve. Classification, perhaps the most commonly applied data mining technique, employs a set of pre-classified examples to develop a model that can classify the population of records at large. Fraud detection and credit-risk applications are particularly well suited to this type of analysis. This approach frequently employs decision tree or neural network-based classification algorithms. The use of classification algorithms begins with a training set of pre-classified example transactions. For a fraud detection application, this would include complete records of both fraudulent and valid activities, determined on a record-by-record basis. The classifier training algorithm uses these pre-classified examples to determine the set of parameters required for proper discrimination. The algorithm then encodes these parameters into a model called a classifier (Moxon, 9). There are two main kinds of models in data mining: predictive and descriptive. Classification is predictive rather than descriptive. Predictive models are/ can be used to forecast explicit values, based on patterns determined from known results. For example, from a database of customers who have already responded to a particular offer, a model can be built that predicts which prospects are most likely to respond to the same offer. Descriptive models describe patterns in existing data, and are generally used to create meaningful subgroups, such as demographic clusters, and relationships between events, such as “people who purchased a VCR are three times more likely to purchase a camera in the following four months”(Long, 13). Break-down of data mining models showing classification as a part of predictive model (Diagram by William Long) Differentiating classification from data mining tasks: Classification is probably the most studied data mining task. In this task, the goal is to predict the value (class) of user-defined goal attributes based on the values of other attributes, called the predicting attributes. The following some many common differentiation of classification from other techniques (Alex Freitas, 1-12) Classification and estimation Classification and estimation are closely related and often go hand and hand within a data mining model. While classification is used to answer questions from a finite set of classes, estimation is best used when the answer lies within an unknown, continuous set of answers. For example, using census tract information to predict household incomes. Classification and clustering In the classification task, the class of a training example is given as input to the data mining algorithm, characterizing a form of supervised learning. In contrast, in the clustering task, the data mining algorithm must discover classes by itself, by partitioning the examples into clusters, which is in a form of unsupervised learning. Clustering is similar to classification, except that clustering does not require a finite set of predefined classes; clustering simply groups data according to the patterns and rules inherent in the data based on the similarity of its attributes. Classification and Association Although both classification and association rules have an If- Then structure, there are important differences between them. Association rules can have more than one item in the rule consequent, whereas classification rules always have one attribute in the consequent. Unlike the association task, the classification task is asymmetric with respect to the predicting attributes and the goal attributes. Predicting attributes can only occur in the rule antecedent, whereas the goal attribute occurs only in the rule consequent. Classification and prediction Classification is a task of finding a function that maps records into one of several discrete classes while predicting a task of learning a pattern from examples and using the developed model to predict future values of the target variable. Classification is a classic data mining task, with roots in machine learning. A typical application is for example, given past records of customers who switched to another supplier; predict which current customers are likely to do the same. This specific application is known as Churn Prediction, but there are very many other applications such as predicting response to a direct marketing campaign, separating good products from faulty ones etc. The classification problem involves data which is divided into two or more groups, or classes. In the example above, the two classes are "switched supplier" and "didn't switch". The data mining software is asked to tell us which of the groups a new example falls into. So, we might train the software using customer records from the last year, divided into our two groups. We then ask the software to predict which of our customers we're likely to lose. Of course, to ensure we can trust the predictions, there is generally a testing or validation stage as well (Brand, 3) The following diagram illustrates classification as a part of supervised machine learning. Note: The diagram is taken from (www.ir.iit.edu/~nazli/cs422/ CS422-Slides/DMClassification.pdf) Works Cited Alex A Freitas. www.ppgia.pucpr.br/alex Moxon, Bruce. "The Hows and Whys of Data Mining and How it Differs From Other Analytical Techniques." DBMS Data Warehouse Supplement (1998): 7. www.ir.iit.edu/~nazli/cs422/ CS422-Slides/DM-Classification.pdf Dunham, Margaret. "Data mining "Introductory and Advanced Topics"." 2002 <http://www.thearling.com/books.htm>. Brand , Estelle. "Classification and Regression." Predicting outcomes is the most popular application of data mining 07 02 2000 5. http://www.dbmsmag.com/9807m04.html>. Long, William. "Classification Tree." Computer Science 2001 <http://dsg.harvard.edu/courses/hst951/Spring02/951-8P.pdf>.