Download ISAM 5931 - UHCL MIS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
ISAM 5931
Team: William Hellela
Rauf Gadar
Alex Prewett
Classification as a Data Mining Tool
As the world turn, information becomes more and more important as businesses become
more complex, corporations expand globally, and industries become more competitive.
Data mining is a useful tool that is/ can be utilized by business executives in search of
information that is useful in making strategic decisions. There are different data mining
techniques but the most commonly used are classification, regression, estimation,
prediction, and clustering.
This paper focuses on definition and exemplification of classification as a data mining
tool.
Classification is the process of dividing a dataset into mutually exclusive groups so that
the members of each group are as "close" as possible to one another, and different groups
are as "far" as possible from one another, where distance is measured with respect to
specific variable(s) we are trying to predict. For example, a typical classification problem
is to divide a database of companies into groups that are as homogeneous as possible with
respect to a creditworthiness variable with values "Good" and "Bad"(Dunham, Data
Mining 1).
Classification and regression are the most common types of problems to which
data mining is applied today. Data miners use classification and regression to predict
customer behavior, to signal potentially fraudulent transactions, to predict store
profitability, and to identify candidates for medical procedures, to name just a few of
many applications. A common thread in these applications is that they have a very high
payoff. Data mining can increase revenue, prevent theft, save lives, and help make better
decisions. What distinguishes classification from regression is the type of output that is
predicted. Classification, as the name implies, predicts class membership. For example, a
model predicts that Jane Doe, a potential customer, will respond to an offer. With
classification, the predicted output (the class) is categorical. A categorical variable has
only a few possible values, such as "Yes" or "No," or "Low," "Middle," or "High."
However, regression predicts a specific value. For example, a model predicts that Mike
Smith’s customer profitability will be $854 (Brand, 4).
Here is an example of classified data table versus unclassified data table as demonstrated
by Bruce Moxon at <http://dsg.harvard.edu/courses/hst951/Spring02/951-8P.pdf>
Basically, classification differs from other mining approaches in terms of classes
of problems they are able to solve. Classification, perhaps the most commonly applied
data mining technique, employs a set of pre-classified examples to develop a model that
can classify the population of records at large. Fraud detection and credit-risk
applications are particularly well suited to this type of analysis. This approach frequently
employs decision tree or neural network-based classification algorithms. The use of
classification algorithms begins with a training set of pre-classified example transactions.
For a fraud detection application, this would include complete records of both fraudulent
and valid activities, determined on a record-by-record basis. The classifier training
algorithm uses these pre-classified examples to determine the set of parameters required
for proper discrimination. The algorithm then encodes these parameters into a model
called a classifier (Moxon, 9).
There are two main kinds of models in data mining: predictive and descriptive.
Classification is predictive rather than descriptive. Predictive models are/ can be used to
forecast explicit values, based on patterns determined from known results. For example,
from a database of customers who have already responded to a particular offer, a model
can be built that predicts which prospects are most likely to respond to the same offer.
Descriptive models describe patterns in existing data, and are generally used to create
meaningful subgroups, such as demographic clusters, and relationships between events,
such as “people who purchased a VCR are three times more likely to purchase a camera
in the following four months”(Long, 13).
Break-down of data mining models showing classification as a part of
predictive model (Diagram by William Long)
Differentiating classification from data mining tasks:
Classification is probably the most studied data mining task. In this task, the goal is to
predict the value (class) of user-defined goal attributes based on the values of other
attributes, called the predicting attributes. The following some many common
differentiation of classification from other techniques (Alex Freitas, 1-12)
Classification and estimation
Classification and estimation are closely related and often go hand and hand within a
data mining model. While classification is used to answer questions from a finite set of
classes, estimation is best used when the answer lies within an unknown, continuous set
of answers. For example, using census tract information to predict household incomes.
Classification and clustering
In the classification task, the class of a training example is given as input to the data
mining algorithm, characterizing a form of supervised learning. In contrast, in the
clustering task, the data mining algorithm must discover classes by itself, by partitioning
the examples into clusters, which is in a form of unsupervised learning.
Clustering is similar to classification, except that clustering does not require a finite set of
predefined classes; clustering simply groups data according to the patterns and rules
inherent in the data based on the similarity of its attributes.
Classification and Association
Although both classification and association rules have an If- Then structure, there are
important differences between them. Association rules can have more than one item in
the rule consequent, whereas classification rules always have one attribute in the
consequent. Unlike the association task, the classification task is asymmetric with respect
to the predicting attributes and the goal attributes. Predicting attributes can only occur in
the rule antecedent, whereas the goal attribute occurs only in the rule consequent.
Classification and prediction
Classification is a task of finding a function that maps records into one of several discrete
classes while predicting a task of learning a pattern from examples and using the
developed model to predict future values of the target variable.
Classification is a classic data mining task, with roots in machine learning. A
typical application is for example, given past records of customers who switched to
another supplier; predict which current customers are likely to do the same. This specific
application is known as Churn Prediction, but there are very many other applications such
as predicting response to a direct marketing campaign, separating good products from
faulty ones etc. The classification problem involves data which is divided into two or
more groups, or classes. In the example above, the two classes are "switched supplier"
and "didn't switch". The data mining software is asked to tell us which of the groups a
new example falls into.
So, we might train the software using customer records from the last year, divided into
our two groups. We then ask the software to predict which of our customers we're likely
to lose. Of course, to ensure we can trust the predictions, there is generally a testing or
validation stage as well (Brand, 3)
The following diagram illustrates classification as a part of supervised machine learning.
Note: The diagram is taken from (www.ir.iit.edu/~nazli/cs422/ CS422-Slides/DMClassification.pdf)
Works Cited
Alex A Freitas. www.ppgia.pucpr.br/alex
Moxon, Bruce. "The Hows and Whys of Data Mining and How it Differs From Other
Analytical Techniques." DBMS Data Warehouse Supplement (1998): 7.
www.ir.iit.edu/~nazli/cs422/ CS422-Slides/DM-Classification.pdf
Dunham, Margaret. "Data mining "Introductory and Advanced Topics"." 2002
<http://www.thearling.com/books.htm>.
Brand , Estelle. "Classification and Regression." Predicting outcomes is the most popular
application of data mining 07 02 2000 5.
http://www.dbmsmag.com/9807m04.html>.
Long, William. "Classification Tree." Computer Science 2001
<http://dsg.harvard.edu/courses/hst951/Spring02/951-8P.pdf>.