Download Lecture Note 7 for MBG 404 Data mining

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

K-means clustering wikipedia, lookup

Cluster analysis wikipedia, lookup

K-nearest neighbors algorithm wikipedia, lookup

Lecture Note for MBG 404
Jens Allmer
Lecture Note 7 for MBG 404
Data mining
Data mining concerns itself with similarity among data. Clustering, a term often used in
bioinformatics is one major field in data mining. Others are association analysis and
classification. First classification will be explained using a decision tree as an example and in
the next week clustering will be the topic.
Classification refers to assigning a class to a row in a dataset. This is usually done by learning
from example. Thus given some measurements associated with a known class, they can be
used to predict the class membership for new measurements (Figure 1).
Figure 1: General strategy used in classification. From Tan, Steinbach, Kumar, Introduction to Data
The abstract strategy in classification as depicted in Figure 1 depends on a good dataset which
has data associated with known class membership (Here class Yes or No). This dataset is used
to establish rules that describe, given the data in the attributes, the class membership; the row
(tuple) refers to. Once the rules have been established new data of unknown class can be
evaluated and assigned to a class (Apply Model). A decision tree can be used as a means of
deciding to which class an unknown dataset belongs. Another way to do this would be using
support vector machines (SVM) and artificial neural networks among many other algorithms
with SVM often being the most successful.
Lecture Note for MBG 404
Jens Allmer
Decision Tree
A decision tree gives step by step instructions of how a class can be assigned based on the
attributes of the data (Figure 2). Each attribute is analyzed and either leads to a new
instruction using another attribute or to a decision.
Figure 2: A decision tree with three instructions (yellow boxes) and 4 decision points (blue boxes). From
Tan, Steinbach, Kumar, Introduction to Data Mining.
The decision tree in Figure 2 can be understood as follows. First the attribute Refund is
evaluated and if it contains the value Yes a decision can be reached, which is No in this case.
If the value is No the next attribute needs to be evaluated. This categorical attribute (MarSt) is
arbitrarily split into the two paths single, divorced and married which in case of being
Married leads to the decision No. If the value in MarSt is, however, Single or Divorced,
another attribute (TaxInc) needs to be tested. This continous attribute is split into the two
paths less than 80000 and more than 80000. At this point a decision must be reached since
there are no more attributes for the data.
How can such a decision tree be build?
For each decision attribute the split that can be achieved using that attribute is being
calculated. From that it can be learned how informative an attribute is. The better the attribute
separates the data into the correct classes, the more informative it is. The most informative
attributes are used first in a decision tree. An example calculation can be seen in Figure 3
which also shows the calculation for splitting data at decision points. Entropy can be
calculated for this and must be calculated for all attributes of the dataset. Then the one with
the lowest entropy can be picked as the first splitter (here Refund).
Lecture Note for MBG 404
Jens Allmer
Figure 3: Splitting the Refund attribute. From Tan, Steinbach, Kumar, Introduction to Data Mining.
The important message to keep in mind here is that with a set of measurements where the
outcome is known it is possible to use classification to build a model which then enables the
automatic evaluation of new measurements.
Clustering, as opposed to classification is independent of a previous class assignment in the
training dataset. Therefore clustering and similar methods are part of unsupervised learning
whereas classification is most often characterized as supervised learning.
Clustering is concerned with the similarity among data. For each data item in the dataset the
distance is calculated for some measure. Then clusters are assigned minimizing the distance
among data items within a cluster and maximizing the distances to other clusters.
The model can be used to ‘classify’ unknown data using the learned clusters. The new data
item is assigned membership to its closest cluster.