Download Lecture Note 7 for MBG 404 Data mining

Lecture Note for MBG 404 Jens Allmer Lecture Note 7 for MBG 404 Data mining Data mining concerns itself with similarity among data. Clustering, a term often used in bioinformatics is one major field in data mining. Others are association analysis and classification. First classification will be explained using a decision tree as an example and in the next week clustering will be the topic. Classification Classification refers to assigning a class to a row in a dataset. This is usually done by learning from example. Thus given some measurements associated with a known class, they can be used to predict the class membership for new measurements (Figure 1). Figure 1: General strategy used in classification. From Tan, Steinbach, Kumar, Introduction to Data Mining. The abstract strategy in classification as depicted in Figure 1 depends on a good dataset which has data associated with known class membership (Here class Yes or No). This dataset is used to establish rules that describe, given the data in the attributes, the class membership; the row (tuple) refers to. Once the rules have been established new data of unknown class can be evaluated and assigned to a class (Apply Model). A decision tree can be used as a means of deciding to which class an unknown dataset belongs. Another way to do this would be using support vector machines (SVM) and artificial neural networks among many other algorithms with SVM often being the most successful. 1 Lecture Note for MBG 404 Jens Allmer Decision Tree A decision tree gives step by step instructions of how a class can be assigned based on the attributes of the data (Figure 2). Each attribute is analyzed and either leads to a new instruction using another attribute or to a decision. Figure 2: A decision tree with three instructions (yellow boxes) and 4 decision points (blue boxes). From Tan, Steinbach, Kumar, Introduction to Data Mining. The decision tree in Figure 2 can be understood as follows. First the attribute Refund is evaluated and if it contains the value Yes a decision can be reached, which is No in this case. If the value is No the next attribute needs to be evaluated. This categorical attribute (MarSt) is arbitrarily split into the two paths single, divorced and married which in case of being Married leads to the decision No. If the value in MarSt is, however, Single or Divorced, another attribute (TaxInc) needs to be tested. This continous attribute is split into the two paths less than 80000 and more than 80000. At this point a decision must be reached since there are no more attributes for the data. How can such a decision tree be build? For each decision attribute the split that can be achieved using that attribute is being calculated. From that it can be learned how informative an attribute is. The better the attribute separates the data into the correct classes, the more informative it is. The most informative attributes are used first in a decision tree. An example calculation can be seen in Figure 3 which also shows the calculation for splitting data at decision points. Entropy can be calculated for this and must be calculated for all attributes of the dataset. Then the one with the lowest entropy can be picked as the first splitter (here Refund). 2 Lecture Note for MBG 404 Jens Allmer Figure 3: Splitting the Refund attribute. From Tan, Steinbach, Kumar, Introduction to Data Mining. The important message to keep in mind here is that with a set of measurements where the outcome is known it is possible to use classification to build a model which then enables the automatic evaluation of new measurements. Clustering Clustering, as opposed to classification is independent of a previous class assignment in the training dataset. Therefore clustering and similar methods are part of unsupervised learning whereas classification is most often characterized as supervised learning. Clustering is concerned with the similarity among data. For each data item in the dataset the distance is calculated for some measure. Then clusters are assigned minimizing the distance among data items within a cluster and maximizing the distances to other clusters. The model can be used to ‘classify’ unknown data using the learned clusters. The new data item is assigned membership to its closest cluster. 3

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture Note 7 for MBG 404 Data mining