Download 1.2 What is data mining?

University of Massachusetts Dartmouth MULTIVARIATE DATA MINING USING INDEXED K-MEANS. A Thesis in Computer Engineering by TR Satish Kumaar Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science University of Massachusetts, Dartmouth January 2003 Copyright 2003 by TR Satish Kumaar I grant the University of Massachusetts Dartmouth the non-exclusive right to use the work for the purpose of making single copies of the work available to the public on a not-forprofit basis if the University's circulating copy is lost or destroyed. ____________________________________ TR Satish Kumaar Date: ______________________________ We approve the thesis of TR Satish Kumaar Date of signature _________________________ Paul J. Fortier Professor of Electrical and Computer Engineering Thesis Advisor __________________ __________________________ Hong Liu Professor of Electrical and Computer Engineering Graduate Committee __________________ __________________________ __________________ Howard E. Michel Assistant Professor of Electrical and Computer Engineering Graduate Committee ___________________________ __________________ Dr. Dayalan P. Kasilingam Associate Professor Graduate Program Director, Department of Electrical and Computer Engineering ________________________ Antonio H. Costa __________________ ________________________ Farhad Azadivar Dean, College of Engineering __________________ Professor Chairperson, Department of Electrical and Computer Engineering __________________________ __________________ Richard J. Panofsky Associate Vice Chancellor for Academic Affairs and Graduate Studies ABSTRACT Multivariate data mining using indexed k-means By TR Satish Kumaar Raw information grows at an ever-increasing rate, dictating a need for tools to turn such data into useful information and knowledge; this is where data mining comes into play. The knowledge gained can be used for applications ranging from business management, production control, market analysis, to engineering design and science exploration. There are many approaches for knowledge discovery ranging from Association rules, Decision trees, and K-nearest neighbor, Classification, Cluster to Genetic algorithms. The focus of this thesis is to mine a multivariate dataset, using indexing within a k-means clustering algorithm to discover rules. The focus is also to compare the results with ordinary kmeans methods so as to analyze and test the results for accuracy and importance. This thesis was testing whether a clustering method is possible for a multivariate dataset with static variables using indexed k-means algorithm as well as researching whether better information can be formed using this process than the regular k-means methods. Indexed k-means method gives a more precise and useful information than the regular k-means method. Questions that are not answered by the k-means method are answered by indexed k-means method. E.g.: This method can say that "Month of January one can get Fish 120 at (42.25,70.25) with a market value of 60.75 and landed weight of 26.25kg," while the regular k-means method didn’t have a cluster for the month of January. Indexed k-means method has a big implication for it reduces the computational power needed to cluster a huge dataset by replacing it with a smaller dataset without losing precious knowledge in the data and also gives a more useful and precise information than the regular k-means method. TABLE OF CONTENTS ACKNOWLEDGEMENT ............................................................................. v LIST OF FIGURES .......................................................................................vi LIST OF TABLES ...................................................................................... viii CHAPTER 1 INTRODUCTION.................................................................. 1 1.1 Emergence of data mining ................................................................................... 1 1.2 What is data mining? ............................................................................................. 1 1.2.1 Architecture of data mining ...................................................................... 3 1.2.2 Data mining versus Query tool ............................................................... 4 1.2.3 Data mining Functionalities ...................................................................... 4 1.2.4 Classification of data mining ..................................................................... 7 1.2.5 Practical Problems of data mining ........................................................... 8 1.2.6 Data mining issues and ethics ................................................................... 9 1.3 What is multivariate data mining? .................................................................... 11 1.3.1 When is multivariate analysis used? ....................................................... 12 1.4 What is K-mean mining?.................................................................................... 13 1.4.1 How does k-mean work? ......................................................................... 13 1.4.2 Why k-means is not enough? .................................................................. 13 1.5 Motivation ............................................................................................................. 13 1.5.1 How can we achieve indexed k-mean method? .................................. 15 1.6 Research contribution ......................................................................................... 15 1.6.1 Assumptions............................................................................................... 15 1.7 Thesis organization ............................................................................................. 16 CHAPTER 2 RELATED WORK ................................................................ 17 2.1 Data mining approach ........................................................................................ 17 2.2 Mining complex data in large data and information repositories ............... 19 2.3 Clustering Analysis .............................................................................................. 20 2.3.1 Partitioning methods ................................................................................ 24 2.3.1.1 k-means algorithm .......................................................................... 25 2.3.1.2 K-medoids method ........................................................................ 30 2.3.2 Hierarchical methods................................................................................ 33 2.3.3 Density-based methods ............................................................................ 34 2.3.4 Grid-based methods ................................................................................. 34 2.3.5 Model-based methods .............................................................................. 35 2.3.6 EM algorithm ............................................................................................. 35 2.4 Indexed based algorithms .................................................................................. 38 2.5 Which techniques to use for which tasks ........................................................ 39 2.6 Multidimensional data model ............................................................................ 41 CHAPTER 3 ALGORITHM AND DATASET .......................................... 43 3.1 Different forms of knowledge .......................................................................... 43 3.2 Getting started...................................................................................................... 45 ii 3.3 KDD Process ....................................................................................................... 46 3.3.1 Data selection ............................................................................................. 46 3.3.2 Cleaning....................................................................................................... 47 3.3.3 Enrichment................................................................................................. 48 3.3.4 Coding ......................................................................................................... 49 3.3.5 Data mining ................................................................................................ 49 3.3.6 Reporting .................................................................................................... 50 3.4 Data Sources ......................................................................................................... 50 3.4.1 Data modeling............................................................................................ 51 3.4.2 Preprocessing ............................................................................................. 53 3.4.3 Data cleaning .............................................................................................. 54 3.5 Algorithm for indexed k-means ........................................................................ 55 3.6 Discover of interesting patterns........................................................................ 58 3.6.1 Interestingness measure ........................................................................... 58 3.7 Presentations and visualization of discovered patterns ................................ 60 3.8 Implementation Tools and software ................................................................ 61 CHAPTER 4 RESULTS AND ANALYSIS ................................................ 62 4.1 Experimental Results .......................................................................................... 62 4.1.1 Output for indexed k-means ................................................................... 62 4.1.2 Output for k-means .................................................................................. 71 4.2 Analyses ................................................................................................................. 72 iii 4.2.1 Interpreting the patterns found ..............................................................73 4.2.2 Testing and Performance .......................................................................74 4.2.3 Comparison between k-means and indexed k-means ........................75 4.3 Discussion ............................................................................................................. 76 CHAPTER 5 FUTURE WORK................................................................... 77 5.1 Conclusion ............................................................................................................ 78 5.2 Future Work ......................................................................................................... 78 5.3 Research Directions ............................................................................................ 79 APPENDIX A SOURCE CODE ................................................................. 81 BIBLIOGRAPHY ......................................................................................... 94 iv ACKNOWLEDGMENTS The author wishes to express appreciation to: y supervisor Dr. Paul J. Fortier, whose ideas and comments made this thesis possible embers of my committee for their time and interest My parents for their unconditional love and support v LIST OF FIGURES Number Page Fig 2.1: Flowchart for K-means Algorithm.................................................................. 29 Fig 3.1: Flowchart for Indexed k-means Algorithm ................................................... 57 Fig 4.1.1: 3-Dimensional Figure of Landed_kg, Percentage of Occurrence and Month .................................................................................................................................. 63 Fig 4.1.2: 3-Dimensional Figure of Latitude Degree, Longitude Degree and Fish ID ......................................................................................................................................... 63 Fig 4.1.3: 3-Dimensional Figure of Latitude Degree, Longitude Degree and Landed Weight of fish ...................................................................................................... 64 Fig 4.1.4: 3-Dimensional Figure of Latitude Degree, Longitude Degree and Market Values of Fish ...................................................................................................... 64 Fig 4.1.5: 3-Dimensional Figure of Latitude Degree, Longitude Degree and Month .................................................................................................................................. 65 Fig 4.1.6: 3-Dimensional Figure of Market Values, Fish ID and Percentage of Occurrence of Fish ........................................................................................................... 65 Fig 4.1.7: 3-Dimensional Figure of Market values, Landed Kg and Fish ID ........ 66 Fig 4.1.8: 3-Dimensional Figure of Market values, Landed Kg and Month .......... 66 Fig 4.1.9: 3-Dimensional Figure of Market values, Landed Kg and Percentage of Occurrence of Fish ........................................................................................................... 67 vi Fig 4.1.10: 3-Dimensional Figure of Market values, Month and Percentage of Occurrence of Fish ........................................................................................................... 67 Fig 4.1.11: 3-Dimensional Figure of Month, Landed KG and Fish Id................... 68 Fig 4.1.12: 3-Dimensional Figure of Month, Market Values and Fish Id .............. 68 Fig 4.1.13: 3-Dimensional Figure of Percentage, Month and Fish Id ..................... 69 Fig 4.1.14: 3-Dimensional Figure of Latitude degree, Longitude degree and Percentage of Occurrence ............................................................................................... 69 Fig 4.1.15: 3-Dimensional Figure of Fish Id, Landed KG and Percentage of Occurrence ......................................................................................................................... 70 vii LIST OF TABLES Number Page Table 2.1: Techniques and Tasks. .................................................................................. 39 Table 4.1: Output of K-means........................................................................................ 71 Table 4.2: Analysis of Indexed K-means ...................................................................... 72 viii Chapter 1 INTRODUCTION 1.1 Emergence of data mining In one of his short stories, The Library of Babel, the South-American writer Jorge Louis Borges describes an infinite library, which consists of an endless network of rooms with bookshelves. Although most of the books have no meaning and have unintelligible titles like 'Axaxaxas mlo', people wander through it until they die, and scholars develop wild hypotheses that somewhere in the library there must be a central catalog; or that all the books that one could possibly think of must be somewhere in the library. None of these hypotheses can be verified because the library contains an infinite amount of data but no information [1]. The library of Babel may be interpreted as an interesting but cruel metaphor for the situation in which modern humans find themselves: we live in an expanding universe in which there is too much data, (this growth of information is due to the mechanical production of texts) and too little information. The development of new techniques to find the required information from huge amounts of data is one of the main challenges for software developers today [1]. 1.2 What is data mining? Against this background, a great interest is being shown in the new field of 'data mining' or KDD (knowledge discovery in databases). KDD is like mining, where enormous quantities of debris have to be removed before diamonds or gold can be found. Similarly, with a computer, one can automatically find the one' information-diamond' among the tons of data-debris in one’s database. It was proposed at the first international KDD conference in Montreal in 1995 that the term 'KDD' be employed to describe the whole process of extraction of knowledge from data, which is a multi-disciplinary field of research where the knowledge here means relationships and patterns between data elements, data mining is used exclusively for the discovery stage of the KDD process [1]. Knowledge discovery as a process consists of the following steps: 1. Data cleaning (to remove noise and inconsistent data) 2. Data Integration (where multiple data sources may be combined) 3. Data Selection (Relevant data from database are retrieved for analysis) 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining) 5. Data mining (Process of intelligent methods to extract data patterns) 2 6. Pattern evaluation (Identifies interesting patterns representing knowledge) 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user) 1.2.1 Architecture of data mining: The architecture of a typical data mining system has the following components:  Database, data warehouse, or other information repository  Database or data warehouse server: It is responsible for fetching the relevant data, based on the user's data mining request  Knowledge base: This is the domain knowledge that is used to guide the search, or evaluate the interestingness of resulting patterns  Data mining engine: Consists of a set of functional modules for tasks such as characterization, association, classification, cluster analysis, and evolution and deviation analysis  Pattern evaluation module: It employs interestingness measures and interacts with the data mining modules in order to focus the search towards interesting patterns. 3  Graphical user interface: This communicates between users and the data mining system, allowing the user to interact with the system by specifying a query or task, information to help focus the search and visualize the patterns in different forms. 1.2.2 Data mining verses query tools Query tools and data mining tools are complementary. Normal queries can answer questions like “who bought which product on which date?” While data mining tools can answer questions like "what are the most important trends in customer behavior?" which are much harder to find using SQL [1]. Of course, these questions could be answered using SQL by a process of trial and error. It could take days or months to find an optimal segmentation for a large database, which the machine-learning algorithm can automatically find the answer to in a much shorter time. Once the data-mining tool has found segmentation, you can use your query environment again to query and analyze the profiles found. One could say that if you know exactly what you are looking for, use SQL; but if you know only vaguely what you are looking for, turn to data mining [1]. 1.2.3 Data mining Functionalities: Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. Data mining tasks can be classified into two categories: 4 descriptive and predictive. Descriptive mining tasks characterize the general properties of the data in the database. Predictive mining tasks perform inference on the current data in order to make predictions. Data mining functionalities, and the patterns they can discover, are as follows: (a) Concept/Class Description: Characterization and Discrimination: Data can be associated with classes or concepts. The description of a class or concept is summarized, concise and yet precise terms are called class/concept descriptions [2]. These descriptions can be derived via (1) data characterization or (2) data discrimination or (3) both data characterization and discrimination (b) Data characterization is a summarization of the general characteristics or features of a target class of data. The output of data characterization can be presented in various forms of charts and tables. The resulting descriptions can be also presented as generalized relations or in rule forms (characteristic rules) (c) Data discrimination is a comparison of the general features of the specified discrimination descriptions will include comparative measures that help distinguish between the target and contrasting classes and expressed in rule form referred as discriminate rules 5 (d) Association Analysis: The discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data. It is widely used for market basket or transaction data analysis (e) Classification and Prediction: The process of finding a set of models that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. The derived model can be presented in various forms, such as classification (IFTHEN) rules, decision trees, mathematical formulae, or neural networks. Classification can also be used for predicting the class label of data objects (f) Cluster Analysis: Analyzes data objects without consulting a known class label. The objects are clustered or grouped based on the principle of maximizing the intra class similarity and minimizing the interclass similarity. Each cluster that is formed can be viewed as a class of objects, from which rules can be derived (g) Outlier Analysis: A database, which contains data, objects that do not comply with the general behavior or model of the data. These data objects are the outliers. Most data mining methods discard outliers as noise or exceptions. In some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones (h) Evolution analysis: Describes and models regularities or trends for objects whose behavior changes over time. This may include characterization; 6 discrimination, association, classification, or clustering or time-related data, distinct features of such an analysis include time-series data analysis, sequence or periodicity pattern matching, and similarity-based data analysis 1.2.4 Classification of data mining Diverse disciple contributes data mining; hence data mining research is expected to generate a large variety of data mining systems. Therefore a clear classification, to help users identify those that best matches their needs. Data mining can be categorized according to various criteria, as follows. (a) Classification according to the kinds of databases mined: Can be classified according to different criteria such as data models or types of data or applications involved. (b) Classification according to the kinds of knowledge mined: Can be categorized based on knowledge like data mining functionalities. (c) Classification according to kinds of techniques utilized: Can be described according to the degree of user interaction (e.g. autonomous, interactive exploratory or query-driven systems) or by methods of data analysis employed (e.g. database or data warehouse oriented techniques etc.) 7 (d) Classification according to the applications adapted: Different applications like finance, telecommunications, DNA, stock markets and e-mail require the integration of application-specific methods. 1.2.5 Practical problems of data mining: A lot of data mining projects get bogged down in a forest of problems like:  Lack of long-term vision: “what do we want from our files in the future?”  Not all files are up to date: Data vary greatly in quality  Struggle between departments: They may not want to give up their data  Poor cooperation from the electric data processing department: “Just give us the queries and we will find the information you want.”  Legal and privacy restrictions: Data cannot be used for legal reasons.  Files are hard to connect for technical reasons: there is a discrepancy between a hierarchical and a relational database, or data models are not up to date  Timing problems: files can be compiled centrally, with a six-month delay  Interpretation problem: Data’s meanings or usages are unknown 8 1.2.6 Data mining issues and ethics The usage of data, particularly data about people has serious ethical implications, and practitioners of data mining techniques must act responsibly by making themselves aware of the ethical issues that surround their particular application. When applied to people, data mining is frequently used to determine who gets the loan, special offer and so on. Certain kinds of discrimination like racial, sexual, religious, and so on- are not only unethical, but also illegal. However, the situation is complex because it depends on the application. Using such information for medical diagnosis is certainly ethical, but using the same information when mining loan payments behavior is not. Even when sensitive information is discarded, there is a risk that models will be bulky that rely on variables that can be shown to substitute for facial or sexual characteristics. For example, people frequently live in areas that are associated with particular ethnic identities, and so using an area code in a data mining study runs the risk of building models that are based on race even though racial information has been explicitly excluded from the data. 1.3 What is multivariate data mining? Multivariate data can be defined as a set of entities E, where the ith element of E consists of a vector with 'n' variables. Each variable may be independent or interdependent with one or more of the other variables. 9 An N-dimensional dataset, E comprises elements Ei = (xi1, xi2...xin). Each observation xij may be independent of or interdependent on one or more of the other observations. Observations may be discrete or continuous in nature, or may take on nominal values. Multivariate data is difficult to visualize effectively because of  Dimensional constraints: Difficult to visualize data in higher than 3 dimensions  Size of data set - Occlusion: Data patterns are difficult to find  Saturation: Data visualization is difficult  Scarcity: Less number of data points to find patterns Examples of the types of multivariate data:  Physical interpretation such as geographical data  A sequence of time-varying information such as stock prices Multivariate data analysis can be used for any tables of data, even one with a few rows and many columns, is converted into a few meaningful plots that display the information in the data, the real information, in a way that is easy to understand. 10 Typical applications:  Quality control and quality optimization (food, beverages, paints, drugs)  Process optimization and process control  Development and optimization of measurement methods  Prospecting for oil, ore, water, minerals, etc  Classification of bacteria, viruses, tissues, and other medical specimens  Analysis of economic and administrative tables  Design of new drugs 1.3.1 WHEN IS MULTIVARIATE ANALYSIS USED? "Variate" refers to variables, and "multi" means several or many. Multivariate analysis is appropriate whenever the dataset consists of two or more variables observed a number of times of individuals. The result is often called a "data set." It is customary to envision a data set as being comprised of rows and columns. The rows pertain to each observation, such as each person or each completed questionnaire in a large survey. The columns pertain to each variable, such as a response or an observed characteristic for each person. 11 Rows: records; individuals; cases; respondents; subjects; patients; etc. Columns: fields; variables; characteristics; responses; etc. Data sets can be immense; a single study may have a sample size of 1,000 respondents, each answering 100 questions. Here the data set would be 1,000 by 100, or 100,000 cells of data. Hence, the need for summarization is evident. Simple univariate or bivariate statistics could not be applied for an average were computed for each variable, 100 means would result, and if all pair wise correlations were computed, there would be close to 5,000 separate values. Cluster analysis might yield five clusters. Multiple regressions could identify six significant predictor variables. Multiple discriminate analyses perhaps would find seven significant variables, and so on. It should be evident that parsimony can be achieved by using multivariate techniques when analyzing most data sets. Another reason for using multivariate techniques is that they automatically assess the significance of all linear combinations of the observed variables. 1.4 What is k-mean mining? The k-means algorithm takes the input parameter, and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster, can be viewed as the cluster's center of gravity. 12 1.4.1 how does K-means work? The k-means algorithm proceeds as follows. First, it randomly selects k of the objects, each of which initially represents a cluster mean or center. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object (Typically, the squared-error criterion is used.) and the cluster mean. It then computes the new mean for each cluster. This process iterates until the criterion function converges. 1.4.2 why k- mean is not enough? The k-means method can be applied only when the mean of a cluster is defined. This may not be the case in some applications, such as when data with categorical attributes are involved. The necessity for users to specify k, the number of clusters, in advance can be seen as a disadvantage. The K-means method is not suitable for discovering clusters with nonconvex shapes or clusters of very different size. Moreover, it is sensitive to noise and outlier data points since a small number of such data can substantially influence the mean value. 1.5 Motivation: To get a sense of how adding multivariate data mining can enrich a pattern sequence, let us look at the same areas in which general sequential patterns are useful. Quality control is one of the areas that multivariate data mining are used. 13 For example, eight different properties are measured on products as part of the quality control before delivery. You have a table with these eight values measured on one hundred and fourteen products samples from the last year. Quality manager can answer questions like: Are there any trends? Are the eight properties related to each other, and if so how? Was there a difference when the new production process was started six months ago? Is there any relation between the products quality and the values of the sixteen process variables? Can we improve the process? Do we have to measure all eight properties to guarantee good products? In effect this is achieved by using different data mining algorithms like k-means clustering. K-means clustering is distance calculations between cluster centroids and patterns. As the number of the patterns and the number of centroids are increased, time needed to complete computations increased. This computational load requires high performance computers and/or algorithmic improvements. My research proposes a method in which to combine k-means algorithm with indexes. These steps provide a better-localized result without any loss of information along with less computation. In the above example, we can index the dataset by products and then run k-means algorithm over each and every product to find meaningful information. The thesis was investigating whether indexed kmeans algorithm method is possible for a multivariate dataset with static variables. It was trying to answer whether better knowledge can be acquired using 14 this process than the regular k-means methods and find the truth in the algorithm. 1.5.1 how can we achieve indexed k-mean method? The dataset was indexed using a static variable with a fewer number of discrete values, which took the resulting dataset in that specific static variable and ran the regular k-means algorithm, (implemented in java) over it to find the information on the dataset. For comparison, whole dataset was taken without indexing and kmeans algorithm was run over it. 1.6 Research Contribution This thesis addresses a problem that has not been looked at before, namely how to combine k-means algorithm with indexing. The thesis proposes to use this index k-means algorithm and compare the information gathered with k-means method. 1.6.1 Assumptions: 1. We assume that knowledge is stored only in 5 attributes or dimensions, even though the algorithm supports more dimensions. With more dimensions, more computations is required 15 2. We also assume that within each unordered dimension, only one value may be present in the database record. For example, if the additional dimension refers to the fish ID, then there cannot be two different fish ID associated with a record. 3. Noise and outlier data points have been discarded in the preprocessing of the dataset, since a small number of such data can substantially influence the mean value. These data’s may or may not have any useful information 4. Datasets from a particular year have been taken, to reduce the computations. However some knowledge may have been lost in that process 1.7 Thesis organization This thesis is organized as follows. In Chapter 2, related work is discussed including other approaches for finding knowledge and research done in the area of multivariate data analysis. Included here is an in depth discussion of the two algorithms, k-means and indexed data mining, on which our proposed indexed kmeans algorithm is based. In Chapter 3 an explanation of how these two algorithms are integrated to form the new algorithm, as well as the comparison algorithm k-means are provided. Chapter 4 shows the results of the knowledge analysis and possible optimizations. Chapter 5 concludes with a look at the future directions of this research. 16 Chapter 2 RELATED WORKS 2.1 Data mining approaches Data mining is a young interdisciplinary field, drawing from areas such as database systems, data warehousing, statistics, machine learning, data visualization, information retrieval, and high performance computing [4]. Other contributing areas include neural networks, pattern recognition, spatial data analysis, image databases, signal processing, probabilistic graph theory, and inductive logic programming. Data mining needs the integration of approaches from multiple disciplines [4]. Large sets of data analysis methods have been developed in statistics. Machine learning has also contributed substantially to classification and induction problems. Neural network have shown their effectiveness in classification, prediction and clustering analysis tasks. However, with increasingly large amounts of data stored in databases for data mining, these methods face challenges on efficiency and scalability. Efficient data structures, indexing and data accessing techniques developed in database researches contribute to high performance data mining. Many data analysis methods developed need to be re-examined and set17 oriented; scalable algorithms should also be developed for effective data mining [4]. Another difference between traditional data analysis and data mining is that traditional data analysis is assumption-driven in the sense that a hypothesis is formed and validated against the data, whereas data mining in contrast is discovery-driven in the sense that patterns are automatically extracted from data, which requires substantial search efforts [4]. Therefore, high performance computing will play an important role in data mining. Parallel, distributed, and incremental data mining methods should be developed, and parallel computer architectures and other high performance computing techniques should also be explored in data mining. Human eyes identify patterns and regularities in data sets or data mining results. Data and knowledge visualization is an effective approach for the presentation of data and knowledge, exploratory data analysis, and interactive data mining. Data mining in data warehouse is one step beyond on-line analytic processing (OLAP) of data warehouse data [3]. By integrating OLAP and data cube technologies, on-line analytical mining mechanism contributes to interactive mining of multiple abstraction spaces of data cubes. 18 2.2 Mining complex data in large data and information repositories [4] Data mining is not confined to relational, transactional, and data warehouse data. There are high demands for mining spatial, text, multimedia and time-series data, and mining complex, heterogeneous, semi-structured and unstructured data, including the web-based information repositories [5,6]. Complex data may require advanced data mining techniques. For example, for object-oriented and object-relational databases, object-cube based generalization techniques can be developed for handling complex structured objects, methods, class/subclass hierarchies, etc. Mining can then be performed on the multidimensional abstraction spaces provided by object-cubes. A spatial database stores spatial data, which represents points, lines, regions, and non-spatial data, which represent other properties of spatial objects and their non-spatial relationships. Spatial data cube can be constructed which consists of both spatial and non-spatial dimensions and/or measures. Since a spatial measure may represent a group of aggregation that may produce a great number of such aggregated spatial objects, it is impossible to pre-compute and store all of such spatial aggregations. Therefore, selective materialization of aggregated spatial objects is a good tradeoff between storage space and online computation time [4]. Spatial data mining can be performed in a spatial data cube as well as directly in a spatial database. A multi-tier computation technique can be adopted in spatial 19 data mining to reduce spatial computation. For example, when applying mining spatial association rules, one can first apply rough spatial computations, such as minimal bounding rectangle method to filter out most of the sets of spatial objects (e.g., not spatially close enough), and then apply relatively costly, refined spatial computation only to the set of promising candidates. Text analysis methods and content-based image retrieval techniques play an important role in mining text and multimedia data, respectively. These techniques can be integrated with data cube and data mining techniques for effective mining of such types of data. It is challenging to mine knowledge from the World-Wide-Web because of the huge amount of unstructured and semi-structured data. However, Web access patterns can be mined from the preprocessed and cleaned Web log records; hot Web sites can be identified based on their access frequencies and the number of links pointed to the corresponding sites. 2.3 Clustering Analysis: Clustering Analysis is to identify clusters embedded in the data, where a cluster is a collection of data objects that are "similar" to one another. Similarity can be expressed by distance functions, specified by users or experts. A good clustering method produces high quality clusters to ensure that the inter-cluster similarity is 20 low and the intra-cluster similarity is high. For example, one may cluster the houses in an area according to their house category and geographical locations. Unlike classification, clustering and unsupervised learning do not rely on predefined classes and class-labeled training examples. For this reason, clustering is a form of learning by observation, rather than learning by examples. Conceptual clustering groups objects to form a class, described by a concept. This differs from conventional clustering, which measures similarity based on geometric distance. Conceptual clustering has two functions: (1) discovers the appropriate classes; (2) forms descriptions for each class, as in classification. The guideline of striving for high intraclass and low interclass similarity still applies. Data mining research has been focused on high quality and scalable clustering methods for large databases and multidimensional data warehouses. An example of clustering would be what most people perform when they do laundry- grouping permanent press, dry cleaning, wash whites and brightly colored clothes, which is important for they have similar characteristics. It turns out that these clusters have important common attributes about the way they behave when washed. Clustering is straight forward, but of course, difficult to be made; Clustering is often more dynamic. An example of the nearest neighbor prediction algorithm is when you look at the people in the neighborhood. It may be noticed that, in general, all have somewhat 21 similar income. However, there may still be a wide variety of incomes among even your closest neighbors. The nearest neighbor prediction algorithm works in very much the same way except that nearness in a database may consist of a variety of factors and it performs quite well in terms of automation because many of the algorithms are robust with respect to dirty and missing data. The nearest neighbor prediction algorithm simply stated is as follows: "Objects that are 'near' each other will also have similar prediction values. Thus, if you know the prediction value of one of the objects, you can predict it from its nearest neighbors [7]." The typical requirements of clustering in data mining are:  Scalability: Highly scalable clustering algorithms are needed for a sample of a given large data set, which may lead to biased results  Ability to deal with different types of attributes: Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixture of data types  Discovery of clusters with arbitrary shape: Clusters can be of any shape. Hence, it is important to develop algorithms that can detect clusters of arbitrary shape 22  Minimal requirement for domain knowledge to determine input parameters: The clustering results can be sensitive to input parameters  Ability to deal with noise: Some clustering algorithms are sensitive to missing, unknown, outliers or erroneous data and lead to clusters of poor quality  Insensitivity to the order of input records: Some clustering algorithms are sensitive to the order of input data  High dimensionality: It is challenging to cluster data objects in high- dimensional space, especially considering that such data can be very sparse and highly skewed  Constraint-based clustering: Applications may need to perform clustering under various kinds of constraints. A challenging task is to find groups of data with good clustering behavior that satisfy specified constraints.  Interpretability and usability: Clustering needs to be tied up with specific semantic interpretations and applications. It is important to study how an application goal may influence the selection of clustering methods. There are many clustering techniques, organized into following categories: partitioning, hierarchical, density-based, grid-based, and model-based methods. Clustering can also be used for outlier detection. 23 2.3.1 Partitioning methods: Given a database of n objects or data tuples, a partitioning method constructs k partitions of the data, where each partition represents a cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the following requirements:(1) each group must contain at least one object; (2) each object must belong to exactly one group. Given k, the number of partitions to constrict, a partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. To achieve global optimality, partitioning-based clustering would require the exhaustive enumeration of all of the possible partitions. Instead, most applications adopt one of two popular heuristic methods: (1) the k-means algorithm, where each cluster is represented by the mean value of the objects in the cluster; (2) the k-mediods algorithm, where each cluster is represented by one of the objects located near the center of the cluster. These heuristic clustering methods work well in finding spherical-shaped clusters in small to medium-sized databases. To find clusters with complex shapes and for clustering very large data sets, partitioning-based methods need to be extended. 24 2.3.1.1 k-means algorithm: Given a database of n objects and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (kn), where each partition represents a cluster. The clusters are formed to optimize an objectivepartitioning criterion, often called a similarity function, such as distance, so that the objects within a cluster are "similar," whereas the objects of different clusters are "dissimilar" in terms of the database attributes. Algorithm: The k-means algorithm is partition based on the mean value of the objects in the cluster. Input: The number of clusters k and a database containing n objects Output: A set of k clusters that minimizes the squared-error criterion. Method: (1) Arbitrarily choose k objects as initial cluster centers (2) Repeat (3) (Re) assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster (4) Update the cluster means, i.e., calculate the mean value of the objects for each cluster 25 (5) Until no change occurs The k-means algorithm takes the input parameter, k and partitions a set of n objects into k clusters so that it results in high intracluster and low intercluster similarity. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster's center of gravity. "How does the k-means algorithm work?" The k-means algorithm proceeds as follows. First, it randomly selects k of the objects, each of which initially represents a cluster mean or center. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean. It then computes the new mean for each cluster. This process iterates until the criterion function converges. Typically, the squared-error criterion is used, which is defined as E = Eki=1 ki=1 pCi|p-mi|2, Where E is the sum of square-error for all the objects in the database, p is the point in space representing a given object, and mi is the mean of cluster Ci (both p and mi are multidimensional). This criterion tries to make the resulting k clusters as compact and as separate as possible. The algorithm attempts to determine k partitions that minimize the squared-error function. It works well when the clusters are compact clouds that are rather well 26 separated from one another. The method is relatively scalable and efficient in processing large data sets because the computational complexity of the algorithm is O (nkt), where n is the total number of objects, k is the number of clusters, and t is the number of iterations. Normally, k <<n and t<<n. The method often terminates at a local optimum. The k-means method, however, can be applied only when the mean of a cluster is defined. This may not be the case in some applications, such as when data with categorical attributes are involved. The necessity for users to specify k (number of clusters) in advance can be seen as a disadvantage. The k-means method is not suitable for discovering clusters with non-convex shapes or clusters of very different size. Moreover, it is sensitive to noise and outlier data points since a small number of such data can substantially influence the mean value. There are a few variants of the k-means method. These differ in the selection of the initial k means, the calculation of dissimilarity, and strategies for calculating cluster means. An interesting strategy that often yields good results is to first apply a hierarchical agglomeration algorithm to determine the number of clusters, find initial clusters, and then use iterative relocation to improve them. Another variant to k-means is the k-modes method, which extends the k-means paradigm to cluster categorical data by replacing the means of clusters with modes, using new dissimilarity measures to deal with categorical objects and a 27 frequency-based method to update modes of cluster. The k-means and the kmodes methods can be integrated to cluster data with mixed numeric and categorical values, resulting in the k-prototypes method. "How can we make the k-means algorithm more scalable?" A recent effort on scaling the k-means algorithm is based on the idea of identifying three kinds of regions in data: regions that are compressible, regions that must be maintained in main memory, and regions that are discardable. An object is compressible if it is not discardable but belongs to a tight sub cluster. A data structure known as a clustering feature is used to summarize objects that have been discarded or compressed. If an object is neither discardable nor compressible, then it should be retained in main memory. To achieve scalability, the iterative clustering algorithm only includes the clustering features of the compressible objects and the objects that must retained in main memory, thereby turning a secondarymemory-based algorithm into a main-memory-based algorithm. 28 Start Initiate Centers of the K Cluster Evaluate Cluster Assignment of Vectors Compute New Cluster Centers with respect to new cluster assignment Evaluate new Cluster Assignments New Cluster Assignments Differ from previous one yes No STOP Fig 2.1 Flowchart for k-means Algorithm 29 2.3.1.2 K-Medoids Method: The k-means algorithm is sensitive to outliers since an object with an extremely large value may substantially distort the distribution of data. "How might the algorithm be modified to diminish such sensitivity?” Instead of taking the mean values of the objects in a cluster as a reference point, the mediod can be used, which is the most centrally located object in a cluster. Thus the partitioning method can still be performed based on the principle of minimizing the sum of the dissimilarities between each object and its corresponding reference point. This forms the basis of the k-medoids method. The strategy of k-mediods clustering algorithm is to find k clusters in n objects by first arbitrarily finding a representative object (the medoid) for each cluster. Each remaining object is clustered with the medoid to which it is the most similar. The strategy then iteratively replaces one of the medoids to which it is the most similar, which then later replaces one of the medoids by one of the non-medoids as long as the quality of the resulting clustering is improved. This quality is estimated using a cost function that measures the average dissimilarity between an object and the medoid of its cluster. To determine whether a nonmedoid object, Orandom, is a good replacement for a current medoid, Oj, the following four cases are examined for each of the nonmedoid objects, p. Case 1: p currently belongs to medoid oj. If oj is replaced by orandom as a medoid and p is closest to one of oi, ij, the p is reassigned to oi. 30 Case 2: currently belongs to medoid oj. If oj is replaced by orandom as a medoid and p is closest to orandom, then p is reassigned to orandom. Case 3: p currently belongs to medoid oi, i  j. If oj is replaced by orandom as a medoid and p is still closest to oi, then the assignment does not change. Case 4: p currently belongs to medoid oi, I  j. If oj is replaced by orandom as a medoid and p is closest to orandom, the p is reassigned to orandom. Each time a reassignment occurs, the difference in square-error E contributes to the cost function that calculates the difference in square error value, if a current medoid is replaced by a nonmedoid object. The total cost of swapping is the sum of costs incurred by all nonmedoid objects. If the total cost is negative, the oj is replaced with orandom for the actual square-error would be reduced. If the total cost is positive, the current medoid oj is considered acceptable. Algorithm: k-medoids is partitioning based on medoid or central objects. Input: The number of clusters k and a database containing n objects. Output: Set of k clusters that minimizes the sum of the dissimilarities of all the objects to their nearest medoid. Method: 31 (1) Arbitrarily choose k objects as the initial medoids (2) Repeat (3) Assign each remaining objects to the cluster with the nearest medoid (4) Randomly select a nonmedoid object, orandom (5) Compute the total cost, S, of swapping oj with orandom (6) If S < 0 then swap oj with orandom to form the new set of k medoids (7) Until no change PAM (Partitioning around Medoids) was one of the first k-medoids algorithms introduced. It attempts to determine k partitions for n objects. After an initial random selection of k medoids, the algorithm repeatedly tries to make a better choice of medoids. All of the possible pairs of objects are analyzed, where one object in each pair is considered a medoid and the other is not. The quality of the resulting clustering is calculated for each such combination. An object, oj, is replaced with the object causing the greatest reduction in square-error. The set of the best objects for each cluster in iteration forms the medoids for the next iteration. For large values of n and k, such computation becomes very costly. "Which method is more robust-k-means or k-medoids?" The k-medoids method is more robust than k-means in the presence of noise and outliers because a 32 medoid is less influenced by outliers or other extreme values than a mean. However, its processing is more costly than the k-means method. Both methods require the user to specify k, the number of clusters. 2.3.2 Hierarchical methods: A hierarchical method creates a hierarchical decomposition of the given set of data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed. The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group. IT successively merges the objects in the same cluster. In successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds. Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never be undone. The rigidity is useful for it leads to smaller computation costs by not worrying about the combination of different choices. However, such techniques cannot correct erroneous decisions. There are two approaches to improve the quality of hierarchical clustering:(1) perform careful analysis of object "linkages" at each hierarchical partitioning, such as in CURE and Chameleon; (2) integrate hierarchical agglomeration and iterative relocation by first using a hierarchical agglomerative algorithm and then refining the result using iterative relocation, as in BIRCH. 33 2.3.3 Density-based methods: Most partitioning methods cluster objects based on the distance between objects. Such methods can find only spherical-shaped clusters and encounter difficulty in discovering clusters of arbitrary shapes. Clustering methods have been developed based on the notion of density. The general idea is to continue growing the given cluster as long as the density (number of objects) in the "neighborhood" exceeds some threshold; that is, for each object within a given cluster, the neighborhood of a given radius will contain at least a minimum number of points. Such a method can be used to filter out noise and discover clusters of arbitrary shape. DBSCAN is a typical density-based method that grows clusters according to density threshold. OPTICS is a density-based method that computes an augmented clustering ordering for automatic and interactive cluster analysis. 2.3.4 Grid-based methods: Grid-based methods quantize the object space into a finite number of cells that form a grid structure. All of the clustering operations are performed on the grid structure (quantized space). The main advantage of this approach is its fast processing time, which is typically independent of the number of data objects and dependent only on the number of cells in each dimension in the quantized space. 34 STING is a typical example of a grid-based method. CLIQUE and Wave-Cluster are two clustering algorithms that are both grid and density based. 2.3.5 Model-based methods: Model-based methods hypothesize a model for each of the clusters and find the best cluster for the data of the given model. A model-based algorithm may locate clusters by constructing a density function that reflects the spatial distribution of the data points. It also leads to a way of automatically determining the number of clusters based on standard statistics, taking "noise" or outlier into account and thus yielding robust clustering methods. Some clustering algorithms integrate the ideas of several clustering methods, so that it is sometimes difficult to classify a given algorithm as uniquely belonging to only one clustering method category. Furthermore, some applications may have clustering criteria that require the integration of several clustering techniques. 2.3.6 EM algorithm The EM (Expectation Maximization) algorithm extends the k-means paradigm in a different way. Instead of assigning each object to a dedicated cluster, it assigns each object to a cluster according to a weight representing the probability of membership. In other words, there are no strict boundaries between clusters. Therefore, new means are computed based on weighted measures. 35 In K-means, we know neither of the distribution that each training instance came from, nor of the parameters of a mixture model. So we adopt the procedure used for k-means clustering algorithm, and iterate. Guessing the initial five parameters and using them to calculate the cluster probabilities for each instance, then using these probabilities to re estimate the parameters, and repeating them is called the EM algorithm. The first step, the calculation of the cluster probabilities (which are the "expected" class values) is the “expectation"; the second, the calculation of the distribution parameters, is the "maximization" of the likelihood of the distributions given the data. Adjustments must be made to the parameter estimation equations to account for the fact that it is only cluster probabilities, not the clusters that are known for each instance. These probabilities act like weights. If wi is the probability, then i belong to cluster A. The mean A and standard deviation  A2 or A are w1x1 + w2x2 + ...+ wnxn A =  w1 +w2 +...+wn w1(x1-)2 + w2(x2-)2 +...+ wn(xn-)2  A2 =  w1 +w2+...+wn 36 - Where xi are the entire instance, not just those belonging to cluster A. Now consider how to terminate the iteration. The k-means algorithm stops when the classes of the instances don't change from iteration to the next. This means that a "fixed point" has been reached. The algorithm converges toward that fixed point but never actually gets there. Despite that, we can see how close it is by calculating the overall likelihood that the data came from this dataset, given the values of the parameters. This overall likelihood is obtained by multiplying the probabilities of the individual instances i: i (pA Pr[xi|A] + pB Pr[xi|B]) The probabilities given of the cluster A and B are determined from the normal distribution function f (x; , ). This overall likelihood is a measure of the "goodness" of the clustering, and increases iteration of the EM algorithm. Again, there is a technical difficulty with equating the probability of a particular value of x with f (x; , ). In this case, the effect does not disappear because no probability normalization operation is applied. The upshot is that the likelihood expression is not a probability and does not necessarily lie between zero and one: nevertheless, its magnitude reflects the quality of the clustering. In practical, logarithm implementation is calculated, summing the logs of the individual components, and avoiding all multiplications. Yet the overall conclusion still holds; you should iterate until the increase in log-likelihood becomes negligible. 37 For example, a practical implementation might iterate until the difference between successive values of log-likelihood is less than 10-10 for ten successive iterations. The log likelihood may increase very sharply over the first few iterations and then converge quickly to a point that is virtually stationary. Although the EM algorithm is guaranteed to converge to a maximum, this is a local maximum and my not necessarily be the same as the global maximum. For a better chance of obtaining the global maximum, the whole procedure should be repeated several times, with different initial guess for the parameter values. The overall log-likelihood figure can be used to compare the different final configurations obtained: just choose the largest of the local maximal. 2.4 Indexed based algorithms: Given a data set, the index-based algorithm uses multidimensional indexing structures, such as R-trees or k-d trees, to search for neighbors of each object o within radius d around that object. Let M be the maximum number of objects within the d-neighborhood of an outlier. Therefore, once M + 1 neighbors of object o are found, it is clear that o is not an outlier. This algorithm has a worstcase complexity of O(k*n2), where k is the dimensionality and n is the number of objects in the data set. The index based algorithm scales well as k increases. However, this complexity evaluation takes only the search time into account even though the task of building an index, in itself, can be computationally intensive. 38 2.5 Which Techniques to Use for Which Tasks [8] Technique Classification Estimation Prediction Affinity Clustering Description Group Standard Statistics Market                 Basket Analysis Memory-Based Reasoning Genetic Algorithm  Cluster Detection Link Analysis   Decision Tree    Neural Networks     Table 2.1 Techniques and Tasks. 39   The choice of data mining techniques to apply at a given point in the cycle depends on the particular data-mining task to be accomplished and on the data available for analysis. Approach to select a data mining technique has two steps:  Translate the business problem to be addressed into a series of data mining tasks  Understand the nature of the available data in terms of the content and the types of data fields, and the structure of the relationships between records Data mining approach is mostly influenced by the following data characteristics:  A preponderance of categorical variables  A preponderance of numeric variables  A large number of fields (independent variables) per record  Multiple target fields (dependent variables)  Variable-length records  Time-ordered data  Free-text data 40 2.5 Multidimensional Data Model Multidimensional data model exists in the form of a data cube that allows data to be modeled and viewed in multiple dimensions, defined by dimensions and facts. Dimensions are the perspectives or entities according to which an organization wants to keep records, like time, item, branch, and location in a sales store. Each dimension may have a table associated with it, called a dimension table, which further describes the dimension. A multidimensional data model is typically organized around a central theme, like sales, for instance. A fact table represents this theme, where facts are numerical measures. The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables. Multidimensional models exist in the form of a star schema, a snowflake schema, or a fact constellation schema. Star Schema: The most common modeling paradigm is the star schema, in which the data warehouse contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, (2) a set of smaller attendant tables (dimensional), one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table. Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby splitting the data 41 into additional tables. The resulting schema graph forms a snowflake shape. The difference between the snowflake and star schema models is that the dimension tables of the snowflake model may be kept in normalized form to reduce redundancies. Such table is easy to maintain and saves storage space because a large dimension table becomes enormous when the dimensional structure is included as columns. The saved space is negligible in comparison to the magnitude of the fact table. The snowflake structure reduces the effectiveness of browsing since more joints are needed to execute a query. Consequently, the system performance may be adversely impacted. Hence, the snowflake schema is not as popular as the star schema in data warehousing design. Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, hence is called a galaxy schema or a fact constellation. A data warehouse collects information about subjects that span the entire organization, such as customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide. For data warehouses, the fact constellation schema are commonly used since it can model multiple, interrelated subjects. A data mart, on the other hand, is a department subset of the data warehouse that focuses on selected subjects, and thus its scope is department-wide. For data marts, the star or snowflake schema are commonly used since both are geared towards modeling single subjects, although the star schema is more popular and efficient. 42 Chapter 3 ALGORITHM AND DATASET This chapter discusses the design and implementation of indexed k-means clustering on the fisheries database. For this purpose, k-means algorithm clustering mining technique is implemented. Initially, the Dataset on which the clustering algorithm is implemented is studied. Secondly, the implementation of indexed k-means and k-means algorithm is discussed. 3.1: Different forms of knowledge: The key issue in KDD is to realize that there is more information hidden in the data than you are able to distinguish at first sight. In data mining we have four different types of knowledge that can be extracted from the data: 1. Shallow knowledge: This is information that can be easily retrieved from databases using a query tool such as structured query language (SQL) 2. Multi-dimensional knowledge: This is the information that can be analyzed using online analytical processing tools. With OLAP tools you have the ability to rapidly explore all sorts of clustering and different orderings of the data but it is important to realize that most of the things you can do with an OLAP 43 tool can also be done using SQL. The advantage of OLAP tools is that they are optimized for this kind of search and analysis operation. However, OLAP is not as powerful as data mining; it cannot search for optimal solutions. 3. Hidden knowledge: This is data that can be found relatively easily by using pattern recognition or machine-learning algorithms. Again, one could use SQL to find these patterns but this would probably prove extremely timeconsuming. A pattern recognition algorithm could find regularities in a database in minutes or at most a couple of hours, whereas you would have to spend months using SQL to achieve the same result. 4. Deep knowledge: This is information that is stored in the database but can only be located if there is a clue that indicates where to look. Hidden knowledge is the result of a search space over a search algorithm. Deep knowledge is typically the result of a search space over only a tiny local optimum, with no indication of the dataset. A search algorithm could roam around without achieving any significant result. An example of this is encrypted information stored in a database. It is almost impossible to decipher a message that is encrypted if you do not have the key, which indicates that for the present time at any rate, there is a limit to what one can learn. 44 3.2 Getting started: The starting point for any data mining activity is the formation of a specific information requirement related to a specific action, i.e., what do we want to know and what do we want to do with this knowledge? Data mining is pointless unless the finding of the knowledge is followed up with the appropriate actions. A data-mining environment can be realized on many different levels using several different techniques; the following list gives an indication of the steps that should be taken to start a KDD project: 1. Make a list of requirements. For what purpose would a KDD environment be realized? What are the criteria of success? How will success be measured? 2. Make an overview of existing hardware and software: networks, databases, applications, servers, and so on. 3. Evaluate the quality of the available data. For what purpose was it collected? 4. Make an inventory of the available databases, both internally and externally. 45 5. Is a data warehouse in existence? What kind of data is available? Can we zoom in on details of operational data? 6. Formulate the knowledge that the organization needs both now and in the future, in order to be able to function optimally. 7. Identify groups of knowledge workers or decision makers who are to apply the results. What kinds of decisions will they need to take? Which patterns are useful to them and which are not, both now and in the future? 8. Analyze whether the knowledge found can actually be used by the organization. It is useless to distill client profiles from mailing files, if for technical reasons the mailing department cannot handle the selections found. 9. List the processes and transformations these databases have to go through before they can be used. 3.3 KDD process: The knowledge discovery process consists of six stages: Data selection, Cleaning, Enrichment, Coding, data mining, reporting. 3.3.1 Data Selection: Once you have formulated your information requirements, the next logical step is to collect and select the data you need. In most cases, this data will be stored in 46 operational databases used by the information systems in the organization. However, gathering this information in a centralized database is not always an easy task since it may involve low-level conversion of data, such as from flat file to relational tables. The operational data used in different parts of the organization varies in quality. Some databases are updated on a day-to-day basis; others may contain information that dates back several years. Therefore a data warehouse is an important aspect of the data mining process. Although it is not essential to have a good data warehouse in operation to set up a KDD activity, it is very helpful. A data warehouse presents a stable and reliable environment in which to collect operational data. 3.3.2 Cleaning: Once data is collected, the next stage is cleaning because the amount of pollution that exists in a data might not be easily detectable, it is therefore a good idea to examine the data in order to obtain a feeling for the possibilities, which is difficult with large dataset. When databases are very large, it is always advisable to select some random samples and analyze them to get a rough idea of what once can expect. For example in an organization, the date of birth of a person will be stored correctly, but the age field may not be correct. Before a data mining operation, one has to clean the data as much as possible, and this can be done automatically in most cases. It is not realistic, however to expect to be able to 47 remove all the pollution in advance since some anomalies in the data will only be discovered during the data mining process itself. Checking domain consistency needs to be carried out by programs that have deep semantic knowledge of the attributes that are being checked. Most forms of pollution are produced via the method in which the data is gathered in the field; removing this kind of pollution will almost always involve re-engineering the business process. 3.3.3 Enrichment: Once the data is cleaned, enriching it becomes necessity. Additional database may be available on a commercial bases; these can provide information on a variety of subjects, including demographic data, such as the average prices of houses and cars, types of insurance that people have, and so on. Matching the information from bought-in databases with the company’s own database can be difficult. A well-known problem is the reconstruction of family relationships in databases: a company may buy a database containing demographic data of people living in certain areas, but this information is of value only if it is able to reconstruct the family relationships between the individuals that are in the database. In a relational environment this information can simply be joined with the original data. 48 3.3.4 Coding: By means of selection and protection in SQL, data can be manipulated to obtain a clean target table. Sometimes pollution in the data can be removed simply by filtering out the polluted records. Suppose that some information concerning car or house ownership is mining about some individuals in the database; if this lack of information is distributed randomly over the database, removing those records will not affect the type of clusters formed, so we can do this successfully. On the other hand, it is possible that some causal connection exists between the lack of certain information and a certain type of customer, especially in situations where fraud could be involved, such as in insurance records. Some customers might have deliberately been given wrong information in order to obtain insurance coverage for which they would not otherwise be eligible. Obviously, in such cases, removing information will certainly affect the type of patterns found. 3.3.5 Data mining: Data mining has three main task areas: - knowledge engineering, classification, and problem solving. There is no single best machine-learning or pattern recognition technique; different tasks pre-suppose different kinds of techniques. A KDD environment therefore supports these different types of techniques, such environment is called hybrid. The selection of data mining algorithm depends on the quality of the input, and the output as well as the performance. The efficiency 49 of the algorithms is in the learning stage and the actual application stage of the algorithm. 3.3.6 Reporting: The reporting stage combines two different functions:  Analyzing the results of the pattern recognition algorithms  Application of the results of the pattern recognition algorithms to data The purpose is not only to inspect what has been learned, but also to apply the classifications and segmentation information that has been gathered. In many cases, reporting can be done using traditional query tools for databases. Nowadays, however, various new data visualization techniques are emerging, ranging from simple scatter diagrams showing different clusters in a twodimensional way to complex interactive environments that enable us to fly over landscapes containing information about data sets. 3.4 Data Sources: One of the key steps in Knowledge Discovery in Databases [9] is to create a suitable target data set for the data mining tasks. Data is stored in the Computer Engineering Department (UMass Dartmouth) fisheries database; this in turn was collected from NOAA. The database has the information about the whole 50 country regarding the available fish types, dates, location of fishing trips, along with the quantities and market values of the fish caught. Interesting tables’ trip, subtrip, and subtrip_fish were examined. From these tables, interesting attributes for the data mining process were extracted. Interesting attributes fish_id, Landed_kg, Mkt_values, Lat_deg, Long_deg and date were considered. After these fields were obtained from the tables, a data source containing all this information was created in a view (containing 2656731 records). To reduce the computations required for the clustering process, the dataset for the year 1989 (containing 275258 records) was obtained, while the other years’ records were ignored. For the analysis, data in the area around the Gulf of Maine were considered, hence reduced the dataset to the Gulf of Maine by limiting lat_deg and long_deg (41.5 to 45 N, 71 w to 65 W) that has 152606 records. 3.4.1 Data Modeling A model is a description of the original historical database from which it was built; this can be successfully applied to new data in order to make predictions about missing values or to make statements about expected values [7]. A data model captures a pattern in the database that describes an important aspect of the data. It does not describe the entire database. In hypothesis testing style of data mining, a mental model is what one starts with. But, there is still a step that must 51 be taken before this mental model can be put to test. This is where the skills of a good analyst who understands the available data and is well versed in the design of decision-support database queries, statistical packages and the data mining tools lie. The thesis uses a data model, which is in multidimensional in character and at the same time, gives me a better understanding of the dataset, which can be used for practical purposes. Hence, the fields of fish_id, landed_kg, Mkt_values, Lat_deg, Long_deg and date were selected. This is a multidimensional dataset, but can be broken into the following dimensions. Domain A: Fish ID - static - This provides the information regarding the types of fish that can be harvested at some point or time depending on other domains. Domain B: landed_kg, Mkt_values - Variables - This gives a measure to quantify the output, and can calculate the profitability of the results. Domain C: Lat_deg, Long_deg - static - Geological - This domain supplies the locations of the Whole dataset, hence providing a better understanding of the fisheries Domain D: Month - static - partial inference of environment - This domain helps in calculating environmental factor that has to be taken into consideration for the data-mining task. Knowing the month of the year can help determine 52 whether the climate will be hot or cold, which is generally taken into consideration by fishermen. Domains A, C and D have indexing capabilities. Domain D was chosen, as it is more useful and relevant. Indexing is possible for Domain D, for it has only 12 variables, while the others have more variables; using the other domains may not result in obtaining a realistic output. The thesis was trying to model a data, for example: IF month = 1 then clusters (Lat_deg, Long_deg, fish_Id, Mkt_values, Landed_kg). One possible answer would be: For the month of January, at a particular point (Lat_deg, Long_deg), there is a good chance of finding a particular fish_id with some landed_kg and market value of this much. By determining the month of the year, fishermen can go to that particular spot and find this fish with a variable landed_kg and market value. 3.4.2 Preprocessing There is a number of data preprocessing techniques. While data cleaning can be applied to remove noise and correct inconsistencies in the data, data integration merges data from multiple sources into a coherent data store, such as a data warehouse or a data cube. Data transformations, such as normalization, may be applied to the data. Data reduction can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance. These data processing 53 techniques, when applied prior to mining, can substantially improve the overall quality of the patterns mined and/or the time required for the actual mining. Manual segregation of data based on indices is required at this stage to allow processing of data on specific indices. 3.4.3 Data Cleaning: Missing values: The database didn't have many missing values. Hence not much cleaning was necessary in the dataset. Noisy Data: Noise is a random error or variance in a measure variable. Given the numeric attribute mkt_values, landed_kg, how can the data be smoothed out to remove the noise? Normalization techniques were used in the attribute data and scaled so as to fall within a small-specified range, such as 0.0 to 1.0. After the number of variables in the different ranges were calculated and the major chunk of data was found in the values between 0.0 and 0.01; very small variables in the range (0.01,1) were found and deleted from the records. Similarly the ranges were reduced to find a best fit, and the other records were deleted. This decreased the number of records for the data-mining task to 122083, and the date variable to a month variable. 54 3.5 Algorithm for indexed k-means The algorithm is an integration of indexing and k-mean clustering. It can also be called associated clustering techniques, for the result is of the kind: if month = February, then one gets a clusters of information. Definition: This method initially takes the dataset of an indexed column and the number of components of the population equal to the final required number of clusters for each indexed field. In this step, the final required number of clusters is chosen so that the points are mutually farthest apart. Next, the method examines each component in the population and assigns it to one of the clusters depending on the minimum distance. The centroid's position is recalculated every time a component is added to the cluster and continues until all the components are grouped into the final required number of clusters grouped by indexed fields. Input: The number of clusters ki, where i is the indexed field (month) and a database containing n objects. Output: A set of ki clusters that minimizes the squared-error criterion 55 Algorithm: (1) Divide the dataset into multiple dataset in such a way that each dataset has a common domain (index). (2) Arbitrarily choose ki objects as initial cluster centers. (3) (Re) assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster. (4) Update the cluster means, i.e., calculate the mean value of the objects for each cluster. (5) Until no change. (6) Repeat for each unique index fields. 56 Start Select Index column for Clustering Extract Datasets for the indexed column Initiate Centers of the K Clusters Evaluate Cluster Assignment of Vectors Compute New Cluster Centers with respect to new cluster assignment Evaluate new Cluster Assignments New Cluster Assignments Differ from previous one Yes No Yes Anymore Index fields No Stop 57 Fig. 3.1 Flowchart for Indexed K-means Algorithm 3.6 Discovery of Interesting patterns: A data mining system can uncover thousands of patterns. Many of the patterns discovered may be uninteresting to the given user, representing common knowledge or lacking novelty. Several challenges remain regarding the development of techniques to assess the interestingness of discovered patterns, particularly with regards to subjective measures that estimate the value of patterns with respect to a given user class, based on user beliefs or expectations. The use of interestingness measures to guide the discovery process and reduce the search space is another active area of research. 3.6.1 Interestingness measures: Although specification of the task-relevant data and of the kind of knowledge to be mined may substantially reduce the number of patterns generated, a data mining process may still generate a large number of patterns. Typically, only a small fraction of these patterns will actually be of interest to the given user. Thus, users need to further confine the number of uninteresting patterns accomplished by the process. This can be achieved by specifying interestingness measures that estimate the simplicity, certainty, utility and novelty of patterns. Some objective measures of pattern interestingness are based on the structure of patterns and the statistics underlying them. In general, each measure is associated with a threshold that can be controlled by the user. Rules that do not meet the 58 threshold are considered uninteresting, and hence are not presented to the user as knowledge. Simplicity: A factor contributing to the interestingness of a pattern is the pattern’s overall simplicity for human comprehension. Objective measures of pattern simplicity can be viewed as functions of the pattern structure, defined in terms of the pattern size in bits or the number of attributes or operators appearing in the pattern. Rule length, for instance, is a simplicity measure. For rules expressed in conjunctive normal form (i.e., as a set of conjunctive predicates), rule length is typically defined as the number of conjuncts in the rule. Certainty: Each discovered pattern should have a measure of certainty associated with it that assesses the validity or “trustworthiness” of the pattern. A certainty measure for clustering rules is the percentage of occurrence. Given a task-relevant data tuples, the confidence of the cluster is defined as Confidence (clusters) = Number of tuples containing the cluster Number of tuples Utility: The potential usefulness of a pattern is a factor defining its interestingness. It can be estimated by a utility function, such as support. The support of a clustering pattern refers to the percentage of task-relevant data tuples for which the pattern is true. 59 Novelty: Novel patterns are those that contribute new information or increased performance to the given pattern set. For example, a data exception may be considered novel in that it differs from the data expected based on a statistical model or user beliefs. Another strategy for detecting novelty is to remove redundant patterns. If a discovered rule can be implied by another rule that is already in the knowledge base or in the derived rule set, then either rule should be reexamined in order to remove the potential redundancy. 3.7 Presentations and Visualization of Discovered Patterns For data mining to be effective, data mining systems should be able to display the discovered patterns in multiple forms, such as rules, tables, cross tabulations, pie or bar charts, decision trees, cubes, or other visual representations. Allowing the visualization of discovered patterns in various forms can help users with different backgrounds to identify patterns of interest and to interact or guide the system in further discovery. A user should be able to specify the forms of presentation to be used for displaying the discovered patterns. The use of concept hierarchies plays an important role in aiding the user to visualize the discovered patterns. Mining with concept hierarchies allows the representation of discovered knowledge in high-level concepts, which may be more understandable to users than rules expressed in terms of primitive or raw data, such as functional or multivalued dependency rules, or integrity constraints. 60 Furthermore, data mining systems should employ concept hierarchies to implement drill-down and roll-up operations, so that users may inspect discovered patterns at multiple levels of abstraction. In addition, pivoting, slicing and dicing operations aid the user in viewing generalized data and knowledge from different perspectives. A data mining system should provide such interactive operations for any dimension, as well as for individual values of each dimension. 3.8 Implementation Tools and software For a successful data mining implementation, a variety of tools were used to aid the thesis investigation. Java was used as the programming language to implement the k-means algorithm. The datasets were residing on the Oracle 9i database. MS access was used as a database for small test dataset to test the java program. Data Sources Open Database Connectivity (ODBC) was employed to access data from a variety of database management systems (Oracle). JDBC was utilized to access the database from the Java programming language. SQL was used to communicate with the database. Toad was used for data extraction and executing queries. WEKA and SPSS was applied for cross verifying the results obtained by the clustering program. Chartist-Pro was employed to generate flowcharts. Mat lab was used to create a 3-dimensional graph of the output. Windows2000 Server and Sun Solaris 5.8 were used as the operating systems. Experiments were finally run on a standard PC with x86-based processor. 61 Chapter 4 RESULTS AND ANALYSES This chapter discusses the results of the indexed k-means and k-means clustering techniques that were applied to the fisheries’ database. The results were plotted in a graph. The patterns were studied and analyzed. 4.1 Experimental Results: 4.1.1 Output for Indexed k-means: Indexed k-mean algorithm clustering techniques were used on the fisheries‘ database and the outputs were tabulated and plotted in a 3 dimensional scatter plot as shown below. The output had more than 3 dimensions, since it is not possible to show a graphical output for more than 3 dimensions. The outputs were plotted in a combination of different graphs. Each plot gave information according to user requirements. 62 63 64 65 66 67 68 69 70 4.1.2 Output for k-means: K-means algorithm clustering techniques were used on the fisheries’ database without indexing. The outputs were tabulated as shown below. Fish ID Lat Deg Long Deg Market Values Landed KG Month 147 43.0833 69.75 223.565 79.536 9 120 41.9167 67.5833 432.724 174.549 10 96 124 12 81 153 512 81 81 43.25 42.25 43.4167 43.4167 41.75 42.5833 42.75 43.25 70.25 70.25 70.0833 68.5833 69.75 70.4167 70.25 68.75 49.234 64.289 90.674 234.942 281.423 71.812 107.275 288.038 28.895 27.025 30.761 125.793 137.360 50.767 62.107 154.691 9 6 12 1 3 5 10 7 81 123 122 124 147 153 12 123 269 42.25 42.25 43.25 43.25 41.5833 42.4167 42.25 41.75 43.4167 70.25 70.25 70.0833 69.25 67.0833 70.25 70.25 70.25 69.9167 155.564 91.283 394.811 396.156 484.324 100.355 59.297 184.124 151.868 117.831 44.449 99.386 140.946 179.784 49.554 23.191 81.466 91.582 6 5 3 11 1 1 1 6 11 120 122 124 120 269 122 41.75 43.75 43.75 42.25 41.75 43.25 69.75 68.25 69.75 70.25 69.75 70.25 105.3 233.755 163.612 46.595 108.031 203.809 38.068 75.082 69.004 19.190 60.321 70.699 2 10 1 3 4 1 71 4.2 Analyses: Implemented Indexed based K-mean algorithm on the Fisheries department dataset and observed the following results as shown below. Month Fishes Yield Points Area Most Probable Profitability January 81,123, 120,12 (Atlantic cod, Yellowtail-Flounder, Winter-Flounder, Angler) 124,81,12,120 (Plaice-Flounder, Atlantic cod, Angler, Winter Flounder 120, 81, 512, 122, 124 (Winter Flounder, Atlantic cod, Wolfish, Witch-Flounder, Plaice-Flounder 122, 81, 124, 120, 269 Witch-Flounder, Atlantic cod, Flounder, Winter Flounder, Pollock 122, 81, 120 Witch-Flounder, Atlantic cod, Winter Flounder 81, 122, 123 Atlantic cod, WitchFlounder, Yellowtail-Flounder (42.25,70.25), (43.25,70.25) (41.5833 to 43.4167& 66.9167 to 70.25)) (43.25,70.25), (42.25,70.25) (41.5833 to 43.25 & 67.0833 to 70.25) (41.75 to 43.75 & 67.25 to 70.75) 120 - Winter Flounder At (42.25,70.25) With a market value of 60.75 and landed weight of 26.25 kg 124 - Plaice-Flounder, at (43.25, 70.25) with market a value of 69.25 and landed weight of 24.33kg. 512 - Wolfish at (43.25, 70.25) with market a value of 21.63 and landed weight of around 10.36 kg 122- Witch Flounder At (43.0833,68.0833) With market values to weight ratio of (791.27/159.43) 122 - Witch Flounder At (42.5833,70.0833) With market values to weight ratio of (484.8/110.43) 122 - Witch Flounder At (43.75,67.75) with market values to landed weight ratio (575/122.5) July February March April (43.25,70.25), (42.25,70.25) (43.25,70.25), (42.25,70.25) (41.75 to 44.25 & 67.75 to 70.25) 81 - Atlantic cod at (42.25, 70.25) with a market value of 37.36 and landed weight of around 29.42 kg. 122 - Witch Flounder At (43.25,70.25) with market values to landed weight ratio (168.07/50.93) (43.25,70.25), (42.25,70.25) (41.58 to 43.25 & 69.25 to 70.75) (43.25,70.25), (41.75,69.75), (42.25, 70.25) (41.75 43.75 67.0833 70.25) 12 - Angler at (42.25,70.25) with market values to landed weight ratio (46.6/16.5) 122 - Witch Flounder At (43.25,70.25) with market values to landed weight ratio (308.05/84.7) 120, 122, 81 Winter Flounder, Witch-Flounder, Atlantic cod (43.25,70.25), (41.75,69.75), (42.25, 70.25) (41.75 to 43.25 & 67.25 to 70.75) 120 - Winter Flounder, at (42.25, 70.25) with a market value of 17.21 and landed weight of 9.47 kg 81 - Atlantic cod at (41.75, 69.75) with a market value of 57.8 and landed weight of around 38.78 kg. 122 -Witch Flounder, at (43.25, 70.25) with a market value of 135.27 and landed weight of around 47.22 kg August 81, 122, 124 Atlantic cod, WitchFlounder, Flounder (41.75, 69.75), (42.25, 70.25) (41.5833 to 44.25 & 68.25 to 70.4167) 122 - Witch Flounder At (43.0833,69.75) with market values to landed weight ratio (149.7/35.56) September 81, 122, 124 Atlantic cod, WitchFlounder, Flounder (43.25, 70.25), (42.25, 70.25) (41.5833 43.75 68.4167 70.75) October 12, 81, 96, 269 Angler, Atlantic cod, Cusk, Pollock (42.25,69.75), (43.25, 70.25), (42.25, 70.25) (41.75 to 43.75 & 68.25 to 70.75) 122 - Witch Flounder, at (42.25, 70.25) with a market value of 32.78 and landed weight of around 10.65 kg 512 - Wolfish at (41.75, 69.75) with a market value of 81.54 and landed weight of around 38.14 kg. 12 - Angler at (42.25, 70.25) with a market value of 45.2 and landed weight of around 16.5 kg. November 12, 81, 120 Angler, Atlantic cod, Winter Flounder (43.25,70.25), (42.25,69.75), (41.75, 70.25), (42.25, 70.25) (41.5833 to 43.25 & 66.75 to 70.4167) December 81, 120, 123, 124 Atlantic cod, Winter Flounder, Yellowtail-Flounder, Flounder (43.25,70.25), (41.75,70.25), (42.25, 70.25) (41.75 43.4167 67.25 70.25) 81-Atlanticcod, at (42.25, 70.25), (43.25, 70.25) with a market value of 25.6, 115.6 and landed weight of 14.8 kg, 60.6 kg. 81- Atlantic cod, at (41.75, 70.25) with a market value of 103.16 and landed weight of around 50.82 kg 122 - Witch Flounder At (43.25,69.4167) with market values to landed weight ratio (591.44/158.72) 12 Angler, at (43.25,70.25) with market values to landed weight ratio (141.61/32.3) May June 72 to & to to & to to & to 122 - Witch Flounder At (42.25,68.25) with market values to landed weight ratio (424.24/117.43) 122 - Witch Flounder At (43.4167,70.0833) with market values to landed weight ratio (77.84/21) 122 - Witch Flounder At (42.75,69.75) with market values to landed weight ratio (155/37) 4.2.1 Interpreting the patterns found: What does it mean? It means that in a month at a particular point(s) a particular fish was found. Profitability from the output can also be inferred from the results. Indexed k-means method proves to be very informative in terms of making decisions on concentrated clusters, as well as intensifying innocuous patterns that are not discerned easily by other clustering techniques. This proves to be all encompassing while giving an idea about the dataset. For example, from the following month, a clear idea which co-ordinates the latitude and longitude, will provide an indication of what type of the kind of fish is available during which month of the year. More specifically, for instance the month of January has a very low instance of dataset. We are able to see that it is possible to find whatever type of fish at this point. This would have been left out in the case of simple k-means method because it depends on the available quantities of each month and hence may not form a cluster with the month. The result helps in deciding when and where to fish. For example: For the month of January, the Atlantic Cod, Yellowtail-Flounder, Winter Flounder and Angler will be mostly found at the points (42.25,70.25) and (43.25,70.25) and the most probable area to fish will be in the area covered by the latitude degree of 41.5833 to 43.4167 and the longitude degree of 66.9167 to 70.25. We also found out that Winter Flounder is the most probable fish to be found at (42.25,70.25) with a market value of 60.75 and landed weight of 26.25. One can also infer that the most profitable fish will be 73 the Witch flounder found at (43.0833,68.0833) with a market to weight ratio of (791.27/159.43). The month of the year as well as the Latitude and Longitude degrees are physically static and are known by the user. The other variable and the relations shown in the results are useful for the user, which tell the user when and where to fish. 4.2.2 Testing and Limitations: Using a simple query to the whole dataset and finding the probability of the occurrence can discover the confidence level of the results. The thesis discovered that on average 15% of the archive records matched the results found, hence the support and the confidence. Sometimes the knowledge gained by the algorithm showed no matching records in the archive, which should mean that the knowledge gained for that instance is of no consequence. After careful analysis, the knowledge gained in that instance is of true deep knowledge. For example in the month of January the most profitable fish is Witch flounder found at (43.0833, 68.0833). When tested with the archive records the analysis revealed that the results hold good for the year 1985 and 1989. Yet the analysis was not able to correspond with the records in year 1983, 1984, 1986, 1987 or 1988. After careful analysis, the result was reached that there was no fishing done at that 74 point for that month and year, hence without this knowledge there would have been a lack of profit. The limitation of the results is that there may be more parameters that the knowledge depends on. There may be physical parameters that might hinder the results, such as the lack of fishing at that spot for a particular year, the type of vessels used for fishing, fishing technology, environmental effects like el nino, or man made interferences, hence the performance of the knowledge gained. 4.2.3 Comparison between k-means and indexed k-means: Computation is reduced for indexed k-means because the dataset is broken into smaller datasets for a knowledge discovery process without the loss of information. Knowledge is lost by the k-means method. As for the case stated above, a column could not be isolated according to the needs of the application. When k-means was questioned with what kind of knowledge is available for the given dataset in the month of August, the answer was not available. However, the indexed k-means was able to answer the question. Both methods require the user to specify the initial number of clusters. Both methods are not suitable for discovering clusters with nonconvex shapes or clusters of different size. Indexed k-means is less sensitive to noise and outlier data points compared to k-means, for the erroneous data may be eliminated as it a more focused study then the regular k-means. 75 4.3 Discussion The nearest-neighbor instance-based learning is simple and often works very well. Instance-based learning is time consuming for a dataset of realistic size because the entire training data must be scanned to classify each test instance. Another problem with instance-based methods is that the database can easily become corrupted by noisy exemplars. One solution is to adopt the k-nearest neighbor strategy. However, computation time inevitably increases. Another way of proofing the database against noise is to choose the exemplars that are added to it selectively and judiciously. The nearest-neighbor method originated many decades ago, statisticians analyzed the k-nearest-neighbor schemes in early 1950's. If the number of training instances is large, it makes intuitive sense to use more than one single nearest neighbor, but clearly this is dangerous if there are not many instances. It can be shown that when k and the number n of instances both become infinite in such a way that k/n -> 0, the probability of error approaches the theoretical minimum for the dataset. The nearest-neighbor method was adopted as a classification scheme in the early 1960s and has been widely used in the field of pattern recognition for over three decades. 76 Chapter 5 FUTURE WORK There have been many data mining systems developed in recent years. This trend of research and development on data mining is expected to flourish because the huge amounts of data that have been collected in databases and the necessity of understanding and making good use of such data in decision making have served as the driving forces in data mining. The diversity of data, data mining tasks, and data mining approaches poses many challenging research issues on data mining. The design of data mining languages, the development of efficient and effective data mining methods and systems, the construction of interactive and integrated data mining environment, and the application of data mining techniques at solving large application problems are the important tasks for data mining researchers and data mining system and application developers. Moreover, with the fast computerization of society, the social impact of data mining should not be under-estimated. When a large amount of interrelated data is effectively analyzed from a different perspective, it can pose threats to the goal of protecting data security and guarding against the invasion of privacy. It is a 77 challenging task to develop effective techniques for preventing the disclosure of sensitive information in data mining, especially as the use of data mining systems is rapidly increasing in domains ranging from business analysis, customer analysis, to medicine and government. 5.1 Conclusion: A very simple and elegant method of conducting more informative clustering techniques has been discussed in this research. In addition, a very descriptive method of performing index k-mean algorithm has been provided and its advantages over other clustering techniques have been discussed. A lot of research is pending in the area of automating the indexing in a manner that is less time consumptive. 5.2 Future Work The diversity of data, data mining tasks, and data mining approaches poses many challenging research issues in data mining. The design and development of efficient data mining methods and systems and the construction of interactive and integrated data mining techniques to solve large application problems are important tasks for data mining researchers. Web mining will become one of the most important and flourishing sub fields in data mining. The future of data mining is in application exploration where it spreads to different applications, hence resulting in the development of more 78 application-specific data mining systems. Scalable data mining is the one that may be able to handle huge amounts of data efficiently and interactively. Data mining is the integration of database, data warehouse and web database systems. Standardization of data mining language will facilitate the systematic development of data mining solutions. The study and development of visual data mining will facilitate data mining as a tool for data analysis. Methods to mine complex types of data like multimedia and time-series, where research is aimed towards the integration of data mining methods with existing data analysis techniques are required. However, data mining should ensure privacy protection and information security while facilitating proper information access and mining. 5.3 Research Directions In this thesis, a lot of research is pending in the area of automating the indexing of the cluster so as to reduce the time involved in the process. This indexed based k-means can be extended to other types of data mining clustering techniques. The same principle can be used in different types of algorithms like Estimated mean, k-mode etc to arrive at more localized results than the regular clustering method. This can be used as an application specific data mining systems. For example: medical expert systems give an opinion about what kind of sickness is a person under. By using this indexed method, we can come to more close solutions. For example, when a patient goes to a dental doctor for tooth pain, the doctor examines him and inspects the visible symptoms to check for cavities. This 79 becomes a static variable. But there can be many types of cavities. Hence, the doctor may ask more questions leading to a general idea of the reason for the pain, assuming that the doctor doesn’t have the new x-ray at initial consulting. In machine terms, instead of clustering the whole dataset and finding a variable that explains the reason behind the pain, the index field (static variable) can be given to find the cluster in that field (i.e. tooth pain). 80 Appendix A SOURCE CODE Source Code for k-means: import java.sql.*; import java.io.*; import java.util.*; import java.lang.Math; class fish1{ /* initializing static variables*/ static int k=0; static double min; static double a[][] = new double[122130][5]; static double clus[][] = new double[15][5]; static double group[][]= new double[15][]; static double comp[][] = new double[15][]; 81 static double b[][] = new double[12050][5]; static double c[][] = new double[12050][]; static double cl[][] = new double[12050][]; static double dis[] = new double[15]; static double count[] = new double[15]; static String newval[] = new String[15]; static String old[] = new String[15]; static Vector vec=null; /* Method to get data values from database. */ public static Vector getdb() { Connection conn = null; ResultSet rset= null; Vector v= null; ResultSetMetaData rsetmd; 82 String a[]= new String[5]; int nCols=0; int cc=0; try{ Class.forName ("sun.jdbc.odbc.JdbcOdbcDriver"); System.out.println ("I am inside class"); conn =DriverManager.getConnection("jdbc:odbc:testoracle","g_str","ora4st"); System.out.println ("connection created"); Statement stmt = conn.createStatement(); String query = "select fish_id, lat_deg, long_deg, mkt_values, landed_kg from thesis_data"; // for indexing we use where statement. rset = stmt.executeQuery(query); rsetmd = rset.getMetaData(); v = new Vector(); while (rset.next ()) 83 { a[0]= rset.getString(1); a[1]= rset.getString(2); a[2]= rset.getString(3); a[3]= rset.getString(4); a[4]= rset.getString(5); String x = a[0]+","+a[1]+","+a[2]+","+a[3]+","+a[4]; System.out.println("rows values"+x); v.add(x); cc++; }//while rset.close(); stmt.close(); conn.close(); }//try 84 catch(Exception e){e.printStackTrace();} System.out.println("rows count"+ cc); return v; }//closing method getdb. /*/method called from main method which call getdb to retrieval data values and stores in double array b */ public static double[][] dataval() { int i=0; int d; String s[] = new String[12050]; vec =fish1.getdb(); for(d=0;d<vec.size();d++) { s[d]=(String)vec.get(d);} for(int a=0;a<vec.size();a++) 85 { int t=0; StringTokenizer st = new StringTokenizer((String)vec.get(a),","); while (st.hasMoreTokens()){ b[a][t] = Double.parseDouble(st.nextToken()); t++; }//while }//for return b; } /* method called from main to initialize clusters.*/ public static double[][] getInitClus(double c[][],int i) { for(int s=0;s<i;s++) { 86 cl[s] = c[s]; } return cl; } /* main method */ public static void main (String args[]){ int z; // k's value represents & should be changed for the no. of clusters. int k=24; int x=k; System.out.println("inside main"); //calling dataval method to init data values to array a. a=fish1.dataval(); //calling getinit to initialize clusters. clus=fish1.getInitClus(a,k); 87 int nrows = vec.size(); do{ //assigning old values to new values System.out.println(" i am an old value"); for(int r=0;r<k;r++) { comp[r]=clus[r]; old[r]=comp[r][0]+","+comp[r][1]+","+comp[r][2]+","+ comp[r][3] +","+comp[r][4]; System.out.println(old[r]); } /*initialize count with k ;used calculate the mean of the grouped values. use 5 */ for (int f=0;f<k;f++){ count[f]=1.0;} /* while condition loop to group the minimum distance values */ while( x < nrows) 88 { // calculating distance use k 5; for(int w=0;w<k;w++) { dis[w]=Math.sqrt(Math.pow((clus[w][0] - a[x][0]),2.0) + Math.pow( (clus[w][1] - a[x][1]),2.0) + Math.pow( (clus[w][2] - a[x][2]),2.0) + Math.pow( (clus[w][3] - a[x][3]),2.0) + Math.pow( (clus[w][4] - a[x][4]),2.0) ); }//for // getting minimum distance value double w = dis[0]; for (int i=1;i<k;i++) //use k 6 { if (w<dis[i]) min = w ; else min=dis[i]; 89 w = min; }//for /*grouping adding cluster values with other related values based on minimum distance values*/ z=0; for(int c=0;c<k;c++)//use k 7 { if(dis[c] == w) { count[c]++; z=c; group[c]=clus[c];//use k 8 }//if }//for group[z][0]=(group[z][0] + a[x][0]); group[z][1]=(group[z][1] + a[x][1]); 90 group[z][2]=(group[z][2] + a[x][2]); group[z][3]=(group[z][3] + a[x][3]); group[z][4]=(group[z][4] + a[x][4]); // incrementing for the next value to be added x++; }//while // System.out.println(" i am after old[r]"); /*Getting mean and comparing the old values and new values in string */ for(int q=0;q<k;q++)// use k 11 { for(int y=0;y<5;y++) { clus[q][y]=(group[q][y]/count[q]); } newval[q]=clus[q][0]+","+clus[q][1]+","+clus[q][2]+","+clus[q][3]+","+clus[q][4]; 91 // System.out.println(" new values"+ newval[q]); }//for //string comparing System.out.println("checking while at end of the grouping"); System .out.println("no, not yet........."); x=k;//use k 12 }while( !(old[0].equals(newval[0])) & !(old[1].equals(newval[1])) & !(old[2].equals(newval[2])) & !(old[3].equals(newval[3])) & !(old[4].equals(newval[4])) & !(old[5].equals(newval[5])) & !(old[6].equals(newval[6])) & !(old[7].equals(newval[7])) & !(old[8].equals(newval[8])) & !(old[9].equals(newval[9])) & !(old[10].equals(newval[10])) & !(old[11].equals(newval[11])) & !(old[12].equals(newval[12])) & !(old[13].equals(newval[13])) & !(old[14].equals(newval[14])) & !(old[15].equals(newval[15])) & !(old[16].equals(newval[16])) & !(old[17].equals(newval[17])) & 92 !(old[18].equals(newval[18])) & !(old[19].equals(newval[19])) & !(old[20].equals(newval[20])) & !(old[21].equals(newval[21])) & !(old[22].equals(newval[22])) & !(old[23].equals(newval[23])) ); System.out.println("checking while at end of the grouping"); System.out.println("yep..i am done buddy ,here is your best values"); //printing the final values of clusters for (int i=0;i<k;i++){ System.out.println(newval[i]); }//for }//end of main method }//class 93 References: [1] Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. From Data Mining to Knowledge Discovery: An Overview. In Advances in Knowledge Discovery and Data Mining, G. Piatetsky-Shapiro and J.Frawley, editors, AAAI Press, Menlo Park, CA, 1996. [2] Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. The KDD Process for Extracting Useful Knowledge from Volumes of Data. In Communications of the ACM – Data Mining and Knowledge Discovery in Databases, pages 27-34, 1996. [3] Web surpasses one billion documents. http://www.inktomi.com/new/press/billion.html. [4] Cooley, R., Mobasher, B., and Srivastava, J. Web mining: Information and patterns discovery on the World Wide Web. In Proceedings of the ninth IEEE International Conference on Tools with Artificial Intelligence, pages 558–567, Newport Beach, CA, 1997. [5] Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the fifteenth National Conference on Artificial Intelligence, pages 509–516, Madison, WI, 1998. [6] Chakrabarti, S., Dom, B. E., Gibson, D., Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. S. Mining the link structure of the world wide web. IEEE Computer, 32(8): 60–67, 1999. 94 [7] Borges, J. and Levene, M. Data mining of user navigation patterns. In Masand, B. and Spliliopoulou, M., editors, Web Usage Mining, To appear in Lecture Notes in Artificial Intelligence (LNAI 1836). Springer Verlag, Berlin, 2000. [1] pieter adriaans, dolf zantinge datamining. Addison-Wesley, 1996 [2] jiawei han & micheline kamber . datamining concepts and techniques, Morgan Kaufmann, 2000 [3] S. Chaudhuri and U.Dayal. An overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record, 26(1):65-74,1997. [4] J. Han, `` Data Mining '', in J. Urban and P. Dasgupta (eds.), Encyclopedia of Distributed Computing , Kluwer Academic Publishers, 1999. [5] M.S. Chen, J. Han, and P.S. Yu. Data Mining: An Overview from a Database Perspective. IEEE Transactions on Knowledge and Data Engineeing, 8(6):866883,1996. [6] U.M. Fayyad, G. Piatetsky-Shapiro, P.Smyth, and R. Uthurusamy (eds.). Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. [7] alex berson, stephan smith and kurt thearing Building data mining applications for crm 95 [8] Michael j.a.berry Gordo linoff Data Mining Techniques for marketing, sales, and customer support - - p415, John Wiley & sons, Inc. 1997 isbn: 0-471-17980-9 [9] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery: An overview. In Proceedings of ACM KDD,1994. Smith, Chris. Theory and the Art of Communications Design. State of the University Press, 1997 96 BIBLIOGRAPHY [1] pieter adriaans, dolf zantinge Transactions on Knowledge and Data datamining. Addison-Wesley, 1996 Engineeing, 8(6):866-883,1996. [2] jiawei han & micheline kamber . [6] U.M. Fayyad, G. Piatetsky- datamining concepts and techniques, Shapiro, P.Smyth, and R. Morgan Kaufmann, 2000 Uthurusamy (eds.). Advances in [3] S. Chaudhuri and U.Dayal. An Knowledge Discovery and Data Mining. overview of Data Warehousing and AAAI/MIT Press, 1996. OLAP Technology. ACM [7] alex berson, stephan smith and SIGMOD Record, 26(1):65- kurt thearing Building data mining 74,1997. applications for crm [4] J. Han, `` Data Mining '', in J. [8] Michael j.a.berry Gordo linoff Urban and P. Dasgupta (eds.), Data Mining Techniques for Encyclopedia of Distributed Computing marketing, sales, and customer support , Kluwer Academic Publishers, - - p415, John Wiley & sons, Inc. 1999. 1997 isbn: 0-471-17980-9 [5] M.S. Chen, J. Han, and P.S. Yu. [9] U. Fayyad, G. Piatetsky-Shapiro, Data Mining: An Overview from a and P. Smyth. From data mining to Database Perspective. IEEE knowledge discovery: An overview. In Proceedings of ACM KDD,1994. 97 Smith, Chris. Theory and the Art of Communications Design. State of the University Press, 1997 2 4

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 1.2 What is data mining?