Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Arisekola Akanbi EVALUATION OF DATA MINING Thesis CENTRAL OSTROBOTHNIA UNIVERSITY OF APPLIED SCIENCES Degree Programme in Information Technology May 2011 Thesis Abstract Department Date Author Technology and Business May 2011 Arisekola Akanbi Degree Programme Degree Programme in Information Technology Thesis Topic Evaluation of data mining Instructor Kauko Kolehmainen Supervisor Pages Kauko Kolehmainen 63 + APPENDIX The development in networking, processor and storage technologies have led to the increase in the amount of data flowing into organizations, the creation of mega databases and data warehouses to handle the bulk of transactional data in digital form. This has led to the emphatic need to develop processes and tools to explicitly analyze such data so as to extract valuable trends and correlations generating interesting information that will yield knowledge from the data. Data mining is the technology that meets up to the challenge of solving our quest for knowledge from these vast data burdens. It provides us with a user oriented approach to novel hidden patterns in data. Important disciplines ranging from machine learning, information retrieval, statistics and artificial intelligence have had impacts on the development of data mining. Based on the geometric increase in data flow, we envisage more advanced and sophisticated information to be hidden in datasets. The goal of the thesis was to evaluate data mining in theory and in practice. An overview of database systems, data warehousing, data mining goals, applications and algorithms was carried out. It also involved reviewing data mining tools. Microsoft SQL server 2008, in conjunction with Excel 2007 data mining add-ins were used in demonstrating data mining task in practice, using data samples from Microsoft AdventureWorks database and Weka repository. In conclusion, the results of the tasks using the Microsoft Excel data mining add-ins, revealed how reliable, easy and efficient data mining could be. Key words Data, data mining, information and knowledge TABLE OF CONTENTS 1 INTRODUCTION……………………………………………………………………....1 2 DATABASE SYSTEMS…....................................................................................3 2.1 Databases……………………………………………………………………….....4 2.2 Relationship between data mining and data warehousing……………..........6 2.3 Data warehousing.....…………………………………………………………….6 3 DATA MINING…………………………………………………………………………9 3.1 Brief history and evolution………………………………………………………..9 3.2 Knowledge discovery in databases…………………………………………….11 3.3 Knowledge discovery process models…………………………………………12 3.4 The need for data mining……………………………………………….............14 3.5 Data Mining Goals……………………………………………………………….15 3.6 Applications of data mining………………………………………………..........16 3.6.1 Marketing…………………………………………………………………......16 3.6.2 Supply chain visibility………………………………………………….........16 3.6.3 Geospatial decision making……………………………………………......17 3.6.4 Biomedicine and science application……………………………………..17 3.6.5 Manufacturing……………………………………………………………….17 3.6.6 Telecommunications and control…………………………………….........18 4 DISCOVERED KNOWLEDGE…………………………………………………......19 4.1 Association rules…………………………………………………………………19 4.1.1 Association rules on transactional data…………………………………....21 4.1.2 Multilevel association rules…………………………………………………22 4.2 Classification…………………………………………………………………….22 4.2.1 Decision tree…………………………………………………………….......23 4.3 Clustering…………………………………………………………………….......25 4.4 Data mining algorithms………………………………………………………....26 4.4.1 Naïve Bayes algorithm……………………………………………………...26 4.4.2 Apriori algorithm……………………………………………………………..27 4.4.3 Sampling algorithm…………………………………………………………28 4.4.4 Frequent-pattern tree algorithm…………………………………………...28 4.4.5 Partition algorithm…………………………………………………………..29 4.4.6 Regression………………………………………………………………….29 4.4.7 Neural networks…………………………………………………………….30 4.4.8 Genetic algorithm……………………………………………………..........30 5 APPLIED DATA MINING……………………………………………………………32 5.1 Data mining environment…………………………………………………….....32 5.2 Installing the SQL server………………………………………………………...34 5.3 Data mining add-ins for Microsoft Office 2007………………………………..34 5.4 Installing the add-ins for Excel 2007…………………………………………...35 5.5 Connecting to the analysis service…………………………………………….35 5.6 Effect of the add-ins……………………………………………………………..36 5.6.1 Analyze key Influencers…………………………………………………....38 5.6.2 Detect categories…………………………………………………………....39 5.6.3 Fill from example tool……………………………………………………….41 5.6.4 Forecast tool…………………………………………………………………43 5.6.5 Highlight exceptions tool…………………………………………………...44 5.6.6 Scenario analysis tool………………………………………………………45 5.6.7 Prediction calculator………………………………………………………...47 5.6.8 Shopping basket analysis………………………………………………….49 6 ANALYSIS SCENARIO AND RESULT…………………………………………..52 7 CONCLUSION………………………………………………………………………59 REFERENCES APPENDIX 1 1 INTRODUCTION It is an established fact that we are in an information technology driven society, where knowledge is an invaluable asset to any individual, organization or government. Companies are supplied with huge amount of data in daily basis, and there is the need for them to focus on refining these data so as to get the most important and useful information in their data warehouses. The need for a technology to help solve this quest for information has been on the research and development front for several years now. Data mining is a new technology which could be used in extracting valuable information from data warehouses and databases of companies and governments. It involves the extraction of hidden information from some huge dataset. It helps in detecting anomalies in data and predicting future patterns and attitude in a highly efficient way. Data mining is implemented using tools, and the automated analysis provided by this tools go beyond evaluation of dataset to providing tangible clues that human experts would not have been able to detect due to the fact that they have never experienced or expected such. Applying data mining makes it easier for companies and government, during quality decisions from available data, which would have taken longer time, based on human expertise. Data mining techniques could be applied in a wide range of organizations, so long as they deal with collecting data, and there are several data mining software been made available to the market today, to help companies tackle decision making problems and invariably overcome competition from other companies in the same business. The goal of this thesis work is for the evaluation of data mining in theory and in practice, as this thesis could also be used for academic purpose. It is to have an overview of database systems, data warehousing, and insight on data mining as a field and try hands on some data mining tools used in accomplishing the process. Achieving such objective involves reviewing the main algorithms been employed 2 during data mining by most data mining tools, carrying out some scenario analysis to demonstrate the process using one or more data mining tools. The tool used in this work is the Microsoft SQL server 2008, in conjunction with Excel 2007 data mining add-ins, however, this tool only uses data that has already been collected and prepared, because it basically models an analyzes a ready data. The other data mining tools tried were Ibm intelligent miner, Tanagra data miner, and Weka. The contents of this work start with an overview of database systems and databases been the root technology that lead to data mining in form of evolution, then there is a brief literature on data warehousing and its relation to data mining, since all useful data collected by organizations are kept there, before they could be subjected to any further mining or analysis prior to decision making. There is an overview of data mining as a field, its evolution what motivated its coming into existence, data mining objective and the process of knowledge discovery in databases. The knowledge discovered is then placed into classes based on the method used and the outcome. The main algorithms employed during data mining process are also analyzed, some having example citations and graphs to emphasize the algorithm. Some of the numerous possible applications of data mining are also discussed. Applied data mining using Microsoft Office Excel 2007 data mining add-ins are also discussed, ranging from the SQL server, the add-ins to the effect of the addins in form of tools and algorithms employed during analysis of ready data. In essence of emphasis, the objective of this work is to evaluate data mining in theory and practice. 3 2 DATABASE SYSTEMS In this chapter, an overview of database systems, its evolution, databases, data warehousing, and the relationship between data warehousing and data mining will be made. Database understanding would be incomplete without some knowledge of the major aspects which constitute the building and framework of database systems, and these fields include structured query language (SQL), extended markup language (XML), relational databases concepts, object-oriented concepts, client and servers, security, unified modeling language (UML), data warehousing, data mining and emerging applications. Relational database idea was put forward to differentiate storage of data from its conceptual depiction, and hence provide a logical foundation for content storage. The birth of object oriented programming languages brought about the idea of object oriented databases, even though they are not as popular as relational databases today. (Elmasri & Navathe 2007, 23.) Adding and retrieving information from databases is fundamentally achieved by the use of SQL, while interchanging data on the web was also possible and enhanced by publishing language like hyper text mark Up language (HTML) and XML. The data are kept in web servers, so other users could have access to them, and subsequent development and research into database management led to the realization of data warehousing and data mining even though they had been applied before the name was set aside. (Elmasri & Navathe 2007, 23.) XML is a standard proposed by the World Wide Web consortium. It helps users to describe and store structured or semi structured data and to exchange data in a platform and tool independent way. It helps to implement and standardize communication between knowledge discovery and database systems. Predictive Model Markup Language (PMML) a standard based on XML, have also been identified to enhance interoperability among different mining tools and achieve 4 integration with other applications ranging from database systems to decision support systems. (Cios, Pedrycz, Swiniarski & Kurgan 2007, 20.) The data mining group (DMG) developed the PMML to represent analytic models. It is supported by leading business intelligence and analytics vendors like IBM, SAS, micro-strategy, Oracle and SAP. (Webopedia 2011.) Database systems could be classified as OLTP (on-line transaction process systems, and decision support systems, like warehouses, on-line analytical processing (OLAP) and mining. Archive of data from OLTP form decision support systems which have the aim of learning from past instances. It involves many short, update-intensive commands and it is the main function of relational database management systems. (Mitra & Acharya 2006, 24.) 2.1 Databases A database is a well structured aggregation of data that are associated in a meaningful way, which could be accessed in various logical ways by several users. Database systems are systems in which the translation and storage are of paramount value. The requirement for data after several years to be used by many users optimally depicts database systems. (Sumathi & Esakkirajan 2007, 2.) It is sometimes abbreviated as db. It is a collection of organized data put in a way that a computer program could quickly and easily select required parts of the data. It can be presumed as an electronic filing system. (Webopedia 2011.) A traditional database is organized into fields, records and fields, where field implies single piece of information, record is a complete set of fields, and file a collection of records. A database management system is needed to be able to access data or information from a database. (Webopedia 2011.) The graph 1 below is a simple representation of how information technology and database has evolved over time to data mining. 5 Data collection and database creation (1960s and earlier) –Primitive file processing Database Management Systems (1970s to early 1980s) Hierarchical and network database systems -Relational database systems Data modeling tools: entity-relational models, etc Indexing and accessing methods: B-trees, hashing, etc -Query languages: SQL, etc. -User interfaces, forms and reports Query processing and query optimization Transactions, concurrent control and recovery -On-line transaction processing (OLTP) Advances Database Systems(mid 80s to present -Advanced data models: extended relational, objectrelational, etc -Advanced applications: spatial, temporal, multimedia, active, stream and sensor, scientific and engineering, knowledge based Web-based databases(1960s –present) -XML-based systems database -Integration information retrieval -Data and integration with information Advanced Data Analysis: Data Warehousing and Data Mining (late 1980s – present) -Data warehousing and knowledge discovery: generalization, classification, association, clustering, frequent pattern and structured pattern analysis, outliers analysis, trend and deviation analysis, etc. -Advanced data mining applications: stream data mining, bio-data mining, time series analysis, text mining, web mining, intrusion detection, etc. -Data mining and society: privacy-preserving data mining New generation of integrated data and information system (present to future GRAPH 1. Evolution of Database Technology (adapted from Han, Kamber & Pei 2005, 2) 6 2.2 Relationship between data mining and data warehousing There has been an explosive growth in database technology and the amount of data collected. Developments in data collection methods, bar code usage and the computerization of transactions have provided us with enormous data. The huge size of data and the great computation involved in knowledge discovery hampers the ability to analyze the data readily available in order to extract more intelligent and useful information, while data mining is all about to enhance decision making and predictions or the process of data-driven extraction of not so obvious but useful information from large databases.( Sumathi & Sivanandam 2006, 5.) Today‟s competitive marketplace challenges even the most successful companies to protect and retain the customer base, manage supplier partnerships and control cost while at the same time increase their revenue. This and other information management processes are only achievable if the information managers have accurate and reliable access to the data, hence prompts the need for creation of data warehouses. (Sumathi & Sivanandam, 2006, 6.) Interestingly data warehousing provides online analytical processing tools for interactive analysis of multidimensional data of varied granularities which enhance data mining and mining functions such as prediction, classification and association could be integrated with OLAP operations thus enhancing mining of knowledge, hence buttressing the fact that data warehouses have become an increasingly important platform for data analysis and data mining. (Sumathi & Sivanandam, 2006, 6.) 2.3 Data warehousing A data warehouse is an enabled relational database system designed to support very large databases at a significantly higher level of performance and manageability. It is an environment and not a product. (Han et al. 2001, 39.) 7 It is an architectural construct of information that is difficult to reach or present in traditional operational data stores. A data warehouse is also referred to as a subject-oriented, integrated, time variant and non-volatile collection of data which supports management decision making process. Subject-oriented depicts that all tangible relevant data pertaining to a subject are collected and stored as a single set in a useful format. (Han et al. 2001, 40.) Integrated relates to the fact that data is being stored in a globally accepted style with consistent naming trends, measurement, encoding structure and physical features even when the underlying operational systems store the data differently. Non-volatile simply implies that data in a data warehouse is in a read-only state, hence can be found and accessed in the warehouse. Time-variant denotes the period the data has been available, because such data are usually of long term states. (Han et al. 2001, 41.) The process of constructing and using data warehouses is called data warehousing. Data warehouses comprise of consolidated data from several sources, augmented with summary information and covering a long time period. They are much larger than other kinds of databases, having sizes ranging from several gigabytes to terabytes. Typical workloads involving ad hoc, fairly complex queries and fast response times are important. (Ramakrishnan & Grehrke 2003, 679.) OLAP however is a basic function of a data warehouse system. It focuses on data analysis and decision making, based on the content of the data warehouse and it is subject oriented thus implying it is organized around a certain main subject. it is built by integrating multiple, heterogeneous data sources like flat files, on-line transaction records and relational databases in a given format. (Mitra & Acharya 2006, 24.) Data cleaning, integration and consolidation techniques are often employed to ensure consistency in nomenclature and could be viewed as an important preprocessing step for data mining, encoding structures, attribute measure and 8 lots more among different data sources. Data warehouses primarily provide information from a historical perspective. (Mitra & Acharya 2006, 26.) Since every key structure in the data warehouse contains some atom of time, explicitly or implicitly even though the key of operational data may not contain the time atom. Data warehouses neither need to be updated operationally nor require transaction processing, recovery and concurrency control systems, all it needs are the initial addition of the data and its access. (Mitra & Acharya 2006, 25.) 9 3 DATA MINING The term data mining could be perceived to have derived its name from the similarities between searching for variable information in a large database and mining a mountain for a vein of valuable one, both processes require either sifting through an immense amount of material or intelligently probing it to find where the value resides. (Lew & Mauch 2006,16.) Data mining also termed as knowledge discovery is the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools predict behaviors and future trends, allowing proactive knowledge driven decisions to be made. It scours the database to find hidden patterns, predictive information that experts may miss because it lies outside their expectation. (Lew & Mauch 2006, 16-17.) The Gartner Group refers to data mining as “ the process of discovering meaningful new correlation, patterns and trends by sifting through large amount of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques (Larose 2005, 2.) 3.1 Brief history and evolution Management Information Systems (MIS) in the 1960s and Decision Support Systems (DSS) during the 1970s did a lot by providing large amount of data for processing and execution, but unfortunately it was more of mere data that was obtained from the systems and that was not enough in enhancing business activities and decisions ( Mueller & Lemke 2003, 5.) A summary of the trend of data mining evolution from data collection to data mining stage could be seen in the table below. 10 TABLE 1. Steps in the evolution of data mining (adapted from Thearling 2010) Evolutionary Business Enabling Characteristics step Question Technologies Data Collection (1960s) "What was my total revenue in the last five years?" Computers, tapes, disks Retrospective, static data delivery Data Access (1980s) "What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language (SQL), ODBC Retrospective, dynamic data delivery at record level Data Warehousing & Decision Support (1990s) "What were unit sales in New England last March? Drill down to Boston On-line analytic processing (OLAP), multidimension al databases, data warehouses Retrospective, dynamic data delivery at multiple levels Data Mining (Emerging Today) "What‟s likely to happen to Boston unit sales next month? Advanced algorithms, multiprocessor computers, massive databases Prospective, proactive information delivery Data mining could be viewed to have evolved from a field of age long history, though the term was introduced in the 1990s. It could be traced back to some major family of origin which includes classical statistics, artificial intelligence, database systems, pattern recognition and machine learning. Basically without 11 statistics there would be no data mining because it is the bedrock upon which most techniques of data mining is based. (Data mining software 2011). Classical statistics concept studies data and data relationship, hence it plays a significant role within the heart of today‟s data mining tools. Artificial intelligence on the other hand was built upon heuristics as opposed to classical statistics; it applies human thought-like processing in statistical problem domains. Machine learning could be perceived as a merger of artificial intelligence and advanced statistics analysis, since it lets computer programs to learn about the data they process. (Data mining software 2011.) It prompts them to make decisions based on the data studied using statistics and advances artificial intelligence heuristics and algorithms. Traditional query and report tools have been used to describe and extract what is in a database. The user forms a hypothesis about a relationship and verifies it or discounts it with a series of queries against the data. (Data mining software 2011.) 3.2 Knowledge discovery in databases Data would not fulfill its potential if it is not been processed into information after which knowledge could be gained from it due to further processing. Knowledge discovery involves a process that yields new knowledge, it gives in details sequence of steps (data mining inclusive) that ought to be adhered to in order to discover knowledge in data, however each step is achieved using some software tools. It is a nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data in databases. It involves several steps, and each attempts to discover some knowledge using some knowledge discovery method. (Han et al. 2005, 10.) It entrenches the whole knowledge exploration process, ranging from data storage and access, analysis of large data using efficient and scalable algorithms, results interpretation and visualization, to human-machine interaction modeling and 12 support. (Han et al. 2005, 11). A unique feature of the model is the definition of input output states of data, because the output of a step is used as the input of a subsequent step in the process, and at the end output is the discovered knowledge portrayed in terms of patterns, rules, classifications, association, trends and statistical analysis. (Han et al. 2005, 11.) 3.3 Knowledge discovery process models The knowledge discovery process has been placed into two main models called the Fayyad et al (academic) model and the Anand and Buchner (industrial) model. The Fayyad model is represented below: Interpretation/ Evaluation Knowledge Data mining 100 Transfor mation 90 80 70 Patterns Preproc essing Transformed Data Selection Preprocessed data Target Data Data GRAPH 2. Knowledge discovery process (adapted from Fayyad, Piatetsky & Smyth 1996) Developing and understanding the application domain entails, learning the useful afore-hand knowledge and aim of the end user for the discovered knowledge. The next phase is creating a target data set, which involves querying the existing data 13 to select the desired subset by selecting subsets of attributes and data points to be used for task. (Han et al. 2005, 14.) Data cleaning and processing entails eradicating outliers, handling noise and missing values in data, and accounting for time sequence information and known changes. It leads to the data rejection and Projection. It consists of finding valuable attributes by utilizing dimension reduction and transformation methods, and discovering invariant representation of the data. (Han et al. 2005,15.) Choosing the Data mining task and this involves matching the relevant prior knowledge and objective of a user with a specific data mining method. Choosing the data mining algorithm basically involves selecting methods to search for patterns in the data and conclude on which models and yardsticks of the method is perfect. (Han et.al. 2005, 15-16.) Data mining being the next phase involves pattern generation, in a particular representation form, such as classification, decision tree, etc. Consolidation of discovered knowledge involves incorporating discovered knowledge into the performance system, documenting and reporting. (Han et al. 2005, 15-16.) The Industrial model also tagged CRISP-DM knowledge discovery process is summarized in the graphs below. 14 GRAPH 3 (cont.). Phases of the CRISP-DM process model (adapted from CRISP 2011) 1.Business understanding · Determination of business objective · Assessment of the situation · Determination of data mining goal · Generation of project plan 2. Data Understanding · Collection of initial data · Description of data · Exploration of data · Verification of data quality 3. Data preparation · Data selection · Data cleansing · Construction of data · Integration of data · Data substeps formatting 4. Modeling · Selection of modeling technique · Generation of test design · Models creation · Generated models assessment 5. Evaluation · Evaluation of results · Process review · Determination of next step 6. Deployment · Plan deployment · Plan monitoring and maintenance · Generation of final report · Review of the process substeps GRAPH 4. Details of CRISP process model (adapted from CRISP 2011) 3.4 The need for data mining The achievement of digital revolution and the escalation of the internet have brought about a great amount of multi-dimensional data in virtually all human endeavor, and the data type ranges from text, image, audio, speech, hypertext, 15 graphics and video thus providing organizations with too many data, but the whole data might not be useful if it does not provide a tangible unique information that could be utilized in solving a problem. The quest to generate information from existing data prompted the need for data mining. (Mitra & Acharya 2003, 2.) The impressive development in the data mining could be affiliated to the conglomeration of several factors such as the explosive growth in data collection, the storing of data in warehouses, thus enhancing accessibility and reliability of database, the availability of increased access to data from internet and intranets, the high quest to raise market share globally, the evolving of mining software, and the awesome development in computing power and storage ability. (Larose 2005, 4.) 3.5 Data mining goals Data mining is basically done with the aim of achieving certain objectives and it ranges from classification, prediction, identification to optimization. · Classification: this involves allocating data into classes or categories as a result of combining yardsticks. · Prediction: Mining in this instance helps to single out features in a data, and their tendencies in the event of time. · Identification: Trends or patterns in certain data could enhance in identifying the existence of items, events or action in a given scenario or case. · Optimization: Mining also facilitates the optimization of the use of scarce resources in turn maximizing output variables within constraint conditions (Elmasri & Navathe 2007, 947.) 16 3.6 Applications of data mining The traditional approach to data analysis for decision making used to involve furnishing domain experts with statistical modeling techniques so as to develop hand-crafted solutions for specific problems, but the influx of mega data having millions of rows and columns and the spontaneous constructions and deployment of data driven analytics coupled with demand by users for results easily readable and understandable has prompted the inevitable need for data mining. (Sumathi & Sivanandam 2006, 166.) Data mining technologies are deployed in several decision-making scenarios in organizations. Its importance cannot be over emphasized, as it is applicable in several fields some of which as discussed below. 3.6.1 Marketing This involves analysis of customer behavior in purchasing patterns, market strategies determination varying from advertising to location, targeted mailing, segmentation of customers, products, stores, catalogs design and advertisement strategy. (Elmasri & Navathe 2007, 970.) 3.6.2 Supply chain visibility Companies have automated portions of their supply chain enabling collection of significant data about inventory, supply performance and logistic of materials, and finished goods, material expenditures, accuracy of plans for order delivery. Data mining application also spans though price optimization and work force analysis in organizations. (Sumathi & Sivanandam 2006, 169.) 17 3.6.3 Geospatial decision making In climate data and earth ecosystem scenario, automatic extraction and analysis of interesting patterns involving modeling ecological data and designing efficient algorithm for finding spatiotemporal patterns in form of tele-connection patterns or recurring and persistent climate patterns. This operation is usually carried out using the clustering technique, which divides the data into meaningful groups, helping to automate the discovery of tele-connections. (Sumathi & Sivanandam 2006, 174.) 3.6.4 Biomedicine and science application Biology used to be a field dominated by an attitude of formulate hypothesis, conduct experiment, evaluate results, but now upon the impact of data mining it has evolved into a field of big science attitude involving collecting and storing data, mine for new hypothesis, then confirm with data or supplemental experiment. (Sumathi & Sivanandam 2006, 170.) It also includes discovery of patterns in radiological images, analysis of microarray (gene-chip) experimental data to cluster genes and to relate to symptoms or disease, analysis of side effects of drugs and effectiveness of certain drugs. (Sumathi & Suvanandam 2006, 171.) 3.6.5 Manufacturing The application in this aspect relates to optimizing the resources used in optimal design of manufacturing processes, and product design based on customers‟ feedback. (Elmasri & Navathe 2007, 970.) 18 3.6.6 Telecommunications and control It is applied to the vastly available high volume of data consisting of call records and other telecommunication related data, which in turn is applied in toll-fraud detection, consumer marketing and improving services. (Sumathi & Suvanandam 2006, 178.) Data mining is also applied in security operations and services, information analysis and delivery, text and Web mining instances, banking and commercial applications as well as insurance. (Han et al. 2005, 456.) 19 4 DISCOVERED KNOWLEDGE Data could only be useful when it is converted into information and it becomes paramount when some knowledge is gained from the generated information, as such is the most vital phase of data handling in any setup that deals with decision making, and this knowledge obtained could be inductive or deductive, where deductive knowledge deduces new information from applying pre specified logical rules on some data. The inductive knowledge is the form of knowledge referred to when data mining is concerned, as it discovers new rules and patterns from some given data. (Elmasri & Navathe 2007, 948). The knowledge acquired from data mining is classified in the forms below, though knowledge could be as a result of a combination of any of them: · Association rules: simply involves correlating the presence of a set of items with another range of values for another set of variable. · Classification hierarchies: this aims at progressing from an existing set if transactions or actions to generate a hierarchy of classes. · Sequential patterns: basically seeks some form of sequence from some events or activities. · Patterns within time series: this involves detecting similarities within positions of time series of some data, implying sets of data obtained at regular intervals. · Clustering: this relates to segmentation of some given collection of items or actions into sets of similar elements (Elmasri & Navathe 2007, 949.) 4.1 Association rules This deals with unsupervised data, as it finds interesting associations, dependencies and relationships in vast data item sets. These items are kept as transactions that could be created by an external process or fetched from data 20 warehouses or relational databases. However, due to the expansible feature of the association rules algorithms and the ever/increasing size of cumulating data, use of association rules for knowledge extraction is somewhat inevitable, as discovering interesting associations gives a sources of information that is used for making decisions.(Han at al. 2005, 256.) Association rules are applied in areas such as market/basket data analysis, cross/marketing, catalog design, loss/leader analysis, clustering, data processing, genomics etc. The Market/basket analysis is the most intuitive application of association rule, as it strives to analyze customer tendencies by finding associations existing between items purchased by customers (Han et al. 2005, 264-270.) An example of the application of association rule is as in the graph below, where a sales transaction table is made to identify items which are often bought together, so as to be able to make some cross-selling decisions during discount sales period. (Maclennan, Tang, & Crivat 2009, 7). It simply shows that customers who buy milk also buy cheese, and customers who buy cheese could also buy wine, likewise customers buy either Coke or Pepsi and juice, the same applies to customers buying beer or wine, cake or donut. MILK CAKE BEER CHEESE WINE COKE PEPSI JUICE DONUT BEEF GRAPH 4. Items association (adapted from MacLennan et al. 2009, 7) 21 4.1.1 Association rules on transactional data Items are denoted as Boolean variables while the collection is denoted as Boolean vector, as the vector is analyzed to determine which variables are frequently taken together by different users or in other words associated with each other. These cooccurrences are represented in association rules written as: LHS => RHS [Support, Confidence] The left-hand side (LHS) implies the right-hand side (RHS) with a given value of support and confidence. Support and confidence are used to determine the quality of a given rule in terms of its usefulness (Strength) and certainty, while support denotes how many instances (transactions) from a data set that was used to generate the rule including items from left hand side and right hand side, confidence on the other hand expresses how many instances that include items from left hand side also include items from right hand side, where measured values appear in percentages. (Han et al. 2005, 290-291.) Association rule is interesting if it satisfies minimum values of confidence and support which are stipulated by the user. Association rules are derived when data describe events that occur at the same time or in some close proximity. The main association rules are: single-dimensional and multidimensional, where both rules could be placed into groups as either Boolean or quantitative. (Han et al. 2005, 292.) The Boolean case relates to the presence or absence of an event or item, while the quantitative case considers values which are partitioned into item intervals. Basically the data to be used in any mining applying association rule ought to be given in a transactional form, hence should have a transaction identification (ID), and information about all items consistently. (Han et al. 2005, 292.) 22 4.1.2 Multilevel association rules Finding association rules at low levels in cases where items form a hierarchy could be difficult. Such association rules could be found at higher levels existing as established knowledge. Multilevel association rules are created by performing a top-down, iterative deepening search, in essence by first finding strong rules at the high levels in the hierarchy, before searching for lower-level weaker rules. (Han et al. 2005, 244.) The main methods for multilevel association rule are in classes subsequently placed as; Uniform support based method which involves the same minimum support threshold been applied to create rules at all levels of abstraction. Reduced support based method which primarily fixes the short comings of uniform support method, since every level of abstraction is supplied with its own minimum support threshold and the lower the level, the smaller the threshold. (Han et al. 2005, 244245.) Level by level independent method involving a breadth-first check been carried out, hence every node in the hierarchy is checked, regardless of its parent node frequency. Level cross-filtering by single item method simply involving checking items at certain levels are checked only if their parent at the previous level are frequent. Level cross filtering by k-itemset method is a method in which a k-itemset at some given level is checked only if its parent k-itemset at the previous level is frequent. (Han et al. 2005, 246.) 4.2 Classification Classification is the process of learning a model which describes different classes of data, and the classes are usually predetermined. It is also known as supervised learning, as it involves building a model which could be used to classify new data. The process begins by using an already classified data usually called training set of data, and each training data consist of an attribute called class label. (Elmasri & 23 Navathe 2007, 961). It essentially is the act of splitting up objects so that each one is assigned to a number of mutually exhaustive and exclusive categories known as class. (Bramer 2007, 23.) Classification basically involves the process of finding set of models or functions which describe and distinguish data classes or concepts for purpose of being able to use the model to predict the class of objects whose label is unknown. The classification models could occur in models operating on rules involving decision tree, or neural networks or mathematical rules or formulae or probability. (Han et al. 2005, 280.) A decision tree is a flow-chart like tree structure, where each node denotes some test on an attribute value, each branch stands for an outcome of the test, while the tree leaves represent classes or class distributions. (Han et al. 2005, 282.) The approach of classification that uses probability theory to find the most likely of classifications is known as the Naïve Bayes. (Bramer 2007, 24.) 4.2.1 Decision tree A decision tree is a flow chart like tree figure used to represent information in a given data, where each internal node represents a test on an attribute, every branch denotes the outcome of a test, while the leaf nodes represent classes. The uppermost node in a tree is the root node. (Han et al. 2005, 284-286.) A typical decision tree is shown below, where the tree represents some magazine subscription by people, with information of their age, car type, number of children if any, and their subscription. Internal nodes are denoted by rectangles, leaf nodes by ovals. The table below is a sample data for the subscription. 24 TABLE 2. Sample training data for magazine subscription (adapted from Ye 2003, 5) ID Age Car Children Subscription 1 23 Sedan 0 Yes 2 31 Sports 1 No 3 36 Sedan 1 No 4 25 Truck 2 No 5 30 Sports 0 No 6 36 Sedan 0 No 7 25 Sedan 0 Yes 8 36 Truck 1 No 9 30 Sedan 2 Yes 10 31 Sedan 1 Yes 11 45 Sedan 1 Yes <=30 Sedan Yes Car Type Age >30 Sports, Truck 0 No Sedan No Number of Children Car Type Sports, Truck Yes >0 Sedan Yes Car Type Sports, Truck No GRAPH 5. Decision tree for magazine subscription (adapted from Ye. 2003, 6) 25 4.3 Clustering This basically aims to place objects into groups, such that records in a group are similar to each other and totally dissimilar to records in other groups, and this groups are said to be disjoint. Clustering analysis is also called segmentation or taxonomy analysis, as it tries to identify homogenous subclasses of cases in a given population. (Elmasri & Navathe 2007, 964). Some of the approaches of Cluster analysis are discussed below. Hierarchical clustering which permits users to choose a definition of distance, then select a linking method for forming clusters, after which the number of clusters that best suit the data are estimated. This approach of clustering creates a representation of clusters in icicle plots and dendograms. A dendogram is defined as a binary tree with distinguished roots that has all the data items at its leaves. (Cios et al. 2007, 260.) The k-means clustering simply requires the user to indicate the number of clusters in advance, after which the algorithm estimates how to designate cases to the k clusters. k-means clustering isles computer-intensive, hence it is often preferred when data sets are much say over a thousand, and it creates a table showing the mean-square error called Anova. (Garson 2010.) The two step clustering which generates pre-clusters after which it clusters the pre-clusters using hierarchical methods. This approach handles very high volume of datasets, and has the largest array of output options, including variable plots. (Garson 2010.) The graph 6 below is a diagrammatic representation of clustering, where people are placed into clusters based on their income levels and age. 26 Income Cluster 1 Cluster 2 Cluster 3 Age GRAPH 6. Output of clustering people on income basis (adapted from MacLennan et al. 2009, 7) 4.4 Data mining algorithms Data mining algorithms are the mechanisms which create the data mining model, which is the main phase of the data mining process. In the subsequent sub headings the algorithms will be discussed. 4.4.1 Naïve Bayes algorithm Collecting frequent item sets involves consideration of all possible item sets, computing their support, checking if they are of higher value than the minimum support threshold. Naïve algorithm involves searching high number of item sets, while scanning lots of transactions each time, and as such making the amount of test that need to be conducted to be exponentially high, thus causing problem and excessive time consumption, and due to this short coming of the algorithm, there was need for the birth of another more efficient algorithm. (Han et al 2005, 296.) 27 An example for demonstrating the algorithm involves having several tiny particles of two colors red and green as shown in graph below. The particles are classified as either red or green. In an effort to classify new cases and tell which class they belong to, could be easily done based on the graph. GRAPH 7. Naive Bayes classifier (adapted from Statsoft 2011) It is obvious that the green particles are twice as much as the red particles, hence on handling a new case which has not be handled before, it is twice like that a particles will belong to the green group rather than the red group.(Statsoft 2011.) 4.4.2 Apriori algorithm This algorithm applies a prior knowledge of an important attribute of frequent itemsets. The Apriori property of any item-set declares that all non empty subsets of a frequent item-set has to be frequent, hence where a given item-set is not frequent (if it does not meet up to the minimum support threshold), then all superset of this item-set will also not be frequent, since it cannot occur more frequently than the original item-set. (Cios et al. 2007, 295.) 4.4.3 Sampling algorithm This algorithm is basically about taking a small sample of the main database of transactions, then establishing the frequent item sets from the sample. where such 28 frequent item-sets form a superset of the frequent item-sets of the whole database, then one could affirm the real frequent item sets by scrutinizing the remainder if the database in order to determine the exact support values of the superset item-set. This is basically some form of Apriori algorithm though with a lowered minimum support. (Elmasri & Navathe 2007, 952.) Second scans of the databases are usually required because of cases of missed item sets, and determining if there were any missed item-sets gave room for the idea of Negative border, which in relation to a frequent item set say S, and some set of item say I, is the minimal item-sets contained in power set(I) and not in S, in a nut shell, the negative border of some set of frequent item sets consist of the closest item sets possibly frequent. (Elmasri & Navathe 2007, 952.) Consider an example having a set of items I= {A, B, C, D, E} and let the combined frequent item-sets of size 1 to 3 be S= {{A}, {B}, {C}, {D}, {AB}, {AC}, {BC}, {AD}, {CD}, {ABC}}. Here the negative border is {{E}, {BD}, {ACD}}. The set {E} is the only 1 item set not contained in S, {BD} is the only 2-item-set not in S, but whose 1 item-set subset are, and {ACD} is the only 3 item set whose 2 item set subsets are all in S. The negative border is important since it is necessary to determine the support for those item-sets in the negative border to ensure that no large item-sets are missed. (Elmasri & Navathe 2007, 953.) 4.4.4 Frequent-pattern tree algorithm This is also an algorithm which came into been due to the fact that Apriori algorithm involves creating and testing huge amount of item-sets. However, this algorithm eliminates the creation of such large candidate item-sets. A compressed sample of the database is first created, based on the frequent pattern tree; this tree keeps useful item-set information and gives an avenue for the efficient finding of frequent item-sets. The main mining process is divided into smaller task and each functions on a conditional frequent pattern tree, which is a branch of the main tree. (Elmasri & Navathe 2007, 955.) 29 4.4.5 Partition algorithm Partitioning algorithm operates by splitting the database into non-overlapping subsets, which are taken for separate databases and all bulk item-sets for that partition are called Local Frequent item-sets, and they are created in one pass, after which the Apriori algorithm is then efficiently applied on each partition if it fits into the primary memory. Partitions are taken such that each every partition could be accommodated in the main memory, hence been checked only once. (Elmasri & Navathe 2007, 957.) The main short coming of this algorithm is that the minimum support for each partition is dependent on the size of the partition rather than the size of the main database for large item-sets. After the first scan, a union of the frequent item-sets from every partition is taken, forming the Global candidate frequent item-sets for the whole database. The global candidate large item-set found in the first scan are confirmed in the second scan, when their support is measured for the whole database, invariably, this algorithm is implemented in parallel or distributed manner for enhanced performance. (Elmasri & Navathe 2007, 957.) 4.4.6 Regression Regression is an exclusive application of the classification rule. If a classification rule is regarded as a function over the variables that map these variables into target class variable, the rule is called a regression rule. A common application of regression occurs when in place of mapping a tuple of data from some relation to a specific class, the value of variable is predicted based on the tuple itself. (Elmasri &Navathe, 2007, 967.) Regression involves smoothing data by fitting the data to a function. It could be linear or multiple, the linear involves finding the best line to fit two variables so that one could be used to predict the other, while the multiple one has to do with more 30 than two variables. (Han et al. 2005, 321.) For example, where there is a single categorical predictor such as female or male, a legitimate regression analysis has been undertaken if one compares two income histograms, one for men and one for women.(Berk 2003.) 4.4.7 Neural networks This is a technique derived from artificial intelligence, using general regression and provides an iterative method to implement it. It operates using a curve fitting approach to infer a function from a given sample. It is a learning approach which uses a test sample for initial learning and inference. Neural networks are placed into two classes namely supervised and unsupervised networks. Adaptive methods that try to reduce the output error are supervised learning methods, while unsupervised learning methods involve those that develop internal representations without sample outputs. (Elmasri & Navathe 2007, 968-969.) It can be used where some information is known, and one would like to infer some unknown information. Example is in the Stock market prediction, where last week and today‟s stock prices are known, but one wants to know tomorrows stock prices. (Statsoft 2011.) 4.4.8 Genetic algorithm Genetic algorithms are a class of randomized search procedures capable of adaptive and large search over a large range of search space topologies. It was developed by John Holland in the 1960s. It applies the techniques of evolution, dependent on optimization of functions in artificial intelligence to generate some solution, by simply developing a sample of possible solutions to some problem domain, then taking out solutions that are better and gathering them together to create a new domain of solutions, and lastly using the new solutions to replace the poorer of the original, after which the whole cycle is done again (Hill 2011.) 31 The solutions generated by genetic algorithms are differentiated from that of other techniques because genetic algorithms use a set of solution during each generation instead of a single solution. The memory of the search done is represented by the set of solutions at disposal for a generation. It finds near optimal balance between knowledge gain and exploitation by manipulating encoded solutions. Genetic algorithm is a randomized algorithm unlike other algorithms, and its ability to solve problems in parallel makes it powerful in data mining (Elmasri &Navathe 2007, 969.) 32 5 APPLIED DATA MINING The process of evaluating data mining could only be complete after a practical demonstration is done. In the process of trying to carry out a practical mining task, several other mining devices were used in the course of this project, and they range from the Ibm intelligent miner, Estard miner, SQL server warehouse, and SQL server using Microsoft Excel 2007, but due to logistic, and administrative limitations in using most of the mining devices, I chose to use the SQL server cum Microsoft Excel 2007 for the mining task, since it provides a trial version which grants much access with lesser administrative requirements. Data Acquisition Application Data Preparation Validation Modeling GRAPH 8. Data mining process (adapted from MacLennan et al. 2009, 188) The graph above simply shows the steps involved in a simple data mining process. However for the purpose of this thesis work, the data acquisition and preparation phases were skipped since a ready data was gotten from a repository. 5.1 Data mining environment In an attempt to demonstrate the data mining process, the mining software that was chosen is the Microsoft excel 2007 which is been used in conjunction with 33 Microsoft SQL server. The exact SQL server that was used is the Microsoft SQL server 2008, though other earlier versions exist too. The server is Microsoft‟s enterprise-class database solution. It consists of four components namely the: database engine, analysis services, integration services and the reporting services, and these four components work together to create business intelligence (Ganas 2009, 2.) The database engine basically facilitates the storage of data in tables and allows users to analyze the given data using commands in SQL language. Considering the database engine from the business intelligence perspective, its primary function is the storage of collected data, and it has the capacity to store hundreds of gigabytes of data which are also be termed as “data warehouse” or “data mart”. (Ganas 2009, 2.) The Analysis Services part of the SQL server is responsible for the analysis of data using Online Analytical Processing (OLAP) cubes and the data mining algorithms. The cube is fundamentally a pre-meditated pivot table. It is located in the server and it stores the raw data, along with pre-calculated data, in a multidimensional format. The data in an OLAP cube could be accessed using Excel pivot table. OLAP cubes are valuable since the make it easy and convenient for users to handle and analyze extremely large amount of data. (Ganas 2009, 2.) The other component of the SQL server is the Integration service, which basically extracts, transforms and load data. Its primary purpose is to transfer data between different storage formats, just as in an instance it pulls data out of excel file and uploads it into an SQL server table. It is also a data cleaning tool which makes it really relevant, since dirty data could make it difficult to develop valid statistical models. One of the cleaning techniques inscribed in the integration service is the fuzzy logic, as it is applied to clean data by isolating and removing questionable values. (Ganas 2009, 2-3.) The Reporting Services is the fourth component of the SQL server, it is a webbased reporting tool. Basically it creates a web page where users could see reports that were generated using the data in the SQL server tables. It also consist 34 of a web application known as Report Builder, which allows users to create ad hoc reports without knowing the SQL language. This facility enhances the easy transmission of business intelligence to several users. (Ganas 2009, 3.) 5.2 Installing the SQL server Upon adhering to the installation requirements before the installation proper, the installation wizard was used as it allows the user to specify which features are to be installed, the installation location, and the administrator privileges. The wizard is also used to grant users access to the components of the server. (MacLennan et al. 2009,16.) 5.3 Data mining add-ins for Microsoft Office 2007 The Data mining add-ins for Microsoft office 2007 enhances the exploitation of the full potentials in the SQL Server, hence an instance of the add-ins have to be installed on a machine which already has an SQL server and all its components installed. This add-ins comprise of three parts namely the Table analysis tools, the Data mining client, and the Data mining template for Visio. It also consist of the Server configuration utility which handles the details of configuring the add-ins and the connection of the Analysis Services (MacLennan et al. 2009, 16.) The Data mining client for Excel 2007 allows users to build complex models in Excel, while processing those models on a high performance server running Analysis Services, and this reduces the time and effort required to extract and analyze information from ordinary raw data using the most powerful algorithms available. (Ganas 2009, 3.) 35 5.4 Installing the add-ins for Excel 2007 The main component of the add-ins which is of utmost importance in this case of data mining is the Data mining client (which is basically installed on a computer), as it acts as a link between the user and SQL server running analysis services. The client architecture allows multiple users, with every one working on his own computer to benefit of the power of a single analysis services server. Installing the add-ins on a computer and specifying the name of the server running analysis services is relatively straightforward; a wizard guides users through the installation process, however taking note that the user has administrative rights to all aspects of the server and the add-ins. (MacLennan et al. 2009, 20.) 5.5 Connecting to the analysis service On clicking the analyze menu, there is also the option of connecting to an analysis service which must be done to facilitate a successful evaluation of the given data. The connection button was labeled <No connection>. On clicking it, the analysis services connection dialog box appeared, requesting server name and instance name, then the given server name was entered and an instance name was entered too. The figure below shows an instance of the connection phase. By default, it is recommended that the windows authentication button is used in connecting to the analysis services, because it supports only windows authenticated clients. (MacLennan et al. 2009, 21.) 36 GRAPH 9. Connecting to analysis service 5.6 Effect of the add-ins Upon completion of the installation of the add-ins to the computer, the data mining ribbon could be readily seen on the menu on a launched Microsoft office 2007 Excel spreadsheet, and through the ribbon the icons which are divided into a logical and organized fashion that mimics the typical order of tasks involved in a typical data mining process appear, they include the data Preparation which comprises explore data, clean data and partition data as this basically acquires and prepares data for analysis. (Mann 2007, 29.) There is also the data modeling functionality which has options of algorithms that could be selected from to use in the data mining. The accuracy and validation option aids in testing and validating the mining model against real data before the deploying the model into production. The model usage allows one to query and browse the analysis services server for existing mining models. The management option enables the management of the mining models such as renaming, deleting, clearing, reprocessing, exporting or importing of models. (Mann 2007, 29.) The other icon on the work sheet is the connection tab, which facilitates connection to a server; it is discussed in detail in the next subsection. Once a 37 connection was been successfully made to the analysis service, if some data were added to the spreadsheet, and the whole or part of the data was selected and formatted as table using the “format as table” option from the menu, the data adopts the new selected format and the “Table Tools” option appears as shown in the graph below, it consist of the analyze and design options on the menu. (Mann 2007, 30.) On clicking the analyze button eight of the functionalities of the analysis services which make it possible to perform tasks without compelling the need to know anything about the underlying data mining algorithms appears and they include; · Analyze Key Influencers · Detect Categories · Fill From Example · Forecast · Highlight Exceptions · Scenario Analysis · Prediction Calculator · Shopping Basket Analysis GRAPH 10. Formatting an Excel sheet 38 The other option which is the design tab provides the option to remove duplicates, convert to range and summarize with pivot table, it also provides the option of exporting the formatted data. GRAPH 11. The analysis ribbons 5.6.1 Analyze key Influencers The analyze key influencers tool when applied, basically brings out the relationship between all other chosen columns on a given table to one specified column, and then it makes a report showing in details which of the columns have major influence on the stipulated column and how it portrays itself. Its implementation generates a temporary mining model using the Naïve Bayes algorithm. (MacLennan et al. 2009, 22). Take for instance if the tool is applied on a table having columns as annual income, geographic location, number of kids and purchases, on applying the tool and selecting the purchases column, it would be possible to determine if it is income that plays a major role in the purchases an individual makes or any of the other columns. Upon specifying the target column, the analyze key influencers button is clicked and it pops up a dialog box requesting the selection of the column 39 to be analyzed for key factors, there is also an option to restrict the other columns that could be used for the analysis process. (MacLennan et al 2009, 23). Once the selection has been made, the run button is clicked and within seconds, a report is displayed such as in the graph below. GRAPH 12. Key influencer report A look at the graph above simply demonstrates that the income column has the greater influence over the weekly purchases of the people represented by the favors column, though the number of kids is closely also having much influence in a few instances. (MacLennan et al. 2009, 23). 5.6.2 Detect categories Handling huge amount of data could be cumbersome, hence it is more advisable and convenient to regroup all the data into smaller categories so that elements in each category have lots of similarities enough to see them as been similar. It applies the clustering algorithm, thus making data analysis convenient and easy. It detects groups in the given data after analyzing the data, and then it places them 40 into groups based on their similarities, also emphasizing the details which prompted the category creation. (MacLennan et al. 2009, 29.) Applying the detect categories functionality of the table analysis tools simply involves selecting and formatting the given data, then the analyze ribbon displays, on clicking the detect categories button, a dialog box pops up. The box displays the option of selecting the columns from the data, that the user would like to analyze, and the user could do so by un-checking or checking any column of his choice. (MacLennan et al. 2009, 29.) There is also the option of the tool appending the detected category column to the original Excel table, which is usually checked by default, and thirdly the option of selecting the number of categories, the user would like to have usually in auto detect state by default. (MacLennan et al. 2009, 29-30.) On clicking run in the dialog box, the process is completed within few seconds and then the results are displayed. The displayed result called the categories report has three parts, one showing the created categories and the number of rows in each as shown in graph below. GRAPH 13. Categories created in category report The second part of the category report shows the features of each category in ranking of their relevance and importance, it is a table having four columns, the 41 first column having the category name, the second is the features from the original column name, the third is the value, and the fourth column is the relative importance showing how significant the features are to the created category. (MacLennan et al. 2009, 31.) A simple graph below shows how the category characteristics look GRAPH 14. Category report characteristics The third part of the category report is the category profiles, it appears as bar charts, showing the distribution of any of the original data characteristics over all the generated categories. Each of the bars in the chart contains more than one color simply showing segments denoting the proportion of a row in the category. There is the color legend on the right hand side which clearly portrays the proportion of the feature on the category. The generated categories could be renamed based on the users wish. (MacLennan et al 2009, 33.) 5.6.3 Fill from example tool This data mining tool has an auto fill potential, in the sense that it is able to learn from any given example of data, and automatically generate subsequent data based on the trend and relationship in the example. It basically operates only on columns of the Excel spreadsheet, so long as some two or more data examples 42 have been given in the row. The reliability of the result of this tool is mainly dependent on the amount of sample data or values given in the target column, hence, the higher the sample data, the greater the reliability of the tool result, and vice versa. (MacLennan et al. 2009, 35.) On selecting and formatting the given data, the table analysis tool option appeared as expected and the fill from example tool button was clicked. A dialog box came up, showing the option of which column to select for the sample data task, though the tool most often suggests a likely column, but there was still the possibility to choose a target column if it is different from the suggested one. There is also the option of choosing which columns to use for the analysis in conjunction with the specified column. On clicking the run button, the process is completed, and a pattern report for the specified column is generated on a new Excel work sheet, and also a new column is appended at the end of the original sheet, showing the original and newly generated complete column. (Brennan 2011.) The generated report has four columns showing the original column names, their values, if they favorably impact the target column or not, and the last column showing the relative effect by aid of horizontal bars. GRAPH 15. Completed table after fill from example process 43 The results of the fill from example process could be refined by carrying out the process again, if the displayed result is not close to the expected one, and the refining could be done several times until a desired expected result is obtained. (MacLennan et al. 2009, 39.) 5.6.4 Forecast tool The forecast tool is able to recognize the trend that operates in a given series, and extrapolates the patterns producing forecasts for subsequent evolution of the series. The main patterns discovered from the analysis include trends (behavior of the series evolution), periodicity (consistency of event intervals), and crosscorrelations (reliability of values in different series). (MacLennan et al. 2009, 40.) Once the data has been formatted, and the forecast button is clicked, the forecast dialog box appears, displaying the option for selecting the columns to use in prediction, there is also the option to specify the time units to be forecasted. The other tabs on the dialog box are the time stamp and the periodicity drop down boxes. On clicking the run button, the tool implements the algorithm and within seconds the forecasted new values could be seen, highlighted at the bottom of the columns in the table. There is also the graph which shows the old series been analyzed in solid lines, while the forecasted evolution trend is represented by broken lines. (MacLennan et al. 2009, 43.) In the graph below, the chosen time stamp was five, so the forecasting generated five new rows to each column. 44 GRAPH 16. Result generated after forecasting 5.6.5 Highlight exceptions tool This tool detects anomalies in any given data. Any row in the given data table that does not follow the pattern of the other rows in the table is highlighted. These discrepancies could be as a result of mistakes during data entry or Excel AutoFill system. There could be instances of correct data values, but due to the fact that they do not match the general pattern, they are seen as anomalies hence of much interest. The tool is a good cleaning tool since it is able to detect and replace such anomalies. (Ganas 2009, 8.) On selecting and formatting the data in Excel, the analysis tools options are displayed, and on clicking the highlight exceptions tool, the dialog box is shown, providing the option of selecting the columns to be used for the analysis, columns having unique values such as Id column are usually unchecked by default. Once the run button is clicked, the tool processes the data, after which it highlights the row having anomaly in a different color, it also generates a new Excel sheet, showing the report for the anomalies details. (MacLennan et al. 2009, 45.) The graph below shows an example of a highlight exception tool process, showing how male skilled manual worker, having just one child, and a commute distance of one mile, is an anomaly. 45 GRAPH 17. Report of exceptions from a highlight exception process 5.6.6 Scenario analysis tool This tool is basically applied in sensitivity analysis of simulations. The “goal seek” and the “what-if” options of the tool demonstrate how model results behave in response to input data moderation. The goal seek option of the tool shows how some or all of the input data need to be modified so as to attain certain expected result, for example an insurance company needs to know what income a customer gets annually to determine when he or she is a good trust worth customer, it is similar to an instance of finding what could be the value in a column A, so that column B would have a value of C. (Ganas 2009, 7- 8.) The what-if option of the tool helps the user to be able to know how the model result would be, upon altering one of the input data. Hence it shows the effect of an input variable change on the result outcome. (Ganas 2009, 8.) The two options of the scenario analysis could be performed on either a single row of a table or on the whole table. The main principle involved in the process is showing how one or more other columns have effect on a target column. This option of the analysis table tool is different because unlike others, it does not present its result on a different spreadsheet, rather the result is displayed on the same dialog box where the user indicated the target row and modifier. (MacLennan et al. 2009, 59.) On selecting and formatting the table, after which the analysis table tools are displayed the target row was selected, the scenario analysis tool arrow was clicked, it showed the two options of goal seek and what if, once the what if button was clicked, a dialog box popped up, showing options of indicating the target column, the proposed state, and the column to be modified. 46 Once the three decisions were made, and the run button clicked, the dialog box shows a message of the either success or failure, the likely value and the confidence state whether good or low. (MacLennan et al. 2009, 59.) The graph below shows an outcome of the analysis scenario tool used on some given data, where the target column was the “purchased bile” column, the requirement condition chosen was yes, and the modified column was the commute distance, which was over 10 miles, but after the goal seeking process was changed to a range of 0 to 1. (MacLennan et al. 2009, 60.) GRAPH 18. Goal seek scenario analysis The what-if option of the scenario analysis is similar in it operation, except that, on clicking the option from the analysis tool option, the dialog box that displayed has the option of choosing a scenario as in what to column to make changes to, from what position or value, then there is the what happens option for choosing what column there should be an effect, and the third point where the user indicates if the task is for a single row or the whole table. (MacLennan et al. 2009, 60.) Once the options are made and the run button is clicked, the result will appear on the same dialog box. The same task could be carried out on the whole table by just indicating the whole table option and the target column option too. The result 47 will be a column at the right end, showing outcome for all rows on the table. (MacLennan et al. 2009, 60.) The graph below shows an example of the what if task performed, having the children column as the input variable been altered from 0 to 1, and the target column been the purchase bile column, and there was success report, implying on changing the number of children an individual has from 0 to 1, he or she is more likely to purchase a bike. GRAPH 19. “What-if” scenario analysis 5.6.7 Prediction calculator This tool is an example of an end user tool that integrates data mining technology. (Ganas 2009, 9). The tool is an easy and convenient device for making predictions, and it does not necessarily need to be connected to any server for it to function. It uses the logistic regression algorithm and it aids in determining best possible conditions in order to minimize or eliminate any case of consequences arising due to wrong predictions, while maximizing the benefits associated to making a correct prediction. It operates on the binary principle of using only one of any two possible conditions. (MacLennan et al. 2009, 63.) 48 It is similar to the key influencer tool, even though it considers and displays value depicting the impact of all columns, whether weak or strong on the target column, after which the total effect is gotten by summing up all the values assigned to the other columns. (MacLennan et al. 2009, 64.) Once the data table has been formatted, from the analyze tools ribbon, the prediction calculator option could be seen. On clicking the button the dialog box appeared, showing options of indicating the target column, the next option was to indicate if the column values are continuous, range or exact values such as yes or no, as in the sample data used for this work. (MacLennan et al. 2009, 64.) The option selected was “yes”, as shown in the graph below, so as to see the effect of other columns on the purchased bike column, the other option on the dialog box was of choosing the columns to be used for analysis other than the one automatically chosen by the tool. The box also presents the option of the reports to be presented, ranging from the operational calculator, to the prediction calculator to the print-ready report. Once all input were selected, the run button is clicked, and reports are generated in three new spreadsheets. (MacLennan et al. 2009, 65.) GRAPH 20. Prediction calculator dialog box Some sections of the prediction calculator spreadsheet generated could be modified to achieve certain results. The outcome of the prediction calculator are 49 categorized into four classes namely the false positive cost, the false negative cost, the true positive cost and the true negative cost as shown in the graph below. GRAPH 21. Prediction calculator report The prediction calculator report been one of the generated outcomes of the prediction process, is really important, since it contains three column, showing the original columns names, the highest attainable value, and their relative impact on the target column. It is on this report that a threshold is generated, to guide the user as to optimum expected reliability of prediction. The values of the original columns could be modified to attain certain goal but if the total from the values of attributes is not equal to or over the given threshold value, then the prediction is not reliable. (MacLennan et al. 2009, 66.) 5.6.8 Shopping basket analysis This analysis uses the association rules algorithm. Though from its name it implies that it is applied to goods and services, it could also be applied in the medical field, where such analyses are used to identify people with likelihood of undiagnosed health problems. Insurance companies also apply the algorithm in certain situations. The application of the association algorithm generates an if-then 50 statement coupled with some degree of accuracy, for instance if a customer buys an items x and y, then M percent of the time, he will buy item z. (Ganas 2009, 9.) On formatting the given data, the shopping basket analysis button was clicked, and a dialog box appeared, there were the options of choosing the transaction id, from a drop down box, the item selection, and the item value though it is optional. Once the input variables are right, the run button was clicked and the analysis was done within few seconds. The results of the process are reports which were generated to two excel spreadsheets, namely the shopping basket bundled item and the shopping basket recommendation. The former classifies the items that were purchased together most often in ascending order, showing the frequency of occurrence, the average value and the overall value of all transaction involving such item bundles.( MacLennan et al. 2009, 76.) The item bundles were presented in rows as shown in graph below, where the purchase of road bikes and helmets combination occurred most often as depicted by the horizontal bar. GRAPH 22. Shopping basket bundled items report However, making a reasonable use of the analysis tool also depends on the shopping basket recommendation table generated, since it displays the best possible items that ought to be combined with selected items based on their sales and overall value of linked sales as shown in graph below. 51 GRAPH 23. Shopping basket recommendation 52 6 ANALYSIS SCENARIO AND RESULT The data chosen to further demonstrate the role of data mining in decision making by the Microsoft Excel data mining add-ins was gotten from the machine learning repository. (Asuncion & Newman 2007.) The main analysis tools existing in the Microsoft Excel were applied to the data, which had fifteen attributes as columns and several hundred rows. The results obtained are as shown in the following graphs. GRAPH 24. Output of Key Influencers tool on data, showing marital status as a positive relative impact on a person belonging to a “less” class in the data 53 GRAPH 25. Result of key influencers on “class” , showing marital status of “never married to positively favor a person belonging to a “ less” class GRAPH 26. Output of the detect category tool showing education having the highest relative importance on the class of an individual 54 GRAPH 27. Bar chart showing the categories created after analysis of people from the data, placing them into the “more” and “less” classes, represented by red and blue colors as indicated on the right hand side of the charts GRAPH 28. Report showing most exceptions existing in marital status in the data 55 GRAPH 29. Exceptions in data been marked out GRAPH 30. Output of Fill from example task indicated on rightmost column, showing the classes to which certain people would belong to based on their other attributes 56 GRAPH 31. Output of Forecasting of age, education and hours per week of people, indicated at the lower 5 rows of the table 57 GRAPH 32. (cont.) Forecast of age, education level, and hours per week of persons from given data after some given time period, based on their previous hours per week work and other attributes GRAPH 33. Report of the Prediction calculator taking “workclass” as the target column to see its relative impact 58 GRAPH 34. Table showing threshold of prediction calculator, indicating that if the sum of the relative impact of a chosen person is less than „47‟ then it is false 59 7 CONCLUSION Data mining is a technology which discovers and extracts hidden trends and patterns in huge amount of data. The thesis‟ goal is to evaluate data mining in theory and in practice. Achieving this goal involves reviewing the stages of the process, the algorithms employed, carrying out a practical task and analyzing the result. The several algorithms which are been employed in discovering the useful and relevant knowledge from large data sources are basically simple and efficient. Most of the data mining software available in the market employ most or almost all of the relevant algorithms, even though the data mining tools use different instructions in handling and manipulating them, the results of employing the same algorithm in different data mining tools with similar data is often similar. It is necessary to note that, due to the choice of analysis tool which was used in this work, the early stages of data mining which involves data collection, data preparation, data selection, and data transformation, were not carried out, since the repository source of the data used had already undergone the processes. However, the most relevant phase of data mining is the modeling phase and that is what the tool used basically does. Attempts to collect real data from businesses and people around based on certain attributes proved abortive, as people were reluctant to provide some information for personal reasons. I also tried taking data from the finish statistic website, but on trying to build a model that could be easily and successfully analyzed using the Microsoft Excel analysis tool, the results generated were almost meaningless, since most of the data were time related and there were insignificant changes in the statistics, and such does not favor good data modeling. On successfully using the repository data with the Microsoft Excel analysis tool, several attempts to use the same data with some other data mining tools were futile, as each of the mining tool has some specification on the format in which 60 data ought to be, for a successful mining or analysis task, but due to time constraint, only little acquaintance was made with such tools. However, based on the Microsoft Excel data mining add-ins used in this work, it could be clearly seen that if an organization could successfully collect and prepare its usually large data in the right format to be analyzed by this tool, the outcome of the analysis is clear and easy to understand, hence aiding in any form of decisions that could be made. I am highly convinced that the efficiency and accuracy of data mining as a process using its tools and algorithms is better off, compared to any human reasoning or skill, and any decision been made from a data mining task output, would have a high percentage of success. 61 REFRENCES Andy, P. 2011. Available: http://www.the-datamine.com/bin/view/Software/AllDataMiningSoftware. Accessed 30 March 2011. Asuncion, A. & Newman, D. 2007. UCL machine learning repository. Available: http://archive.ics.uci.edu/ml/. Accessed 20 April 2011. Berk, R. 2003. Data mining within a regression framework. Available: http://preprints.stat.ucla.edu/371/regmine.pdf. Accessed 02 April 2011. Bramer, M. 2007. Principles of data mining. London: Springer. Cios, K. J., Pedrycz, W., Swiniarski, R. W.,& Kurgan, L. A. 2007. Data mining: A Knowledge Discovery approach. New York: Springer. CRISP 2011. CRISP (Cross Industry Standard Process for Data Mining). Available: http://www.crisp-dm.org/CRISPWP-0800.pdf. Accessed 20 April 2011. Data mining software 2011. A brief history of data mining. Available: http://www.data-mining-software.com/data_mining_history.htm . Accessed 21 February 2011. Data mining add-ins for office 2007, video by Brennan Mary, Available http://www.microsoft.com/sqlserver/2008/en/us/data-mining-addins.aspx . Accessed 08 April 2011. Elmasri, R. & Navathe, S. 2007. Fundamentals of database systems. New York: Addison Wesley. Fayyad, U., Piatesky-shapiro, G. & Smyth, P. 1996. Advances in knowledge discovery and data mining. Menlo Park, California: American Association for Artificial Intelligence (AAAI) Press. Ganas, S. 2009. Data mining with predictive modeling with Excel 2007. Available: http://www.casact.org/pubs/forum/10spforum/Ganas.pdf . Accessed 15 March 2011. Garcia M.E.B. 2006. Mining your business in retail with IBM DB2 intelligent miner. Available: http://www.ibm.com/developerworks/data/library/tutorials/iminer/iminer.html. Accessed 15 February 2011. Garson, D. 2011. Cluster analysis. Available: http://faculty.chass.ncsu.edu/garson/PA765/cluster.htm. Accessed 17 March 2011. 62 Han, J. & Kamber, M. & Pei, J. 2001. Data mining: concepts and techniques. San Francisco: Morgan Kaufmann Han, J., Kamber, M. & Pei, J. 2005. Data mining: Concepts and techniques. (2nd ed.). San Francisco: Morgan Kaufmann. Hand, D, Mannila, H, & Smyth, P. 2001. Principles of data mining. Cambridge Massachusetts: MIT Press. Hill, T. 2011. Genetic algorithms. Available: http://www.pcai.com/web/ai_info/genetic_algorithms.html Accessed 18 March 2011. Inmon, W.H 2005. Building the data warehouse, Indianoapolis: Wiley Publishing, Inc. Larose, T. D. 2004. Discovering knowledge in data: Introduction to data mining. New Jersey: John Wiley & Sons Inc. Lew , A. & Mauch, H. 2010. Dynamic programming: A computational tool (Studies in Computational Intelligence). Berlin: Springer. Maclennan, J., Tang, Z. & Crivat, B. 2009. Data mining with Microsoft SQL server 2008. Indianapolis: Wiley Publishing Inc. Mann, A.T. 2007. Microsoft Office 2007 system business intelligence integration. New York. Mann Publishing. Mitra,S. & Acharya, T. 2003. Data mining: multimedia, soft computing and bioinformatics. New Jersey: John Wiley & Sons, Inc. Mueller, J. A. & Lemke, F. 1999. Self-organizing data mining: An intelligent approach to extract knowledge from data. Berlin: Dresden. Pal, N. & Jain, L. 2005. Advanced techniques in knowledge discovery and data mining. London: Springer. Ramakrishnan, R. & Gehrke, J. 2003. Database management systems. New York: McGraw-Hill. Sumathi, S. & Esakkirajan, S. 2007. Fundamentals of relational database management systems. New York: Springer. Sumathi, S. & Sivanandam, S. N. 2006. Introduction to data mining and its application. New York: Springer . Thearling, K. 2010. Introduction to data mining. Available: http://www.thearling.com/text/dmwhite/dmwhite.htm Accessed 20 April 2011. 63 Webopedia 2011. Database. Available: http://www.webopedia.com/TERM/D/database.html. Accessed 21 April 2011. Witten, I., H. & Frank, E. 2005. Data Mining: Practical machine learning tools and techniques. San Francisco: Morgan Kauffman. Ye, N. (ed) 2003. Handbook of data mining. New Jersey: Lawrence Earlbaum Associates. APPENDIX 1/1 OTHER DATA MINING TOOLS IBM DB2 Data mining using the Ibm intelligent miner takes data mining from the perspective that it involves basic steps which are: problem definition, data exploration, data preparation, modeling, evaluation and deployment. In the data exploration phase of data mining, the data is been selected. The data table and views were taken into account and collected. The sample data that was used was related to some transaction and purchases, and based on the quest to determine a customers‟ behavior, there was need to identify which tables or views in the database contain all the information needed, and the data was used in generating models in the subsequent phase. (Garcia 2003, 10.) Demonstrating data mining process using the Ibm intelligent miner required installation and configuration of an IBM DB2 Infosphere warehouse software, containing the Intelligent miner, scoring and visualization components. The software contains the server that houses the database which is to be analyzed. (Garcia 2003, 3.) Once the software had been installed, there was need to upload the data to the database, and that was done using the console window from the installed application. The graph below shows a view of it. In this graph, the name of the database been connected to is retail, and the server is DB2/NT 9.7.2 APPENDIX 1/2 Console showing connection to database. After the connection had been confirmed to be successful, the control centre from the installed IBM DB2 was used to view the tables in the database. The graphs below shows a view of the table of importance in the database. A retail table view from DB2 database APPENDIX 1/3 Articles Tables from DB2 control centre View during DB2 data preparation phase. APPENDIX 1/4 During the modeling phase, the association rule was implemented and the support, confidence and lift were specified. The next phase was the evaluation phase, when the intelligent mining visualization application was used. It was used to look at the results and to evaluate whether the model is good or not. (Garcia 2003, 20). The graph below shows the IM visualize connector to DB2 database IBM DB2 intelligent miner connection interface. Upon launching the application, there was need to supply the name of the database, after which the connection was tested by clicking the connect button which requested for user id and password. Due to password and user limits in the trial copy of database and mining software, the data mining process using the DB2 intelligent miner was not completed. APPENDIX 1/5 Tanagra data mining tool Basic view of user interface for Tanagra mining tool View of Tanagra tool during data loading. APPENDIX 1/6 Weka data mining tool Weka startup environment Weka tool interface on loading data View of weka tool, showing file selection options APPENDIX 1/7 Weka tool analysis display Weka analysis output Weka cluster analysis output APPENDIX 1/8 Rapid miner interface screenshots