Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Algorithms and Tools Kgosi Tshetlhoyagae BSc in Information Systems Session (2001/2002) i Project Summary Most Organisations today produce an electronic record of every transaction they are involved in. In large Organisations, this results in millions of records being produced every day. Nowadays many Organisations are going online to exploit the e-business wagon, this will result in huge amount of data being accumulated as the Internet connects many sources of data. The accumulated data is very important in today’s competitive world and its used for gaining competitive edge over competitors by a process called data mining, which can be said to be the extraction of useful information from large databases. Data mining being a new area has seen many sophisticated algorithms and tools being developed. This project explores the current data mining algorithms and tools in an attempt to find out which algorithms are best to use for solving what problems and how the tools compare. In addition to the research, which was carried out in algorithms and techniques a practical data mining exercise was done on datasets, which were accumulated by an online retailer. ii Acknowledgements There are many people who were involved in the completion of the project and the production of this report. I take this opportunity to show my greatest gratitude and say without your contribution this project will not have been completed. Thank You!! Prof. Martin Berzins my project supervisor, for his support, advice, guidance, encouragement and feedback, which was always valuable. Dr Stuart Robert my project assessor, for his feedback from the mid-year report and progress meeting, which was very constructive in guiding and helped me in doing some tasks, which I thought I couldn’t do. Prof. Ross Quinlan, for offering me an evaluation licence to use the See5 evaluation software in my data mining exercise. Dr Liu Bing, for giving me permission to download the CBA software, which was used for evaluation. The KDDCUP website administrators for giving me access to their datasets, which were used in this project. Friends I Khan and T Mosweu for always encouraging me to go on. S Khupe for asking for the second evaluation licence. Finally, I would like to dedicate this project report to my wife Boi, for always being there for me when my stress level was above the limit. iii Table of Contents Page Number 1.0 Problem Definition……………………………………………………….…………. 1 1.1 The Problem……………………………………………………………………….. 1 1.2 Project Aim…………………………………………………………………...…… 1 1.3 Project Approach……………………………………………………………..…… 1 1.4 Project Objectives……………………………………………………………...….. 2 1.5 Project Requirements………………………………………………………..…….. 2 2.0 Introduction……………………………………………………………………….… 3 2.1 What is data mining?………………………………………………………….….. 3 2.2 Data mining styles……………………………………………………….……… 5 3.0 Data Mining Techniques…………………………………………………………… 6 3.1 Automatic Cluster Detection………………………………………………….… 7 3.2 Decision Trees………………………………………………………………..…. 8 3.3 Neural Networks……………………………………………………………….... 9 3.4 Market Basket Analysis…………………………………………………………. 10 3.5 Memory-Based Reasoning……………………………………………………… 10 3.6 Link Analysis………………………………………………………………….… 11 3.7 Genetic Algorithms…………………………………………………………….... 12 3.8 Online Analytic Processing……………………………………………...……… 13 3.9 Regression algorithms……………………………………………………...…… 13 3.10 Data Visualization……………………………………………………………..... 14 4.0 Data Mining Algorithms…………………………………………………………… 15 4.1 C4.5…………………………………………………………………………..….. 15 4.2 CART………………………………………………………………………..…… 16 4.3 Apriori Algorithms……………………………………………………………….. 17 4.4 K-Means……………………………………………………………………..…… 18 4.5 CURE …………………………………………………………………………..… 18 4.6 CHAID………………………………………………………………………….… 19 4.7 ID3……………………………………………………………………………...… 20 4.8 SLIQ…………………………………………………………………………….... 20 4.9 C5.0/See5……………………………………………………………………….… 21 4.10 CBA………………………………………………………………………….…… 21 5.0 Evaluation of Data Mining Techniques and Algorithms…………………. 22 5.1 Evaluation of Data Mining Techniques………………………………..… 22 5.2 24 Evaluation of Data Mining Algorithms………………………………………… iv 6.0 Potential applications………………………………………………………………. 25 7.0 Case Study ………………………………………………………………………..… 27 7.1 Overview…………………………………………………………………...…… 27 7.2 Extracting information from the datasets……………………………………….. 28 7.3 Evaluation and Interpretation of results……………….………………………… 42 8.0 Current Issues in data mining……………………………………...……………… 47 8.1 Individual privacy………………………………………..……………………… 47 8.2 Data integrity……………………………………………………….…………… 47 8.3 Data Size………………………………………………………………………… 47 8.4 Data Noise………………………………………………………………...…….. 47 8.5 Technical issues………………………………………………………………… 48 8.6 Cost issues……………………………………………………………………… 48 9.0 Future Requirements………………………………………………………..……… 49 Reference List ……………………………………………………………….………… 50 Appendix A - Personal Evaluation……………………………………………………… 51 Appendix B - Project Plan and Schedule………………………………….…..………… 52 Appendix C - Project specification……………………………………………………… 53 Appendix D – Emails received………………………………………………………… 54 Appendix E – Sample of the Customers.data file records……………………………… 57 Appendix F – Sample of the Customers.names file attributes……………………..…… 62 Appendix G – Answers to questions asked………………………………………...…… 66 Appendix H – Project Log……………………………………………………………… 86 v Chapter 1 Problem Definition 1.1 The Problem Data mining is a fast growing area of interest and therefore many sophisticated tools have been developed to perform it. The area has gained recognition as an important tool for both business and industry and the information gleaned from huge databases of past transactions allows more focussed decision-making. For example, in retailing the information allows customer relationship management to be targeted to individuals. As more tools are being developed most Organisations are also trying to find how they can exploit this area and use the data that they have been accumulating for years to improve the services they provide, increase business opportunities or to just understand customer’s behaviour. This project intends to answer most of the questions that most of the Organisations, which are following this new area, might be asking themselves about data mining tools and algorithms that are around today. The full specification of this project is in Appendix C. 1.2 Project Aim The aim of this project was to provide an analysis of some data mining tools and techniques, which are currently in use today. The aim of the project was to be met by following the project approach that is discussed below. 1.3 Project Approach A theoretical and practical approach was taken to tackle the project. The theoretical phase involved intensive exploration of the data mining literature and doing some research on the Internet. This part took more time as there was more information related to data mining on the Internet. Books and journals were also searched for any information they can provide. The practical phase of the project was in a form of a Case study, which involved some extraction of patterns from a dataset using a decision tree technique. The Datasets used were the ones, which have been accumulated by an on-line retailer and a Data mining tool called See5 was used to find the patterns from the datasets. The practical approach was in a form of directed data mining whereby the miner knows what result he/she is looking for from the data mining exercise. 1 1.3 Project Objectives. The objectives of the project are as follows; • To make a general introduction of Data Mining. • To examine which Data mining algorithms are most effective for mining data that have been acquired through website visits. 1.4 • To apply a data mining technique to analyse and determine useful classifications within the dataset. • To extend the range of questions, which are relevant to the datasets. • To discuss current issues in Data mining and future requirements Project Requirements The project’s minimum requirements are to meet the above-mentioned objectives and it is hoped that a mark above a minimum pass will be achieved by the quality of the report. 2 Chapter 2 Introduction to Data Mining 2.1 What is Data Mining? “Data mining is an interdisciplinary field bringing together techniques from machine learning, pattern recognition, statistics, databases, and visualization to address the issue of information extraction from large databases. Data mining has captured the imagination of the business and academic worlds, in fact 80% of the Fortune 500 companies are currently involved in a data mining pilot project or have already deployed one or more data mining production systems” [1]. Data Mining comes from a wide range of disciplines as quoted above and therefore has seen many definitions made to explain what it is. All the definitions made so far are really the same and Data Mining can be summarised as the process of exploring and analysing large quantities of data stored in databases in order to find useful correlations or meaningful patterns. The data to be analysed can be collected from a number of sources such as web sites, electronic point of sale, data warehouses etc. Data mining can help online retailers to; • Increase page views per session. • Increase shopping basket contents per checkout. • Increase the number of referred customers /visitors. • Retain their old customers. • Promote their brands. • Increase revenue and reduce costs. Data mining has a life cycle, which is made up of the following four main business processes; 1. Identifying the business problem. 2. Transforming data into actionable results. 3. Acting on the results. 4. Measuring the results. 3 The above business processes are achieved by the following Data mining stages. A diagram (fig 2.0) adapted from [5] can show how the process is done. • Selection - involves selecting data according to some criteria. • Pre-processing/Data Cleansing - eliminating errors that can be in the data. • Transformation – making the data useable and navigable by adding overlays such as demographics. • Data mining – the actual extraction of meaningful patterns and correlations from the selected data. • Interpretation and Evaluation - this is where the identified patterns are interpreted into knowledge. Fig 2.0 4 2.2 Data Mining Styles. Directed data mining. Directed style of data mining is used when we know what we are looking for and can direct data mining effort towards getting the most accurate result possible. This style takes a hypothesis from a user and tests the validity of it against the data. The user is responsible for formulating the hypothesis and issuing the query on the data. The style takes the form of a predictive model because it is making predictions about the unknown based on the known and the user always knows the format of the output. A predictive model can answer questions such as; - What is the right medical treatment, based on past experience? - Which customers are likely to leave in the next six months? The goal in predictions is to learn from the past, and to learn in such a way that the knowledge can be applied to the future. The problem of directed data mining is that it does not create any new information in the retrieval process but returns records to verify or negate the hypothesis. Undirected data Mining. Undirected data mining finds patterns in the data and leaves it to the user to determine whether or not these patterns are important. Data is sifted in search of frequently occurring patterns, trends and generalisations about the data without the intervention from the user. The data is searched with no hypothesis in mind other than for the system to group the facts according to common characteristics found. An example of undirected data mining will be a bank database, which is mined to discover how many groups of customers to target for a mailing campaign. 5 Chapter 3 Data Mining Techniques Data mining software analyse relationships and patterns in stored data based on open-ended user queries. The software extracts meaningful new information from data by performing any one of the following activities. 1. Classification: This is where stored data is used to locate data in pre-determined groups. It consists of examining the features of a record and assigning it a class. For example an Online Retailer could mine customer purchase data to determine when customers visit the website and what they buy. The information could be used to increase traffic by having promotions or specials on some goods. 2. Clustering: This is where the data items are grouped according to logical relationships or in the Online retailing example customer preferences. It is actually segmenting a group of records into a number of more similar subgroups. The records are grouped together on the basis of self-similarity. Unlike classification, clustering does not rely on predefined classes. For example data can be mined to identify market segments in a town. 3. Association: Here data is mined to identify associations and is sometimes called affinity grouping. The idea is to determine what things go together. “Association rules are really no different from classification rules except that they can predict any attribute, not just the class” [2]. Affinity grouping can be used by on-line retailers to plan how to display items on their web pages. Through associations they can know what to display with what. For example Men’s shirts may be displayed on the same page with ties. 4. Sequential Patterns: This is where data is mined to anticipate behaviour patterns and trends. For example a retailer could predict the likelihood of a sleeping bag being purchased based on a customer’s purchase of a backpack and hiking shoes. 5. Estimation: Here a given input data is used to come up with a value for some unknown continuous variable. For example, if a person’s income is not known an estimator can identify other variables that correlate well with income such as location (residential address), car preference, job title etc. then find other people with the same traits and use them to estimate income and confidence value. Its used much in Neural networks discussed later in this report. 6. Prediction: This is where the data items are classified according to some predicted future behaviour. Prediction guesses a future value such as the probability of passing a module, when a person hasn’t done it yet or the probability of a customer to leave within 3 months. Of all the activities discussed above, classification is the only example of direct data mining as its goal is to use the available data to build a model that describes one particular variable of interest in terms of the rest of the available data. Other activities are examples of undirected data mining as their goal is to establish some relationship among all the variables. No variable is singled out as the target. 6 To perform the above activities a data mining technique such as Market Basket analysis, Memory-based Reasoning, Link Analysis, Cluster Detection, Decision Trees, Artificial Neural networks etc, have to be used. Data mining techniques are conceptual approaches to extracting information from data. The technique used is usually determined by goals of data mining and the data types of the data to be involved. For example the goal of predictive data mining is to automate a decision-making process by creating a model capable of making a prediction, while the goal of a descriptive data mining is to gain increased understanding of what is happening inside the data. Data types of the data to be used also determine the technique, which is to be used e.g. numeric data will require an algorithm, which will produce numeric output. The techniques for data mining have been discussed below and as the project didn’t evaluate each technique by practically using it, the strengths and weaknesses of the techniques have been adopted from [6]. 3.1 Automatic Cluster Detection. The technique uses mathematics to find clusters in data. For example divisive methods considers all records to be part of one big cluster and breaks the cluster until each record has a cluster to itself while the agglomerative methods start with each record occupying a cluster and combine the clusters until there is one big cluster containing all the records, this is actually the reverse of the divisive methods. Automatic cluster detection is used for undirected data mining and therefore can be applied without prior knowledge of the structure to be discovered. The strengths of Automatic cluster detection [6] are that it works with categorical, numeric and textual data and it is also easy to apply, as it requires little preparation of the input data. The weaknesses of the technique are that the clusters generated are not guaranteed to have any practical value, its up to the miner to interpret the clusters. The fact that the miner does not know what he/she is looking for makes it difficult for him/her to recognise it when he/she finds it. It is also not easy to choose the right weights and measures and the technique is sensitive to initial parameters because the initial value for the k in the k-means is important as it determines the number of clusters. Automatic cluster detection is useful when there are competing patterns in the data, making it hard to spot any single pattern, creating clusters of similar records will always reduce the complexity within the clusters. Automatic cluster detection is used with “large complex data sets with many variables and a lot of internal structure” [6]. 7 3.2 Decision Trees. When a Decision tree technique is applied to data, each record flows through the tree along a path determined by a series of tests until the record reaches a leaf or terminal node of the tree and given a class label. Each branch of a decision tree is a test on a single variable that cuts the space into two or more pieces. For example the school of computing might want to know how many students received a 2:1 degree since 1990. To get an answer using a decision tree might involve going through asking questions which involves yes or no answers such as graduate year > 1990, Department = Computing, degree classification = 2:1. There are two types of decision trees, which are; Classification trees – these trees labels records and assign them to the proper class. They can provide the confidence that the classification is correct. Regression trees - these trees estimates the values of target variables that takes on numeric values. Decision trees are built by recursive partitioning which is an iterative process of splitting the data up into partitions. The initial split produce two nodes, each of which is then split in the same way as the root node. When no split can be found that decrease the purity of a given node then it’s called the leaf node. A process called pruning, which is removing leaves and branches from the decision tree can vastly improve its performance. A decision tree technique is used in direct data mining where the miner knows the fields they are targeting and what to expect as the results. Decision trees do not discover rules that involve a relationship between variables, therefore it’s the miner’s responsibility to add derived variables to express relationships that are likely to be useful. The strengths of the technique [6] are said to be its ability to generate understandable rules that can be translated in to English, The way it performs classification without too much computations and the fact that it provides a clear indication of which fields are important for classification, this fields are usually put at the root node of the tree. Its weaknesses are said to be that its not good for estimation tasks where the goal is to predict the value of the continuous variable and it is also not good for time-series data. Decision tree methods are good when the tree’s task is classification of records or prediction of outcomes. Decision trees must be used when the goal of data mining is to assign each record to one of the few categories or the goal is to generate rules that can be easily understood and translated in to natural language. 8 3.3 Neural Networks. Neural networks have an input layer and an output layer, and each of the inputs gets its own network node. The actual values of the input variables are not fed into the input layer, but only some transformation of them. Each input layer is connected to the hidden layer with a weight (wi), which is some co-efficient. In the hidden layer the input weights are combined using a combination function and then passed to a transfer function, the result of which is the output of the network. The combination and transfer functions make up the layer’s activation function and the resulting value from the output node’s activation function is some transformation of the actual output. Refer to figure 3.1 below, which was adapted from [4] and have been edited to show how customers may be evaluated for credit risks. fig 3.1 Training a Neural network is a process of setting the weights on the inputs of each of the units in such a way that the network does the best job of predicting the target variable. Neural networks can produce good predictions but are not easy to use mainly because of the data preparation required to get good results. The results are also difficult to understand because neural networks do not produce rules. Neural networks are good for most classification and prediction tasks when the results of the model are more important than understanding how the model works. The technique does not work well when there are many input features, as large number of features make it more difficult to find patterns. 9 3.4 Market Basket Analysis (MBA) When an ordinary person goes to Tesco to do his/her groceries he/she might not be aware that she/he is leaving some data, which might be used for the layout of the shop. Finding groups of items that occur together in a transaction is called Market Basket Analysis. The technique builds rules that recognize products, which are bought together in a transaction. MBA is an undirected style of data mining and its usually used on problems where the goal is to know which items form clusters. Its results have found much use in the retailing industry where the results are used in store layouts, products promotions etc. MBA is good at analysing point-of-sale transactions. For example having determined which products are likely to be bought together, the store manager might decide to place the products next to each other or to offer promotions on some products while raising prices on their basket associates. The strengths of MBA are producing clear and understandable results, using simple computations and the fact that it works on variable-length data. But it has the weaknesses of discounting exceptional items in the basket and it also requires more computation as the size of the problem grows. 3.5 Memory-Based Reasoning (MBR) The School of computing may predict the number of first classes they will produce for an academic year based on the examination results of the first semester of their finalists and the past year’s results. By making such a prediction they will be applying a technique called memory-based reasoning, which uses known instances to make predictions about unknown instances. The technique assigns the new cases to the class in which most of its neighbours belong by employing knearest neighbour algorithm and can run on any source of data. Memory-based reasoning is a direct mining for classification and prediction and has been used successfully in fraud detection, medical treatment, where the patients may be prescribed medicines based on other patient’s records. MBR ‘s results are easy to understand. The technique can work on almost any number of attributes and applied to non-relational data. Its main weaknesses are that it requires large storage for training the datasets and the combination functions and number of neighbours can influence the results. 10 3.6 Link Analysis Link Analysis is a technique that follows relationships between records to develop patterns in them. When buying some goods on the Internet you usually leave your personal details at the checkout, for delivery and payment of the goods. The information such as delivery address is sometimes used to target you for marketing purposes. For example receiving discounts offers from an on-line retail shop that you once bought something is common these days. Link Analysis has been used successfully in telephone call pattern analysis, whereby a telephone number can be selected and the numbers it dials be observed and a link be made on it and the number it dials. The tools used in the technique do not find any patterns but assist people with discovering knowledge. Link analysis has the following strengths Good for linked data. Aids knowledge discovery by direct visualization of the links. And the following weaknesses Not applicable on most types of data. Its implementations in relational database are not that efficient. 11 3.7 Genetic Algorithms In their Book [6] say Genetic algorithms apply the mechanics of genetic and natural selection to a search used for finding the optimal sets of parameters that describe a predictive function. The technique uses the selection, crossover, and mutation operators to evolve successive generations of solutions. A search for the optimal solution is similar to the process of evolution of a population of organisms, where each organism is represented by a set of its chromosomes. This evolution is driven by three mechanisms: selection of the strongest - those sets of chromosomes that characterize the most optimal solutions, crossbreeding - production of new organisms by mixing sets of chromosomes of parent sets of chromosomes and mutations - accidental changes of genes in some organisms of the population. After a number of new generations built with the help of the described mechanisms one obtains a solution that cannot be improved any further. This solution is taken as a final one. As the generations evolve, only the most predictive survive until the functions converge on an optimal. Genetic Algorithms are a directed style of data mining and can be used to improve Memory-Based Reasoning and Neural networks techniques. Thus genetic algorithms should be considered at present more as an instrument for scientific research rather than as a tool for generic practical data analysis. The strengths of the techniques are as follows; • Produce explainable results. • Easy to apply the results from using the technique. • Able to handle a wide range of data types. • Applicable to optimisation problems. The weaknesses of the technique are that; • Difficulty in encoding many problems • It does not guarantee optimality. • It’s computationally expensive. • Only a specialist can develop a criterion for the chromosome selection and formulate the problem effectively. 12 3.8 Online Analytic Processing (OLAP). Online Analytic Processing is not an actual data mining technique but a presentation tool that can enable manual knowledge discovery, though it depends on human intelligence for discovering the knowledge. OLAP are client-server tools that have an advanced interface connected to an efficient representation of the data called a cube. The cube allows users to slice-and-dice the data in any way they like. For example a retailer might discover that certain products sells better at a particular period during the year by the use of OLAP tools. This might lead to an investigation using Market basket Analysis technique to find other items purchased with that item. The strengths of OLAP have been said to be the following [6]; • It is a powerful visualization tool. • It provides fast, interactive response times. • It is good for analysing time series. • It can also be used to find clusters and outliners. The weaknesses are as follows; 3.9 • It does not handle continuous variables well • Setting up a cube can be difficult • Cubes can easily become out-of-date. Regression Algorithms. Regression Algorithms are based on searching for a dependence of the target variable on other variables in the form of function of some predetermined form. For example in a group attribute accounting method, a dependence is sought in the form of polynomials. Such methods must provide solutions with a larger statistical significance than neural networks do. An obtained formula, a polynomial, is more suitable for analysis and interpreting in principle. Thus this method has better chances of providing reliable solutions in applications such as financial markets or medical diagnostics. 13 3.10 Data Visualization. Data visualization techniques are good at manipulating sampled and computed data for comprehensive display. The goal of the visualization is to bring to the user a deeper understanding of the data, as well as the underlying physical laws and properties. Such visualization may be used to enlighten a physicist on the complex interaction between electrons, to guide the medical practitioner in a surgery situation, or to view the surface of a planet, which has never been seen by human eyes. Visualization data can be static or in motion, to provide visual explanations of algorithms or general information. 14 Chapter 4 Data Mining Algorithms Data mining algorithms are the step-by-step details of a particular way of implementing a data mining technique. 4.1 C4.5 This algorithm generates a classification decision tree for a dataset by recursive partitioning of data. The tree is grown using depth-first strategy. C4.5 considers all the possible tests that can split the dataset and selects a test that gives the best information gain. For each discrete attribute, one test with outcomes as many as the number of distinct values of the attribute is considered and for each continuous attribute, binary tests involving every distinct values of the attribute are considered. All files read and written by C4.5 are of the form filestem.ext, where filestem is the file name stem that identifies the induction task and ext is an extension that defines the type of file. C4.5 algorithm can generate trees in two ways. In iterative mode, the program starts with a randomly selected subset of the data and generates a trial decision tree, add some misclassification objects and continues until the trial decision tree correctly classifies all objects not in the data. In batch mode, which is usually the default the program generates a single tree using all the available data. The trees generated are saved as filestem.unpruned and after each is generated, it is pruned in an attempt to simplify it. 15 4.2 CART (Classification and Regression Trees). Cart is a data mining algorithm that automatically searches for important patterns and relationships and rapidly uncovers hidden structures. The discovered knowledge is used to generate accurate and reliable predictive models for applications such as credit-card fraud, profiling customers and targeting direct mailings. Cart is more accurate for classifying new data than conventional stepwise procedures like linear regression and logistic regression. The algorithm includes reliable estimates of error rates and is robust to outliners. There is no need to transform independent variables. Using categorical or continuous variables can be achieved by classification trees, which predict values of categorical variables or by regression trees, which predict values of continuous variables. Cart’s binary decision trees are more careful with data and detect more structure before too little data is left for learning. Cart also has embedded test disciplines that ensure that the patterns found hold up when applied to new data. The algorithm handles missing values in the database by substituting “surrogate splitters”, which are backup rules that closely mimic the action of primary splitting rules. Cart accommodates situations in which there are some misclassifications by enabling users to specify penalties for misclassifying certain data. See figure 4.0 below adapted from [4]. fig 4.0 16 4.3 Apriori Algorithm. The Apriori algorithm is an association rule algorithm, which was developed for mining large transaction databases. The algorithm uses itemsets, which are non-empty set of items. The Apriori works by making multiple passes over a database. In the first pass, it counts item occurrences to determine the frequent itemsets with one item. A subsequent pass, say pass x consists of two phases. First the set of all frequent (x – 1) itemsets found in the (x – 1)th pass are used to generate the candidate itemsets Cx, using the Apriorigen () function. In the second phase the algorithm scans the database and for each transaction, it determines which of the candidates in Cx are contained in the transaction using a hash-tree data structure and increment the count of those candidates. The Apriori algorithm has many types such as AprioriAll and AprioriSome. 17 4.4 K–Means Algorithm. The K-Means is a clustering algorithm, which works best when the input data is numeric. The algorithm divides the dataset into predetermined number of clusters. The number of clusters is what the k in the name of the algorithm means while the mean part refers to the average location of all the members of a particular cluster. The original choice of a value for k determines the number of clusters that will be found. The algorithm then compute the new mean for each cluster and this exercise is iterated until a criterion function converges. The algorithm is not applicable to categorical data and is very sensitive to outliners. 4.5 Clustering Using Representatives (CURE). Cure is a clustering algorithm, which is more robust to outliners and identifies clusters having non-spherical shapes and wide variances in size. The algorithm achieves this by representing each cluster by a fixed number of points that are generated by selecting well-scattered points from the cluster and then shrinking them towards the centre of the cluster by a specified shrinkage factor. Multiple representative points enables clusters of unusual shapes to be represented better. Clusters with closest pair of representative points are chosen to be merged and the distance between them is defined to be the minimum distance between any pair of points in the representative sets of two clusters. The algorithm handles limited main memory by sampling. 18 4.6 Chi-Squared Automatic Induction (CHAID). Chi-Squared Automatic Induction is an algorithm that builds decision trees by detecting statistical relationships between variables. The algorithm is widely used because it is supplied as part of Statistics packages such as SPSS and SAS. CHAID works only with categorical variables and attempts to stop growing the tree before overfitting occurs, this makes the algorithm to be different from other decision tree algorithms as it do away with pruning. The algorithm grows the tree until no more splits are available that lead to statistical differences in the classification. The cut-off value used affects the size of the tree and its value as a classifier. The diagram (fig 4.1) below adapted from [4] shows the CHAID algorithm in use. Fig 4.1 19 4.7 ID3. The ID3 algorithm is a decision tree-building algorithm, which determines the classification of items by testing the values of their properties. The algorithm builds the tree in a top-down manner, starting from a set of items and a specification of properties. At each node of the tree, a property is tested and the results used to partition the item set. The process is done recursively until the set in a given sub-tree is consistent with respect to classification criteria, i.e. containing items belonging to the same category. This then becomes a leaf node. At each node the property to test is chosen based on information criteria that seeks to maximise information gain and minimize entropy. 4.8 Supervised Learning In Quest (SLIQ). Supervised learning in Quest is a decision tree classifier designed to classify large training data. The algorithm uses a pre-sorting technique in the tree-growth phase. This helps in avoiding costly sorting at each node. SLIQ keeps a separate sorted list for each continuous attribute and a separate list called class list. An entry in the class list corresponds to a data item, and has a class label and name of the node it belongs in the decision tree while an entry in the sorted attribute list has an attribute value and the index of the data item in the class list. SLIQ grows the decision tree in a breadth-first fashion. For each attribute it scans the corresponding sorted list and calculate entropy values of each distinct values of all the nodes in the edge of the decision tree. After the entropy values have been found for each attribute, one attribute is chosen for a split for each node in the current edge and they are expanded to have a new frontier. Then one or more scan of the sorted attribute list is performed to update the class list for the new nodes. 20 4.9 C5.0/See5 C5.0 is an upgraded version of C4.5, which have been discussed before in this report and therefore works just like C4.5 but offers many features. The algorithm supports boosting with any number of trials, this may slow down the algorithm but produces more accurate results. The algorithm allows a separate cost to be defined for each predicted/actual class pair. The cost option enables the algorithm to construct classifiers to minimize expected misclassification costs rather than error rates. “C5.0 is 214 times faster than C4.5 on the coding data, uses less than 10 % of the memory and produces a more accurate rule set”. [11] The algorithm offers a windows version See5, which has a user-friendly graphic interface and it’s easy to use. For example the cross-reference window makes classifiers more understandable by linking cases to relevant parts of the classifier. See5 is the one, which was used to carry out a data mining exercise for this project. 4.10 Classification Based on Association (CBA). Classification Based on Association is a data-mining algorithm, which integrates classification and association rules into one algorithm, which is more powerful, and produce accurate classifiers for prediction. The algorithm can be used for mining various forms of association rules, and for text classification or categorization. CBA builds accurate classifiers from relational data, where each record is described with a fixed number of attributes and also builds accurate classifiers from transactional data, where each data record has a variable number of items. For example items bought in an Online retail website by a customer. More on this algorithm will be discussed later in the report. 21 Chapter 5 Evaluation of Data Mining Techniques and Algorithms 5.1 Evaluation of Data Mining Techniques. “Averaging an algorithm's performance over all target concepts, assuming they are all equally likely, would be like averaging a car's performance over all possible terrain types, assuming they are all equally likely. This assumption is clearly wrong in practice; for a given domain, it is clear that not all concepts are equally probable. In medical domains, many measurements (attributes) that doctors have developed over the years tend to be independent: if the attributes are highly correlated, only one attribute will be chosen. In such domains, a certain class of learning algorithms might outperform others. For example, Naive-Bayes seems to be a good performer in medical domains (Kononenko 1993). Quinlan (1994) identifies families of parallel and sequential domains and claims that neural-networks are likely to perform well in parallel domains, while decision-tree algorithms are likely to perform well in sequential domains. Therefore, although a single induction algorithm cannot build the most accurate classifiers in all situations, some algorithms will perform better in specific domains” [13]. The quotation above is true as it has been noted previously in the report that the choice of a technique depends on a number of factors such as the suitability for certain input data types, transparency of the mining output, tolerance of missing variable values, level of accuracy possible and ability to handle large volumes of data. After studying through the vast resources of books, technical papers, white papers written on data mining, the author came up with an Evaluation framework below to evaluate the techniques that have been discussed in this report. The criterion used for evaluating the techniques is simple and its related to the datasets, which were used in the case study of the report. This part of the project was very difficult, as the author had to fill in the table based on what has been read or thought they understood but not after gaining practical experience, as it was with the decision tree. Therefore the results in the next page are merely the author’s opinion and are open for criticism. 22 Technique Evaluation table. Technique How it works Decision Trees Classification, Prediction Artificial Neural Networks Estimating, Prediction, Can it be used with Transactional data accumulated in an Online retail websites? Yes Yes Clustering Memory-Based Reasoning Classification, Prediction Yes On-Line Analytic Processing Summarization, Presentation Yes Link Analysis Classification, Sequential Yes patterns Genetic Algorithms Clustering No Automatic Cluster Detection Clustering Yes Market Basket Analysis Association, Prediction Yes Data Visualization Deviation No Regression Deviation No Conclusion From the table above it can be observed that the techniques, which will work well with data that has been accumulated from online retailers will be the ones, which do classification, prediction, association, clustering, and estimation. This is so because the online retailers are more interested on customer relationship management and therefore they need to cluster their markets, associate the products they buy, predict their product sales and estimate their profits. The knowledge gained from this analysis can help them with satisfying and retaining their customers. 23 5.2 Evaluation of Data Mining Algorithms. In order to evaluate the algorithms studied the same evaluation framework as the one used to evaluate the techniques was used. Again the results were based on my understanding and the explored literature. Hands on experience were gained with See5 only, which is the algorithm that was used in the case study of the report. The table below shows how they faired. Algorithm How it works Can it be used with Transactional data accumulated in an Online retail websites? Yes C4.5 Classification CART Prediction, Classification Yes Apriori Sequential patterns, Associations Yes K-Means Clustering Yes ID3 Classification Yes CURE Clustering Yes CHAID Classification Yes C5.0 / See5 Classification Yes CBA Classification, Association Yes SLIQ Classification Yes Conclusion. Surprisingly when filling in the table it was noticed that all the algorithms that were chosen for investigation are able to work with transactional data accumulated by an online retailer provided its been pre-processed and well cleaned. The conclusion drawn from the exploration of data mining algorithms is that most of the algorithms can be used with any data that is suitable or in a correct format. 24 Chapter 6 Potential Applications “Data mining is used by companies with a strong consumer focus- retail, financial, communication and marketing organisations. It enables these companies to determine relationships among internal factors such as price, product positioning, or staff skills, and external factors such as economic indicators, competition, and customer demographics. It enables them to determine the impact on sales, customer satisfaction, and corporate profits. Data mining also enables the companies to drill down into summary information to view detail transactional data.” [3] As the quote above explains, data mining can be used in different sectors. The use of data mining is unlimited and it can be applied anywhere, where large quantities of data are collected. The following are some of the few areas where it’s applied. • Retail / Marketing: Data mining can be used in retailing and marketing to identify buying patterns from customers, predict response to mailing campaigns, basket analysis and to find associations among customer demographic characteristics. • Banking: Banks can use data mining to identify loyal customers, predict customers who are likely to change their credit card affiliation and detect patterns of fraudulent credit card use. • Insurance and Health care: Data mining can be used to analyse claims, predict which customers will buy new policies, identify fraudulent behaviour and identify patterns of risky customers. • Transportation: Data mining can be used in determining the distribution schedules among outlets and analysing loading patterns. • Medicine: Data mining can be used to identify successful medical therapies for different illness and to characterise patient behaviour to predict office visits. • Education: Student databases can be mined to determine any correlations or patterns, which leads to good grades and the universities, can then act on such information and improve their performance. 25 • Customer Relationship Management: Data mining can be used to allow companies to learn from their customer behaviour, that is what customers prefer. This knowledge can enable them to satisfy and retain the customers. The application of data mining in customer relationship management is proving to be one of the factors that makes data mining popular as every company is striving to woe more new customers to itself and retain the old ones. Most of online retailers perform data mining for customer relationship management and believe that through it they increase revenue and satisfy customers. 26 Chapter 7 Case Study 7.1 Overview During the first project meeting with the project supervisor it became clear that he didn’t want the project to be a research only kind of a project. He wanted an actual data mining exercise to be done. The supervisor asked the author to go and enquire whether there were any tools in the school of computing machines that he can use, Whether there were any datasets and if there were any modules in the university which taught data mining. The questions were forwarded to the project initiator through an email and he responded with the answers to most of the questions (See Appendix D for reply). There were no datasets to be used and the project initiator suggested getting some datasets from the KDDCUP website and this was done within the same week. The author enrolled at the website and was given a UserID and password to download the data. The datasets provided were accumulated by an online retailer called Gazelle.com, which deals with legware and legcare products. Gazelle.com’s goal was to retain and attract more customers, even if it meant losing money in the short term. So they had many promotions that were relevant for mining, since these effect traffic to the website, type of customers etc. There were 3 datasets for each question asked in the KDDCUP 2000 competition. The datasets for question 1 was a transactional one and it was the biggest of the 3 and was chosen as the supervisor had hinted that a bigger dataset would be ideal. The dataset had about 30,000 records and was about 297MB. When downloading the datasets from [10], it was discovered that they were in extended C5.0 format with a .names and .data files, which meant that it would be easy to use them with C5.0 software tool. A search for the software was done on the Internet and a free download of the demo was found at [11]. It was realised that C5.0 had a windows version called See5, which was also available for downloading. The windows version was preferred over the Unix version C5.0 because of its graphical interface and it was thought it would be easy to use and learn within a short time. It was also chosen based on the fact that it could be used at home, as the author didn’t have Unix installed in his home computer. The tutorials for See5 were also downloaded from the same website and the data mining lessons were started. It was realised during the familiarisation with the demo that it only read 400 records of the dataset. This was communicated to the supervisor and a plan was made that during the data mining phase in the project plan an evaluation package of the software will have to be acquired to enable the mining exercise. 27 On February 2002, an evaluation licence was requested from [11] and a licence was offered by Ross Quinlan (See his reply in Appendix D) but it was only for 10 days. So a lot had to be done in ten days. The software was installed and the data mining started. The .data file seemed too big for the tool as it kind of hanged while constructing the decision trees. The ten days were running out and it was obvious that the problem was the .data file size and the author didn’t know how he could split it. It was later decided that a smaller dataset be obtained from [10]. The dataset was much smaller about 4.49MB and had 1781 records. When this dataset was used the program started working well. During the progress meeting the results of the data mining were shown to the Assessor and Supervisor and the reason for using a smaller dataset explained. The Assessor suggested that the author should use a perl command to split the bigger dataset so that he can have training and a test data. Or alternatively use the crossvalidation to reduce the errors in the small dataset. It was obvious after that meeting that the whole exercise had to be done again but by then the evaluation licence had expired. So the author tried to ask for another licence but Ross Quinlan refused saying that he had already got one (See Appendix D). A friend of mine who is in Leeds Metropolitan University was asked to get a licence, which she did. The only other task left now was to split the dataset, so the perl command random_select.pl was run and the dataset split in to two files customers.data with 19999 records and customers.test with 10088 records. See Appendix E and F for the sample customer.data and customer.names files. 7.2 Extracting information from the datasets. The files were then loaded into the See5 program and the following questions were asked by targeting at specific attributes. The author has tried to justify why each question was selected and how it can benefit Gazelle.com on their goal of attracting and retaining customers. A cross-validation of 10 folds has been used in all the questions to help with decreasing the error rates. The decision tree for the questions can be seen in Appendix G. 28 Question 1. Which website refers most customers to website? Most Organisations advertise by having their website links as clickthroughs from other Organisation’s websites. This service is usually paid for, so its very important to know which websites are referring more customers and which ones are not, so that the concerned Organisation can withdraw the adverts from the websites which are not productive and seek others. It’s hoped that this question will be useful for the on-line retailers. The targeted attribute is Session First Referrer Top 5. Below is the evaluation on training and test data. Evaluation on training data (19998 cases): Decision Tree ---------------Size Errors 460 3909(19.5%) << (a) (b) (c) (d) (e) (f) <-classified as ------- ---- ---- ---- ---50 2 129 (a): class http://www.mycoupons.com 4 209 11 13 26 1413 (b): class http://www.fashionmall.com 4 9 1233 8 10 406 (c): class http://www.gazelle.com 13 239 41 1147 (d): class http://stores.shopnow.com 7 1 43 386 299 (e): class http://www.winnie-cooper.com 22 36 105 58 102 13972 (f): class Other *** line 10089 of `Customers.test': unexpected end of file Evaluation on test data (10088 cases): Decision Tree ---------------Size Errors 460 (a) ---11 2 12 (b) ---42 4 11 11 54 2407(23.9%) << (c) ---1 9 623 1 5 105 (d) ---- (e) ---- 17 4 41 25 79 20 1 24 142 121 (f) <-classified as ---72 (a): class http://www.mycoupons.com 752 (b): class http://www.fashionmall.com 254 (c): class http://www.gazelle.com 616 (d): class http://stores.shopnow.com 207 (e): class http://www.winnie-cooper.com 6822 (f): class Other Time: 77.6 secs 29 Results with Cross-validation of 10 Folds [ Summary ] Fold ---- Decision Tree ---------------Size Errors 0 1 2 3 4 5 6 7 8 9 464 505 419 455 537 557 496 627 499 465 Mean 502.4 SE 18.8 24.5% 25.4% 23.5% 24.1% 24.6% 24.7% 24.3% 24.8% 23.6% 23.3% 24.3% 0.2% (a) (b) (c) (d) ---- ---- ---- ---29 1 2 3 149 11 21 6 17 1094 7 21 2 150 19 7 67 34 157 226 193 (e) (f) <-classified as ---- ---149 (a): class http://www.mycoupons.com 39 1453 (b): class http://www.fashionmall.com 11 535 (c): class http://www.gazelle.com 63 1204 (d): class http://stores.shopnow.com 249 394 (e): class http://www.winnie-cooper.com 214 13471 (f): class Other Time: 378.3 secs 30 Question 2. How do most of our customers find about us? This question is also related to the one above as it’s about marketing. It seeks to find out which advertising is more effective for the Organisation. The results of the question might lead to concentrating in an advertising media, which brings more customers or it might lead to more attention being paid to the less productive advertising media. The targeted attribute is HowDidYouFindUs. Below is the evaluation on training data. Evaluation on training data (58 cases): Decision Tree ---------------Size Errors 8 14(24.1%) << (a) (b) (c) (d) (e) (f) <-classified as ---- ---- ---- ---- ---- ---5 (a): class Web/Banner Ad 1 (b): class News Story 2 (c): class Search Engine 1 34 (d): class Friend/Co-worker (e): class Magazine Ad 2 9 4 (f): class Other *** ignoring cases with bad or unknown class *** line 10089 of `Customers.test': unexpected end of file Evaluation on test data (18 cases): Decision Tree ---------------Size Errors 8 12(66.7%) << (a) (b) (c) (d) (e) (f) <-classified as ---- ---- ---- ---- ---- ---(a): class Web/Banner Ad 1 (b): class News Story 2 (c): class Search Engine 5 3 (d): class Friend/Co-worker (e): class Magazine Ad 1 5 1 (f): class Other Time: 17.3 secs 31 Results with Cross-validation of 10 Folds [ Summary ] Fold Decision Tree ------------------Size Errors 0 1 2 3 4 5 6 7 8 9 5 12 8 7 5 7 11 9 11 6 80.0% 80.0% 33.3% 50.0% 66.7% 50.0% 50.0% 33.3% 50.0% 50.0% Mean SE 8.1 0.8 54.3% 5.2% (a) (b) (c) (d) (e) (f) ---- ---- ---- ---- ---- ---1 3 1 1 2 4 1 24 6 2 11 2 <-classified as (a): class Web/Banner Ad (b): class News Story (c): class Search Engine (d): class Friend/Co-worker (e): class Magazine Ad (f): class Other Time: 10.4 secs 32 Question 3. What gender is most of our customers? Knowing the gender of your customers is very in important in a retail shop, which deals with clothing. The answer to the question will make the marketing team to advise the stores department to stock goods which might be preferred by the dominant gender. For example having more female clothing if most of the customers are females. This is actually segmenting the market into gender groups to enable different offers. The targeted attribute is Gender. Evaluation on training data (19998 cases): Decision Tree ---------------Size Errors 39 6( 0.0%) << (a) (b) (c) <-classified as ---- ---- ---157 (a): class Female 4 46 (b): class Male 1 1 19789 (c): class NULL Evaluation on test data (10088 cases): Decision Tree ---------------Size Errors 39 20( 0.2%) << (a) (b) (c) <-classified as ---- ---- ---66 8 4 (a): class Female 6 9 (b): class Male 1 1 9993 (c): class NULL Time: 31.4 secs Results with Cross-validation of 10 Folds [ Summary ] Fold Decision Tree ------------------Size Errors 0 34 0.3% 1 31 0.4% 2 30 0.3% 3 32 0.3% 4 38 0.4% 5 37 0.3% 6 34 0.3% 7 35 0.4% 8 32 0.3% 9 35 0.4% Mean 33.8 SE 0.8 (a) ---124 19 4 0.3% 0.0% (b) (c) ---- ---16 17 28 3 9 19778 <-classified as (a): class Female (b): class Male (c): class NULL Time: 93.2 secs 33 Question 4. How many customers wish to receive mail from our company? This question will result with the number of customers who wish to receive electronic mail from the Organisation being known. It will also give an idea of those who do not wish to receive electronic mail. The answer may lead to investigation carried out to find out why some customers do not wish to receive mail with the hope of convincing them to receive mail. The answer will also help in target marketing, whereby only customers who wish to receive mail will be send offers by electronic mail. The targeted attribute is SendEmail. Evaluation on training data (19998 cases): Decision Tree ---------------Size Errors 10 17( 0.1%) << (a) (b) (c) <-classified as ---- ---- ---19596 1 (a): class NULL 10 61 1 (b): class True 3 2 324 (c): class False Evaluation on test data (10088 cases): Decision Tree ---------------Size Errors 10 16( 0.2%) << (a) (b) (c) <-classified as ---- ---- ---9909 (a): class NULL 6 22 1 (b): class True 3 6 141 (c): class False Time: 20.2 secs Results with Cross-validation of 10 Folds [ Summary ] Fold Decision Tree ------------------Size Errors 0 1 2 3 4 5 6 7 8 9 9 14 9 10 13 12 9 15 8 9 Mean 10.8 SE 0.8 0.1% 0.2% 0.3% 0.1% 0.2% 0.2% 0.2% 0.2% 0.2% 0.3% 0.2% 0.0% (a) (b) (c) ---- ---- ---19595 2 6 56 10 13 316 <-classified as (a): class NULL (b): class True (c): class False Time: 87.3 secs 34 Question 5. How many of our customers are Premium Card holders? This question will help with creating a segment of customers who have Premium cards. This group of people is important as they have high credits and spending power. Satisfying such a group is very important so it will be good for any retail outlet to know if such a group exists. The targeted attribute is Premium Card Holder. Evaluation on training data (19998 cases): Decision Tree ---------------Size Errors 12 42( 0.2%) << (a) (b) (c) <-classified as ---- ---- ---19697 (a): class NULL 29 36 (b): class True 6 230 (c): class False Evaluation on test data (10088 cases): Decision Tree ---------------Size Errors 12 30( 0.3%) << (a) ---9966 (b) ---7 12 (c) <-classified as ---(a): class NULL 18 (b): class True 85 (c): class False Time: 19.3 secs Results with Cross-validation of 10 Folds [ Summary ] Fold ---0 1 2 3 4 5 6 7 8 9 Decision Tree ---------------Size Errors 7 21 18 13 17 6 13 7 26 11 Mean 13.9 SE 2.1 0.3% 0.5% 0.5% 0.3% 0.4% 0.3% 0.6% 0.3% 0.5% 0.2% 0.4% 0.0% (a) (b) (c) <-classified as ---- ---- ---19697 (a): class NULL 16 49 (b): class True 27 209 (c): class False Time: 49.8 secs 35 Question 6. Who is the Email provider of most of our customers? The answer to this question will reveal who is the Email provider of most customers. The provider may be targeted for marketing purposes such as doing more advertising with them or paying for a click-through link in their website. The targeted attribute is Email. Evaluation on training data (19998 cases): Decision Tree ---------------Size Errors 42 48( 0.2%) << (a) (b) (c) (d) (e) (f) (g) (h) (i) <-classified as ---- ---- ---- ---- ---- ---- ---- ---- ---(a): class MIL 56 30 (b): class NET (c): class GOV 4 276 1 1 (d): class COM 6 5 (e): class EDU 2 (f): class ORG 1 1 15 (g): class Gazelle 1 1 2 (h): class Other 19596 (i): class NULL *** line 10089 of `Customers.test': unexpected end of file Evaluation on test data (10088 cases): Decision Tree ---------------Size Errors 42 57( 0.6%) << (a) (b) (c) (d) (e) (f) (g) (h) (i) <-classified as ---- ---- ---- ---- ---- ---- ---- ---- ---1 (a): class MIL 7 22 (b): class NET (c): class GOV 17 113 2 1 (d): class COM 6 1 (e): class EDU 1 1 (f): class ORG 1 3 (g): class Gazelle 1 3 (h): class Other 9908 (i): class NULL Time: 20.6 secs 36 Results with Cross-validation of 10 Folds [ Summary ] Fold Decision Tree ------------------Size Errors 0 1 2 3 4 5 6 7 8 9 41 38 53 35 36 45 46 25 31 35 Mean 38.5 SE 2.5 0.6% 0.7% 0.6% 0.7% 0.6% 1.0% 0.6% 0.7% 0.6% 0.6% 0.7% 0.0% (a) (b) (c) (d) (e) (f) (g) (h) (i) ---- ---- ---- ---- ---- ---- ---- ---- ---24 61 1 36 3 1 235 6 1 9 2 4 2 2 6 8 <-classified as (a): class MIL (b): class NET (c): class GOV 1 (d): class COM (e): class EDU (f): class ORG (g): class Gazelle (h): class Other 19596 (i): class NULL Time: 49.9 secs 37 Question 7. How many of our customers are mail order buyers? The question seeks to find out if some of the customers buying from the website have bought any goods by mail order in the past as such customers will usually have no problem with buying from the internet and therefore are potential loyal customers who need to be retained. The targeted attribute is Mail Order Buyer. Evaluation on training data (19998 cases): Decision Tree ---------------Size Errors 7 31( 0.2%) << (a) ---19697 (b) ---186 30 (c) <-classified as ---(a): class NULL 1 (b): class True 84 (c): class False Evaluation on test data (10088 cases): Decision Tree ---------------Size Errors 7 15( 0.1%) << (a) ---9966 (b) ---75 12 (c) <-classified as ---(a): class NULL 3 (b): class True 32 (c): class False Time: 21.0 secs Results with Cross-validation of 10 Folds [ Summary ] Fold Decision Tree ------------------Size Errors 0 1 2 3 4 5 6 7 8 9 Mean SE 8 6 7 7 5 3 7 5 7 3 0.2% 0.4% 0.3% 0.1% 0.2% 0.2% 0.1% 0.2% 0.1% 0.3% 5.8 0.6 0.2% 0.0% (a) (b) (c) <-classified as ---- ---- ---19697 (a): class NULL 186 1 (b): class True 37 77 (c): class False Time: 47.0 secs 38 Question 8. Find out how many children do most of our customers have? The answer will classify customers by number of their children. Those who have many children may then be targeted for family purchasing, therefore increasing the number of overall customers. The targeted attribute is NumberOfChildren. Evaluation on training data (58 cases): Decision Tree ---------------Size Errors 5 7(12.1%) << (a) (b) (c) (d) (e) <-classified as ---- ---- ---- ---- ---44 (a): class 0 2 (b): class 1 4 4 (c): class 2 1 (d): class 3 3 (e): class 4 or more *** ignoring cases with bad or unknown class Evaluation on test data (18 cases): Decision Tree ---------------Size Errors 5 3(16.7%) << (a) (b) (c) (d) (e) <-classified as ---- ---- ---- ---- ---11 (a): class 0 (b): class 1 2 3 (c): class 2 (d): class 3 1 1 (e): class 4 or more Time: 15.1 secs Results with Cross-validation of 10 Folds [ Summary ] Fold Decision Tree ------------------Size Errors 0 1 2 3 4 5 6 7 8 9 Mean SE 7 6 5 4 6 6 3 4 4 5 0.0% 20.0% 33.3% 33.3% 33.3% 16.7% 50.0% 50.0% 16.7% 16.7% 5.0 27.0% 0.4 5.0% (a) (b) (c) (d) (e) <-classified as ---- ---- ---- ---- ---40 1 1 2 (a): class 0 2 (b): class 1 7 1 (c): class 2 1 (d): class 3 2 1 (e): class 4 or more Time: 10.1 secs 39 Question 9. How many of our customers respond to our marketing mails? The answer to the question above will be used for marketing purposes. The answer will show which of the customers are “loyal” and respond to direct marketing mails. These customers will need to be retained and they may be enticed by offering them discounts. The targeted attribute is Mail Responder. Evaluation on training data (19998 cases): Decision Tree ---------------Size Errors 16 9( 0.0%) << (a) ---19697 (b) ---220 4 (c) <-classified as ---(a): class NULL 5 (b): class True 72 (c): class False Evaluation on test data (10088 cases): Decision Tree ---------------Size Errors 16 11( 0.1%) << (a) ---9966 (b) ---- (c) <-classified as ---(a): class NULL 6 (b): class True 26 (c): class False 85 5 Time: 20.4 secs Results with Cross-validation of 10 Folds [ Summary ] Fold ---0 1 2 3 4 5 6 7 8 9 Decision Tree ---------------Size Errors 17 13 15 19 8 10 15 16 16 17 Mean 14.6 SE 1.1 (a) (b) ---- ---19697 203 13 Time: 44.4 secs 0.2% 0.3% 0.2% 0.1% 0.3% 0.2% 0.3% 0.1% 0.1% 0.1% 0.2% 0.0% (c) <-classified as ---(a): class NULL 22 (b): class True 63 (c): class False 40 Question 10. What percentage of our customers come back to our website? The answer to this question will show the percentage of the customers who come back to the website after having purchased something in the past. These kind of customers need to be retained. The targeted attribute is Retail Activity. Evaluation on training data (19998 cases): Decision Tree ---------------Size Errors 7 14( 0.1%) << (a) ---19697 (b) ---- (c) <-classified as ---(a): class NULL 92 7 (b): class True 7 195 (c): class False Evaluation on test data (10088 cases): Decision Tree ---------------Size Errors 7 14( 0.1%) << (a) (b) ---- ---9966 34 8 (c) <-classified as ---(a): class NULL 6 (b): class True 74 (c): class False Time: 19.9 secs Results with Cross-validation of 10 Folds [ Summary ] Fold Decision Tree ------------------Size Errors 0 1 2 3 4 5 6 7 8 9 Mean SE 9 7 7 5 6 11 6 6 5 5 0.2% 0.2% 0.1% 0.1% 0.0% 0.3% 0.1% 0.0% 0.1% 0.3% 6.7 0.6 (a) ---19697 0.1% 0.0% (b) (c) <-classified as ---- ---(a): class NULL 88 11 (b): class True 12 190 (c): class False Time: 45.1 secs 41 7.3 Evaluation and Interpretation of results. See5 See5 is a windows version of C5.0, which is an upgraded version of C4.5. Therefore it constructs decision trees and does classification based on the targeted attribute. After locating the files in the program the screen looks like below. Check out the Customers.names, customers.data and customers.test. The main window of See5 has six buttons on its toolbar. From left to right, they are; Locate Data: Invokes a browser to find the files for your application, or to change the current application. Construct Classifier: Selects the type of classifier to be constructed and sets other options such as cross-validation. Stop: Interrupts the classifier generating process. Review Output: Re-displays the output from the last classifier construction. Use Classifier: Interactively applies the current classifier to one or more cases. Cross-Reference: Shows how cases in training or test data relate to parts of a classifier. All this functions can also be initiated from the File menu. 42 Constructing Classifiers Once the names, data and test files have been set up, everything is ready to use See5. The first thing is to locate the date using the locate data button on the toolbar. There are several options that affect the type of classifier that See5 produces and the way that it is constructed. The construct classifier button on the toolbar displays a dialog box that sets out these classifier construction options like below. From the screenshot above it can be seen that a cross-validation of ten folds was chosen when performing the mining exercise. If the rulesets option is ticked See5 converts the tree into collections of rules called rulesets. For this exercise See5 was invoked with default values and it constructed decision trees with a crossvalidation of 10 folds and with a pruning confidence of 25%, which is default in See5. One of the weaknesses of the See5 program, which was seen, was that the program does not work well with attributes, which have continuous values. This is the message that was received when trying to target the login failure count attribute “*** line 305 of `Customers.names': target attribute `Login Failure Count' must be specified by a list of discrete values”. 43 Classification Based on Associations (CBA) It was intended that CBA will be used to evaluate the results of See5 as it was said to be able to read the .names and .data files. CBA is also said to be able to reduce the error rate to a lower level than See5. An Academic version of CBA was requested from the university of Singapore [12] and Liu Bing, one of the authors of CBA granted me permission to download the program [See Appendix D]. When the author tried installing the program it came up with an error message “Setup has detected that unInstallShield is in use. Please close unInstall and restart setup”. This message was reported to support who replied by saying that the program was updating the registry so it could not be installed on the School of Computing computers [See Appendix D]. The reply from support was discussed with my supervisor and their reply was also forwarded to him. I then decided to install the program on my home computer. This is CBA’s main interface. All the main functions can be seen from this interface. The top four selectors are for different mining tasks. The flow of the CBA data mining is a top-down process. Whenever you have data to mine, you should check the data consistency by using the data cleaner. If there are any continuous attributes, you should pass it to data discretizer for discretization. You could also try to use feature selection to reduce the number of attributes. 44 There was a message, which asked the author to run discretizer to discretize the data after he loaded the customers.data file. See screenshot below. The Ok button was clicked. When the discretizer button was clicked, the following screen showed up with the error and the author couldn’t go any further as once this error message showed everything froze and the computer had to be restarted. This was tried a couple of times but got stuck at the same place so the author gave up. Results Interpretation 45 From the results for question 1 it can be seen that using a training data and test data files reduced the errors. The errors in the training data are higher while they are low in the test data. This shows that by the time the program does the tests it had been already trained. Cross-Validating further reduces the errors by validating each fold. Gazelle.com is the main referrer, with 1233 cases in the training data and 623 in the test data. The cross-validation results show it with 1094 cases. Question 2 show that most customers find about the retail through a Friend/Co-worker while questions 3 shows the gender of most customers being female. The question about whether the customers want to be sent mail shows that only a few are really interested and this will need to be investigated further. The segmentation question about customers who holds premium cards shows that only 36 cases are true. The email provider question shows that the .com are the main providers, followed by Mil. Question 7 wanted to find out about customers who have bought goods through mail orders and the results show that only a handful seem to have done that before. The number of children question showed that most of the customers did not have any children while the one about customers who respond to the shop’s mailing showed that 305 customers do respond. The last question was about customers who have bought more than once in the website and the targeted attribute was retail activity. The results for this question shows that only about 126 customer have done so, which is not a good figure considering that there are about 30 000 record in the files. 46 Chapter 8 Current Issues in Data Mining 8.1 Individual privacy One of the key issues raised by data mining technology is not a business or technological one, but rather an ethical or social one. It is the issue of individual privacy. Data mining makes it possible to analyse routine business transactions and glean a significant amount of information about individuals buying habits and preferences. This is done without the individual’s consent in most cases and its like people do not have a choice, as it’s the technology that enables this intrusion of people’s privacy. 8.2 Data integrity Another issue is that of data integrity. Clearly, data analysis can only be as good as the data that is being analysed. A key implementation challenge is integrating conflicting or redundant data from different sources; this can be when creating the Organisation’s data warehouse. For example, a bank may maintain credit cards accounts on several different databases. The addresses (or even the names) of a single cardholder may be different in each. So the integrity of data in data warehouses is not guaranteed. 8.3 Data Size Issues Most of the existing data mining techniques fail because of the size of the data. New techniques have to be developed which will be able to work with large and heterogeneous databases, as the Internet connects many sources of data. Any algorithm that is proposed for mining data will have to account for out of core data structures. Most of the existing algorithms haven't addressed this issue. 8.4 Noisy Data Issues The other issue is that of noise, most of the algorithms assume the data to be free of noise. As a result, the most time-consuming part of solving problems becomes data pre-processing. Data formatting and cleaning is time-consuming and can be frustrating if working with large datasets. The concept of noisy data can be understood by the example of mining logs. A real life scenario can be if one wants to mine information from web logs. A user may have gone to a web site by mistake - incorrect URL or incorrect button press. In such a case, this information is useless if we are trying to deduce a sequence in which the user accessed the web pages. The logs may contain many such data items. These data items constitute data noise. A database may constitute up to 30-40% such Noisy data and pre-processing this data may take up more time than the actual algorithm execution time. 47 8.5 Technical Issues There is a technical issue, which is whether it is better to set up a relational database structure or a multidimensional one. In a relational structure, data is stored in tables, permitting ad hoc queries. In a multidimensional structure, sets of cubes are arranged in arrays, with subsets created according to category. While multidimensional structures facilitate multidimensional data mining, relational structures thus far have performed better in client/server environments. And, with the explosion of the Internet, the world is becoming one big client/server environment. 8.6 Cost Issues While system hardware costs have dropped dramatically within the past 10 years, data mining and data warehousing tend to be self-reinforcing. The more powerful the data mining queries, the greater the value of the information being gleaned from the data, and the greater the pressure to increase the amount of data being collected and maintained, which increases the pressure for faster, more powerful data mining queries. This increases pressure for acquiring larger and faster systems, which are more expensive. 48 Chapter 9 Future Requirements The electronic monitoring of our lives will increase in the future as more Organisations join the e-business wagon and more data will be accumulated for data mining. The author sees data mining being the main decision making tool in the near future and almost every big Organisation being involved in it. The future will see efficient and scalable data mining tools, which are able to effectively extract information from huge amount of data being developed. The issue of data size will also be addressed by parallel and incremental algorithms that divide data in to partitions that can be processed in parallel. The new algorithms developed will have to deliver acceptable performance on large volumes of data regardless of the computing platform. They will also need to be scalable to take advantage of parallel computing and cope with long computation times, such as experienced in Neural networks. The algorithms, which are associated with probabilistic learning, will also need to be improved drastically. The future might see the standardisation of the data mining methodology, where each tool might have to have all the stages such as Selection, Data cleaning, Transformation, Data mining, Interpretation and Evaluation. This will greatly help as once one have used a tool it would be easier to use a different one because of the standardisation. Finally new developed tools will also require to be easy to use, as data mining will eventually be accessible to people who know very little about computers and may have no time to learn a complicated one. 49 Reference List [1] Peter, Cabena, et al. Discovering Data Mining from Concepts to Implementation. Prentice Hall, 1998. [2] Ian, H, Witten & Eibe, Frank. Data Mining. Morgan Kaufmanns Publishers, 2000. [3] Bill, Palace. (1996), Data Mining: What is Data Mining?, http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm [04th April 2002] [4] Clementine, URL: http://www.spss.com/datamine/techniques.htm [08th April 2002]. [5] Alan, Rea. Data Mining, URL: http://www.pcc.qub.ac.uk/tec/courses/datamining/stu_notes/dm_book_2.html [10th April 2002] [6] Michael, J, A, Berry & Gordon S, Linoff. Data Mining Techniques for Marketing, Sales, and Customer Support. Wiley Computers Publishing, 1997. [7] Steven, Brett. Data Mining in the School Information System (SIS). School of Computer Studies, 1998/1999. [8] Michael, J, A, Berry & Gordon S, Linoff. Mastering Data Mining: The Art and Science of Customer Relationship Management. Wiley Computers Publishing, 2000. [9] Jiawei, Han & Micheline, Kamber. Data Mining Concepts and Techniques. Morgan Kaufmanns Publishers, 2001 [10] KDDCUP website, URL: http://www.ecn.purdue.edu/KDDCUP [15th March 2002] [11] RuleQuest website, URL: http://www.rulequest.com [12th April 2002] [12] National University of Singapore website, URL: http://www.comp.nus.edu.sg/~dm2 [24th April 2002] [13] Karuna, P, Joshi. Analysis of Data Mining Algorithms, URL: http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm [15th April 2002] [14] Heikki, Manila, Hannu, Toivonen & Lukeri, Verkamo, (1994). Efficient Algorithms for discovering Association rules, Journal of Knowledge discovery in databases, pp.181 – 192. [15] Rombel, Adam (2001). CRM shifts to Data Mining to keep customers, Global Finance journal, 15(11): p.97. [16] Two Crows Corporation. Introduction to Data Mining and Knowledge Discovery, URL: http://www.twocrows.com [25th April 2002] 50 Appendix A Personal Evaluation This was a totally new area for me and I have learnt a lot from the project and will recommend the company I work for, to start mining its data when I get back home. When the project started there were some problems such as acquiring datasets and tools but once that was sorted out everything started going well. I have learned many skills from doing this project that is both personal and technical skills. On the personal development side I have really learnt a lot, as it was the first time I was engaged in a project of this magnitude on my own. The project helped me with applying my personal goal and time management. My personal goal was to learn this new area and do some practical exercises on it within the duration of the project. I think I have achieved this goal as I can now be confidently engaged in any data mining discussion. Time management was the order of the day throughout this project. I tried by all means to keep everything on schedule, though I did experience some glitches along the way. Other skills learnt were communication and being organised. Communication is very important between the supervisor and student and I think we didn’t have any problem in our case. The other thing I have learned was to apply my decision-making skills, there were times when the next meeting with the supervisor seemed too far and I had to make decisions and brief him later. This really broadened my mind. Patience and being tough are some of the things learned in this project as there were times when I thought I would breakdown, especially during the time when I found out that the software I needed to evaluate the tool I was using could not be installed in the School of Computing computers. This was really a devastating time for me and I don’t know how I managed to pull through. But through this experience I learnt that things sometimes don’t go as expected. The project involved evaluating the techniques and algorithms, this really gave me some knowledge in doing evaluation and think that next time I do it, it would be much better than what I have managed to do in this project. The overall important skill learned was that of conducting research, more especially on the Internet. Data mining being a new field means that there is a lot of information in the Internet some of which is incorrect, so one has to sift through this massive information to get what is relevant and correct. On my conclusion I would say I am very proud to have achieved what I have done and still think the project was an eye-opener. 51 Appendix B Project Plan and Schedule Task Name Duration Start Date Finish Date 15 days Mon 15/10/01 Fri 02/11/01 Problem Definition 11 days Mon 15/10/01 Mon 29/10/01 Background Reading 11 days Mon 15/10/01 Mon 29/10/01 Minimum Requirements 4 days Tue 30/10/01 Fri 02/11/01 10 days Mon 05/11/01 Fri 16/11/01 Data Acquisition 10 days Mon 05/11/01 Fri 16/11/01 Tool Acquisition 10 days Mon 05/11/01 Fri 16/11/01 Algorithms and Techniques Research 28 days Mon 05/11/01 Wed 12/12/01 Analysis 23 days Mon 19/11/01 Wed 19/12/01 Data analysis 23 days Mon 19/11/01 Wed 19/12/01 Tool Familiarization 23 days Mon 19/11/01 Wed 19/12/01 21 days Wed 21/11/01 Wed 19/12/01 21 days Wed 21/11/01 Wed 19/12/01 20 days Mon 04/02/02 Fri 01/03/02 20 days Mon 04/02/02 Fri 01/03/02 10 days Mon 04/03/02 Fri 15/03/02 10 days Mon 04/03/02 Fri 15/03/02 Research on Issues and Future Requirements 10 days Mon 18/03/02 Fri 29/03/02 Project Evaluation 5 days Mon 01/04/02 Fri 05/04/02 Report Write-up 22 days Mon 01/04/02 Tue 30/04/02 Feasibility Study Resource Gathering Design Generate Questions/Hypothesis Data Mining Extraction of Information from Datasets Results Evaluation Results Interpretation 52 Appendix C Project Specification Title of project: Data Mining Algorithms and Tools Code of project: SAR05 Supervisor: Stuart Roberts Area of interest: Knowledge discovery; Data Mining Appropriate for degree programmes: IS, CT, CS Prerequisites : database Multiple projects can be considered: Yes Further information: This is an exploratory project. Data mining is a fast growing area of interest and many sophisticated tools are becoming available. Which algorithms are best to use for solving what problems? How do different tools compare? By taking a case study approach (case study data are available), and through research papers, the aim of this project would be to answer some of these questions. The School has both commercial and experimental data mining software which could be used in the evaluation. 53 Appendix D Emails Received Email from Stuart Roberts Date: Tue, 16 Oct 2001 11:09:03 +0100 From: S A Roberts <[email protected]> To: K Tshetlhoyagae <[email protected]> Subject: Re: Project questions At 10:24 AM 10/16/01 +0100, you wrote: >Morning, > >I have been allocated the Data mining Algorithms and Tools project, and >I met my supervisor(Martin Berzins) yesterday and he wanted me to find >out answers to the following questions. > >Whether there is available data to mine? There are test data sets available from the KDD data mining web site, but of course it depends what your objectives are as to what data are suitable. see for example: http://www.ecn.purdue.edu/KDDCUP/ Eric Atwell also has a data mining project with some associated data, but he may have someone else doing that project. >Whether Data mining is taught in the University? There is an introduction in DB32 >What software is available for data mining in the university? SQL Server 2000 Data analysis services Weka data mining tools on W2000 m/cs C4.5 decision tree algorithm on linux It's probably worth pointing out that you don't necessarily have to stick with this project if there is something else that you and your supervisor can agree on. If you can see clearly what you would like to achieve using data mining tools then the project should be possible, but the published specification almost certainly needs some extra thinking to turn it into a well-defined set of objectives. Hope this helps. Stuart 54 Emails from Ross Quinlan Date: Wed, 6 Feb 2002 19:50:38 -0800 (PST) From: Ross Quinlan <[email protected]> To: [email protected] Subject: See5 Thank you for your interest in RuleQuest data mining tools. Your ten-day evaluation licence ID is 3893 e694 97eb ee2e# Instructions for downloading the system appear at http://www.rulequest.com/Install/ If you have any problems installing the system, please contact [email protected] with the following information: * Windows: the machine name and user name shown on the message indicating a licence problem * Unix: the output from the commands "hostname" and "who am i" Regards, Ross Quinlan Subject: To: Date sent: From: Send reply to: Request for evaluation licence [email protected] Wed, 22 Mar 2002 02:02:11 -0800 (PST) [email protected] (Ross Quinlan) [email protected] I regret that a licence ID cannot be issued. Only one evaluation licence is issued to an organisation, and a licence has previously been issued to [email protected]. Regards, Ross Quinlan 55 Email from Liu Bing Date: Thu, 7 Mar 2002 18:22:49 +0800 (GMT-8) From: Liu Bing <[email protected]> To: K Tshetlhoyagae <[email protected]> Cc: ma yiming <[email protected]> Subject: Re: CBA Academic Full version Hi, You can download it from the following: http://www.comp.nus.edu.sg/~dm2/CBAFull.zip Cheers Liu, Bing (Dr.) Department of Computer Science Office: S17, 05-17 School of Computing Tel: (65) 8746736 National University of Singapore Fax: (65) 7794580 3 Science Drive 2 Email: [email protected] Singapore 117543 Web: http://www.comp.nus.edu.sg/~liub Email from Support Date: Fri, 8 Mar 2002 14:32:47 GMT From: Pritpal Rehal via RT <[email protected]> To: [email protected] Subject: [SUPPORT #7733] (cts) Unable to install Status: resolved Requestors: [email protected] as i said before ...you cant install the software if it trying to access windows cfg/registry so dont install it - it will not let you!!!! prit 56 Appendix E Sample of the Customers.data file records (Please note that this file has the same structure with Customers.test) ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-02-22,17\:23\:20,2000-0222,17\:23\:20,?,177035,64671,?,Mozilla/4\.0 (compatible; MSIE 5\.0; Windows 98; DigExt),1,47,NULL,http\://womencentral\.msn\.com/women/today/default\.asp,main/home\.jhtml,1389,Tuesd ay,17,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0222,17\:23\:20,47.0,47,47,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,NULL,main/home\.jhtml,Tuesday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,N ULL,/Content/templates/main,17,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet Explorer,main/home\.jhtml,main/home\.jhtml,Other,(14 \.\.\. 17] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-02-28,01\:32\:44,2000-0228,01\:32\:44,?,281054,100697,?,Mozilla/4\.0 (compatible; MSIE 5\.0; MSNIA; AOL 5\.0; Windows 98; DigExt),1,47,NULL,http\://search\.yahoo\.com/search?p=thong&hc=2&hs=50&h=s&b=19,main/home\.jhtml ,1389,Monday,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0228,01\:32\:44,47.0,47,47,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,NULL,main/home\.jhtml,Monday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,N ULL,/Content/templates/main,1,0.0,?,AOL,AOL Windows 4\.0,AOL,main/home\.jhtml,main/home\.jhtml,Other,(\.\.\. 5] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-02-21,11\:04\:07,2000-0222,17\:10\:47,?,152324,64560,?,Mozilla/4\.0 (compatible; MSIE 5\.01; Windows NT 5\.0),5,312,NULL,NULL,main/home\.jhtml,1389,Tuesday,17,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False, 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0222,17\:10\:47,312.0,312,312,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,NULL,main/home\.jhtml,Tuesday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder ,NULL,/Content/templates/main,17,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet Explorer,main/home\.jhtml,main/home\.jhtml,Other,(14 \.\.\. 17] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-02,14\:20\:46,2000-0302,14\:20\:46,?,386018,139339,?,Mozilla/4\.0 (compatible; MSIE 5\.0; Windows 98; DigExt),1,62,NULL,NULL,main/home\.jhtml,1389,Thursday,14,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fal se,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0302,14\:20\:46,62.0,62,62,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0 57 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,NULL,main/home\.jhtml,Thursday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU LL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder, NULL,/Content/templates/main,14,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet Explorer,main/home\.jhtml,main/home\.jhtml,Other,(12 \.\.\. 14] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-20,11\:14\:43,2000-0320,11\:14\:43,?,870200,311626,?,Mozilla/4\.0 (compatible; MSIE 4\.01; Windows NT),1,79,NULL,NULL,main/home\.jhtml,1389,Monday,11,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,0, 0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0320,11\:14\:43,79.0,79,79,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,NULL,main/home\.jhtml,Monday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,N ULL,/Content/templates/main,11,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet Explorer,main/home\.jhtml,main/home\.jhtml,Other,(9 \.\.\. 12] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-18,20\:14\:19,2000-0318,20\:14\:19,?,824654,295799,?,Mozilla/4\.0 (compatible; MSIE 5\.0; AOL 5\.0; Windows 98; DigExt),1,47,NULL,NULL,main/home\.jhtml,1389,Saturday,20,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fals e,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0318,20\:14\:19,47.0,47,47,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,NULL,main/home\.jhtml,Saturday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,N ULL,/Content/templates/main,20,0.0,?,AOL,AOL Windows 4\.0,AOL,main/home\.jhtml,main/home\.jhtml,Other,(17 \.\.\. 22] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-02-15,10\:02\:38,2000-0216,16\:37\:39,?,60257,29520,?,Mozilla/4\.0 (compatible; MSIE 5\.0; Windows 98; DigExt),3,375,NULL,NULL,main/home\.jhtml,1389,Wednesday,16,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, False,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0216,16\:37\:39,375.0,375,375,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,NULL,main/home\.jhtml,Wednesday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandO rder,NULL,/Content/templates/main,16,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet Explorer,main/home\.jhtml,main/home\.jhtml,Other,(14 \.\.\. 17] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-20,18\:20\:34,2000-0320,18\:20\:34,?,882455,316085,?,Mozilla/4\.0 (compatible; MSIE 5\.0; Windows 95; DigExt),1,62,NULL,http\://www\.winniecooper\.com/cpagesal/gazelle\.html,main/home\.jhtml,1389,Monday,18,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,False,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,2000-0320,18\:20\:58,1976.0,3952,3890,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,2,0,0,0,0,1,0,0,1,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,FOLDER%3C%3Efolder_id=8703&ASSORTMENT%3C%3East_id=8683&bmUID=953 58 605234769&WebLogicSession=ONbccjVbove2FgbgHHm2178E1woDG11D3qeF36XzpiKsx10g5o1ljtmhJT mjcsoPPFJoNqUbjW83\|2030677379036461453/174787178/5/7005/7005/7002/7002/1,main/departments\.jhtml,Monday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL, NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/Departments,/Assortmen ts/Main/Departments/A1_Hosiery,/Content/templates/main,18,24.0,24.0,Internet Explorer,Internet Explorer Windows 4\.0,Internet Explorer,main/home\.jhtml,main/departments\.jhtml,http\://www\.winniecooper\.com,(17 \.\.\. 22] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-09,19\:23\:00,2000-0313,06\:38\:38,?,590681,239520,?,Mozilla/4\.5 [en] (Win98; I),7,156,FOLDER%3C%3Efolder_id=45687&ASSORTMENT%3C%3East_id=8687&bmUID=9529390801 69&WebLogicSession=OMyySOJrRKcOI2jtPoy9t2HduN6YovoQMxgLRICUpIKxIcZ84JWOcJjnjbRHhXK cByjgJT1KrtU3\|7918625116450430875/174787179/5/7005/7005/7002/7002/1&asstListPath=/Assortments/Main/B,http\://www\.gazelle\.com/main/home\.jhtml,main/assortment2\.jhtml,4 5891,Monday,6,3,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,False,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,2000-0313,06\:40\:25,213.66666666666666,641,32,3.0,3,?,?,?,?,?,?,?,0.5,0.5,?,?,?,?,1.0,1,?,9.75,9.75,?,10.0,10.0,?,3.5 ,3.5,?,7.0,7.0,?,3.0,3,?,2,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,1,0,1,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,PRODUCT%3C%3Eprd_id=33449&FOLDER% 3C%3Efolder_id=45687&ASSORTMENT%3C%3East_id=8687&bmUID=952958374403,main/shopping_ca rt\.jhtml,Monday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL L,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/Brands,/Assortments/Main/Brands/Legwear ,/Content/templates/main,6,107.0,53.5,Netscape,Netscape 4\.5,Netscape,Other,Other,http\://www\.gazelle\.com,(5 \.\.\. 9] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-05,15\:32\:26,2000-0305,15\:32\:26,?,472304,169949,?,Mozilla/4\.0 (compatible; MSIE 5\.0; Windows 95; DigExt),1,63,NULL,NULL,main/home\.jhtml,1389,Sunday,15,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False ,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0305,15\:32\:26,63.0,63,63,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,NULL,main/home\.jhtml,Sunday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,NU LL,/Content/templates/main,15,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet Explorer,main/home\.jhtml,main/home\.jhtml,Other,(14 \.\.\. 17] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-22,08\:49\:15,2000-0322,08\:49\:15,?,923984,330841,?,Mozilla/4\.0 (compatible; MSIE 5\.5; Windows 98),1,31,ASSORTMENT%3C%3East_id=8687&bmUID=952973502998&WebLogicSession=OM04vug11m 0U1WHs6JkTU0QcaDjjvc1ej1tS1Gwy5ZIVPeQmfik3U3EAEGH2pbAwnM1NJiwCsmU3\|7918625116450 430875/174787179/5/7005/7005/7002/7002/1,http\://beauty\.about\.com/style/beauty/gi/dynamic/offsite\.htm?site=http\://www\.gazelle\.com/main/freegif t\.jhtml%3FASSORTMENT%253C%253East%5Fid=8687%0D%0A%26bmUID=952973502998%26WebLo gicSession=OM04vug11m0U1WHs6JkTU0QcaDjjvc1ej1tS1Gwy%0D%0A5ZIVPeQmfik3U3EAEGH2,mai n/freegift\.jhtml,21189,Wednesday,8,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,0,0,0,0,0,0,0,0,0,1,0,0,0,0 ,0,0,2000-0322,08\:49\:15,31.0,31,31,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,ASSORTMENT%3C%3East_id=8687&bmUID=952973502998&WebLogicSession=OM04vug1 1m0U1WHs6JkTU0QcaDjjvc1ej1tS1Gwy5ZIVPeQmfik3U3EAEGH2pbAwnM1NJiwCsmU3\|79186251164 59 50430875/174787179/5/7005/7005/7002/7002/1,main/freegift\.jhtml,Wednesday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/UniqueBoutiques,NULL,/ Content/templates/main,8,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet Explorer,Other,Other,Other,(5 \.\.\. 9] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-02-24,06\:07\:42,2000-0224,06\:07\:42,?,207398,75201,?,Mozilla/4\.0 (compatible; MSIE 5\.0; AOL 5\.0; Windows 98; DigExt),1,47,NULL,NULL,main/home\.jhtml,1389,Thursday,6,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fals e,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,2000-0224,06\:08\:23,55.0,110,63,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,2,0,0,0,0,1,0,1,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,FOLDER%3C%3Efolder_id=8739&ASSORTMENT%3C%3East_id=8687&bmUID=951401262 672&WebLogicSession=OLU7LpR58bM4A0tHfwfg7KR8MdIBTFhzww0dwx93jIcekkzxkyO2KH0hKvKd mBXwj0YcjDZwhRQ3\|-5383655529336981119/174787178/5/7005/7005/7002/7002/1,main/boutique\.jhtml,Thursday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU LL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/UniqueBoutiques,/Assortme nts/Main/UniqueBoutiques/06_dance,/Content/templates/main,6,41.0,41.0,AOL,AOL Windows 4\.0,AOL,main/home\.jhtml,main/boutique\.jhtml,Other,(5 \.\.\. 9] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-23,19\:28\:38,2000-0323,19\:28\:38,?,960503,343960,?,Mozilla/5\.0 (compatible; MSIE 5\.0),1,140,ASSORTMENT%3C%3East_id=8687&bmUID=953865993027&WebLogicSession=ONrXCKIa c2iUoBBtLUWBNh217V51ad5WwqxYoj2w7HrdR1IPWpG4rL2lf9z7A2mS4XIoJRIQFQQ3\|7863941279941471645/174787179/5/7005/7005/7002/7002/1&ls=Family,http\://www\.gazelle\.com/main/home\.jhtml,main/lifestyles\.jhtml,8171,Thursday,19,1,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2000-0323,19\:28\:38,140.0,140,140,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,0,0,0,0,0,0,0,1 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,ASSORTMENT%3C%3East_id=8687&bmUID=953865993027&WebLogicSession=ONrXC KIac2iUoBBtLUWBNh217V51ad5WwqxYoj2w7HrdR1IPWpG4rL2lf9z7A2mS4XIoJRIQFQQ3\|7863941279941471645/174787179/5/7005/7005/7002/7002/1&ls=Family,main/lifestyles\.jhtml,Thursday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/UniqueBoutiqu es,NULL,/Content/templates/main,19,0.0,?,Internet Explorer,Internet Explorer 5\.0,Internet Explorer,Other,Other,http\://www\.gazelle\.com,(17 \.\.\. 22] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-03,20\:30\:18,2000-0303,20\:30\:18,?,424859,153368,?,Mozilla/4\.0 (compatible; MSIE 4\.5; Mac_PowerPC),1,47,NULL,http\://www\.flamingoworld\.com/,main/home\.jhtml,1389,Friday,20,1,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0303,20\:30\:18,47.0,47,47,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,NULL,main/home\.jhtml,Friday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL, NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,NU LL,/Content/templates/main,20,0.0,?,Internet Explorer,Internet Explorer Mac 4\.0,Internet Explorer,main/home\.jhtml,main/home\.jhtml,Other,(17 \.\.\. 22] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, 60 NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-28,18\:03\:25,2000-0328,18\:03\:25,?,1074530,384447,?,Mozilla/4\.0 (compatible; MSIE 5\.0; Windows NT; DigExt),1,62,NULL,http\://stores\.shopnow\.com/cgibin/visitstore\.cgi?lid=9432355,main/home\.jhtml,1389,Tuesday,18,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, False,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,2000-0328,18\:04\:22,93.66666666666667,281,188,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,3,0,0,0,0 ,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,FOLDER%3C%3Efolder_id=8735&ASSORTMENT%3C%3East_id=8687&b mUID=954295443812,main/boutique\.jhtml,Tuesday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/Unique Boutiques,/Assortments/Main/UniqueBoutiques/05_maternity,/Content/templates/main,18,57.0,28.5,Internet Explorer,Internet Explorer Windows 4\.0,Internet Explorer,main/home\.jhtml,main/boutique\.jhtml,http\://stores\.shopnow\.com,(17 \.\.\. 22] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-18,10\:20\:22,2000-0318,10\:20\:22,?,814958,292427,?,Mozilla/4\.0 (compatible; MSIE 5\.01; Windows 98),1,31,NULL,http\://chain\.station\.sony\.com/chain/reaction/html/interstitial0\.html,main/home\.jhtml,1389 ,Saturday,10,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0318,10\:20\:22,31.0,31,31,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,NULL,main/home\.jhtml,Saturday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,N ULL,/Content/templates/main,10,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet Explorer,main/home\.jhtml,main/home\.jhtml,Other,(9 \.\.\. 12] ?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?, NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?, ?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-02-01,14\:54\:12,2000-0201,14\:54\:12,?,3848,1691,?,Mozilla/4\.0 (compatible; MSIE 5\.0; Windows 98; The File has been cut for convenience 61 Appendix F Sample of the Customers.names file attributes | The content of this file and the matching data file are confidential | to Blue Martini Software and Gazelle.com | Right to use is restricted by the NDA agreement for the KDD-CUP 2000 | For questions, e-mail [email protected] | | C5.0 names file | Retail Activity. | target column WhichDoYouWearMostFrequent: trouser socks,hosiery,casual socks,athletic socks. YourFavoriteLegcareBrand: eShave,Lucky Chick,VeinGard,Tweezerman,Swiss Balance,Ovacion,Covermark,Ya Babi,Medea,Dr\. Perricone,TendSkin,IColoniali,Dr\. Schon,Nature Made,Bio Depiless,Conair,Natural Line,Kleb Sole,Living Earth,Cellasene,DailyHerbs,Kneipp,Juvena,Epilady. Registration Gender: Female,Male. NumberOfChildren: [ordered] 0, 1, 2, 3, 4 or more. DoYouPurchaseForOthers: NULL,False. HowDoYouDressForWork:business dress,business casual,comfortable /athletic,very casual. HowManyPairsDoYouPurchase: [ordered] 1 to 5, 6 to 10, 11 to 15, 15 or more. YourFavoriteLegwearBrand: Ellen Tracy,Oroblu,Belly Basics,American Essentials,Danskin,Hot Sox,Donna Karan,Evan Picone,Givenchy,Greg Norman,DKNY,Hanes,Round the Clock,Falke,Berkshire. WhoMakesPurchasesForYou: parent,spouse,siblings,friend. NumberOfAdults: [ordered] 0, 1, 2, 3 or more. HowDidYouHearAboutUs: in the news,e-mail,print ad,direct mail,other,friend / family. Company: COMPANY 2605,COMPANY 416,COMPANY 2975,COMPANY 3051,COMPANY 780,COMPANY 3055,COMPANY 1608,COMPANY 3059,COMPANY 2575,COMPANY 2330,COMPANY 1169,COMPANY 1207,COMPANY 1209,COMPANY 2982,COMPANY 1571,COMPANY 1733,COMPANY 1737,COMPANY 1455,COMPANY 796,COMPANY 673,COMPANY 713,COMPANY 395,COMPANY 2183,COMPANY 2870,COMPANY 3070,COMPANY 1587,COMPANY 1626,COMPANY 842,COMPANY 2479,COMPANY 445,COMPANY 2235,COMPANY 2882,COMPANY 2922,COMPANY 3089,COMPANY 613,COMPANY 452,COMPANY 615,COMPANY 617,COMPANY 1085,COMPANY 1086,COMPANY 2092,COMPANY 503,COMPANY 1932,COMPANY 2258,COMPANY 1935,COMPANY 1658,COMPANY 594,COMPANY 1258,COMPANY 1099,COMPANY 2262,COMPANY 1944,COMPANY 519,COMPANY 920,COMPANY 3032,COMPANY 3036,COMPANY 2678,COMPANY 482,COMPANY 2273,COMPANY 1792,COMPANY 402,COMPANY 409,NULL,COMPANY 2963,COMPANY 2682,COMPANY 1273,COMPANY 933. SendEmail: NULL,True,False. HowOftenDoYouPurchase: [ordered] once a year, every 6 months, each week. HowDidYouFindUs: Web/Banner Ad,News Story,Search Engine,Friend/Co-worker,Magazine Ad,Other. City: ignore. Country: United States,NULL. US State: IA,ID,IL,IN,AK,AL,AR,RI,AZ,SC,CA,KS,CO,KY,CT,LA,TN,DC,DE,TX,MA,MD,ME,MI,UT,MN,MO,MS ,MT,VA,NC,ND,NE,NH,NJ,VT,NM,FL,NV,WA,NY,WI,OH,GA,OK,WV,WY,OR,PA,HI. Account Creation Date: date. Account Creation Date_Time: time. Year of Birth: continuous. Email: NET,GOV,COM,EDU,ORG,Gazelle,Other,NULL. Login Failure Count: continuous. 62 Customer ID: continuous. Truck Owner: NULL,True,False. RV Owner: NULL,True,False. Motorcycle Owner: NULL,True,False. Value Of All Vehicles: continuous. Age: continuous. Other Indiv\. Age: continuous. Marital Status: Inferred Married,Single,Inferred Single,Married,NULL. Working Woman: NULL,True,False. Mail Responder: NULL,True,False. Bank Card Holder: NULL,True,False. Gas Card Holder: NULL,True,False. Upscale Card Holder: NULL,True,False. Unknown Card Type: NULL,True,False. TE Card Holder: NULL,True,False. Premium Card Holder: NULL,True,False. Presence Of Children: NULL,True,False. Number Of Adults: continuous. Estimated Income Code: [ordered] Under $15;000, $15;000-$19;999, $20;000-$29;999, $30;000-$39;999, $40;000-$49;999, $50;000-$74;999, $75;000-$99;999, $100;000-$124;999, $125;000 OR MORE. Home Market Value: [ordered] $1;000-$24;999, $25;000-$49;999, $50;000-$74;999, $75;000-$99;999, $100;000-$124;999, $125;000-$149;999, $150;000-$174;999, $175;000-$199;999, $200;000-$224;999, $225;000-$249;999, $250;000-$274;999, $275;000-$299;999, $300;000-$349;999, $350;000-$399;999, $400;000-$449;999, $450;000-$499;999, $500;000-$774;999, $775;000-$999;999, $1;000;000+. New Car Buyer: NULL,True. Vehicle Lifestyle: FULL SIZE (STANDARD/LUXURY),IMPORT (STANDARD/ECONOMY),SPECIALTY (MIDSIZE/SMALL),PERSONAL LUXURY CAR,STATION WAGON,REGULAR (MIDSIZE/SMALL),TRUCK OR UTILITY VEHICLE,NULL. Property Type: apartment(5+ units),2-4 unit(duplex;triplex;quad),mobile_home,misc\. residential (condo store/flat),single family dwelling,NULL,condo. Loan To Value Percent: [ordered] 0% (NO LOANS), 01-49%, 50-59%, 60-69%, 70-74%, 75-79%, 80-84%, 85-89%, 90-94%, 95-99%, 100-99%. Presence Of Pool: NULL,True,False. Year House Was Built: continuous. Own Or Rent Home: Owner,NULL,Renter. Length Of Residence: continuous. Mail Order Buyer: NULL,True,False. Year Home Was Bought: continuous. Home Purchase Date: continuous. Number Of Vehicles: continuous. DMA No Mail Solicitation Flag: NULL,True. DMA No Phone Solicitation Flag: NULL,True. CRA Income Classification: continuous. New Bank Card: NULL,True,False. Number Of Credit Lines: continuous. Speciality Store Retail: NULL,True,False. Oil Retail Activity: NULL,True,False. Bank Retail Activity: NULL,True,False. Finance Retail Activity: NULL,True,False. Miscellaneous Retail Activity: NULL,True,False. Upscale Retail: NULL,True,False. Upscale Speciality Retail: NULL,True,False. Retail Activity: NULL,True,False. Last Retail Date: date. Last Retail Date_Time: time. 63 Dwelling Size: [ordered] SINGLE HOUSEHOLD, 2 HOUSEHOLDS, 3 HOUSEHOLDS, 4 HOUSEHOLDS, 5 HOUSEHOLDS, 6 HOUSEHOLDS, 7 HOUSEHOLDS, 8 HOUSEHOLDS, 9 HOUSEHOLDS, 10-19 HOUSEHOLDS, 20-29 HOUSEHOLDS, 30-39 HOUSEHOLDS, 40-49 HOUSEHOLDS, 50-99 HOUSEHOLDS, 100+ HOUSEHOLDS. Dataquick Market Code: continuous. BrandName Last: BB,Silk Reflections,DAN,ELT,Absolutely Ultra Sheer,ORO,AME,DKNY,Hanes Too,EVP,NM,HOSO,NULL. StockType Last: Seasonal 1*,Seasonal 1,Seasonal 2,Replenishable,NULL. Look Last: Sheer,NULL,Ultra Sheer. BasicOrFashion Last: Fashion,NULL,Basic. HasDressingRoom Last: NULL,True,False. Texture Last: Flat,Textured,NULL. ToeFeature Last: RT,SF,NULL. Material Last: Nylon,Cotton,Lycra,NULL. WaistControl Last: CT,NULL. Collection Last: Specialty Items,Childrens Dance,Oroblu Fashion Line,Conversationals,Hanes Plus Collection,Conversational Classics,Athletics,Men's Essential Sport,Occasions Collection,Pregnancy Survival Kit,DKNY Basic Trouser Socks,Men's Patterns and Textures,Beyond Bare Collection,Spring/Summer 2000,Teddy Hose,NULL. Audience Last: Men,Children,Women,NULL. Pattern Last: Conversational,Solid,NULL. Product Level 1 Path Last: /Products/Legwear,/Products/LegCare,NULL. Product Level 2 Path Last: /Products/LegCare/SwissBalance,/Products/Legwear/Danskin,/Products/LegCare/VeinGard,/Products/Legwe ar/EllenTracy,/Products/Legwear/Oroblu,/Products/Legwear/NicoleMiller,/Products/Legwear/DKNY,/Produc ts/Legwear/HotSox,/Products/Legwear/Hanes,/Products/Legwear/BellyBasics,/Products/Legwear/AmericanE ssentials,/Products/Legwear/EvanPicone,NULL. Assortment Level 2 Path Last: /Assortments/Main/UniqueBoutiques,/Assortments/Main/Brands,/Assortments/Main/SaleAssortments,/Assort ments/Main/Welcome,/Assortments/Main/BrandOrder,/Assortments/Main/Departments,/Assortments/Main/L ifeStyles,/Assortments/Main/Seasonal,NULL. Assortment Level 3 Path Last: /Assortments/Main/Departments/A2_Socks,/Assortments/Main/UniqueBoutiques/05_maternity,/Assortments/ Main/Welcome/Legwear,/Assortments/Main/Brands/LegCare,/Assortments/Main/UniqueBoutiques/04_eveni ngs,/Assortments/Main/UniqueBoutiques/08_Seasonal,/Assortments/Main/LifeStyles/family,/Assortments/M ain/UniqueBoutiques/02_men,/Assortments/Main/UniqueBoutiques/07_gifts,/Assortments/Main/Departments /A1_Hosiery,/Assortments/Main/Seasonal/Spring2000,/Assortments/Main/LifeStyles/InStyle,/Assortments/M ain/UniqueBoutiques/03_kids,/Assortments/Main/UniqueBoutiques/01_PlusSizes,/Assortments/Main/Unique Boutiques/06_dance,/Assortments/Main/Departments/A4_Legcare,/Assortments/Main/LifeStyles/LegCare,/A ssortments/Main/LifeStyles/Work,/Assortments/Main/Departments/A3_Bodywear,/Assortments/Main/Welco me/Legcare,/Assortments/Main/SaleAssortments/WinterSale2000,/Assortments/Main/Brands/Legwear,/Assor tments/Main/LifeStyles/Sport,NULL. Content Level 2 Path Last: /Content/templates/articles,/Content/templates/Replenish,/Content/templates/checkout,/Content/templates/mai n,/Content/templates/products,/Content/templates/account,NULL. Session Last Request Hour Of Day: continuous. Session Request Count Average: continuous. Session Request Count Sum: continuous. Session Time Elapsed Average: continuous. Session Time Elapsed Sum: continuous. Average Time Each Page View: continuous. Num Sessions: continuous. First Session First Referrer: ignore. First Session First Request Day of Week: Thursday,Saturday,Tuesday,Sunday,Wednesday,Friday,Monday. First Session Browser Family: WebTV,Netscape,AOL,Internet Explorer,Other. 64 First Session Browser: Internet Explorer Windows 4\.0,WebTV 1\.2,AOL Windows 2\.0,Internet Explorer Mac 4\.0,Netscape 4\.51,Internet Explorer MSN 4\.0,Netscape Mac 4\.51,AOL Mac 4\.0,Netscape 4\.61,Internet Explorer+ 4\.0,Netscape Mac 4\.61,Netscape 4\.71,Netscape 4\.72,Netscape Mac 4\.72,Internet Explorer++ 4\.0,AOL Mac 3\.0,Internet Explorer Windows 2\.0,Netscape 4\.02,Netscape 4\.03,Netscape 4\.04,Netscape 4\.5,Netscape 4\.05,Netscape 4\.6,Netscape 4\.06,Netscape 4\.7,Netscape 4\.07,Netscape 4\.08,AOL Windows 4\.0,Netscape Mac 4\.04,Netscape Mac 4\.05,Netscape Mac 4\.06,Netscape Mac 4\.08,AOL Mac 2\.0,Netscape Mac 4\.5,Netscape Mac 4\.6,Netscape Mac 4\.7,Other. First Session Browser Family Top 3: Netscape,AOL,Internet Explorer,Other. First Session First Template Top 5: main/departments\.jhtml,main/vendor\.jhtml,main/boutique\.jhtml,products/productDetailLegwear\.jhtml,mai n/home\.jhtml,Other. First Session First Referrer Top 5: http\://www\.mycoupons\.com,http\://www\.fashionmall\.com,http\://www\.gazelle\.com,http\://stores\.shopno w\.com,Other. Session First Processing Time: continuous. Session First Request Hour of Day: continuous. Last Session Average Time Per Page View: continuous. Session First Query String: ignore. Session First Referrer: ignore. Session First Template: main/departments\.jhtml,main/freegift\.jhtml,main/replenishment\.jhtml,main/legcare_vendor\.jhtml,main/ven dor\.jhtml,main/registration_shipaddress\.jhtml,main/login2\.jhtml,main/vendor2\.jhtml,main/registration\.jht ml,main/assortment\.jhtml,.jhtml,account/your_account\.jhtml,articles/new_shipping\.jhtml,main/welcome\.jh tml,main/boutique\.jhtml,main/shopping_cart\.jhtml,main/assortment2\.jhtml,products/productDetailLegwear\ .jhtml,main/home\.jhtml,articles/dpt_about\.jhtml,main/leg_news_healthwellness\.jhtml. Session Browser Family: WebTV,Netscape,AOL,Internet Explorer,Other. Session Browser: Internet Explorer Windows 4\.0,WebTV 1\.2,AOL Windows 2\.0,Internet Explorer Mac 4\.0,Netscape 4\.51,Internet Explorer MSN 4\.0,Netscape Mac 4\.51,AOL Mac 4\.0,Netscape 4\.61,Internet Explorer+ 4\.0,Netscape Mac 4\.61,Netscape 4\.71,Netscape 4\.72,Netscape Mac 4\.72,Internet Explorer++ 4\.0,AOL Mac 3\.0,Internet Explorer Windows 2\.0,Netscape 4\.02,Netscape 4\.03,Netscape 4\.04,Netscape 4\.5,Netscape 4\.05,Netscape 4\.6,Netscape 4\.06,Netscape 4\.7,Netscape 4\.07,Netscape 4\.08,AOL Windows 4\.0,Netscape Mac 4\.04,Netscape Mac 4\.05,Netscape Mac 4\.06,Netscape Mac 4\.08,AOL Mac 2\.0,Netscape Mac 4\.5,Netscape Mac Session Browser Family Top 3: Netscape,AOL,Internet Explorer,Other. Session First Template Top 5: main/departments\.jhtml,main/vendor\.jhtml,main/boutique\.jhtml,products/productDetailLegwear\.jhtml,mai n/home\.jhtml,Other. Session First Referrer Top 5: http\://www\.mycoupons\.com,http\://www\.fashionmall\.com,http\://www\.gazelle\.com,http\://stores\.shopno w\.com,Other. Session First Request Hour of Day Bin: [ordered] (\.\.\. 5], (5 \.\.\. 9], (9 \.\.\. 12], (12 \.\.\. 14], (14 \.\.\. 17], (17 \.\.\. 22], (22 \.\.\.). Last Session User Agent: ignore. First Session First Request Date: date. First Session First Request Date_Time: time. Last Session Last Request Date: date. Last Session Last Request Date_Time: time. Last Session Visit Count: continuous. Session Last Template Top 5: main/departments\.jhtml,main/vendor\.jhtml,main/boutique\.jhtml,products/productDetailLegwear\.jhtml,mai n/home\.jhtml,Other. The File has been cut for convenience 65 Appendix G Questions/Hypothesis for Data Mining Project Question 1. Which website refers most customers to website? See5 [Release 1.15] Tue Mar 26 13:26:38 2002 Options: Cross-validate using 10 folds Class specified by attribute `Session First Referrer Top 5' *** line 19999 of `Customers.data': unexpected end of file Read 19998 cases (296 attributes) from Customers.data [ Fold 0 ] Decision tree: Num BrandOrder Assortment Views <= 0: :...Num main/vendor2 Template Views > 0: : :...Num HotSox Product Views > 0: http://www.mycoupons.com (2/1) : : Num HotSox Product Views <= 0: : : :...Session ID <= 286453: : : :...Num Danskin Product Views <= 0: Other (31/12) : : : Num Danskin Product Views > 0: http://www.gazelle.com (3) : : Session ID > 286453: : : :...Num Solid Pattern Views > 1: Other (2) : : Num Solid Pattern Views <= 1: [S1] : Num main/vendor2 Template Views <= 0: : :...Session Browser Family in Google,AltaVista,Novell Border Manager, : : link-check,EmailSiphon,Java,Genie,Lycos, : : Lotus Notes,InfoSeek,Northern Light, : : Cute FTP,Mozilla: Other (0) : Session Browser Family = Teleport Pro: http://www.gazelle.com (39) : Session Browser Family = PerMan Surfer: Other (6) : Session Browser Family = WebTV: http://www.gazelle.com (4/1) : Session Browser Family = WebTrends: Other (175) : Session Browser Family = Enfish Tracker: Other (2) : Session Browser Family = ergyBot: Other (4) : Session Browser Family = Lesszilla: Other (2) : Session Browser Family = Unknown: Other (140) : Session Browser Family = Nitro e-mail collector: Other (764) : Session Browser Family = AOL: [S2] : Session Browser Family = Other: : :...Texture Last = Textured: Other (0) : : Texture Last = Flat: http://www.mycoupons.com (2/1) : : Texture Last = NULL: : : :...Num Nylon Product Views > 0: http://www.gazelle.com (3) : : Num Nylon Product Views <= 0: [S3] : Session Browser Family = Netscape: : :...Num articles/dpt_about_mgmtteam Template Views > 0: Other (22) : : Num articles/dpt_about_mgmtteam Template Views <= 0: : : :...Num main/freegift Template Views > 0: [S4] : : Num main/freegift Template Views <= 0: : : :...Num EvanPicone Product Views > 0: [S5] : : Num EvanPicone Product Views <= 0: : : :...Num HotSox Product Views > 0: Other (3/1) : : Num HotSox Product Views <= 0: : : :...Num main/login2 Template Views > 0: [S6] : : Num main/login2 Template Views <= 0: [S7] 66 : Session Browser Family = Internet Explorer: : :...Session First Request Day of Week in [Monday-Wednesday]: : :...Session Visit Count <= 1: : : :...Num WDCS Category Views > 0: [S8] : : : Num WDCS Category Views <= 0: : : : :...Num Brands Assortment Views > 1: [S9] : : : Num Brands Assortment Views <= 1: : : : :...Num main/lifestyles Template Views > 0: [S10] : : : Num main/lifestyles Template Views <= 0: : : : :...Cookie First Visit Date <= 2000/02/29: : : : :...Session First Content ID > 11337: Other (11) : : : : Session First Content ID <= 11337: [S11] : : : Cookie First Visit Date > 2000/02/29: [S12] : : Session Visit Count > 1: [S13] : Session First Request Day of Week in [Thursday-Sunday]: [S14] Num BrandOrder Assortment Views > 0: :...Session Cookie ID > 847463: :...Session First Request Date > 2000/03/26: : :...Session Browser Family in Google,Teleport Pro,AltaVista, : : : PerMan Surfer,link-check,EmailSiphon, : : : Java,Genie,WebTrends,Enfish Tracker, : : : ergyBot,Lotus Notes,InfoSeek, : : : Northern Light, : : : Cute FTP: Other (0) : : Session Browser Family = Novell Border Manager: Other (2) : : Session Browser Family = Lycos: Other (1) : : Session Browser Family = Lesszilla: Other (1) : : Session Browser Family = Netscape: Other (265/10) : : Session Browser Family = Unknown: Other (28) : : Session Browser Family = Nitro e-mail collector: Other (1) : : Session Browser Family = AOL: Other (97/23) : : Session Browser Family = Mozilla: Other (8) : : Session Browser Family = Other: Other (73/4) : : Session Browser Family = WebTV: [S15] : : Session Browser Family = Internet Explorer: : : :...Session Visit Count > 16: http://stores.shopnow.com (13/2) : : Session Visit Count <= 16: : : :...Num BrandOrder Assortment Views > 2: [S16] : : Num BrandOrder Assortment Views <= 2: : : :...Num products Template Views > 0: Other (61/12) : : Num products Template Views <= 0: : : :...Session First Request Date > 2000/03/28: [S17] : : Session First Request Date <= 2000/03/28: : : :...Cookie First Visit Date_Time <= 00:31:02: [S18] : : Cookie First Visit Date_Time > 00:31:02: : : :...Session Visit Count > 2: Other (5) : : Session Visit Count <= 2: : : :...Session Request Count > 2: Other (42/6) : : Session Request Count <= 2: [S19] : Session First Request Date <= 2000/03/26: : :...Num articles Template Views > 0: : :...Num Hanes Product Views <= 0: Other (44) : : Num Hanes Product Views > 0: http://www.fashionmall.com (2/1) : Num articles Template Views <= 0: [S20] Session Cookie ID <= 847463: :...Session ID > 142138: Tree has been cut for convenience 67 Question 2. How do most of our customers find about us? See5 [Release 1.15] Tue Mar 26 13:35:19 2002 Options: Cross-validate using 10 folds Class specified by attribute `HowDidYouFindUs' *** ignoring cases with bad or unknown class *** line 19999 of `Customers.data': unexpected end of file Read 58 cases (296 attributes) from Customers.data [ Fold 0 ] Decision tree: Num main/freegift Template Views > 0: :...Num main/freegift Template Views <= 1: Web/Banner Ad (5/1) : Num main/freegift Template Views > 1: Friend/Co-worker (2) Num main/freegift Template Views <= 0: :...Num LEO Category Views > 0: Web/Banner Ad (3/2) Num LEO Category Views <= 0: :...Num Oroblu Product Views <= 0: Friend/Co-worker (40/10) Num Oroblu Product Views > 0: Other (3) Evaluation on hold-out data (5 cases): Decision Tree ---------------Size Errors 5 4(80.0%) << [ Fold 1 ] Decision tree: Num LEO Category Views > 0: Web/Banner Ad (3/2) Num LEO Category Views <= 0: :...Num main/freegift Template Views > 0: :...Session Visit Count <= 2: Web/Banner Ad (5/1) : Session Visit Count > 2: Friend/Co-worker (3) Num main/freegift Template Views <= 0: :...Num Oroblu Product Views > 1: Other (2) Num Oroblu Product Views <= 1: :...Email in MIL,GOV,ORG,NULL: Friend/Co-worker (0) Email = NET: Friend/Co-worker (5) Email = EDU: Search Engine (1) Email = Gazelle: Friend/Co-worker (9) Email = Other: Friend/Co-worker (1) Email = COM: :...Num CT Waist Control Views > 0: Other (3/1) Num CT Waist Control Views <= 0: :...Num HotSox Product Views > 0: Other (2) Num HotSox Product Views <= 0: :...Session Time Elapsed <= 44: Other (4) Session Time Elapsed > 44: Friend/Co-worker (15/1) Evaluation on hold-out data (5 cases): Decision Tree ---------------Size Errors 12 4(80.0%) << [ Fold 2 ] 68 Decision tree: Num main/freegift Template Views > 0: :...Session Visit Count <= 2: Web/Banner Ad (5/1) : Session Visit Count > 2: Friend/Co-worker (3) Num main/freegift Template Views <= 0: :...Session Last Request Processing Time > 6813: Other (4/1) Session Last Request Processing Time <= 6813: :...Login Failure Count > 4: Other (3.4/0.3) Login Failure Count <= 4: :...Minority Census Tract = True: Other (1) Minority Census Tract = False: Friend/Co-worker (25.7/3.9) Minority Census Tract = NULL: :...NumberOfChildren = 0: Friend/Co-worker (7.8/0.9) NumberOfChildren in [1-4 or more]: Other (2) Evaluation on hold-out data (6 cases): Decision Tree ---------------Size Errors 8 2(33.3%) << [ Fold 3 ] Decision tree: Num LEO Category Views > 0: Web/Banner Ad (2/1) Num LEO Category Views <= 0: :...Num main/freegift Template Views > 0: :...Session Visit Count <= 2: Web/Banner Ad (5/1) : Session Visit Count > 2: Friend/Co-worker (3) Num main/freegift Template Views <= 0: :...Num Oroblu Product Views > 1: Other (2) Num Oroblu Product Views <= 1: :...Session Last Request Processing Time > 6813: Search Engine (2/1) Session Last Request Processing Time <= 6813: :...Login Failure Count <= 4: Friend/Co-worker (33.1/4.9) Login Failure Count > 4: Other (4.9/0.8) Evaluation on hold-out data (6 cases): Decision Tree ---------------Size Errors 7 3(50.0%) << [ Fold 4 ] Decision tree: Num main/freegift Template Views > 0: :...New Car Buyer = NULL: Web/Banner Ad (5/1) : New Car Buyer = True: Friend/Co-worker (2) Num main/freegift Template Views <= 0: :...Num LEO Category Views > 0: Web/Banner Ad (3/2) Num LEO Category Views <= 0: :...Num Oroblu Product Views <= 1: Friend/Co-worker (40/10) Num Oroblu Product Views > 1: Other (2) Evaluation on hold-out data (6 cases): Decision Tree ---------------Size Errors 5 4(66.7%) << Tree has been cut for convenience 69 Question 3. What gender is most of our customers? See5 [Release 1.15] Tue Mar 26 13:38:12 2002 Options: Cross-validate using 10 folds Class specified by attribute `Gender' *** line 19999 of `Customers.data': unexpected end of file Read 19998 cases (296 attributes) from Customers.data [ Fold 0 ] Decision tree: Other Indiv. Gender = Male: :...Num MDS Category Views <= 0: : :...Session Cookie ID <= 737171: Female (99/5) : : Session Cookie ID > 737171: : : :...Available Home Equity in [EQUITY $1-$4;999-EQUITY $75;000-$99;999]: Female (3) : : Available Home Equity in [EQUITY $100;000-$149;999-EQUITY $2;000;000 AND OVER]: Male (2) : Num MDS Category Views > 0: : :...Session Browser Family Top 3 = Netscape: Female (0) : Session Browser Family Top 3 = AOL: Female (2) : Session Browser Family Top 3 = Other: Female (1) : Session Browser Family Top 3 = Internet Explorer: : :...Num EllenTracy Product Views <= 0: Male (3) : Num EllenTracy Product Views > 0: NULL (2) Other Indiv. Gender = Female: :...Session Visit Count > 5: Female (6) : Session Visit Count <= 5: : :...Num Fashion Product Views > 1: Female (2) : Num Fashion Product Views <= 1: : :...Number Of Adults <= 1: Female (2) : Number Of Adults > 1: : :...Cookie First Visit Date <= 2000/03/13: Male (15) : Cookie First Visit Date > 2000/03/13: Female (2) Other Indiv. Gender = NULL: :...Occupation in SELF EMPLOYED MANAGEMENT,MILITARY,SELF EMPLOYED PROF/TECH, : RETIRED,SELF EMPLOYED SALES/MARKETING,SELF EMPLOYED, : SELF EMPLOYED BLUE COLLAR,SELF EMPLOYED CLERICAL, : SELF EMPLOYED HOMEMAKER: NULL (0) Occupation = SALES/SERVICE: Female (1) Occupation = HOUSEWIFE: Female (1) Occupation = CRAFTSMAN/BLUE COLLAR: Female (1) Occupation = STUDENT: Female (3/1) Occupation = ADMINISTRATIVE/MANAGERIAL: Male (6) Occupation = OTHER: Female (1) Occupation = CLERICAL/WHITE COLLAR: Female (2) Occupation = PROFESSIONAL/TECHNICAL: :...Working Woman = NULL: Female (0) : Working Woman = True: Female (7) : Working Woman = False: Male (2) Occupation = NULL: :...Household Status = NAME APPEARING ON INPUT IS INDIVIDUAL 2: NULL (15) Household Status = NULL: NULL (17768) Household Status = NAME APPEARING ON INPUT IS INDIVIDUAL 1: :...Account Creation Date_Time > 19:55:49: Male (7) Account Creation Date_Time <= 19:55:49: [S1] SubTree [S1] Property Type in apartment(5+ units),2-4 unit(duplex;triplex;quad), : mobile_home, : misc. residential (condo store/flat): NULL (0) Property Type = condo: Male (1) Property Type = single family dwelling: :...Number Of Adults <= 1: Female (5/1) : Number Of Adults > 1: NULL (4) Property Type = NULL: 70 :...Num AmericanEssentials Product Views > 1: Male (2) Num AmericanEssentials Product Views <= 1: [S2] SubTree [S2] Session First Referrer Top 5 in http://www.fashionmall.com, : http://stores.shopnow.com, : http://www.winnie-cooper.com: NULL (0) Session First Referrer Top 5 = http://www.mycoupons.com: Female (2) Session First Referrer Top 5 = http://www.gazelle.com: :...Estimated Income Code in [Under $15;000-$30;000-$39;999]: Male (2/1) : Estimated Income Code in [$40;000-$49;999-$125;000 OR MORE]: Female (2) Session First Referrer Top 5 = Other: :...Num main/login2 Template Views <= 0: NULL (15) Num main/login2 Template Views > 0: :...New Car Buyer = NULL: Female (7) New Car Buyer = True: NULL (6) Evaluation on hold-out data (1999 cases): Decision Tree ---------------Size Errors 34 5( 0.3%) << [ Fold 1 ] Decision tree: Other Indiv. Gender = Female: :...Marital Status in Inferred Married,Inferred Single: Male (0) : Marital Status = Married: Male (13/1) : Marital Status = NULL: Female (9) : Marital Status = Single: : :...Num EllenTracy Product Views <= 0: Male (3) : Num EllenTracy Product Views > 0: Female (2) Other Indiv. Gender = Male: :...Num MDS Category Views > 0: : :...Account Creation Date > 2000/03/09: Female (2) : : Account Creation Date <= 2000/03/09: : : :...Num EllenTracy Product Views <= 0: Male (3) : : Num EllenTracy Product Views > 0: NULL (2) : Num MDS Category Views <= 0: : :...DoYouPurchaseForOthers = NULL: : :...Speciality Store Retail = NULL: Female (0) : : Speciality Store Retail = True: Male (3) : : Speciality Store Retail = False: Female (4) : DoYouPurchaseForOthers = False: : :...Account Creation Date <= 2000/03/14: Female (92/3) : Account Creation Date > 2000/03/14: : :...Other Indiv. Age <= 28: Male (2) : Other Indiv. Age > 28: Female (4) Other Indiv. Gender = NULL: :...Occupation in SELF EMPLOYED MANAGEMENT,MILITARY,SELF EMPLOYED PROF/TECH, : SELF EMPLOYED SALES/MARKETING,SELF EMPLOYED, : SELF EMPLOYED BLUE COLLAR,SELF EMPLOYED CLERICAL, : SELF EMPLOYED HOMEMAKER: NULL (0) Occupation = SALES/SERVICE: Female (1) Occupation = HOUSEWIFE: Female (2) Occupation = CRAFTSMAN/BLUE COLLAR: Female (1) Occupation = STUDENT: Female (3/1) Occupation = ADMINISTRATIVE/MANAGERIAL: Male (5) Occupation = RETIRED: Female (1) Occupation = OTHER: Female (1) Occupation = CLERICAL/WHITE COLLAR: Female (2) Occupation = PROFESSIONAL/TECHNICAL: :...Working Woman = NULL: Female (0) : Working Woman = True: Female (7) : Working Woman = False: Male (2) Occupation = NULL: :...Household Status = NAME APPEARING ON INPUT IS INDIVIDUAL 2: NULL (16) Household Status = NULL: NULL (17769) Tree has been cut for convenience 71 Question 4. How many customers wish to receive mail from our company? See5 [Release 1.15] Tue Mar 26 13:49:10 2002 Options: Cross-validate using 10 folds Class specified by attribute `SendEmail' *** line 19999 of `Customers.data': unexpected end of file Read 19998 cases (296 attributes) from Customers.data [ Fold 0 ] Decision tree: DoYouPurchaseForOthers = NULL: :...Email in MIL,GOV,ORG: True (0) : Email = NET: False (6/1) : Email = COM: True (33/8) : Email = EDU: True (1) : Email = Gazelle: True (13/2) : Email = Other: True (2) : Email = NULL: NULL (17637) DoYouPurchaseForOthers = False: :...Account Creation Date <= 2000/03/15: False (278) Account Creation Date > 2000/03/15: :...Num Reinforced Toe Views <= 1: True (26/2) Num Reinforced Toe Views > 1: False (3/1) Evaluation on hold-out data (1999 cases): Decision Tree ---------------Size Errors 9 1( 0.1%) << [ Fold 1 ] Decision tree: DoYouPurchaseForOthers = False: :...Account Creation Date <= 2000/03/15: False (279) : Account Creation Date > 2000/03/15: : :...Num products Template Views <= 7: True (24) : Num products Template Views > 7: False (4/1) DoYouPurchaseForOthers = NULL: :...Email in MIL,GOV,ORG: True (0) Email = NET: False (7/1) Email = EDU: True (1) Email = Other: True (2) Email = NULL: NULL (17638) Email = Gazelle: :...Num BrandOrder Assortment Views <= 0: False (3/1) : Num BrandOrder Assortment Views > 0: True (9) Email = COM: :...Bank Card Holder = False: False (2) Bank Card Holder = NULL: :...Session Start Login Count <= 1: True (5) : Session Start Login Count > 1: False (2) Bank Card Holder = True: :...Year Of Structure <= 1954: False (3/1) Year Of Structure > 1954: True (20) 72 Evaluation on hold-out data (1999 cases): Decision Tree ---------------Size Errors 14 4( 0.2%) << [ Fold 2 ] Decision tree: DoYouPurchaseForOthers = False: :...Account Creation Date <= 2000/03/15: False (279) : Account Creation Date > 2000/03/15: True (29/2) DoYouPurchaseForOthers = NULL: :...DMA No Mail Solicitation Flag = NULL: NULL (17649/12) DMA No Mail Solicitation Flag = True: :...Own Or Rent Home = Renter: True (0) Own Or Rent Home = NULL: :...Num main/replenishment Template Views <= 1: False (8) : Num main/replenishment Template Views > 1: True (2) Own Or Rent Home = Owner: :...Num checkout Template Views > 7: False (2) Num checkout Template Views <= 7: :...Year Of Structure > 1927: True (25/1) Year Of Structure <= 1927: :...Account Creation Date <= 2000/02/09: True (2) Account Creation Date > 2000/02/09: False (2) Evaluation on hold-out data (2000 cases): Decision Tree ---------------Size Errors 9 5( 0.3%) << [ Fold 3 ] Decision tree: DoYouPurchaseForOthers = False: :...Account Creation Date <= 2000/03/15: False (279) : Account Creation Date > 2000/03/15: : :...Num Reinforced Toe Views <= 2: True (27/2) : Num Reinforced Toe Views > 2: False (3/1) DoYouPurchaseForOthers = NULL: :...DMA No Mail Solicitation Flag = NULL: NULL (17648/12) DMA No Mail Solicitation Flag = True: :...Num Danskin Product Views > 1: False (2) Num Danskin Product Views <= 1: :...Own Or Rent Home = Renter: True (0) Own Or Rent Home = NULL: :...Account Creation Date <= 2000/01/30: True (3) : Account Creation Date > 2000/01/30: False (6) Own Or Rent Home = Owner: :...Year Of Structure > 1954: True (25/1) Year Of Structure <= 1954: :...Estimated Income Code in [Under $15;000-$75;000-$99;999]: False (3) Estimated Income Code in [$100;000-$124;999-$125;000 OR MORE]: True (2) Evaluation on hold-out data (2000 cases): Decision Tree ---------------Size Errors 10 1( 0.1%) << [ Fold 4 ] Decision tree: DoYouPurchaseForOthers = False: Tree has been cut for convenience 73 Question 5. How many of our customers are Premium Card holders? See5 [Release 1.15] Tue Mar 26 13:51:51 2002 Options: Cross-validate using 10 folds Class specified by attribute `Premium Card Holder' *** line 19999 of `Customers.data': unexpected end of file Read 19998 cases (296 attributes) from Customers.data [ Fold 0 ] Decision tree: TE Card Holder = NULL: NULL (17728) TE Card Holder = True: :...Session First Request Day of Week in [Monday-Tuesday]: False (6) : Session First Request Day of Week in [Wednesday-Sunday]: : :...Own Or Rent Home = NULL: True (0) : Own Or Rent Home = Owner: True (22/5) : Own Or Rent Home = Renter: False (2) TE Card Holder = False: :...Num Seasonal 1 Stock Views <= 4: False (232/37) Num Seasonal 1 Stock Views > 4: :...Truck Owner = NULL: True (0) Truck Owner = True: False (3) Truck Owner = False: True (6/1) Evaluation on hold-out data (1999 cases): Decision Tree ---------------Size Errors 7 6( 0.3%) << [ Fold 1 ] Decision tree: TE Card Holder = NULL: NULL (17728) TE Card Holder = True: :...Session First Request Day of Week in [Monday-Tuesday]: False (6) : Session First Request Day of Week in [Wednesday-Sunday]: : :...RV Owner = NULL: True (0) : RV Owner = True: False (2) : RV Owner = False: True (18/3) TE Card Holder = False: :...Bank Card Holder = NULL: False (0) Bank Card Holder = False: False (33) Bank Card Holder = True: :...Num main/lifestyles Template Views > 0: False (12) Num main/lifestyles Template Views <= 0: :...Num Replenishment Stock Views > 0: False (13) Num Replenishment Stock Views <= 0: :...Gas Card Holder = NULL: False (0) Gas Card Holder = False: False (26/1) Gas Card Holder = True: :...Finance Retail Activity = NULL: False (0) Finance Retail Activity = True: [S1] Finance Retail Activity = False: [S2] SubTree [S1] Session First Request Day of Week in [Monday-Wednesday]: True (5/1) Session First Request Day of Week in [Thursday-Sunday]: False (5/1) SubTree [S2] 74 Property Type in apartment(5+ units),2-4 unit(duplex;triplex;quad), : mobile_home, : misc. residential (condo store/flat): False (0) Property Type = condo: True (4/1) Property Type = single family dwelling: :...Num main/login2 Template Views <= 1: False (40/5) : Num main/login2 Template Views > 1: True (4/1) Property Type = NULL: :...Num Ultra Sheer Look Product Views > 0: True (4/1) Num Ultra Sheer Look Product Views <= 0: :...RV Owner = NULL: False (0) RV Owner = True: False (14/1) RV Owner = False: :...Number Of Adults > 3: :...Age <= 26: False (3) : Age > 26: True (8/1) Number Of Adults <= 3: :...Num PH Category Views <= 0: :...Num Departments Assortment Views <= 15: False (61/7) : Num Departments Assortment Views > 15: True (5/1) Num PH Category Views > 0: :...Session Last Request Processing Time <= 532: True (5) Session Last Request Processing Time > 532: False (3) Evaluation on hold-out data (1999 cases): Decision Tree ---------------Size Errors 21 9( 0.5%) << [ Fold 2 ] Decision tree: TE Card Holder = NULL: NULL (17728) TE Card Holder = True: :...Session First Request Day of Week in [Monday-Tuesday]: False (5) : Session First Request Day of Week in [Wednesday-Sunday]: : :...Num Welcome Assortment Views > 0: False (2) : Num Welcome Assortment Views <= 0: : :...Own Or Rent Home = NULL: True (0) : Own Or Rent Home = Owner: True (17/2) : Own Or Rent Home = Renter: False (2) TE Card Holder = False: :...Bank Card Holder = NULL: False (0) Bank Card Holder = False: False (35) Bank Card Holder = True: :...Num main/lifestyles Template Views > 0: False (14) Num main/lifestyles Template Views <= 0: :...Num Replenishment Stock Views > 0: False (14) Num Replenishment Stock Views <= 0: :...Num Hanes Product Views > 0: :...Session Browser Family Top 3 = Other: True (0) : Session Browser Family Top 3 = Netscape: True (1) : Session Browser Family Top 3 = AOL: False (5) : Session Browser Family Top 3 = Internet Explorer: True (8/1) Num Hanes Product Views <= 0: :...Account Creation Date <= 2000/03/08: [S1] Account Creation Date > 2000/03/08: :...Working Woman = NULL: False (0) Working Woman = True: True (6/2) Working Woman = False: :...Num Luxury Product Views > 0: True (2) Num Luxury Product Views <= 0: :...Num Tube Package Views <= 0: False (28/8) Num Tube Package Views > 0: True (2) Tree has been cut for convenience 75 Question 6. Who is the Email provider of most of our customers? See5 [Release 1.15] Tue Mar 26 13:03:09 2002 Options: Cross-validate using 10 folds Class specified by attribute `Email' *** line 19999 of `Customers.data': unexpected end of file Read 19998 cases (296 attributes) from Customers.data [ Fold 0 ] Decision tree: DoYouPurchaseForOthers = NULL: :...SendEmail = NULL: NULL (17638/1) : SendEmail = False: [S1] : SendEmail = True: : :...Num Fashion 1 Stock Views > 0: Other (2) : Num Fashion 1 Stock Views <= 0: : :...Num Luxury Product Views > 0: NET (2/1) : Num Luxury Product Views <= 0: : :...Customer ID <= 236: Gazelle (6) : Customer ID > 236: : :...RV Owner = True: EDU (1) : RV Owner = NULL: : :...Session Start Login Count <= 1: COM (4) : : Session Start Login Count > 1: Gazelle (2) : RV Owner = False: : :...Num main/freegift Template Views <= 0: COM (18/1) : Num main/freegift Template Views > 0: : :...Account Creation Date <= 2000/02/03: Gazelle (2) : Account Creation Date > 2000/02/03: COM (2) DoYouPurchaseForOthers = False: :...Num Tube Package Views > 0: :...Customer ID <= 12146: Gazelle (2/1) : Customer ID > 12146: COM (5) Num Tube Package Views <= 0: :...Request Processing Time Sum <= 62: :...Num main/login2 Template Views <= 0: NET (8/1) : Num main/login2 Template Views > 0: COM (2/1) Request Processing Time Sum > 62: :...Session Browser Family in Google,Teleport Pro,AltaVista, : PerMan Surfer,Novell Border Manager, : link-check,EmailSiphon,Java,Genie, : WebTrends,Enfish Tracker,Lycos,ergyBot, : Lotus Notes,Lesszilla,InfoSeek, : Northern Light,Unknown,Cute FTP, : Nitro e-mail collector, : Mozilla: COM (0) Session Browser Family = WebTV: NET (1) Session Browser Family = AOL: COM (63) Session Browser Family = Netscape: :...Num UniqueBoutiques Assortment Views <= 29: COM (39/14) : Num UniqueBoutiques Assortment Views > 29: EDU (2) Session Browser Family = Other: :...Session Start Login Count > 1: NET (2) : Session Start Login Count <= 1: : :...New Car Buyer = NULL: COM (7) : New Car Buyer = True: : :...Length Of Residence <= 9: NET (4) : Length Of Residence > 9: COM (3) Session Browser Family = Internet Explorer: :...Upscale Retail = True: COM (4/2) Upscale Retail = NULL: :...Cookie First Visit Date_Time <= 16:10:16: COM (31/3) : Cookie First Visit Date_Time > 16:10:16: : :...Num PH Category Views <= 0: NET (8/1) 76 : Num PH Category Views > 0: COM (3/2) Upscale Retail = False: :...Miscellaneous Retail Activity = NULL: COM (0) Miscellaneous Retail Activity = True: :...Session Request Count <= 13: NET (3) : Session Request Count > 13: COM (2/1) Miscellaneous Retail Activity = False: :...Num main/freegift Template Views > 1: NET (2) Num main/freegift Template Views <= 1: :...Num NicoleMiller Product Views > 2: NET (4/1) Num NicoleMiller Product Views <= 2: :...Num Liquid Product Views > 0: NET (4/1) Num Liquid Product Views <= 0: :...Speciality Store Retail = NULL: COM (0) Speciality Store Retail = True: COM (12) Speciality Store Retail = False: :...BasicOrFashion Last = Fashion: COM (0) BasicOrFashion Last = Basic: COM (6) BasicOrFashion Last = NULL: [S2] SubTree [S1] Session First Referrer Top 5 in http://www.mycoupons.com, : http://www.fashionmall.com, : http://stores.shopnow.com, : http://www.winnie-cooper.com: NET (0) Session First Referrer Top 5 = http://www.gazelle.com: Gazelle (2) Session First Referrer Top 5 = Other: :...Marital Status = Single: NET (0) Marital Status = Inferred Married: COM (1) Marital Status = Inferred Single: NET (6) Marital Status = Married: COM (1) Marital Status = NULL: COM (3) SubTree [S2] Num main/vendor Template Views <= 0: COM (83/21) Num main/vendor Template Views > 0: :...Estimated Income Code in [Under $15;000-$40;000-$49;999]: COM (4) Estimated Income Code in [$50;000-$74;999-$125;000 OR MORE]: NET (5) Evaluation on hold-out data (1999 cases): Decision Tree ---------------Size Errors 41 11( 0.6%) << [ Fold 1 ] Decision tree: DoYouPurchaseForOthers = NULL: :...SendEmail = NULL: NULL (17638/1) : SendEmail = False: [S1] : SendEmail = True: : :...Num Fashion 1 Stock Views > 0: Other (2) : Num Fashion 1 Stock Views <= 0: : :...Customer ID <= 328: Gazelle (6) : Customer ID > 328: : :...Num BrandOrder Assortment Views <= 1: COM (28/5) : Num BrandOrder Assortment Views > 1: Gazelle (2) DoYouPurchaseForOthers = False: :...Session Browser Family in Google,Teleport Pro,AltaVista,PerMan Surfer, : Novell Border Manager,link-check,EmailSiphon, : Java,Genie,WebTrends,Enfish Tracker,Lycos, : ergyBot,Lotus Notes,Lesszilla,InfoSeek, : Northern Light,Unknown,Cute FTP, : Nitro e-mail collector,Mozilla: COM (0) Tree has been cut for convenience 77 Question 7. How many of our customers are mail order buyers? See5 [Release 1.15] Tue Mar 26 13:07:39 2002 Options: Cross-validate using 10 folds Class specified by attribute `Mail Order Buyer' *** line 19999 of `Customers.data': unexpected end of file Read 19998 cases (296 attributes) from Customers.data [ Fold 0 ] Decision tree: Mail Responder = NULL: NULL (17728) Mail Responder = False: False (68) Mail Responder = True: :...Num Opaque Look Product Views > 0: :...UnitsPerInnerBox Average <= 5.25: True (4) : UnitsPerInnerBox Average > 5.25: False (3) Num Opaque Look Product Views <= 0: :...Unknown Card Type = NULL: True (0) Unknown Card Type = True: True (84/7) Unknown Card Type = False: :...Session Visit Count <= 13: True (105/19) Session Visit Count > 13: :...Account Creation Date <= 2000/01/30: True (2) Account Creation Date > 2000/01/30: False (5) Evaluation on hold-out data (1999 cases): Decision Tree ---------------Size Errors 8 4( 0.2%) << [ Fold 1 ] Decision tree: Mail Responder = NULL: NULL (17728) Mail Responder = False: False (70) Mail Responder = True: :...Num Opaque Look Product Views <= 0: :...Session Start Login Count <= 6: True (189/25) : Session Start Login Count > 6: False (5/1) Num Opaque Look Product Views > 0: :...UnitsPerInnerBox Average <= 5.25: True (4) UnitsPerInnerBox Average > 5.25: False (3) Evaluation on hold-out data (1999 cases): Decision Tree ---------------Size Errors 6 7( 0.4%) << [ Fold 2 ] Decision tree: 78 Mail Responder = NULL: NULL (17728) Mail Responder = False: False (69) Mail Responder = True: :...Num Opaque Look Product Views > 0: :...UnitsPerInnerBox Average <= 5.25: True (3) : UnitsPerInnerBox Average > 5.25: False (3) Num Opaque Look Product Views <= 0: :...Login Failure Count <= 4: True (185/24.8) Login Failure Count > 4: :...Session Visit Count <= 12: True (5.9/1.1) Session Visit Count > 12: False (4.1) Evaluation on hold-out data (2000 cases): Decision Tree ---------------Size Errors 7 5( 0.3%) << [ Fold 3 ] Decision tree: Mail Responder = NULL: NULL (17727) Mail Responder = False: False (67) Mail Responder = True: :...Unknown Card Type = NULL: True (0) Unknown Card Type = True: True (82/8) Unknown Card Type = False: :...Num Opaque Look Product Views > 0: False (4/1) Num Opaque Look Product Views <= 0: :...Session Visit Count <= 13: True (112/21) Session Visit Count > 13: :...Account Creation Date <= 2000/01/30: True (2) Account Creation Date > 2000/01/30: False (4) Evaluation on hold-out data (2000 cases): Decision Tree ---------------Size Errors 7 1( 0.1%) << [ Fold 4 ] Decision tree: Mail Responder = NULL: NULL (17727) Mail Responder = False: False (68) Mail Responder = True: :...Num Opaque Look Product Views <= 0: True (196/32) Num Opaque Look Product Views > 0: :...UnitsPerInnerBox Average <= 5.25: True (4) UnitsPerInnerBox Average > 5.25: False (3) Evaluation on hold-out data (2000 cases): Decision Tree ---------------Size Errors 5 3( 0.1%) << [ Fold 5 ] Decision tree: Mail Responder = NULL: NULL (17727) Mail Responder = True: True (202/34) Mail Responder = False: False (69) Tree has been cut for convenience 79 Question 8. Find out how many children do most of our customers have? See5 [Release 1.15] Tue Mar 26 13:10:46 2002 Options: Cross-validate using 10 folds Class specified by attribute `NumberOfChildren' *** ignoring cases with bad or unknown class *** line 19999 of `Customers.data': unexpected end of file Read 58 cases (296 attributes) from Customers.data [ Fold 0 ] Decision tree: Vehicle Lifestyle in IMPORT (STANDARD/ECONOMY),REGULAR (MIDSIZE/SMALL), : TRUCK OR UTILITY VEHICLE: 0 (0) Vehicle Lifestyle = FULL SIZE (STANDARD/LUXURY): 0 (1) Vehicle Lifestyle = SPECIALTY (MIDSIZE/SMALL): 4 or more (3) Vehicle Lifestyle = PERSONAL LUXURY CAR: 0 (1) Vehicle Lifestyle = STATION WAGON: 1 (1) Vehicle Lifestyle = NULL: :...Num LifeStyles Assortment Views > 0: 2 (2/1) Num LifeStyles Assortment Views <= 0: :...Num Tube Package Views <= 0: 0 (43/6) Num Tube Package Views > 0: 2 (2) Evaluation on hold-out data (5 cases): Decision Tree ---------------Size Errors 7 0( 0.0%) << [ Fold 1 ] Decision tree: Other Indiv. Occupation in SALES/SERVICE,SELF EMPLOYED MANAGEMENT, : SELF EMPLOYED RETIRED,MILITARY,HOUSEWIFE, : CRAFTSMAN/BLUE COLLAR,RELIGIOUS,STUDENT, : SELF EMPLOYED PROF/TECH,RETIRED,SELF EMPLOYED, : SELF EMPLOYED SALES/MARKETING, : SELF EMPLOYED BLUE COLLAR,SELF EMPLOYED CLERICAL, : CLERICAL/WHITE COLLAR: 0 (0) Other Indiv. Occupation = ADMINISTRATIVE/MANAGERIAL: 2 (3) Other Indiv. Occupation = PROFESSIONAL/TECHNICAL: :...NumberOfAdults in [0-1]: 0 (2) : NumberOfAdults in [2-3 or more]: 4 or more (3) Other Indiv. Occupation = NULL: :...Num main/freegift Template Views > 1: 2 (2) Num main/freegift Template Views <= 1: :...Num main/boutique Template Views <= 2: 0 (41/4) Num main/boutique Template Views > 2: 1 (2/1) Evaluation on hold-out data (5 cases): Decision Tree ---------------Size Errors 6 1(20.0%) << 80 [ Fold 2 ] Decision tree: Session First Request Day of Week = Sunday: 4 or more (2) Session First Request Day of Week in [Monday-Saturday]: :...Num main/departments Template Views > 3: 2 (4/1) Num main/departments Template Views <= 3: :...Account Creation Date_Time <= 20:28:38: 0 (42/3) Account Creation Date_Time > 20:28:38: :...Account Creation Date_Time <= 21:54:16: 1 (2) Account Creation Date_Time > 21:54:16: 0 (2/1) Evaluation on hold-out data (6 cases): Decision Tree ---------------Size Errors 5 2(33.3%) << [ Fold 3 ] Decision tree: Session First Request Day of Week = Sunday: 4 or more (2) Session First Request Day of Week in [Monday-Saturday]: :...Num LifeStyles Assortment Views > 0: 2 (2/1) Num LifeStyles Assortment Views <= 0: :...Num Tube Package Views <= 0: 0 (46/6) Num Tube Package Views > 0: 2 (2) Evaluation on hold-out data (6 cases): Decision Tree ---------------Size Errors 4 2(33.3%) << [ Fold 4 ] Decision tree: Session Last Request Processing Time > 20093: 2 (2/1) Session Last Request Processing Time <= 20093: :...Other Indiv. Occupation in SALES/SERVICE,SELF EMPLOYED MANAGEMENT, : SELF EMPLOYED RETIRED,MILITARY,HOUSEWIFE, : CRAFTSMAN/BLUE COLLAR,RELIGIOUS,STUDENT, : SELF EMPLOYED PROF/TECH,RETIRED,SELF EMPLOYED, : SELF EMPLOYED SALES/MARKETING, : SELF EMPLOYED BLUE COLLAR, : SELF EMPLOYED CLERICAL, : CLERICAL/WHITE COLLAR: 0 (0) Other Indiv. Occupation = PROFESSIONAL/TECHNICAL: 0 (3/1) Other Indiv. Occupation = ADMINISTRATIVE/MANAGERIAL: 2 (3) Other Indiv. Occupation = NULL: :...Account Creation Date_Time <= 20:28:38: 0 (40/3) Account Creation Date_Time > 20:28:38: :...Account Creation Date_Time <= 21:54:16: 1 (2) Account Creation Date_Time > 21:54:16: 0 (2/1) Evaluation on hold-out data (6 cases): Decision Tree ---------------Size Errors 6 2(33.3%) << Tree has been cut for convenience 81 Question 9. How many of our customers respond to our marketing mails? See5 [Release 1.15] Tue Mar 26 13:13:34 2002 Options: Cross-validate using 10 folds Class specified by attribute `Mail Responder' *** line 19999 of `Customers.data': unexpected end of file Read 19998 cases (296 attributes) from Customers.data [ Fold 0 ] Decision tree: Mail Order Buyer = NULL: NULL (17728) Mail Order Buyer = True: True (167) Mail Order Buyer = False: :...Truck Owner = NULL: False (0) Truck Owner = True: True (9/1) Truck Owner = False: [S1] SubTree [S1] Other Indiv. Occupation in SALES/SERVICE,SELF EMPLOYED MANAGEMENT, : SELF EMPLOYED RETIRED,MILITARY,HOUSEWIFE,RELIGIOUS, : STUDENT,SELF EMPLOYED PROF/TECH,RETIRED, : SELF EMPLOYED,SELF EMPLOYED SALES/MARKETING, : SELF EMPLOYED BLUE COLLAR,SELF EMPLOYED CLERICAL, : CLERICAL/WHITE COLLAR: False (0) Other Indiv. Occupation = PROFESSIONAL/TECHNICAL: True (5) Other Indiv. Occupation = CRAFTSMAN/BLUE COLLAR: True (1) Other Indiv. Occupation = ADMINISTRATIVE/MANAGERIAL: False (1) Other Indiv. Occupation = NULL: :...Occupation in SELF EMPLOYED MANAGEMENT,MILITARY,CRAFTSMAN/BLUE COLLAR, : SELF EMPLOYED PROF/TECH,RETIRED, : SELF EMPLOYED SALES/MARKETING,SELF EMPLOYED,OTHER, : SELF EMPLOYED BLUE COLLAR,SELF EMPLOYED CLERICAL, : SELF EMPLOYED HOMEMAKER: False (0) Occupation = SALES/SERVICE: True (1) Occupation = HOUSEWIFE: True (1) Occupation = STUDENT: True (2/1) Occupation = ADMINISTRATIVE/MANAGERIAL: False (6/1) Occupation = CLERICAL/WHITE COLLAR: True (3) Occupation = PROFESSIONAL/TECHNICAL: :...Number Of Adults <= 1: True (5/1) : Number Of Adults > 1: False (3/1) Occupation = NULL: :...Premium Card Holder = NULL: False (0) Premium Card Holder = False: False (54/3) Premium Card Holder = True: :...Dwelling Unit Size = NULL: False (0) Dwelling Unit Size = MULTI FAMILY DWELLING UNIT: False (5) Dwelling Unit Size = SINGLE FAMILY DWELLING UNIT: :...Length Of Residence <= 4: False (2) Length Of Residence > 4: True (6) Evaluation on hold-out data (1999 cases): Decision Tree ---------------Size Errors 17 4( 0.2%) << 82 [ Fold 1 ] Decision tree: Mail Order Buyer = NULL: NULL (17728) Mail Order Buyer = True: True (171) Mail Order Buyer = False: :...RV Owner = NULL: False (0) RV Owner = True: True (3) RV Owner = False: :...Texture Last = Flat: True (3) Texture Last = Textured: True (2) Texture Last = NULL: :...Presence Of Children = NULL: False (0) Presence Of Children = True: :...Num Gift Sets & Special Items Views > 0: False (2) : Num Gift Sets & Special Items Views <= 0: : :...Customer ID <= 9064: False (4) : Customer ID > 9064: [S1] Presence Of Children = False: :...Num main/departments Template Views > 4: True (3) Num main/departments Template Views <= 4: :...Login Failure Count <= 6: False (63.6/4.7) Login Failure Count > 6: :...Session Start Login Count <= 2: False (2.4/0.3) Session Start Login Count > 2: True (2) SubTree [S1] Available Home Equity in [EQUITY $1-$4;999-EQUITY $10;000-$19;9999]: False (2.1/0.1) Available Home Equity in [EQUITY $20;000-$29;000-EQUITY $2;000;000 AND OVER]: True (12.9) Evaluation on hold-out data (1999 cases): Decision Tree ---------------Size Errors 13 5( 0.3%) << [ Fold 2 ] Decision tree: Mail Order Buyer = NULL: NULL (17728) Mail Order Buyer = True: True (167) Mail Order Buyer = False: :...RV Owner = NULL: False (0) RV Owner = True: True (5) RV Owner = False: :...TE Card Holder = NULL: False (0) TE Card Holder = True: True (2) TE Card Holder = False: :...Occupation in SELF EMPLOYED MANAGEMENT,MILITARY, : CRAFTSMAN/BLUE COLLAR,SELF EMPLOYED PROF/TECH, : RETIRED,SELF EMPLOYED SALES/MARKETING,SELF EMPLOYED, : OTHER,SELF EMPLOYED BLUE COLLAR, : SELF EMPLOYED CLERICAL, : SELF EMPLOYED HOMEMAKER: False (0) Occupation = SALES/SERVICE: True (1) Occupation = PROFESSIONAL/TECHNICAL: True (7/2) Occupation = HOUSEWIFE: True (2) Occupation = STUDENT: True (2/1) Occupation = ADMINISTRATIVE/MANAGERIAL: False (4) Occupation = CLERICAL/WHITE COLLAR: True (3) Occupation = NULL: :...Dwelling Unit Size = MULTI FAMILY DWELLING UNIT: Tree has been cut for convenience 83 Question 10. What percentage of our customers come back to our website? See5 [Release 1.15] Tue Mar 26 13:16:07 2002 Options: Cross-validate using 10 folds Class specified by attribute `Retail Activity' *** line 19999 of `Customers.data': unexpected end of file Read 19998 cases (296 attributes) from Customers.data [ Fold 0 ] Decision tree: Unknown Card Type = NULL: NULL (17728) Unknown Card Type = False: False (177/6) Unknown Card Type = True: :...Upscale Retail = NULL: True (0) Upscale Retail = True: :...Truck Owner = NULL: True (0) : Truck Owner = True: False (4) : Truck Owner = False: True (8/1) Upscale Retail = False: :...Num LifeStyles Assortment Views > 1: :...Cookie First Visit Date_Time <= 16:44:31: False (2) : Cookie First Visit Date_Time > 16:44:31: True (2) Num LifeStyles Assortment Views <= 1: :...Num WDCS Category Views <= 4: True (71/2) Num WDCS Category Views > 4: :...Year Of Structure <= 1991: True (5) Year Of Structure > 1991: False (2) Evaluation on hold-out data (1999 cases): Decision Tree ---------------Size Errors 9 3( 0.2%) << [ Fold 1 ] Decision tree: Unknown Card Type = NULL: NULL (17728) Unknown Card Type = False: False (177/6) Unknown Card Type = True: :...Num Rayon Product Views > 0: :...Account Creation Date_Time <= 17:18:40: False (2) : Account Creation Date_Time > 17:18:40: True (2) Num Rayon Product Views <= 0: :...Num LifeStyles Assortment Views <= 1: True (85/6) Num LifeStyles Assortment Views > 1: :...Num main/home Template Views <= 2: False (2) Num main/home Template Views > 2: True (3) Evaluation on hold-out data (1999 cases): Decision Tree ---------------Size Errors 7 3( 0.2%) << 84 [ Fold 2 ] Decision tree: Unknown Card Type = NULL: NULL (17728) Unknown Card Type = False: False (177/7) Unknown Card Type = True: :...Num Rayon Product Views > 0: :...Account Creation Date_Time <= 17:21:14: False (2) : Account Creation Date_Time > 17:21:14: True (2) Num Rayon Product Views <= 0: :...Upscale Retail = NULL: True (0) Upscale Retail = False: True (77/5) Upscale Retail = True: :...Bank Retail Activity = NULL: True (0) Bank Retail Activity = True: True (9/1) Bank Retail Activity = False: False (3) Evaluation on hold-out data (2000 cases): Decision Tree ---------------Size Errors 7 1( 0.1%) << [ Fold 3 ] Decision tree: Unknown Card Type = NULL: NULL (17727) Unknown Card Type = False: False (179/7) Unknown Card Type = True: :...Upscale Retail = NULL: True (0) Upscale Retail = False: True (80/6) Upscale Retail = True: :...Bank Retail Activity = NULL: True (0) Bank Retail Activity = True: True (9/1) Bank Retail Activity = False: False (3) Evaluation on hold-out data (2000 cases): Decision Tree ---------------Size Errors 5 2( 0.1%) << [ Fold 4 ] Decision tree: Unknown Card Type = NULL: NULL (17727) Unknown Card Type = False: False (177/7) Unknown Card Type = True: :...Upscale Retail = NULL: True (0) Upscale Retail = False: True (83/7) Upscale Retail = True: :...Bank Retail Activity = NULL: True (0) Bank Retail Activity = False: False (3) Bank Retail Activity = True: :...Truck Owner = NULL: True (0) Truck Owner = True: False (3/1) Truck Owner = False: True (5) Evaluation on hold-out data (2000 cases): Decision Tree ---------------Size Errors 6 0( 0.0%) << Tree has been cut for convenience 85 Appendix H Project Log Activity Procedures and Timetable Meeting Met Prof Berzins, My project supervisor and he told me that I have been allocated a project on Data Mining. He asked me to do some reading and to find out some information before deciding what I should do. Wrote an email to Stuart Robert, the project initiator, asking him the questions, which I had to find answers for. Forwarded the answers I got from Stuart to Prof Berzins. Attended the Questionnaire briefing Met with my supervisor and discussed minimum requirements for my project. He was worried that I might be doing something that has been done already on my case study. The thing is we don’t have datasets to do some test on. So I have to rely on the Internet KDDCUP datasets to do some mining, and their datasets are specific to their questions. Submitted my Aim and minimum requirements of my project. Met my supervisor to update him about what I have submitted. Asked him to back me up on my Disk space requirements. Send an Email to support requesting some disk space for downloading the data sets and the See5 software Granted space and started downloading the datasets and software Updated my supervisor about what I have done. He asked me to think about what the answers to my question will teach me. Started writing my Mid-year report Submitted my report to my supervisor to have a look at it before I finally submitted it. Submitted my report Received my marked mid-year report from my Supervisor Tried doing the data mining exercise using the See5 Demo that I downloaded from the Web but it could only read 400 records. Filled a form requesting the See5 evaluation licence from RuleQuest Received an Email from Ross Quinlan with the evaluation licence and installed the Software. Started doing the mining exercise with the Evaluation Software Started writing my Draft chapter and table of contents Received an email from Liu Bing granting me permission to download CBA, which is to be used to validate the results from the See5 program Sent an Email to support reporting an error I was experiencing when trying to install the CBA program Received an Email from Support telling me that I should not install the CBA software. Forwarded the Email to my supervisor Submitted the Table of contents and draft chapter to my Supervisor Did some presentation on what I have done during the Progress Meeting with supervisor and Assessor. Got good advice from Assessor on how to split the dataset. Received another Evaluation licence from a friend and started doing the mining exercise again. Started writing up the project report Finished writing the project report Date & Time 12.10.2001, 3pm 15.10.2001, 2pm. 16.10.2001, 10:25am. 16.10.2001, 13:15pm 26.10.01 3pm 29.10.01, 2pm 05/11/01 06/11/01 07/11/01 19/11/01 21/11/01 08/12/01 13/12/01 28/01/02 04/02/02 05/02/02 06/02/02 07/02/02 01/03/02 07/03/02 07/03/02 08/03/02 11/03/02 20/03/02 26/03/02 01/04/02 25/04/02 86 87