Download Data Mining Algorithms and Tools Kgosi Tshetlhoyagae BSc

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Transcript
Data Mining Algorithms and Tools
Kgosi Tshetlhoyagae
BSc in Information Systems
Session (2001/2002)
i
Project Summary
Most Organisations today produce an electronic record of every transaction they are involved in. In large
Organisations, this results in millions of records being produced every day. Nowadays many Organisations
are going online to exploit the e-business wagon, this will result in huge amount of data being accumulated as
the Internet connects many sources of data.
The accumulated data is very important in today’s competitive world and its used for gaining competitive
edge over competitors by a process called data mining, which can be said to be the extraction of useful
information from large databases.
Data mining being a new area has seen many sophisticated algorithms and tools being developed. This project
explores the current data mining algorithms and tools in an attempt to find out which algorithms are best to
use for solving what problems and how the tools compare. In addition to the research, which was carried out
in algorithms and techniques a practical data mining exercise was done on datasets, which were accumulated
by an online retailer.
ii
Acknowledgements
There are many people who were involved in the completion of the project and the production of this report. I
take this opportunity to show my greatest gratitude and say without your contribution this project will not
have been completed. Thank You!!
Prof. Martin Berzins my project supervisor, for his support, advice, guidance, encouragement and feedback,
which was always valuable.
Dr Stuart Robert my project assessor, for his feedback from the mid-year report and progress meeting, which
was very constructive in guiding and helped me in doing some tasks, which I thought I couldn’t do.
Prof. Ross Quinlan, for offering me an evaluation licence to use the See5 evaluation software in my data
mining exercise.
Dr Liu Bing, for giving me permission to download the CBA software, which was used for evaluation.
The KDDCUP website administrators for giving me access to their datasets, which were used in this project.
Friends I Khan and T Mosweu for always encouraging me to go on. S Khupe for asking for the second
evaluation licence.
Finally, I would like to dedicate this project report to my wife Boi, for always being there for me when my
stress level was above the limit.
iii
Table of Contents
Page Number
1.0 Problem Definition……………………………………………………….…………. 1
1.1 The Problem………………………………………………………………………..
1
1.2 Project Aim…………………………………………………………………...……
1
1.3 Project Approach……………………………………………………………..……
1
1.4 Project Objectives……………………………………………………………...…..
2
1.5 Project Requirements………………………………………………………..……..
2
2.0 Introduction……………………………………………………………………….…
3
2.1 What is data mining?………………………………………………………….…..
3
2.2
Data mining styles……………………………………………………….………
5
3.0 Data Mining Techniques……………………………………………………………
6
3.1
Automatic Cluster Detection………………………………………………….…
7
3.2
Decision Trees………………………………………………………………..….
8
3.3
Neural Networks………………………………………………………………....
9
3.4
Market Basket Analysis………………………………………………………….
10
3.5
Memory-Based Reasoning………………………………………………………
10
3.6
Link Analysis………………………………………………………………….…
11
3.7
Genetic Algorithms…………………………………………………………….... 12
3.8
Online Analytic Processing……………………………………………...………
13
3.9
Regression algorithms……………………………………………………...……
13
3.10 Data Visualization…………………………………………………………….....
14
4.0 Data Mining Algorithms……………………………………………………………
15
4.1 C4.5…………………………………………………………………………..…..
15
4.2 CART………………………………………………………………………..……
16
4.3 Apriori Algorithms………………………………………………………………..
17
4.4 K-Means……………………………………………………………………..……
18
4.5 CURE …………………………………………………………………………..…
18
4.6 CHAID………………………………………………………………………….…
19
4.7 ID3……………………………………………………………………………...…
20
4.8 SLIQ……………………………………………………………………………....
20
4.9 C5.0/See5……………………………………………………………………….…
21
4.10 CBA………………………………………………………………………….……
21
5.0 Evaluation of Data Mining Techniques and Algorithms…………………. 22
5.1 Evaluation of Data Mining Techniques………………………………..…
22
5.2
24
Evaluation of Data Mining Algorithms…………………………………………
iv
6.0 Potential applications……………………………………………………………….
25
7.0 Case Study ………………………………………………………………………..…
27
7.1
Overview…………………………………………………………………...……
27
7.2
Extracting information from the datasets………………………………………..
28
7.3
Evaluation and Interpretation of results……………….………………………… 42
8.0 Current Issues in data mining……………………………………...………………
47
8.1
Individual privacy………………………………………..………………………
47
8.2
Data integrity……………………………………………………….……………
47
8.3
Data Size…………………………………………………………………………
47
8.4
Data Noise………………………………………………………………...……..
47
8.5
Technical issues…………………………………………………………………
48
8.6
Cost issues………………………………………………………………………
48
9.0 Future Requirements………………………………………………………..……… 49
Reference List ……………………………………………………………….…………
50
Appendix A - Personal Evaluation………………………………………………………
51
Appendix B - Project Plan and Schedule………………………………….…..…………
52
Appendix C - Project specification………………………………………………………
53
Appendix D – Emails received…………………………………………………………
54
Appendix E – Sample of the Customers.data file records………………………………
57
Appendix F – Sample of the Customers.names file attributes……………………..……
62
Appendix G – Answers to questions asked………………………………………...……
66
Appendix H – Project Log………………………………………………………………
86
v
Chapter 1
Problem Definition
1.1
The Problem
Data mining is a fast growing area of interest and therefore many sophisticated tools have been developed to
perform it. The area has gained recognition as an important tool for both business and industry and the
information gleaned from huge databases of past transactions allows more focussed decision-making. For
example, in retailing the information allows customer relationship management to be targeted to individuals.
As more tools are being developed most Organisations are also trying to find how they can exploit this area
and use the data that they have been accumulating for years to improve the services they provide, increase
business opportunities or to just understand customer’s behaviour. This project intends to answer most of the
questions that most of the Organisations, which are following this new area, might be asking themselves
about data mining tools and algorithms that are around today. The full specification of this project is in
Appendix C.
1.2
Project Aim
The aim of this project was to provide an analysis of some data mining tools and techniques, which are
currently in use today. The aim of the project was to be met by following the project approach that is
discussed below.
1.3
Project Approach
A theoretical and practical approach was taken to tackle the project. The theoretical phase involved intensive
exploration of the data mining literature and doing some research on the Internet. This part took more time as
there was more information related to data mining on the Internet. Books and journals were also searched for
any information they can provide.
The practical phase of the project was in a form of a Case study, which involved some extraction of patterns
from a dataset using a decision tree technique. The Datasets used were the ones, which have been
accumulated by an on-line retailer and a Data mining tool called See5 was used to find the patterns from the
datasets. The practical approach was in a form of directed data mining whereby the miner knows what result
he/she is looking for from the data mining exercise.
1
1.3
Project Objectives.
The objectives of the project are as follows;
•
To make a general introduction of Data Mining.
•
To examine which Data mining algorithms are most effective for mining data that have been acquired
through website visits.
1.4
•
To apply a data mining technique to analyse and determine useful classifications within the dataset.
•
To extend the range of questions, which are relevant to the datasets.
•
To discuss current issues in Data mining and future requirements
Project Requirements
The project’s minimum requirements are to meet the above-mentioned objectives and it is hoped that a mark
above a minimum pass will be achieved by the quality of the report.
2
Chapter 2
Introduction to Data Mining
2.1
What is Data Mining?
“Data mining is an interdisciplinary field bringing together techniques from machine learning, pattern
recognition, statistics, databases, and visualization to address the issue of information extraction from large
databases. Data mining has captured the imagination of the business and academic worlds, in fact 80% of the
Fortune 500 companies are currently involved in a data mining pilot project or have already deployed one or
more data mining production systems” [1]. Data Mining comes from a wide range of disciplines as quoted
above and therefore has seen many definitions made to explain what it is. All the definitions made so far are
really the same and Data Mining can be summarised as the process of exploring and analysing large quantities
of data stored in databases in order to find useful correlations or meaningful patterns.
The data to be analysed can be collected from a number of sources such as web sites, electronic point of sale,
data warehouses etc. Data mining can help online retailers to;
•
Increase page views per session.
•
Increase shopping basket contents per checkout.
•
Increase the number of referred customers /visitors.
•
Retain their old customers.
•
Promote their brands.
•
Increase revenue and reduce costs.
Data mining has a life cycle, which is made up of the following four main business processes;
1. Identifying the business problem.
2. Transforming data into actionable results.
3. Acting on the results.
4. Measuring the results.
3
The above business processes are achieved by the following Data mining stages. A diagram (fig 2.0) adapted
from [5] can show how the process is done.
•
Selection - involves selecting data according to some criteria.
•
Pre-processing/Data Cleansing - eliminating errors that can be in the data.
•
Transformation – making the data useable and navigable by adding overlays such as
demographics.
•
Data mining – the actual extraction of meaningful patterns and correlations from the
selected data.
•
Interpretation and Evaluation - this is where the identified patterns are interpreted into
knowledge.
Fig 2.0
4
2.2
Data Mining Styles.
Directed data mining.
Directed style of data mining is used when we know what we are looking for and can direct data mining effort
towards getting the most accurate result possible. This style takes a hypothesis from a user and tests the
validity of it against the data. The user is responsible for formulating the hypothesis and issuing the query on
the data. The style takes the form of a predictive model because it is making predictions about the unknown
based on the known and the user always knows the format of the output. A predictive model can answer
questions such as;
-
What is the right medical treatment, based on past experience?
-
Which customers are likely to leave in the next six months?
The goal in predictions is to learn from the past, and to learn in such a way that the knowledge can be applied
to the future. The problem of directed data mining is that it does not create any new information in the
retrieval process but returns records to verify or negate the hypothesis.
Undirected data Mining.
Undirected data mining finds patterns in the data and leaves it to the user to determine whether or not these
patterns are important. Data is sifted in search of frequently occurring patterns, trends and generalisations
about the data without the intervention from the user. The data is searched with no hypothesis in mind other
than for the system to group the facts according to common characteristics found. An example of undirected
data mining will be a bank database, which is mined to discover how many groups of customers to target for a
mailing campaign.
5
Chapter 3
Data Mining Techniques
Data mining software analyse relationships and patterns in stored data based on open-ended user queries. The
software extracts meaningful new information from data by performing any one of the following activities.
1. Classification: This is where stored data is used to locate data in pre-determined groups. It consists of
examining the features of a record and assigning it a class. For example an Online Retailer could
mine customer purchase data to determine when customers visit the website and what they buy. The
information could be used to increase traffic by having promotions or specials on some goods.
2. Clustering: This is where the data items are grouped according to logical relationships or in the
Online retailing example customer preferences. It is actually segmenting a group of records into a
number of more similar subgroups. The records are grouped together on the basis of self-similarity.
Unlike classification, clustering does not rely on predefined classes. For example data can be mined
to identify market segments in a town.
3. Association: Here data is mined to identify associations and is sometimes called affinity grouping.
The idea is to determine what things go together. “Association rules are really no different from
classification rules except that they can predict any attribute, not just the class” [2]. Affinity grouping
can be used by on-line retailers to plan how to display items on their web pages. Through
associations they can know what to display with what. For example Men’s shirts may be displayed on
the same page with ties.
4. Sequential Patterns: This is where data is mined to anticipate behaviour patterns and trends. For
example a retailer could predict the likelihood of a sleeping bag being purchased based on a
customer’s purchase of a backpack and hiking shoes.
5. Estimation: Here a given input data is used to come up with a value for some unknown continuous
variable. For example, if a person’s income is not known an estimator can identify other variables that
correlate well with income such as location (residential address), car preference, job title etc. then
find other people with the same traits and use them to estimate income and confidence value. Its used
much in Neural networks discussed later in this report.
6. Prediction: This is where the data items are classified according to some predicted future behaviour.
Prediction guesses a future value such as the probability of passing a module, when a person hasn’t
done it yet or the probability of a customer to leave within 3 months.
Of all the activities discussed above, classification is the only example of direct data mining as its goal is to
use the available data to build a model that describes one particular variable of interest in terms of the rest of
the available data. Other activities are examples of undirected data mining as their goal is to establish some
relationship among all the variables. No variable is singled out as the target.
6
To perform the above activities a data mining technique such as Market Basket analysis, Memory-based
Reasoning, Link Analysis, Cluster Detection, Decision Trees, Artificial Neural networks etc, have to be used.
Data mining techniques are conceptual approaches to extracting information from data. The technique used is
usually determined by goals of data mining and the data types of the data to be involved. For example the
goal of predictive data mining is to automate a decision-making process by creating a model capable of
making a prediction, while the goal of a descriptive data mining is to gain increased understanding of what is
happening inside the data. Data types of the data to be used also determine the technique, which is to be used
e.g. numeric data will require an algorithm, which will produce numeric output. The techniques for data
mining have been discussed below and as the project didn’t evaluate each technique by practically using it,
the strengths and weaknesses of the techniques have been adopted from [6].
3.1
Automatic Cluster Detection.
The technique uses mathematics to find clusters in data. For example divisive methods considers all records
to be part of one big cluster and breaks the cluster until each record has a cluster to itself while the
agglomerative methods start with each record occupying a cluster and combine the clusters until there is one
big cluster containing all the records, this is actually the reverse of the divisive methods.
Automatic cluster detection is used for undirected data mining and therefore can be applied without prior
knowledge of the structure to be discovered. The strengths of Automatic cluster detection [6] are that it works
with categorical, numeric and textual data and it is also easy to apply, as it requires little preparation of the
input data. The weaknesses of the technique are that the clusters generated are not guaranteed to have any
practical value, its up to the miner to interpret the clusters. The fact that the miner does not know what he/she
is looking for makes it difficult for him/her to recognise it when he/she finds it. It is also not easy to choose
the right weights and measures and the technique is sensitive to initial parameters because the initial value for
the k in the k-means is important as it determines the number of clusters.
Automatic cluster detection is useful when there are competing patterns in the data, making it hard to spot any
single pattern, creating clusters of similar records will always reduce the complexity within the clusters.
Automatic cluster detection is used with “large complex data sets with many variables and a lot of internal
structure” [6].
7
3.2
Decision Trees.
When a Decision tree technique is applied to data, each record flows through the tree along a path determined
by a series of tests until the record reaches a leaf or terminal node of the tree and given a class label. Each
branch of a decision tree is a test on a single variable that cuts the space into two or more pieces. For example
the school of computing might want to know how many students received a 2:1 degree since 1990. To get an
answer using a decision tree might involve going through asking questions which involves yes or no answers
such as graduate year > 1990, Department = Computing, degree classification = 2:1.
There are two types of decision trees, which are;
Classification trees – these trees labels records and assign them to the proper class. They can provide
the confidence that the classification is correct.
Regression trees - these trees estimates the values of target variables that takes on numeric values.
Decision trees are built by recursive partitioning which is an iterative process of splitting the data up into
partitions. The initial split produce two nodes, each of which is then split in the same way as the root node.
When no split can be found that decrease the purity of a given node then it’s called the leaf node. A process
called pruning, which is removing leaves and branches from the decision tree can vastly improve its
performance. A decision tree technique is used in direct data mining where the miner knows the fields they
are targeting and what to expect as the results.
Decision trees do not discover rules that involve a relationship between variables, therefore it’s the miner’s
responsibility to add derived variables to express relationships that are likely to be useful. The strengths of the
technique [6] are said to be its ability to generate understandable rules that can be translated in to English, The
way it performs classification without too much computations and the fact that it provides a clear indication
of which fields are important for classification, this fields are usually put at the root node of the tree. Its
weaknesses are said to be that its not good for estimation tasks where the goal is to predict the value of the
continuous variable and it is also not good for time-series data.
Decision tree methods are good when the tree’s task is classification of records or prediction of outcomes.
Decision trees must be used when the goal of data mining is to assign each record to one of the few categories
or the goal is to generate rules that can be easily understood and translated in to natural language.
8
3.3
Neural Networks.
Neural networks have an input layer and an output layer, and each of the inputs gets its own network node.
The actual values of the input variables are not fed into the input layer, but only some transformation of them.
Each input layer is connected to the hidden layer with a weight (wi), which is some co-efficient. In the hidden
layer the input weights are combined using a combination function and then passed to a transfer function, the
result of which is the output of the network. The combination and transfer functions make up the layer’s
activation function and the resulting value from the output node’s activation function is some transformation
of the actual output. Refer to figure 3.1 below, which was adapted from [4] and have been edited to show how
customers may be evaluated for credit risks.
fig 3.1
Training a Neural network is a process of setting the weights on the inputs of each of the units in such a way
that the network does the best job of predicting the target variable. Neural networks can produce good
predictions but are not easy to use mainly because of the data preparation required to get good results. The
results are also difficult to understand because neural networks do not produce rules.
Neural networks are good for most classification and prediction tasks when the results of the model are more
important than understanding how the model works. The technique does not work well when there are many
input features, as large number of features make it more difficult to find patterns.
9
3.4
Market Basket Analysis (MBA)
When an ordinary person goes to Tesco to do his/her groceries he/she might not be aware that she/he is
leaving some data, which might be used for the layout of the shop. Finding groups of items that occur
together in a transaction is called Market Basket Analysis. The technique builds rules that recognize products,
which are bought together in a transaction.
MBA is an undirected style of data mining and its usually used on problems where the goal is to know which
items form clusters. Its results have found much use in the retailing industry where the results are used in
store layouts, products promotions etc.
MBA is good at analysing point-of-sale transactions. For example having determined which products are
likely to be bought together, the store manager might decide to place the products next to each other or to
offer promotions on some products while raising prices on their basket associates.
The strengths of MBA are producing clear and understandable results, using simple computations and the fact
that it works on variable-length data. But it has the weaknesses of discounting exceptional items in the basket
and it also requires more computation as the size of the problem grows.
3.5
Memory-Based Reasoning (MBR)
The School of computing may predict the number of first classes they will produce for an academic year
based on the examination results of the first semester of their finalists and the past year’s results. By making
such a prediction they will be applying a technique called memory-based reasoning, which uses known
instances to make predictions about unknown instances.
The technique assigns the new cases to the class in which most of its neighbours belong by employing knearest neighbour algorithm and can run on any source of data. Memory-based reasoning is a direct mining
for classification and prediction and has been used successfully in fraud detection, medical treatment, where
the patients may be prescribed medicines based on other patient’s records.
MBR ‘s results are easy to understand. The technique can work on almost any number of attributes and
applied to non-relational data. Its main weaknesses are that it requires large storage for training the datasets
and the combination functions and number of neighbours can influence the results.
10
3.6
Link Analysis
Link Analysis is a technique that follows relationships between records to develop patterns in them. When
buying some goods on the Internet you usually leave your personal details at the checkout, for delivery and
payment of the goods. The information such as delivery address is sometimes used to target you for marketing
purposes. For example receiving discounts offers from an on-line retail shop that you once bought something
is common these days.
Link Analysis has been used successfully in telephone call pattern analysis, whereby a telephone number can
be selected and the numbers it dials be observed and a link be made on it and the number it dials. The tools
used in the technique do not find any patterns but assist people with discovering knowledge.
Link analysis has the following strengths
Good for linked data.
Aids knowledge discovery by direct visualization of the links.
And the following weaknesses
Not applicable on most types of data.
Its implementations in relational database are not that efficient.
11
3.7
Genetic Algorithms
In their Book [6] say Genetic algorithms apply the mechanics of genetic and natural selection to a search used
for finding the optimal sets of parameters that describe a predictive function. The technique uses the selection,
crossover, and mutation operators to evolve successive generations of solutions.
A search for the optimal solution is similar to the process of evolution of a population of organisms, where
each organism is represented by a set of its chromosomes. This evolution is driven by three mechanisms:
selection of the strongest - those sets of chromosomes that characterize the most optimal solutions, crossbreeding - production of new organisms by mixing sets of chromosomes of parent sets of chromosomes and
mutations - accidental changes of genes in some organisms of the population. After a number of new
generations built with the help of the described mechanisms one obtains a solution that cannot be improved
any further. This solution is taken as a final one.
As the generations evolve, only the most predictive survive until the functions converge on an optimal.
Genetic Algorithms are a directed style of data mining and can be used to improve Memory-Based Reasoning
and Neural networks techniques. Thus genetic algorithms should be considered at present more as an
instrument for scientific research rather than as a tool for generic practical data analysis.
The strengths of the techniques are as follows;
•
Produce explainable results.
•
Easy to apply the results from using the technique.
•
Able to handle a wide range of data types.
•
Applicable to optimisation problems.
The weaknesses of the technique are that;
•
Difficulty in encoding many problems
•
It does not guarantee optimality.
•
It’s computationally expensive.
•
Only a specialist can develop a criterion for the chromosome selection and formulate the problem
effectively.
12
3.8
Online Analytic Processing (OLAP).
Online Analytic Processing is not an actual data mining technique but a presentation tool that can enable
manual knowledge discovery, though it depends on human intelligence for discovering the knowledge.
OLAP are client-server tools that have an advanced interface connected to an efficient representation of the
data called a cube. The cube allows users to slice-and-dice the data in any way they like. For example a
retailer might discover that certain products sells better at a particular period during the year by the use of
OLAP tools. This might lead to an investigation using Market basket Analysis technique to find other items
purchased with that item.
The strengths of OLAP have been said to be the following [6];
•
It is a powerful visualization tool.
•
It provides fast, interactive response times.
•
It is good for analysing time series.
•
It can also be used to find clusters and outliners.
The weaknesses are as follows;
3.9
•
It does not handle continuous variables well
•
Setting up a cube can be difficult
•
Cubes can easily become out-of-date.
Regression Algorithms.
Regression Algorithms are based on searching for a dependence of the target variable on other variables in the
form of function of some predetermined form. For example in a group attribute accounting method, a
dependence is sought in the form of polynomials. Such methods must provide solutions with a larger
statistical significance than neural networks do. An obtained formula, a polynomial, is more suitable for
analysis and interpreting in principle. Thus this method has better chances of providing reliable solutions in
applications such as financial markets or medical diagnostics.
13
3.10
Data Visualization.
Data visualization techniques are good at manipulating sampled and computed data for comprehensive
display. The goal of the visualization is to bring to the user a deeper understanding of the data, as well as the
underlying physical laws and properties. Such visualization may be used to enlighten a physicist on the
complex interaction between electrons, to guide the medical practitioner in a surgery situation, or to view the
surface of a planet, which has never been seen by human eyes.
Visualization data can be static or in motion, to provide visual explanations of algorithms or general
information.
14
Chapter 4
Data Mining Algorithms
Data mining algorithms are the step-by-step details of a particular way of implementing a data mining
technique.
4.1
C4.5
This algorithm generates a classification decision tree for a dataset by recursive partitioning of data. The tree
is grown using depth-first strategy. C4.5 considers all the possible tests that can split the dataset and selects a
test that gives the best information gain. For each discrete attribute, one test with outcomes as many as the
number of distinct values of the attribute is considered and for each continuous attribute, binary tests
involving every distinct values of the attribute are considered.
All files read and written by C4.5 are of the form filestem.ext, where filestem is the file name stem that
identifies the induction task and ext is an extension that defines the type of file. C4.5 algorithm can generate
trees in two ways. In iterative mode, the program starts with a randomly selected subset of the data and
generates a trial decision tree, add some misclassification objects and continues until the trial decision tree
correctly classifies all objects not in the data. In batch mode, which is usually the default the program
generates a single tree using all the available data. The trees generated are saved as filestem.unpruned and
after each is generated, it is pruned in an attempt to simplify it.
15
4.2
CART (Classification and Regression Trees).
Cart is a data mining algorithm that automatically searches for important patterns and relationships and
rapidly uncovers hidden structures. The discovered knowledge is used to generate accurate and reliable
predictive models for applications such as credit-card fraud, profiling customers and targeting direct mailings.
Cart is more accurate for classifying new data than conventional stepwise procedures like linear regression
and logistic regression. The algorithm includes reliable estimates of error rates and is robust to outliners.
There is no need to transform independent variables. Using categorical or continuous variables can be
achieved by classification trees, which predict values of categorical variables or by regression trees, which
predict values of continuous variables.
Cart’s binary decision trees are more careful with data and detect more structure before too little data is left
for learning. Cart also has embedded test disciplines that ensure that the patterns found hold up when applied
to new data. The algorithm handles missing values in the database by substituting “surrogate splitters”, which
are backup rules that closely mimic the action of primary splitting rules. Cart accommodates situations in
which there are some misclassifications by enabling users to specify penalties for misclassifying certain data.
See figure 4.0 below adapted from [4].
fig 4.0
16
4.3
Apriori Algorithm.
The Apriori algorithm is an association rule algorithm, which was developed for mining large transaction
databases. The algorithm uses itemsets, which are non-empty set of items.
The Apriori works by making multiple passes over a database. In the first pass, it counts item occurrences to
determine the frequent itemsets with one item. A subsequent pass, say pass x consists of two phases. First the
set of all frequent (x – 1) itemsets found in the (x – 1)th pass are used to generate the candidate itemsets Cx,
using the Apriorigen () function.
In the second phase the algorithm scans the database and for each transaction, it determines which of the
candidates in Cx are contained in the transaction using a hash-tree data structure and increment the count of
those candidates. The Apriori algorithm has many types such as AprioriAll and AprioriSome.
17
4.4
K–Means Algorithm.
The K-Means is a clustering algorithm, which works best when the input data is numeric. The algorithm
divides the dataset into predetermined number of clusters. The number of clusters is what the k in the name of
the algorithm means while the mean part refers to the average location of all the members of a particular
cluster. The original choice of a value for k determines the number of clusters that will be found. The
algorithm then compute the new mean for each cluster and this exercise is iterated until a criterion function
converges. The algorithm is not applicable to categorical data and is very sensitive to outliners.
4.5
Clustering Using Representatives (CURE).
Cure is a clustering algorithm, which is more robust to outliners and identifies clusters having non-spherical
shapes and wide variances in size. The algorithm achieves this by representing each cluster by a fixed number
of points that are generated by selecting well-scattered points from the cluster and then shrinking them
towards the centre of the cluster by a specified shrinkage factor.
Multiple representative points enables clusters of unusual shapes to be represented better. Clusters with
closest pair of representative points are chosen to be merged and the distance between them is defined to be
the minimum distance between any pair of points in the representative sets of two clusters. The algorithm
handles limited main memory by sampling.
18
4.6
Chi-Squared Automatic Induction (CHAID).
Chi-Squared Automatic Induction is an algorithm that builds decision trees by detecting statistical
relationships between variables. The algorithm is widely used because it is supplied as part of Statistics
packages such as SPSS and SAS.
CHAID works only with categorical variables and attempts to stop growing the tree before overfitting occurs,
this makes the algorithm to be different from other decision tree algorithms as it do away with pruning. The
algorithm grows the tree until no more splits are available that lead to statistical differences in the
classification. The cut-off value used affects the size of the tree and its value as a classifier. The diagram (fig
4.1) below adapted from [4] shows the CHAID algorithm in use.
Fig 4.1
19
4.7
ID3.
The ID3 algorithm is a decision tree-building algorithm, which determines the classification of items by
testing the values of their properties. The algorithm builds the tree in a top-down manner, starting from a set
of items and a specification of properties.
At each node of the tree, a property is tested and the results used to partition the item set. The process is done
recursively until the set in a given sub-tree is consistent with respect to classification criteria, i.e. containing
items belonging to the same category. This then becomes a leaf node. At each node the property to test is
chosen based on information criteria that seeks to maximise information gain and minimize entropy.
4.8
Supervised Learning In Quest (SLIQ).
Supervised learning in Quest is a decision tree classifier designed to classify large training data. The
algorithm uses a pre-sorting technique in the tree-growth phase. This helps in avoiding costly sorting at each
node. SLIQ keeps a separate sorted list for each continuous attribute and a separate list called class list.
An entry in the class list corresponds to a data item, and has a class label and name of the node it belongs in
the decision tree while an entry in the sorted attribute list has an attribute value and the index of the data item
in the class list.
SLIQ grows the decision tree in a breadth-first fashion. For each attribute it scans the corresponding sorted
list and calculate entropy values of each distinct values of all the nodes in the edge of the decision tree. After
the entropy values have been found for each attribute, one attribute is chosen for a split for each node in the
current edge and they are expanded to have a new frontier. Then one or more scan of the sorted attribute list is
performed to update the class list for the new nodes.
20
4.9
C5.0/See5
C5.0 is an upgraded version of C4.5, which have been discussed before in this report and therefore works just
like C4.5 but offers many features. The algorithm supports boosting with any number of trials, this may slow
down the algorithm but produces more accurate results.
The algorithm allows a separate cost to be defined for each predicted/actual class pair. The cost option
enables the algorithm to construct classifiers to minimize expected misclassification costs rather than error
rates. “C5.0 is 214 times faster than C4.5 on the coding data, uses less than 10 % of the memory and produces
a more accurate rule set”. [11]
The algorithm offers a windows version See5, which has a user-friendly graphic interface and it’s easy to use.
For example the cross-reference window makes classifiers more understandable by linking cases to relevant
parts of the classifier. See5 is the one, which was used to carry out a data mining exercise for this project.
4.10
Classification Based on Association (CBA).
Classification Based on Association is a data-mining algorithm, which integrates classification and
association rules into one algorithm, which is more powerful, and produce accurate classifiers for prediction.
The algorithm can be used for mining various forms of association rules, and for text classification or
categorization.
CBA builds accurate classifiers from relational data, where each record is described with a fixed number of
attributes and also builds accurate classifiers from transactional data, where each data record has a variable
number of items. For example items bought in an Online retail website by a customer. More on this
algorithm will be discussed later in the report.
21
Chapter 5
Evaluation of Data Mining Techniques and Algorithms
5.1
Evaluation of Data Mining Techniques.
“Averaging an algorithm's performance over all target concepts, assuming they are all equally likely, would
be like averaging a car's performance over all possible terrain types, assuming they are all equally likely. This
assumption is clearly wrong in practice; for a given domain, it is clear that not all concepts are equally
probable. In medical domains, many measurements (attributes) that doctors have developed over the years
tend to be independent: if the attributes are highly correlated, only one attribute will be chosen. In such
domains, a certain class of learning algorithms might outperform others. For example, Naive-Bayes seems to
be a good performer in medical domains (Kononenko 1993). Quinlan (1994) identifies families of parallel and
sequential domains and claims that neural-networks are likely to perform well in parallel domains, while
decision-tree algorithms are likely to perform well in sequential domains. Therefore, although a single
induction algorithm cannot build the most accurate classifiers in all situations, some algorithms will perform
better in specific domains” [13].
The quotation above is true as it has been noted previously in the report that the choice of a technique
depends on a number of factors such as the suitability for certain input data types, transparency of the mining
output, tolerance of missing variable values, level of accuracy possible and ability to handle large volumes of
data. After studying through the vast resources of books, technical papers, white papers written on data
mining, the author came up with an Evaluation framework below to evaluate the techniques that have been
discussed in this report. The criterion used for evaluating the techniques is simple and its related to the
datasets, which were used in the case study of the report. This part of the project was very difficult, as the
author had to fill in the table based on what has been read or thought they understood but not after gaining
practical experience, as it was with the decision tree. Therefore the results in the next page are merely the
author’s opinion and are open for criticism.
22
Technique Evaluation table.
Technique
How it works
Decision Trees
Classification, Prediction
Artificial Neural Networks
Estimating, Prediction,
Can it be used with Transactional
data accumulated in an Online
retail websites?
Yes
Yes
Clustering
Memory-Based Reasoning
Classification, Prediction
Yes
On-Line Analytic Processing
Summarization, Presentation
Yes
Link Analysis
Classification, Sequential
Yes
patterns
Genetic Algorithms
Clustering
No
Automatic Cluster Detection
Clustering
Yes
Market Basket Analysis
Association, Prediction
Yes
Data Visualization
Deviation
No
Regression
Deviation
No
Conclusion
From the table above it can be observed that the techniques, which will work well with data that has been
accumulated from online retailers will be the ones, which do classification, prediction, association, clustering,
and estimation. This is so because the online retailers are more interested on customer relationship
management and therefore they need to cluster their markets, associate the products they buy, predict their
product sales and estimate their profits. The knowledge gained from this analysis can help them with
satisfying and retaining their customers.
23
5.2
Evaluation of Data Mining Algorithms.
In order to evaluate the algorithms studied the same evaluation framework as the one used to evaluate the
techniques was used. Again the results were based on my understanding and the explored literature. Hands on
experience were gained with See5 only, which is the algorithm that was used in the case study of the report.
The table below shows how they faired.
Algorithm
How it works
Can it be used with Transactional
data accumulated in an Online retail
websites?
Yes
C4.5
Classification
CART
Prediction, Classification
Yes
Apriori
Sequential patterns, Associations
Yes
K-Means
Clustering
Yes
ID3
Classification
Yes
CURE
Clustering
Yes
CHAID
Classification
Yes
C5.0 / See5
Classification
Yes
CBA
Classification, Association
Yes
SLIQ
Classification
Yes
Conclusion.
Surprisingly when filling in the table it was noticed that all the algorithms that were chosen for investigation
are able to work with transactional data accumulated by an online retailer provided its been pre-processed and
well cleaned. The conclusion drawn from the exploration of data mining algorithms is that most of the
algorithms can be used with any data that is suitable or in a correct format.
24
Chapter 6
Potential Applications
“Data mining is used by companies with a strong consumer focus- retail, financial, communication and
marketing organisations. It enables these companies to determine relationships among internal factors such as
price, product positioning, or staff skills, and external factors such as economic indicators, competition, and
customer demographics. It enables them to determine the impact on sales, customer satisfaction, and
corporate profits. Data mining also enables the companies to drill down into summary information to view
detail transactional data.” [3]
As the quote above explains, data mining can be used in different sectors. The use of data mining is unlimited
and it can be applied anywhere, where large quantities of data are collected. The following are some of the
few areas where it’s applied.
•
Retail / Marketing:
Data mining can be used in retailing and marketing to identify buying patterns from customers,
predict response to mailing campaigns, basket analysis and to find associations among customer
demographic characteristics.
•
Banking:
Banks can use data mining to identify loyal customers, predict customers who are likely to change
their credit card affiliation and detect patterns of fraudulent credit card use.
•
Insurance and Health care:
Data mining can be used to analyse claims, predict which customers will buy new policies, identify
fraudulent behaviour and identify patterns of risky customers.
•
Transportation:
Data mining can be used in determining the distribution schedules among outlets and analysing
loading patterns.
•
Medicine:
Data mining can be used to identify successful medical therapies for different illness and to
characterise patient behaviour to predict office visits.
•
Education:
Student databases can be mined to determine any correlations or patterns, which leads to good grades
and the universities, can then act on such information and improve their performance.
25
•
Customer Relationship Management:
Data mining can be used to allow companies to learn from their customer behaviour, that is what
customers prefer. This knowledge can enable them to satisfy and retain the customers. The
application of data mining in customer relationship management is proving to be one of the factors
that makes data mining popular as every company is striving to woe more new customers to itself and
retain the old ones. Most of online retailers perform data mining for customer relationship
management and believe that through it they increase revenue and satisfy customers.
26
Chapter 7
Case Study
7.1
Overview
During the first project meeting with the project supervisor it became clear that he didn’t want the project to
be a research only kind of a project. He wanted an actual data mining exercise to be done. The supervisor
asked the author to go and enquire whether there were any tools in the school of computing machines that he
can use, Whether there were any datasets and if there were any modules in the university which taught data
mining.
The questions were forwarded to the project initiator through an email and he responded with the answers to
most of the questions (See Appendix D for reply). There were no datasets to be used and the project initiator
suggested getting some datasets from the KDDCUP website and this was done within the same week. The
author enrolled at the website and was given a UserID and password to download the data.
The datasets provided were accumulated by an online retailer called Gazelle.com, which deals with legware
and legcare products. Gazelle.com’s goal was to retain and attract more customers, even if it meant losing
money in the short term. So they had many promotions that were relevant for mining, since these effect traffic
to the website, type of customers etc. There were 3 datasets for each question asked in the KDDCUP 2000
competition.
The datasets for question 1 was a transactional one and it was the biggest of the 3 and was chosen as the
supervisor had hinted that a bigger dataset would be ideal. The dataset had about 30,000 records and was
about 297MB. When downloading the datasets from [10], it was discovered that they were in extended C5.0
format with a .names and .data files, which meant that it would be easy to use them with C5.0 software tool.
A search for the software was done on the Internet and a free download of the demo was found at [11]. It was
realised that C5.0 had a windows version called See5, which was also available for downloading. The
windows version was preferred over the Unix version C5.0 because of its graphical interface and it was
thought it would be easy to use and learn within a short time. It was also chosen based on the fact that it could
be used at home, as the author didn’t have Unix installed in his home computer.
The tutorials for See5 were also downloaded from the same website and the data mining lessons were started.
It was realised during the familiarisation with the demo that it only read 400 records of the dataset. This was
communicated to the supervisor and a plan was made that during the data mining phase in the project plan an
evaluation package of the software will have to be acquired to enable the mining exercise.
27
On February 2002, an evaluation licence was requested from [11] and a licence was offered by Ross Quinlan
(See his reply in Appendix D) but it was only for 10 days. So a lot had to be done in ten days. The software
was installed and the data mining started. The .data file seemed too big for the tool as it kind of hanged while
constructing the decision trees. The ten days were running out and it was obvious that the problem was the
.data file size and the author didn’t know how he could split it. It was later decided that a smaller dataset be
obtained from [10]. The dataset was much smaller about 4.49MB and had 1781 records. When this dataset
was used the program started working well.
During the progress meeting the results of the data mining were shown to the Assessor and Supervisor and the
reason for using a smaller dataset explained. The Assessor suggested that the author should use a perl
command to split the bigger dataset so that he can have training and a test data. Or alternatively use the crossvalidation to reduce the errors in the small dataset. It was obvious after that meeting that the whole exercise
had to be done again but by then the evaluation licence had expired. So the author tried to ask for another
licence but Ross Quinlan refused saying that he had already got one (See Appendix D). A friend of mine who
is in Leeds Metropolitan University was asked to get a licence, which she did. The only other task left now
was to split the dataset, so the perl command random_select.pl was run and the dataset split in to two files
customers.data with 19999 records and customers.test with 10088 records. See Appendix E and F for the
sample customer.data and customer.names files.
7.2
Extracting information from the datasets.
The files were then loaded into the See5 program and the following questions were asked by targeting at
specific attributes. The author has tried to justify why each question was selected and how it can benefit
Gazelle.com on their goal of attracting and retaining customers. A cross-validation of 10 folds has been used
in all the questions to help with decreasing the error rates. The decision tree for the questions can be seen in
Appendix G.
28
Question 1.
Which website refers most customers to website?
Most Organisations advertise by having their website links as clickthroughs from other Organisation’s
websites. This service is usually paid for, so its very important to know which websites are referring more
customers and which ones are not, so that the concerned Organisation can withdraw the adverts from the
websites which are not productive and seek others. It’s hoped that this question will be useful for the on-line
retailers. The targeted attribute is Session First Referrer Top 5. Below is the evaluation on training and test
data.
Evaluation on training data (19998 cases):
Decision Tree
---------------Size
Errors
460 3909(19.5%) <<
(a)
(b) (c) (d) (e) (f) <-classified as
------- ---- ---- ---- ---50
2
129 (a): class http://www.mycoupons.com
4
209
11 13 26 1413 (b): class http://www.fashionmall.com
4
9 1233
8 10 406 (c): class http://www.gazelle.com
13
239 41 1147 (d): class http://stores.shopnow.com
7
1 43 386
299 (e): class http://www.winnie-cooper.com
22
36 105 58 102 13972 (f): class Other
*** line 10089 of `Customers.test': unexpected end of file
Evaluation on test data (10088 cases):
Decision Tree
---------------Size
Errors
460
(a)
---11
2
12
(b)
---42
4
11
11
54
2407(23.9%) <<
(c)
---1
9
623
1
5
105
(d)
----
(e)
----
17
4
41
25
79
20
1
24
142
121
(f) <-classified as
---72 (a): class http://www.mycoupons.com
752 (b): class http://www.fashionmall.com
254 (c): class http://www.gazelle.com
616 (d): class http://stores.shopnow.com
207 (e): class http://www.winnie-cooper.com
6822 (f): class Other
Time: 77.6 secs
29
Results with Cross-validation of 10 Folds
[ Summary ]
Fold
----
Decision Tree
---------------Size
Errors
0
1
2
3
4
5
6
7
8
9
464
505
419
455
537
557
496
627
499
465
Mean 502.4
SE
18.8
24.5%
25.4%
23.5%
24.1%
24.6%
24.7%
24.3%
24.8%
23.6%
23.3%
24.3%
0.2%
(a) (b) (c) (d)
---- ---- ---- ---29
1
2
3 149 11 21
6 17 1094 7
21
2 150
19
7 67
34 157 226 193
(e) (f) <-classified as
---- ---149
(a): class http://www.mycoupons.com
39 1453
(b): class http://www.fashionmall.com
11 535
(c): class http://www.gazelle.com
63 1204
(d): class http://stores.shopnow.com
249 394
(e): class http://www.winnie-cooper.com
214 13471
(f): class Other
Time: 378.3 secs
30
Question 2.
How do most of our customers find about us?
This question is also related to the one above as it’s about marketing. It seeks to find out which advertising is
more effective for the Organisation. The results of the question might lead to concentrating in an advertising
media, which brings more customers or it might lead to more attention being paid to the less productive
advertising media. The targeted attribute is HowDidYouFindUs. Below is the evaluation on training data.
Evaluation on training data (58 cases):
Decision Tree
---------------Size
Errors
8
14(24.1%) <<
(a) (b) (c) (d) (e) (f) <-classified as
---- ---- ---- ---- ---- ---5
(a): class Web/Banner Ad
1
(b): class News Story
2
(c): class Search Engine
1
34
(d): class Friend/Co-worker
(e): class Magazine Ad
2
9
4 (f): class Other
*** ignoring cases with bad or unknown class
*** line 10089 of `Customers.test': unexpected end of file
Evaluation on test data (18 cases):
Decision Tree
---------------Size
Errors
8
12(66.7%) <<
(a) (b) (c) (d) (e) (f) <-classified as
---- ---- ---- ---- ---- ---(a): class Web/Banner Ad
1
(b): class News Story
2
(c): class Search Engine
5
3 (d): class Friend/Co-worker
(e): class Magazine Ad
1
5
1 (f): class Other
Time: 17.3 secs
31
Results with Cross-validation of 10 Folds
[ Summary ]
Fold
Decision Tree
------------------Size
Errors
0
1
2
3
4
5
6
7
8
9
5
12
8
7
5
7
11
9
11
6
80.0%
80.0%
33.3%
50.0%
66.7%
50.0%
50.0%
33.3%
50.0%
50.0%
Mean
SE
8.1
0.8
54.3%
5.2%
(a) (b) (c) (d) (e) (f)
---- ---- ---- ---- ---- ---1
3
1
1
2
4
1 24
6
2
11
2
<-classified as
(a): class Web/Banner Ad
(b): class News Story
(c): class Search Engine
(d): class Friend/Co-worker
(e): class Magazine Ad
(f): class Other
Time: 10.4 secs
32
Question 3.
What gender is most of our customers?
Knowing the gender of your customers is very in important in a retail shop, which deals with clothing. The
answer to the question will make the marketing team to advise the stores department to stock goods which
might be preferred by the dominant gender. For example having more female clothing if most of the
customers are females. This is actually segmenting the market into gender groups to enable different offers.
The targeted attribute is Gender.
Evaluation on training data (19998 cases):
Decision Tree
---------------Size
Errors
39
6( 0.0%) <<
(a) (b) (c) <-classified as
---- ---- ---157
(a): class Female
4 46
(b): class Male
1
1 19789 (c): class NULL
Evaluation on test data (10088 cases):
Decision Tree
---------------Size
Errors
39
20( 0.2%) <<
(a) (b) (c) <-classified as
---- ---- ---66 8
4 (a): class Female
6 9
(b): class Male
1 1 9993 (c): class NULL
Time: 31.4 secs
Results with Cross-validation of 10 Folds
[ Summary ]
Fold
Decision Tree
------------------Size
Errors
0
34
0.3%
1
31
0.4%
2
30
0.3%
3
32
0.3%
4
38
0.4%
5
37
0.3%
6
34
0.3%
7
35
0.4%
8
32
0.3%
9
35
0.4%
Mean 33.8
SE
0.8
(a)
---124
19
4
0.3%
0.0%
(b) (c)
---- ---16 17
28 3
9 19778
<-classified as
(a): class Female
(b): class Male
(c): class NULL
Time: 93.2 secs
33
Question 4.
How many customers wish to receive mail from our company?
This question will result with the number of customers who wish to receive electronic mail from the
Organisation being known. It will also give an idea of those who do not wish to receive electronic mail. The
answer may lead to investigation carried out to find out why some customers do not wish to receive mail with
the hope of convincing them to receive mail. The answer will also help in target marketing, whereby only
customers who wish to receive mail will be send offers by electronic mail. The targeted attribute is
SendEmail.
Evaluation on training data (19998 cases):
Decision Tree
---------------Size
Errors
10
17( 0.1%) <<
(a) (b) (c) <-classified as
---- ---- ---19596 1
(a): class NULL
10 61
1 (b): class True
3 2 324 (c): class False
Evaluation on test data (10088 cases):
Decision Tree
---------------Size
Errors
10 16( 0.2%) <<
(a) (b) (c) <-classified as
---- ---- ---9909
(a): class NULL
6 22
1 (b): class True
3
6 141 (c): class False
Time: 20.2 secs
Results with Cross-validation of 10 Folds
[ Summary ]
Fold
Decision Tree
------------------Size
Errors
0
1
2
3
4
5
6
7
8
9
9
14
9
10
13
12
9
15
8
9
Mean 10.8
SE
0.8
0.1%
0.2%
0.3%
0.1%
0.2%
0.2%
0.2%
0.2%
0.2%
0.3%
0.2%
0.0%
(a) (b) (c)
---- ---- ---19595 2
6 56 10
13 316
<-classified as
(a): class NULL
(b): class True
(c): class False
Time: 87.3 secs
34
Question 5.
How many of our customers are Premium Card holders?
This question will help with creating a segment of customers who have Premium cards. This group of people
is important as they have high credits and spending power. Satisfying such a group is very important so it will
be good for any retail outlet to know if such a group exists. The targeted attribute is Premium Card Holder.
Evaluation on training data (19998 cases):
Decision Tree
---------------Size
Errors
12
42( 0.2%) <<
(a) (b) (c) <-classified as
---- ---- ---19697
(a): class NULL
29 36 (b): class True
6 230 (c): class False
Evaluation on test data (10088 cases):
Decision Tree
---------------Size
Errors
12
30( 0.3%) <<
(a)
---9966
(b)
---7
12
(c) <-classified as
---(a): class NULL
18 (b): class True
85 (c): class False
Time: 19.3 secs
Results with Cross-validation of 10 Folds
[ Summary ]
Fold
---0
1
2
3
4
5
6
7
8
9
Decision Tree
---------------Size
Errors
7
21
18
13
17
6
13
7
26
11
Mean 13.9
SE
2.1
0.3%
0.5%
0.5%
0.3%
0.4%
0.3%
0.6%
0.3%
0.5%
0.2%
0.4%
0.0%
(a) (b) (c) <-classified as
---- ---- ---19697
(a): class NULL
16 49 (b): class True
27 209 (c): class False
Time: 49.8 secs
35
Question 6.
Who is the Email provider of most of our customers?
The answer to this question will reveal who is the Email provider of most customers. The provider may be
targeted for marketing purposes such as doing more advertising with them or paying for a click-through link
in their website. The targeted attribute is Email.
Evaluation on training data (19998 cases):
Decision Tree
---------------Size
Errors
42
48( 0.2%) <<
(a) (b) (c) (d) (e) (f) (g) (h) (i) <-classified as
---- ---- ---- ---- ---- ---- ---- ---- ---(a): class MIL
56
30
(b): class NET
(c): class GOV
4
276
1
1
(d): class COM
6
5
(e): class EDU
2
(f): class ORG
1
1
15
(g): class Gazelle
1
1 2
(h): class Other
19596
(i): class NULL
*** line 10089 of `Customers.test': unexpected end of file
Evaluation on test data (10088 cases):
Decision Tree
---------------Size
Errors
42
57( 0.6%) <<
(a) (b) (c) (d) (e) (f) (g) (h) (i) <-classified as
---- ---- ---- ---- ---- ---- ---- ---- ---1
(a): class MIL
7
22
(b): class NET
(c): class GOV
17
113
2
1 (d): class COM
6
1
(e): class EDU
1
1
(f): class ORG
1
3
(g): class Gazelle
1
3
(h): class Other
9908 (i): class NULL
Time: 20.6 secs
36
Results with Cross-validation of 10 Folds
[ Summary ]
Fold
Decision Tree
------------------Size
Errors
0
1
2
3
4
5
6
7
8
9
41
38
53
35
36
45
46
25
31
35
Mean 38.5
SE
2.5
0.6%
0.7%
0.6%
0.7%
0.6%
1.0%
0.6%
0.7%
0.6%
0.6%
0.7%
0.0%
(a) (b) (c) (d) (e) (f) (g) (h) (i)
---- ---- ---- ---- ---- ---- ---- ---- ---24
61
1
36
3
1
235
6
1
9
2
4
2
2
6
8
<-classified as
(a): class MIL
(b): class NET
(c): class GOV
1 (d): class COM
(e): class EDU
(f): class ORG
(g): class Gazelle
(h): class Other
19596 (i): class NULL
Time: 49.9 secs
37
Question 7.
How many of our customers are mail order buyers?
The question seeks to find out if some of the customers buying from the website have bought any goods by
mail order in the past as such customers will usually have no problem with buying from the internet and
therefore are potential loyal customers who need to be retained. The targeted attribute is Mail Order Buyer.
Evaluation on training data (19998 cases):
Decision Tree
---------------Size
Errors
7
31( 0.2%) <<
(a)
---19697
(b)
---186
30
(c) <-classified as
---(a): class NULL
1 (b): class True
84 (c): class False
Evaluation on test data (10088 cases):
Decision Tree
---------------Size
Errors
7
15( 0.1%) <<
(a)
---9966
(b)
---75
12
(c) <-classified as
---(a): class NULL
3 (b): class True
32 (c): class False
Time: 21.0 secs
Results with Cross-validation of 10 Folds
[ Summary ]
Fold
Decision Tree
------------------Size
Errors
0
1
2
3
4
5
6
7
8
9
Mean
SE
8
6
7
7
5
3
7
5
7
3
0.2%
0.4%
0.3%
0.1%
0.2%
0.2%
0.1%
0.2%
0.1%
0.3%
5.8
0.6
0.2%
0.0%
(a) (b) (c) <-classified as
---- ---- ---19697
(a): class NULL
186 1 (b): class True
37 77 (c): class False
Time: 47.0 secs
38
Question 8.
Find out how many children do most of our customers have?
The answer will classify customers by number of their children. Those who have many children may then be
targeted for family purchasing, therefore increasing the number of overall customers. The targeted attribute is
NumberOfChildren.
Evaluation on training data (58 cases):
Decision Tree
---------------Size
Errors
5 7(12.1%) <<
(a) (b) (c) (d) (e) <-classified as
---- ---- ---- ---- ---44
(a): class 0
2
(b): class 1
4
4
(c): class 2
1
(d): class 3
3 (e): class 4 or more
*** ignoring cases with bad or unknown class
Evaluation on test data (18 cases):
Decision Tree
---------------Size
Errors
5
3(16.7%) <<
(a) (b) (c) (d) (e) <-classified as
---- ---- ---- ---- ---11
(a): class 0
(b): class 1
2
3
(c): class 2
(d): class 3
1
1 (e): class 4 or more
Time: 15.1 secs
Results with Cross-validation of 10 Folds
[ Summary ]
Fold
Decision Tree
------------------Size
Errors
0
1
2
3
4
5
6
7
8
9
Mean
SE
7
6
5
4
6
6
3
4
4
5
0.0%
20.0%
33.3%
33.3%
33.3%
16.7%
50.0%
50.0%
16.7%
16.7%
5.0 27.0%
0.4 5.0%
(a) (b) (c) (d) (e) <-classified as
---- ---- ---- ---- ---40 1 1
2 (a): class 0
2
(b): class 1
7
1
(c): class 2
1
(d): class 3
2
1 (e): class 4 or more
Time: 10.1 secs
39
Question 9.
How many of our customers respond to our marketing mails?
The answer to the question above will be used for marketing purposes. The answer will show which of the
customers are “loyal” and respond to direct marketing mails. These customers will need to be retained and
they may be enticed by offering them discounts. The targeted attribute is Mail Responder.
Evaluation on training data (19998 cases):
Decision Tree
---------------Size
Errors
16
9( 0.0%) <<
(a)
---19697
(b)
---220
4
(c) <-classified as
---(a): class NULL
5 (b): class True
72 (c): class False
Evaluation on test data (10088 cases):
Decision Tree
---------------Size
Errors
16
11( 0.1%) <<
(a)
---9966
(b)
----
(c) <-classified as
---(a): class NULL
6 (b): class True
26 (c): class False
85
5
Time: 20.4 secs
Results with Cross-validation of 10 Folds
[ Summary ]
Fold
---0
1
2
3
4
5
6
7
8
9
Decision Tree
---------------Size
Errors
17
13
15
19
8
10
15
16
16
17
Mean 14.6
SE
1.1
(a) (b)
---- ---19697
203
13
Time: 44.4 secs
0.2%
0.3%
0.2%
0.1%
0.3%
0.2%
0.3%
0.1%
0.1%
0.1%
0.2%
0.0%
(c) <-classified as
---(a): class NULL
22 (b): class True
63 (c): class False
40
Question 10.
What percentage of our customers come back to our website?
The answer to this question will show the percentage of the customers who come back to the website after
having purchased something in the past. These kind of customers need to be retained. The targeted attribute is
Retail Activity.
Evaluation on training data (19998 cases):
Decision Tree
---------------Size
Errors
7
14( 0.1%) <<
(a)
---19697
(b)
----
(c) <-classified as
---(a): class NULL
92
7 (b): class True
7 195 (c): class False
Evaluation on test data (10088 cases):
Decision Tree
---------------Size
Errors
7
14( 0.1%) <<
(a) (b)
---- ---9966
34
8
(c) <-classified as
---(a): class NULL
6 (b): class True
74 (c): class False
Time: 19.9 secs
Results with Cross-validation of 10 Folds
[ Summary ]
Fold
Decision Tree
------------------Size
Errors
0
1
2
3
4
5
6
7
8
9
Mean
SE
9
7
7
5
6
11
6
6
5
5
0.2%
0.2%
0.1%
0.1%
0.0%
0.3%
0.1%
0.0%
0.1%
0.3%
6.7
0.6
(a)
---19697
0.1%
0.0%
(b) (c) <-classified as
---- ---(a): class NULL
88 11 (b): class True
12 190 (c): class False
Time: 45.1 secs
41
7.3
Evaluation and Interpretation of results.
See5
See5 is a windows version of C5.0, which is an upgraded version of C4.5. Therefore it constructs decision
trees and does classification based on the targeted attribute. After locating the files in the program the screen
looks like below. Check out the Customers.names, customers.data and customers.test.
The main window of See5 has six buttons on its toolbar. From left to right, they are;
Locate Data: Invokes a browser to find the files for your application, or to change the current
application.
Construct Classifier: Selects the type of classifier to be constructed and sets other options such as
cross-validation.
Stop: Interrupts the classifier generating process.
Review Output: Re-displays the output from the last classifier construction.
Use Classifier: Interactively applies the current classifier to one or more cases.
Cross-Reference: Shows how cases in training or test data relate to parts of a classifier.
All this functions can also be initiated from the File menu.
42
Constructing Classifiers
Once the names, data and test files have been set up, everything is ready to use See5. The first thing is to
locate the date using the locate data button on the toolbar. There are several options that affect the type of
classifier that See5 produces and the way that it is constructed. The construct classifier button on the toolbar
displays a dialog box that sets out these classifier construction options like below.
From the screenshot above it can be seen that a cross-validation of ten folds was chosen when performing the
mining exercise. If the rulesets option is ticked See5 converts the tree into collections of rules called rulesets.
For this exercise See5 was invoked with default values and it constructed decision trees with a crossvalidation of 10 folds and with a pruning confidence of 25%, which is default in See5.
One of the weaknesses of the See5 program, which was seen, was that the program does not work well with
attributes, which have continuous values. This is the message that was received when trying to target the login
failure count attribute “*** line 305 of `Customers.names': target attribute `Login Failure Count' must
be specified by a list of discrete values”.
43
Classification Based on Associations (CBA)
It was intended that CBA will be used to evaluate the results of See5 as it was said to be able to read the
.names and .data files. CBA is also said to be able to reduce the error rate to a lower level than See5. An
Academic version of CBA was requested from the university of Singapore [12] and Liu Bing, one of the
authors of CBA granted me permission to download the program [See Appendix D]. When the author tried
installing the program it came up with an error message “Setup has detected that unInstallShield is in use.
Please close unInstall and restart setup”. This message was reported to support who replied by saying that the
program was updating the registry so it could not be installed on the School of Computing computers [See
Appendix D]. The reply from support was discussed with my supervisor and their reply was also forwarded to
him. I then decided to install the program on my home computer.
This is CBA’s main interface. All the main functions can be seen from this interface. The top four selectors
are for different mining tasks. The flow of the CBA data mining is a top-down process. Whenever you have
data to mine, you should check the data consistency by using the data cleaner. If there are any continuous
attributes, you should pass it to data discretizer for discretization. You could also try to use feature selection
to reduce the number of attributes.
44
There was a message, which asked the author to run discretizer to discretize the data after he loaded the
customers.data file. See screenshot below. The Ok button was clicked.
When the discretizer button was clicked, the following screen showed up with the error and the author
couldn’t go any further as once this error message showed everything froze and the computer had to be
restarted. This was tried a couple of times but got stuck at the same place so the author gave up.
Results Interpretation
45
From the results for question 1 it can be seen that using a training data and test data files reduced the errors.
The errors in the training data are higher while they are low in the test data. This shows that by the time the
program does the tests it had been already trained. Cross-Validating further reduces the errors by validating
each fold. Gazelle.com is the main referrer, with 1233 cases in the training data and 623 in the test data. The
cross-validation results show it with 1094 cases.
Question 2 show that most customers find about the retail through a Friend/Co-worker while questions 3
shows the gender of most customers being female. The question about whether the customers want to be sent
mail shows that only a few are really interested and this will need to be investigated further.
The segmentation question about customers who holds premium cards shows that only 36 cases are true. The
email provider question shows that the .com are the main providers, followed by Mil. Question 7 wanted to
find out about customers who have bought goods through mail orders and the results show that only a handful
seem to have done that before.
The number of children question showed that most of the customers did not have any children while the one
about customers who respond to the shop’s mailing showed that 305 customers do respond. The last question
was about customers who have bought more than once in the website and the targeted attribute was retail
activity. The results for this question shows that only about 126 customer have done so, which is not a good
figure considering that there are about 30 000 record in the files.
46
Chapter 8
Current Issues in Data Mining
8.1
Individual privacy
One of the key issues raised by data mining technology is not a business or technological one, but rather an
ethical or social one. It is the issue of individual privacy. Data mining makes it possible to analyse routine
business transactions and glean a significant amount of information about individuals buying habits and
preferences. This is done without the individual’s consent in most cases and its like people do not have a
choice, as it’s the technology that enables this intrusion of people’s privacy.
8.2
Data integrity
Another issue is that of data integrity. Clearly, data analysis can only be as good as the data that is being
analysed. A key implementation challenge is integrating conflicting or redundant data from different sources;
this can be when creating the Organisation’s data warehouse. For example, a bank may maintain credit cards
accounts on several different databases. The addresses (or even the names) of a single cardholder may be
different in each. So the integrity of data in data warehouses is not guaranteed.
8.3
Data Size Issues
Most of the existing data mining techniques fail because of the size of the data. New techniques have to be
developed which will be able to work with large and heterogeneous databases, as the Internet connects many
sources of data. Any algorithm that is proposed for mining data will have to account for out of core data
structures. Most of the existing algorithms haven't addressed this issue.
8.4
Noisy Data Issues
The other issue is that of noise, most of the algorithms assume the data to be free of noise. As a result, the
most time-consuming part of solving problems becomes data pre-processing. Data formatting and cleaning is
time-consuming and can be frustrating if working with large datasets. The concept of noisy data can be
understood by the example of mining logs. A real life scenario can be if one wants to mine information from
web logs. A user may have gone to a web site by mistake - incorrect URL or incorrect button press. In such a
case, this information is useless if we are trying to deduce a sequence in which the user accessed the web
pages. The logs may contain many such data items. These data items constitute data noise. A database may
constitute up to 30-40% such Noisy data and pre-processing this data may take up more time than the actual
algorithm execution time.
47
8.5
Technical Issues
There is a technical issue, which is whether it is better to set up a relational database structure or a
multidimensional one. In a relational structure, data is stored in tables, permitting ad hoc queries. In a
multidimensional structure, sets of cubes are arranged in arrays, with subsets created according to category.
While multidimensional structures facilitate multidimensional data mining, relational structures thus far have
performed better in client/server environments. And, with the explosion of the Internet, the world is becoming
one big client/server environment.
8.6
Cost Issues
While system hardware costs have dropped dramatically within the past 10 years, data mining and data
warehousing tend to be self-reinforcing. The more powerful the data mining queries, the greater the value of
the information being gleaned from the data, and the greater the pressure to increase the amount of data being
collected and maintained, which increases the pressure for faster, more powerful data mining queries. This
increases pressure for acquiring larger and faster systems, which are more expensive.
48
Chapter 9
Future Requirements
The electronic monitoring of our lives will increase in the future as more Organisations join the e-business
wagon and more data will be accumulated for data mining. The author sees data mining being the main
decision making tool in the near future and almost every big Organisation being involved in it.
The future will see efficient and scalable data mining tools, which are able to effectively extract information
from huge amount of data being developed. The issue of data size will also be addressed by parallel and
incremental algorithms that divide data in to partitions that can be processed in parallel. The new algorithms
developed will have to deliver acceptable performance on large volumes of data regardless of the computing
platform. They will also need to be scalable to take advantage of parallel computing and cope with long
computation times, such as experienced in Neural networks. The algorithms, which are associated with
probabilistic learning, will also need to be improved drastically.
The future might see the standardisation of the data mining methodology, where each tool might have to have
all the stages such as Selection, Data cleaning, Transformation, Data mining, Interpretation and Evaluation.
This will greatly help as once one have used a tool it would be easier to use a different one because of the
standardisation.
Finally new developed tools will also require to be easy to use, as data mining will eventually be accessible to
people who know very little about computers and may have no time to learn a complicated one.
49
Reference List
[1] Peter, Cabena, et al. Discovering Data Mining from Concepts to Implementation. Prentice Hall, 1998.
[2] Ian, H, Witten & Eibe, Frank. Data Mining. Morgan Kaufmanns Publishers, 2000.
[3] Bill, Palace. (1996), Data Mining: What is Data Mining?,
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm
[04th April 2002]
[4] Clementine,
URL: http://www.spss.com/datamine/techniques.htm [08th April 2002].
[5] Alan, Rea. Data Mining,
URL: http://www.pcc.qub.ac.uk/tec/courses/datamining/stu_notes/dm_book_2.html
[10th April 2002]
[6] Michael, J, A, Berry & Gordon S, Linoff. Data Mining Techniques for Marketing, Sales, and
Customer Support. Wiley Computers Publishing, 1997.
[7] Steven, Brett. Data Mining in the School Information System (SIS). School of Computer Studies,
1998/1999.
[8] Michael, J, A, Berry & Gordon S, Linoff. Mastering Data Mining: The Art and Science of Customer
Relationship Management. Wiley Computers Publishing, 2000.
[9] Jiawei, Han & Micheline, Kamber. Data Mining Concepts and Techniques. Morgan Kaufmanns
Publishers, 2001
[10] KDDCUP website,
URL: http://www.ecn.purdue.edu/KDDCUP [15th March 2002]
[11] RuleQuest website,
URL: http://www.rulequest.com [12th April 2002]
[12] National University of Singapore website,
URL: http://www.comp.nus.edu.sg/~dm2 [24th April 2002]
[13] Karuna, P, Joshi. Analysis of Data Mining Algorithms,
URL: http://userpages.umbc.edu/~kjoshi1/data-mine/proj_rpt.htm [15th April 2002]
[14] Heikki, Manila, Hannu, Toivonen & Lukeri, Verkamo, (1994). Efficient Algorithms for discovering
Association rules, Journal of Knowledge discovery in databases, pp.181 – 192.
[15] Rombel, Adam (2001). CRM shifts to Data Mining to keep customers, Global Finance journal,
15(11): p.97.
[16] Two Crows Corporation. Introduction to Data Mining and Knowledge Discovery,
URL: http://www.twocrows.com [25th April 2002]
50
Appendix A
Personal Evaluation
This was a totally new area for me and I have learnt a lot from the project and will recommend the company I
work for, to start mining its data when I get back home. When the project started there were some problems
such as acquiring datasets and tools but once that was sorted out everything started going well. I have learned
many skills from doing this project that is both personal and technical skills.
On the personal development side I have really learnt a lot, as it was the first time I was engaged in a project
of this magnitude on my own. The project helped me with applying my personal goal and time management.
My personal goal was to learn this new area and do some practical exercises on it within the duration of the
project. I think I have achieved this goal as I can now be confidently engaged in any data mining discussion.
Time management was the order of the day throughout this project. I tried by all means to keep everything on
schedule, though I did experience some glitches along the way. Other skills learnt were communication and
being organised. Communication is very important between the supervisor and student and I think we didn’t
have any problem in our case.
The other thing I have learned was to apply my decision-making skills, there were times when the next
meeting with the supervisor seemed too far and I had to make decisions and brief him later. This really
broadened my mind. Patience and being tough are some of the things learned in this project as there were
times when I thought I would breakdown, especially during the time when I found out that the software I
needed to evaluate the tool I was using could not be installed in the School of Computing computers. This
was really a devastating time for me and I don’t know how I managed to pull through. But through this
experience I learnt that things sometimes don’t go as expected.
The project involved evaluating the techniques and algorithms, this really gave me some knowledge in doing
evaluation and think that next time I do it, it would be much better than what I have managed to do in this
project. The overall important skill learned was that of conducting research, more especially on the Internet.
Data mining being a new field means that there is a lot of information in the Internet some of which is
incorrect, so one has to sift through this massive information to get what is relevant and correct.
On my conclusion I would say I am very proud to have achieved what I have done and still think the project
was an eye-opener.
51
Appendix B
Project Plan and Schedule
Task Name
Duration
Start Date
Finish Date
15 days
Mon 15/10/01
Fri 02/11/01
Problem Definition
11 days
Mon 15/10/01
Mon 29/10/01
Background Reading
11 days
Mon 15/10/01
Mon 29/10/01
Minimum Requirements
4 days
Tue 30/10/01
Fri 02/11/01
10 days
Mon 05/11/01
Fri 16/11/01
Data Acquisition
10 days
Mon 05/11/01
Fri 16/11/01
Tool Acquisition
10 days
Mon 05/11/01
Fri 16/11/01
Algorithms and Techniques Research
28 days
Mon 05/11/01
Wed 12/12/01
Analysis
23 days
Mon 19/11/01
Wed 19/12/01
Data analysis
23 days
Mon 19/11/01
Wed 19/12/01
Tool Familiarization
23 days
Mon 19/11/01
Wed 19/12/01
21 days
Wed 21/11/01
Wed 19/12/01
21 days
Wed 21/11/01
Wed 19/12/01
20 days
Mon 04/02/02
Fri 01/03/02
20 days
Mon 04/02/02
Fri 01/03/02
10 days
Mon 04/03/02
Fri 15/03/02
10 days
Mon 04/03/02
Fri 15/03/02
Research on Issues and Future Requirements
10 days
Mon 18/03/02
Fri 29/03/02
Project Evaluation
5 days
Mon 01/04/02
Fri 05/04/02
Report Write-up
22 days
Mon 01/04/02
Tue 30/04/02
Feasibility Study
Resource Gathering
Design
Generate Questions/Hypothesis
Data Mining
Extraction of Information from Datasets
Results Evaluation
Results Interpretation
52
Appendix C
Project Specification
Title of project: Data Mining Algorithms and Tools
Code of project: SAR05
Supervisor: Stuart Roberts
Area of interest: Knowledge discovery; Data Mining
Appropriate for degree programmes: IS, CT, CS
Prerequisites : database
Multiple projects can be considered: Yes
Further information:
This is an exploratory project. Data mining is a fast growing area of
interest and many sophisticated tools are becoming available. Which
algorithms are best to use for solving what problems? How do different
tools compare? By taking a case study approach (case study data are
available), and through research papers, the aim of this project would be to
answer some of these questions. The School has both commercial and
experimental data mining software which could be used in the evaluation.
53
Appendix D
Emails Received
Email from Stuart Roberts
Date: Tue, 16 Oct 2001 11:09:03 +0100
From: S A Roberts <[email protected]>
To: K Tshetlhoyagae <[email protected]>
Subject: Re: Project questions
At 10:24 AM 10/16/01 +0100, you wrote:
>Morning,
>
>I have been allocated the Data mining Algorithms and Tools project, and
>I met my supervisor(Martin Berzins) yesterday and he wanted me to find
>out answers to the following questions.
>
>Whether there is available data to mine?
There are test data sets available from the KDD data mining web site, but
of course it depends what your objectives are as to what data are suitable.
see for example: http://www.ecn.purdue.edu/KDDCUP/
Eric Atwell also has a data mining project with some associated data, but
he may have someone else doing that project.
>Whether Data mining is taught in the University?
There is an introduction in DB32
>What software is available for data mining in the university?
SQL Server 2000 Data analysis services
Weka data mining tools on W2000 m/cs
C4.5 decision tree algorithm on linux
It's probably worth pointing out that you don't necessarily have to stick
with this project if there is something else that you and your supervisor
can agree on. If you can see clearly what you would like to achieve using
data mining tools then the project should be possible, but the published
specification almost certainly needs some extra thinking to turn it into a
well-defined set of objectives.
Hope this helps.
Stuart
54
Emails from Ross Quinlan
Date: Wed, 6 Feb 2002 19:50:38 -0800 (PST)
From: Ross Quinlan <[email protected]>
To: [email protected]
Subject: See5
Thank you for your interest in RuleQuest data mining tools.
Your ten-day evaluation licence ID is
3893 e694 97eb ee2e#
Instructions for downloading the system appear at
http://www.rulequest.com/Install/
If you have any problems installing the system, please contact
[email protected] with the following information:
* Windows: the machine name and user name shown
on the message indicating a licence problem
* Unix: the output from the commands "hostname"
and "who am i"
Regards,
Ross Quinlan
Subject:
To:
Date sent:
From:
Send reply to:
Request for evaluation licence
[email protected]
Wed, 22 Mar 2002 02:02:11 -0800 (PST)
[email protected] (Ross Quinlan)
[email protected]
I regret that a licence ID cannot be issued. Only one evaluation
licence is issued to an organisation, and a licence has previously
been issued to [email protected]
Regards,
Ross Quinlan
55
Email from Liu Bing
Date: Thu, 7 Mar 2002 18:22:49 +0800 (GMT-8)
From: Liu Bing <[email protected]>
To: K Tshetlhoyagae <[email protected]>
Cc: ma yiming <[email protected]>
Subject: Re: CBA Academic Full version
Hi,
You can download it from the following:
http://www.comp.nus.edu.sg/~dm2/CBAFull.zip
Cheers
Liu, Bing (Dr.)
Department of Computer Science
Office: S17, 05-17
School of Computing
Tel: (65) 8746736
National University of Singapore Fax: (65) 7794580
3 Science Drive 2
Email: [email protected]
Singapore 117543
Web: http://www.comp.nus.edu.sg/~liub
Email from Support
Date: Fri, 8 Mar 2002 14:32:47 GMT
From: Pritpal Rehal via RT <[email protected]>
To: [email protected]
Subject: [SUPPORT #7733] (cts) Unable to install
Status: resolved
Requestors: [email protected]
as i said before ...you cant install the software
if it trying to access windows cfg/registry
so dont install it - it will not let you!!!!
prit
56
Appendix E
Sample of the Customers.data file records
(Please note that this file has the same structure with Customers.test)
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-02-22,17\:23\:20,2000-0222,17\:23\:20,?,177035,64671,?,Mozilla/4\.0 (compatible; MSIE 5\.0; Windows 98;
DigExt),1,47,NULL,http\://womencentral\.msn\.com/women/today/default\.asp,main/home\.jhtml,1389,Tuesd
ay,17,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0222,17\:23\:20,47.0,47,47,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,NULL,main/home\.jhtml,Tuesday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL
L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,N
ULL,/Content/templates/main,17,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet
Explorer,main/home\.jhtml,main/home\.jhtml,Other,(14 \.\.\. 17]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-02-28,01\:32\:44,2000-0228,01\:32\:44,?,281054,100697,?,Mozilla/4\.0 (compatible; MSIE 5\.0; MSNIA; AOL 5\.0; Windows 98;
DigExt),1,47,NULL,http\://search\.yahoo\.com/search?p=thong&hc=2&hs=50&h=s&b=19,main/home\.jhtml
,1389,Monday,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0228,01\:32\:44,47.0,47,47,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,NULL,main/home\.jhtml,Monday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL
L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,N
ULL,/Content/templates/main,1,0.0,?,AOL,AOL Windows
4\.0,AOL,main/home\.jhtml,main/home\.jhtml,Other,(\.\.\. 5]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-02-21,11\:04\:07,2000-0222,17\:10\:47,?,152324,64560,?,Mozilla/4\.0 (compatible; MSIE 5\.01; Windows NT
5\.0),5,312,NULL,NULL,main/home\.jhtml,1389,Tuesday,17,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0222,17\:10\:47,312.0,312,312,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,NULL,main/home\.jhtml,Tuesday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder
,NULL,/Content/templates/main,17,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet
Explorer,main/home\.jhtml,main/home\.jhtml,Other,(14 \.\.\. 17]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-02,14\:20\:46,2000-0302,14\:20\:46,?,386018,139339,?,Mozilla/4\.0 (compatible; MSIE 5\.0; Windows 98;
DigExt),1,62,NULL,NULL,main/home\.jhtml,1389,Thursday,14,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fal
se,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0302,14\:20\:46,62.0,62,62,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
57
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,NULL,main/home\.jhtml,Thursday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU
LL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,
NULL,/Content/templates/main,14,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet
Explorer,main/home\.jhtml,main/home\.jhtml,Other,(12 \.\.\. 14]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-20,11\:14\:43,2000-0320,11\:14\:43,?,870200,311626,?,Mozilla/4\.0 (compatible; MSIE 4\.01; Windows
NT),1,79,NULL,NULL,main/home\.jhtml,1389,Monday,11,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,0,
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0320,11\:14\:43,79.0,79,79,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,NULL,main/home\.jhtml,Monday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL
L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,N
ULL,/Content/templates/main,11,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet
Explorer,main/home\.jhtml,main/home\.jhtml,Other,(9 \.\.\. 12]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-18,20\:14\:19,2000-0318,20\:14\:19,?,824654,295799,?,Mozilla/4\.0 (compatible; MSIE 5\.0; AOL 5\.0; Windows 98;
DigExt),1,47,NULL,NULL,main/home\.jhtml,1389,Saturday,20,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fals
e,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0318,20\:14\:19,47.0,47,47,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,NULL,main/home\.jhtml,Saturday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL
L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,N
ULL,/Content/templates/main,20,0.0,?,AOL,AOL Windows
4\.0,AOL,main/home\.jhtml,main/home\.jhtml,Other,(17 \.\.\. 22]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-02-15,10\:02\:38,2000-0216,16\:37\:39,?,60257,29520,?,Mozilla/4\.0 (compatible; MSIE 5\.0; Windows 98;
DigExt),3,375,NULL,NULL,main/home\.jhtml,1389,Wednesday,16,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
False,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0216,16\:37\:39,375.0,375,375,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,NULL,main/home\.jhtml,Wednesday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL
L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandO
rder,NULL,/Content/templates/main,16,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet
Explorer,main/home\.jhtml,main/home\.jhtml,Other,(14 \.\.\. 17]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-20,18\:20\:34,2000-0320,18\:20\:34,?,882455,316085,?,Mozilla/4\.0 (compatible; MSIE 5\.0; Windows 95;
DigExt),1,62,NULL,http\://www\.winniecooper\.com/cpagesal/gazelle\.html,main/home\.jhtml,1389,Monday,18,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,False,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,2000-0320,18\:20\:58,1976.0,3952,3890,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,2,0,0,0,0,1,0,0,1,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,FOLDER%3C%3Efolder_id=8703&ASSORTMENT%3C%3East_id=8683&bmUID=953
58
605234769&WebLogicSession=ONbccjVbove2FgbgHHm2178E1woDG11D3qeF36XzpiKsx10g5o1ljtmhJT
mjcsoPPFJoNqUbjW83\|2030677379036461453/174787178/5/7005/7005/7002/7002/1,main/departments\.jhtml,Monday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/Departments,/Assortmen
ts/Main/Departments/A1_Hosiery,/Content/templates/main,18,24.0,24.0,Internet Explorer,Internet Explorer
Windows 4\.0,Internet Explorer,main/home\.jhtml,main/departments\.jhtml,http\://www\.winniecooper\.com,(17 \.\.\. 22]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-09,19\:23\:00,2000-0313,06\:38\:38,?,590681,239520,?,Mozilla/4\.5 [en] (Win98;
I),7,156,FOLDER%3C%3Efolder_id=45687&ASSORTMENT%3C%3East_id=8687&bmUID=9529390801
69&WebLogicSession=OMyySOJrRKcOI2jtPoy9t2HduN6YovoQMxgLRICUpIKxIcZ84JWOcJjnjbRHhXK
cByjgJT1KrtU3\|7918625116450430875/174787179/5/7005/7005/7002/7002/1&asstListPath=/Assortments/Main/B,http\://www\.gazelle\.com/main/home\.jhtml,main/assortment2\.jhtml,4
5891,Monday,6,3,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,False,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,2000-0313,06\:40\:25,213.66666666666666,641,32,3.0,3,?,?,?,?,?,?,?,0.5,0.5,?,?,?,?,1.0,1,?,9.75,9.75,?,10.0,10.0,?,3.5
,3.5,?,7.0,7.0,?,3.0,3,?,2,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,1,0,1,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,PRODUCT%3C%3Eprd_id=33449&FOLDER%
3C%3Efolder_id=45687&ASSORTMENT%3C%3East_id=8687&bmUID=952958374403,main/shopping_ca
rt\.jhtml,Monday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL
L,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/Brands,/Assortments/Main/Brands/Legwear
,/Content/templates/main,6,107.0,53.5,Netscape,Netscape
4\.5,Netscape,Other,Other,http\://www\.gazelle\.com,(5 \.\.\. 9]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-05,15\:32\:26,2000-0305,15\:32\:26,?,472304,169949,?,Mozilla/4\.0 (compatible; MSIE 5\.0; Windows 95;
DigExt),1,63,NULL,NULL,main/home\.jhtml,1389,Sunday,15,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False
,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0305,15\:32\:26,63.0,63,63,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,NULL,main/home\.jhtml,Sunday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,NU
LL,/Content/templates/main,15,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet
Explorer,main/home\.jhtml,main/home\.jhtml,Other,(14 \.\.\. 17]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-22,08\:49\:15,2000-0322,08\:49\:15,?,923984,330841,?,Mozilla/4\.0 (compatible; MSIE 5\.5; Windows
98),1,31,ASSORTMENT%3C%3East_id=8687&bmUID=952973502998&WebLogicSession=OM04vug11m
0U1WHs6JkTU0QcaDjjvc1ej1tS1Gwy5ZIVPeQmfik3U3EAEGH2pbAwnM1NJiwCsmU3\|7918625116450
430875/174787179/5/7005/7005/7002/7002/1,http\://beauty\.about\.com/style/beauty/gi/dynamic/offsite\.htm?site=http\://www\.gazelle\.com/main/freegif
t\.jhtml%3FASSORTMENT%253C%253East%5Fid=8687%0D%0A%26bmUID=952973502998%26WebLo
gicSession=OM04vug11m0U1WHs6JkTU0QcaDjjvc1ej1tS1Gwy%0D%0A5ZIVPeQmfik3U3EAEGH2,mai
n/freegift\.jhtml,21189,Wednesday,8,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,0,0,0,0,0,0,0,0,0,1,0,0,0,0
,0,0,2000-0322,08\:49\:15,31.0,31,31,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,ASSORTMENT%3C%3East_id=8687&bmUID=952973502998&WebLogicSession=OM04vug1
1m0U1WHs6JkTU0QcaDjjvc1ej1tS1Gwy5ZIVPeQmfik3U3EAEGH2pbAwnM1NJiwCsmU3\|79186251164
59
50430875/174787179/5/7005/7005/7002/7002/1,main/freegift\.jhtml,Wednesday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/UniqueBoutiques,NULL,/
Content/templates/main,8,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet
Explorer,Other,Other,Other,(5 \.\.\. 9]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-02-24,06\:07\:42,2000-0224,06\:07\:42,?,207398,75201,?,Mozilla/4\.0 (compatible; MSIE 5\.0; AOL 5\.0; Windows 98;
DigExt),1,47,NULL,NULL,main/home\.jhtml,1389,Thursday,6,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fals
e,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,2000-0224,06\:08\:23,55.0,110,63,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,2,0,0,0,0,1,0,1,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,FOLDER%3C%3Efolder_id=8739&ASSORTMENT%3C%3East_id=8687&bmUID=951401262
672&WebLogicSession=OLU7LpR58bM4A0tHfwfg7KR8MdIBTFhzww0dwx93jIcekkzxkyO2KH0hKvKd
mBXwj0YcjDZwhRQ3\|-5383655529336981119/174787178/5/7005/7005/7002/7002/1,main/boutique\.jhtml,Thursday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NU
LL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/UniqueBoutiques,/Assortme
nts/Main/UniqueBoutiques/06_dance,/Content/templates/main,6,41.0,41.0,AOL,AOL Windows
4\.0,AOL,main/home\.jhtml,main/boutique\.jhtml,Other,(5 \.\.\. 9]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-23,19\:28\:38,2000-0323,19\:28\:38,?,960503,343960,?,Mozilla/5\.0 (compatible; MSIE
5\.0),1,140,ASSORTMENT%3C%3East_id=8687&bmUID=953865993027&WebLogicSession=ONrXCKIa
c2iUoBBtLUWBNh217V51ad5WwqxYoj2w7HrdR1IPWpG4rL2lf9z7A2mS4XIoJRIQFQQ3\|7863941279941471645/174787179/5/7005/7005/7002/7002/1&ls=Family,http\://www\.gazelle\.com/main/home\.jhtml,main/lifestyles\.jhtml,8171,Thursday,19,1,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2000-0323,19\:28\:38,140.0,140,140,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,0,0,0,0,0,0,0,1
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,ASSORTMENT%3C%3East_id=8687&bmUID=953865993027&WebLogicSession=ONrXC
KIac2iUoBBtLUWBNh217V51ad5WwqxYoj2w7HrdR1IPWpG4rL2lf9z7A2mS4XIoJRIQFQQ3\|7863941279941471645/174787179/5/7005/7005/7002/7002/1&ls=Family,main/lifestyles\.jhtml,Thursday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL
L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/UniqueBoutiqu
es,NULL,/Content/templates/main,19,0.0,?,Internet Explorer,Internet Explorer 5\.0,Internet
Explorer,Other,Other,http\://www\.gazelle\.com,(17 \.\.\. 22]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-03,20\:30\:18,2000-0303,20\:30\:18,?,424859,153368,?,Mozilla/4\.0 (compatible; MSIE 4\.5;
Mac_PowerPC),1,47,NULL,http\://www\.flamingoworld\.com/,main/home\.jhtml,1389,Friday,20,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0303,20\:30\:18,47.0,47,47,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,NULL,main/home\.jhtml,Friday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,NU
LL,/Content/templates/main,20,0.0,?,Internet Explorer,Internet Explorer Mac 4\.0,Internet
Explorer,main/home\.jhtml,main/home\.jhtml,Other,(17 \.\.\. 22]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
60
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-28,18\:03\:25,2000-0328,18\:03\:25,?,1074530,384447,?,Mozilla/4\.0 (compatible; MSIE 5\.0; Windows NT;
DigExt),1,62,NULL,http\://stores\.shopnow\.com/cgibin/visitstore\.cgi?lid=9432355,main/home\.jhtml,1389,Tuesday,18,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
False,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,2000-0328,18\:04\:22,93.66666666666667,281,188,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,3,0,0,0,0
,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,FOLDER%3C%3Efolder_id=8735&ASSORTMENT%3C%3East_id=8687&b
mUID=954295443812,main/boutique\.jhtml,Tuesday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL
L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/Unique
Boutiques,/Assortments/Main/UniqueBoutiques/05_maternity,/Content/templates/main,18,57.0,28.5,Internet
Explorer,Internet Explorer Windows 4\.0,Internet
Explorer,main/home\.jhtml,main/boutique\.jhtml,http\://stores\.shopnow\.com,(17 \.\.\. 22]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-03-18,10\:20\:22,2000-0318,10\:20\:22,?,814958,292427,?,Mozilla/4\.0 (compatible; MSIE 5\.01; Windows
98),1,31,NULL,http\://chain\.station\.sony\.com/chain/reaction/html/interstitial0\.html,main/home\.jhtml,1389
,Saturday,10,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2000-0318,10\:20\:22,31.0,31,31,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,NULL,main/home\.jhtml,Saturday,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NUL
L,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,/Assortments/Main/BrandOrder,N
ULL,/Content/templates/main,10,0.0,?,Internet Explorer,Internet Explorer Windows 4\.0,Internet
Explorer,main/home\.jhtml,main/home\.jhtml,Other,(9 \.\.\. 12]
?,?,?,?,NULL,?,?,?,?,?,?,NULL,NULL,?,?,NULL,NULL,NULL,?,?,?,NULL,?,?,NULL,NULL,NULL,?,?,?,N
ULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,NULL,NULL,NULL,?,NULL,?,
NULL,?,NULL,?,?,?,NULL,NULL,?,NULL,?,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,?,?,?,
?,?,?,?,?,NULL,?,NULL,?,NULL,?,?,?,NULL,NULL,NULL,NULL,2000-02-01,14\:54\:12,2000-0201,14\:54\:12,?,3848,1691,?,Mozilla/4\.0 (compatible; MSIE 5\.0; Windows 98;
The File has been cut for convenience
61
Appendix F
Sample of the Customers.names file attributes
| The content of this file and the matching data file are confidential
| to Blue Martini Software and Gazelle.com
| Right to use is restricted by the NDA agreement for the KDD-CUP 2000
| For questions, e-mail [email protected]
|
| C5.0 names file
|
Retail Activity. | target column
WhichDoYouWearMostFrequent: trouser socks,hosiery,casual socks,athletic socks.
YourFavoriteLegcareBrand: eShave,Lucky Chick,VeinGard,Tweezerman,Swiss
Balance,Ovacion,Covermark,Ya Babi,Medea,Dr\. Perricone,TendSkin,IColoniali,Dr\. Schon,Nature
Made,Bio Depiless,Conair,Natural Line,Kleb Sole,Living
Earth,Cellasene,DailyHerbs,Kneipp,Juvena,Epilady.
Registration Gender: Female,Male.
NumberOfChildren: [ordered] 0, 1, 2, 3, 4 or more.
DoYouPurchaseForOthers: NULL,False.
HowDoYouDressForWork:business dress,business casual,comfortable /athletic,very casual.
HowManyPairsDoYouPurchase: [ordered] 1 to 5, 6 to 10, 11 to 15, 15 or more.
YourFavoriteLegwearBrand: Ellen Tracy,Oroblu,Belly Basics,American Essentials,Danskin,Hot Sox,Donna
Karan,Evan Picone,Givenchy,Greg Norman,DKNY,Hanes,Round the Clock,Falke,Berkshire.
WhoMakesPurchasesForYou: parent,spouse,siblings,friend.
NumberOfAdults: [ordered] 0, 1, 2, 3 or more.
HowDidYouHearAboutUs: in the news,e-mail,print ad,direct mail,other,friend / family.
Company: COMPANY 2605,COMPANY 416,COMPANY 2975,COMPANY 3051,COMPANY
780,COMPANY 3055,COMPANY 1608,COMPANY 3059,COMPANY 2575,COMPANY
2330,COMPANY 1169,COMPANY 1207,COMPANY 1209,COMPANY 2982,COMPANY
1571,COMPANY 1733,COMPANY 1737,COMPANY 1455,COMPANY 796,COMPANY 673,COMPANY
713,COMPANY 395,COMPANY 2183,COMPANY 2870,COMPANY 3070,COMPANY 1587,COMPANY
1626,COMPANY 842,COMPANY 2479,COMPANY 445,COMPANY 2235,COMPANY 2882,COMPANY
2922,COMPANY 3089,COMPANY 613,COMPANY 452,COMPANY 615,COMPANY 617,COMPANY
1085,COMPANY 1086,COMPANY 2092,COMPANY 503,COMPANY 1932,COMPANY
2258,COMPANY 1935,COMPANY 1658,COMPANY 594,COMPANY 1258,COMPANY
1099,COMPANY 2262,COMPANY 1944,COMPANY 519,COMPANY 920,COMPANY 3032,COMPANY
3036,COMPANY 2678,COMPANY 482,COMPANY 2273,COMPANY 1792,COMPANY 402,COMPANY
409,NULL,COMPANY 2963,COMPANY 2682,COMPANY 1273,COMPANY 933.
SendEmail: NULL,True,False.
HowOftenDoYouPurchase: [ordered] once a year, every 6 months, each week.
HowDidYouFindUs: Web/Banner Ad,News Story,Search Engine,Friend/Co-worker,Magazine Ad,Other.
City: ignore.
Country: United States,NULL.
US State:
IA,ID,IL,IN,AK,AL,AR,RI,AZ,SC,CA,KS,CO,KY,CT,LA,TN,DC,DE,TX,MA,MD,ME,MI,UT,MN,MO,MS
,MT,VA,NC,ND,NE,NH,NJ,VT,NM,FL,NV,WA,NY,WI,OH,GA,OK,WV,WY,OR,PA,HI.
Account Creation Date: date.
Account Creation Date_Time: time.
Year of Birth: continuous.
Email: NET,GOV,COM,EDU,ORG,Gazelle,Other,NULL.
Login Failure Count: continuous.
62
Customer ID: continuous.
Truck Owner: NULL,True,False.
RV Owner: NULL,True,False.
Motorcycle Owner: NULL,True,False.
Value Of All Vehicles: continuous.
Age: continuous.
Other Indiv\. Age: continuous.
Marital Status: Inferred Married,Single,Inferred Single,Married,NULL.
Working Woman: NULL,True,False.
Mail Responder: NULL,True,False.
Bank Card Holder: NULL,True,False.
Gas Card Holder: NULL,True,False.
Upscale Card Holder: NULL,True,False.
Unknown Card Type: NULL,True,False.
TE Card Holder: NULL,True,False.
Premium Card Holder: NULL,True,False.
Presence Of Children: NULL,True,False.
Number Of Adults: continuous.
Estimated Income Code: [ordered] Under $15;000, $15;000-$19;999, $20;000-$29;999, $30;000-$39;999,
$40;000-$49;999, $50;000-$74;999, $75;000-$99;999, $100;000-$124;999, $125;000 OR MORE.
Home Market Value: [ordered] $1;000-$24;999, $25;000-$49;999, $50;000-$74;999, $75;000-$99;999,
$100;000-$124;999, $125;000-$149;999, $150;000-$174;999, $175;000-$199;999, $200;000-$224;999,
$225;000-$249;999, $250;000-$274;999, $275;000-$299;999, $300;000-$349;999, $350;000-$399;999,
$400;000-$449;999, $450;000-$499;999, $500;000-$774;999, $775;000-$999;999, $1;000;000+.
New Car Buyer: NULL,True.
Vehicle Lifestyle: FULL SIZE (STANDARD/LUXURY),IMPORT
(STANDARD/ECONOMY),SPECIALTY (MIDSIZE/SMALL),PERSONAL LUXURY CAR,STATION
WAGON,REGULAR (MIDSIZE/SMALL),TRUCK OR UTILITY VEHICLE,NULL.
Property Type: apartment(5+ units),2-4 unit(duplex;triplex;quad),mobile_home,misc\. residential (condo
store/flat),single family dwelling,NULL,condo.
Loan To Value Percent: [ordered] 0% (NO LOANS), 01-49%, 50-59%, 60-69%, 70-74%, 75-79%, 80-84%,
85-89%, 90-94%, 95-99%, 100-99%.
Presence Of Pool: NULL,True,False.
Year House Was Built: continuous.
Own Or Rent Home: Owner,NULL,Renter.
Length Of Residence: continuous.
Mail Order Buyer: NULL,True,False.
Year Home Was Bought: continuous.
Home Purchase Date: continuous.
Number Of Vehicles: continuous.
DMA No Mail Solicitation Flag: NULL,True.
DMA No Phone Solicitation Flag: NULL,True.
CRA Income Classification: continuous.
New Bank Card: NULL,True,False.
Number Of Credit Lines: continuous.
Speciality Store Retail: NULL,True,False.
Oil Retail Activity: NULL,True,False.
Bank Retail Activity: NULL,True,False.
Finance Retail Activity: NULL,True,False.
Miscellaneous Retail Activity: NULL,True,False.
Upscale Retail: NULL,True,False.
Upscale Speciality Retail: NULL,True,False.
Retail Activity: NULL,True,False.
Last Retail Date: date.
Last Retail Date_Time: time.
63
Dwelling Size: [ordered] SINGLE HOUSEHOLD, 2 HOUSEHOLDS, 3 HOUSEHOLDS, 4 HOUSEHOLDS,
5 HOUSEHOLDS, 6 HOUSEHOLDS, 7 HOUSEHOLDS, 8 HOUSEHOLDS, 9 HOUSEHOLDS, 10-19
HOUSEHOLDS, 20-29 HOUSEHOLDS, 30-39 HOUSEHOLDS, 40-49 HOUSEHOLDS, 50-99
HOUSEHOLDS, 100+ HOUSEHOLDS.
Dataquick Market Code: continuous.
BrandName Last: BB,Silk Reflections,DAN,ELT,Absolutely Ultra Sheer,ORO,AME,DKNY,Hanes
Too,EVP,NM,HOSO,NULL.
StockType Last: Seasonal 1*,Seasonal 1,Seasonal 2,Replenishable,NULL.
Look Last: Sheer,NULL,Ultra Sheer.
BasicOrFashion Last: Fashion,NULL,Basic.
HasDressingRoom Last: NULL,True,False.
Texture Last: Flat,Textured,NULL.
ToeFeature Last: RT,SF,NULL.
Material Last: Nylon,Cotton,Lycra,NULL.
WaistControl Last: CT,NULL.
Collection Last: Specialty Items,Childrens Dance,Oroblu Fashion Line,Conversationals,Hanes Plus
Collection,Conversational Classics,Athletics,Men's Essential Sport,Occasions Collection,Pregnancy Survival
Kit,DKNY Basic Trouser Socks,Men's Patterns and Textures,Beyond Bare Collection,Spring/Summer
2000,Teddy Hose,NULL.
Audience Last: Men,Children,Women,NULL.
Pattern Last: Conversational,Solid,NULL.
Product Level 1 Path Last: /Products/Legwear,/Products/LegCare,NULL.
Product Level 2 Path Last:
/Products/LegCare/SwissBalance,/Products/Legwear/Danskin,/Products/LegCare/VeinGard,/Products/Legwe
ar/EllenTracy,/Products/Legwear/Oroblu,/Products/Legwear/NicoleMiller,/Products/Legwear/DKNY,/Produc
ts/Legwear/HotSox,/Products/Legwear/Hanes,/Products/Legwear/BellyBasics,/Products/Legwear/AmericanE
ssentials,/Products/Legwear/EvanPicone,NULL.
Assortment Level 2 Path Last:
/Assortments/Main/UniqueBoutiques,/Assortments/Main/Brands,/Assortments/Main/SaleAssortments,/Assort
ments/Main/Welcome,/Assortments/Main/BrandOrder,/Assortments/Main/Departments,/Assortments/Main/L
ifeStyles,/Assortments/Main/Seasonal,NULL.
Assortment Level 3 Path Last:
/Assortments/Main/Departments/A2_Socks,/Assortments/Main/UniqueBoutiques/05_maternity,/Assortments/
Main/Welcome/Legwear,/Assortments/Main/Brands/LegCare,/Assortments/Main/UniqueBoutiques/04_eveni
ngs,/Assortments/Main/UniqueBoutiques/08_Seasonal,/Assortments/Main/LifeStyles/family,/Assortments/M
ain/UniqueBoutiques/02_men,/Assortments/Main/UniqueBoutiques/07_gifts,/Assortments/Main/Departments
/A1_Hosiery,/Assortments/Main/Seasonal/Spring2000,/Assortments/Main/LifeStyles/InStyle,/Assortments/M
ain/UniqueBoutiques/03_kids,/Assortments/Main/UniqueBoutiques/01_PlusSizes,/Assortments/Main/Unique
Boutiques/06_dance,/Assortments/Main/Departments/A4_Legcare,/Assortments/Main/LifeStyles/LegCare,/A
ssortments/Main/LifeStyles/Work,/Assortments/Main/Departments/A3_Bodywear,/Assortments/Main/Welco
me/Legcare,/Assortments/Main/SaleAssortments/WinterSale2000,/Assortments/Main/Brands/Legwear,/Assor
tments/Main/LifeStyles/Sport,NULL.
Content Level 2 Path Last:
/Content/templates/articles,/Content/templates/Replenish,/Content/templates/checkout,/Content/templates/mai
n,/Content/templates/products,/Content/templates/account,NULL.
Session Last Request Hour Of Day: continuous.
Session Request Count Average: continuous.
Session Request Count Sum: continuous.
Session Time Elapsed Average: continuous.
Session Time Elapsed Sum: continuous.
Average Time Each Page View: continuous.
Num Sessions: continuous.
First Session First Referrer: ignore.
First Session First Request Day of Week: Thursday,Saturday,Tuesday,Sunday,Wednesday,Friday,Monday.
First Session Browser Family: WebTV,Netscape,AOL,Internet Explorer,Other.
64
First Session Browser: Internet Explorer Windows 4\.0,WebTV 1\.2,AOL Windows 2\.0,Internet Explorer
Mac 4\.0,Netscape 4\.51,Internet Explorer MSN 4\.0,Netscape Mac 4\.51,AOL Mac 4\.0,Netscape
4\.61,Internet Explorer+ 4\.0,Netscape Mac 4\.61,Netscape 4\.71,Netscape 4\.72,Netscape Mac 4\.72,Internet
Explorer++ 4\.0,AOL Mac 3\.0,Internet Explorer Windows 2\.0,Netscape 4\.02,Netscape 4\.03,Netscape
4\.04,Netscape 4\.5,Netscape 4\.05,Netscape 4\.6,Netscape 4\.06,Netscape 4\.7,Netscape 4\.07,Netscape
4\.08,AOL Windows 4\.0,Netscape Mac 4\.04,Netscape Mac 4\.05,Netscape Mac 4\.06,Netscape Mac
4\.08,AOL Mac 2\.0,Netscape Mac 4\.5,Netscape Mac 4\.6,Netscape Mac 4\.7,Other.
First Session Browser Family Top 3: Netscape,AOL,Internet Explorer,Other.
First Session First Template Top 5:
main/departments\.jhtml,main/vendor\.jhtml,main/boutique\.jhtml,products/productDetailLegwear\.jhtml,mai
n/home\.jhtml,Other.
First Session First Referrer Top 5:
http\://www\.mycoupons\.com,http\://www\.fashionmall\.com,http\://www\.gazelle\.com,http\://stores\.shopno
w\.com,Other.
Session First Processing Time: continuous.
Session First Request Hour of Day: continuous.
Last Session Average Time Per Page View: continuous.
Session First Query String: ignore.
Session First Referrer: ignore.
Session First Template:
main/departments\.jhtml,main/freegift\.jhtml,main/replenishment\.jhtml,main/legcare_vendor\.jhtml,main/ven
dor\.jhtml,main/registration_shipaddress\.jhtml,main/login2\.jhtml,main/vendor2\.jhtml,main/registration\.jht
ml,main/assortment\.jhtml,.jhtml,account/your_account\.jhtml,articles/new_shipping\.jhtml,main/welcome\.jh
tml,main/boutique\.jhtml,main/shopping_cart\.jhtml,main/assortment2\.jhtml,products/productDetailLegwear\
.jhtml,main/home\.jhtml,articles/dpt_about\.jhtml,main/leg_news_healthwellness\.jhtml.
Session Browser Family: WebTV,Netscape,AOL,Internet Explorer,Other.
Session Browser: Internet Explorer Windows 4\.0,WebTV 1\.2,AOL Windows 2\.0,Internet Explorer Mac
4\.0,Netscape 4\.51,Internet Explorer MSN 4\.0,Netscape Mac 4\.51,AOL Mac 4\.0,Netscape 4\.61,Internet
Explorer+ 4\.0,Netscape Mac 4\.61,Netscape 4\.71,Netscape 4\.72,Netscape Mac 4\.72,Internet Explorer++
4\.0,AOL Mac 3\.0,Internet Explorer Windows 2\.0,Netscape 4\.02,Netscape 4\.03,Netscape 4\.04,Netscape
4\.5,Netscape 4\.05,Netscape 4\.6,Netscape 4\.06,Netscape 4\.7,Netscape 4\.07,Netscape 4\.08,AOL Windows
4\.0,Netscape Mac 4\.04,Netscape Mac 4\.05,Netscape Mac 4\.06,Netscape Mac 4\.08,AOL Mac
2\.0,Netscape Mac 4\.5,Netscape Mac
Session Browser Family Top 3: Netscape,AOL,Internet Explorer,Other.
Session First Template Top 5:
main/departments\.jhtml,main/vendor\.jhtml,main/boutique\.jhtml,products/productDetailLegwear\.jhtml,mai
n/home\.jhtml,Other.
Session First Referrer Top 5:
http\://www\.mycoupons\.com,http\://www\.fashionmall\.com,http\://www\.gazelle\.com,http\://stores\.shopno
w\.com,Other.
Session First Request Hour of Day Bin: [ordered] (\.\.\. 5], (5 \.\.\. 9], (9 \.\.\. 12], (12 \.\.\. 14], (14 \.\.\. 17],
(17 \.\.\. 22], (22 \.\.\.).
Last Session User Agent: ignore.
First Session First Request Date: date.
First Session First Request Date_Time: time.
Last Session Last Request Date: date.
Last Session Last Request Date_Time: time.
Last Session Visit Count: continuous.
Session Last Template Top 5:
main/departments\.jhtml,main/vendor\.jhtml,main/boutique\.jhtml,products/productDetailLegwear\.jhtml,mai
n/home\.jhtml,Other.
The File has been cut for convenience
65
Appendix G
Questions/Hypothesis for Data Mining Project
Question 1.
Which website refers most customers to website?
See5 [Release 1.15] Tue Mar 26 13:26:38 2002
Options:
Cross-validate using 10 folds
Class specified by attribute `Session First Referrer Top 5'
*** line 19999 of `Customers.data': unexpected end of file
Read 19998 cases (296 attributes) from Customers.data
[ Fold 0 ]
Decision tree:
Num BrandOrder Assortment Views <= 0:
:...Num main/vendor2 Template Views > 0:
: :...Num HotSox Product Views > 0: http://www.mycoupons.com (2/1)
: : Num HotSox Product Views <= 0:
: : :...Session ID <= 286453:
: :
:...Num Danskin Product Views <= 0: Other (31/12)
: :
: Num Danskin Product Views > 0: http://www.gazelle.com (3)
: :
Session ID > 286453:
: :
:...Num Solid Pattern Views > 1: Other (2)
: :
Num Solid Pattern Views <= 1: [S1]
: Num main/vendor2 Template Views <= 0:
: :...Session Browser Family in Google,AltaVista,Novell Border Manager,
:
:
link-check,EmailSiphon,Java,Genie,Lycos,
:
:
Lotus Notes,InfoSeek,Northern Light,
:
:
Cute FTP,Mozilla: Other (0)
:
Session Browser Family = Teleport Pro: http://www.gazelle.com (39)
:
Session Browser Family = PerMan Surfer: Other (6)
:
Session Browser Family = WebTV: http://www.gazelle.com (4/1)
:
Session Browser Family = WebTrends: Other (175)
:
Session Browser Family = Enfish Tracker: Other (2)
:
Session Browser Family = ergyBot: Other (4)
:
Session Browser Family = Lesszilla: Other (2)
:
Session Browser Family = Unknown: Other (140)
:
Session Browser Family = Nitro e-mail collector: Other (764)
:
Session Browser Family = AOL: [S2]
:
Session Browser Family = Other:
:
:...Texture Last = Textured: Other (0)
:
: Texture Last = Flat: http://www.mycoupons.com (2/1)
:
: Texture Last = NULL:
:
: :...Num Nylon Product Views > 0: http://www.gazelle.com (3)
:
:
Num Nylon Product Views <= 0: [S3]
:
Session Browser Family = Netscape:
:
:...Num articles/dpt_about_mgmtteam Template Views > 0: Other (22)
:
: Num articles/dpt_about_mgmtteam Template Views <= 0:
:
: :...Num main/freegift Template Views > 0: [S4]
:
:
Num main/freegift Template Views <= 0:
:
:
:...Num EvanPicone Product Views > 0: [S5]
:
:
Num EvanPicone Product Views <= 0:
:
:
:...Num HotSox Product Views > 0: Other (3/1)
:
:
Num HotSox Product Views <= 0:
:
:
:...Num main/login2 Template Views > 0: [S6]
:
:
Num main/login2 Template Views <= 0: [S7]
66
:
Session Browser Family = Internet Explorer:
:
:...Session First Request Day of Week in [Monday-Wednesday]:
:
:...Session Visit Count <= 1:
:
: :...Num WDCS Category Views > 0: [S8]
:
: : Num WDCS Category Views <= 0:
:
: : :...Num Brands Assortment Views > 1: [S9]
:
: :
Num Brands Assortment Views <= 1:
:
: :
:...Num main/lifestyles Template Views > 0: [S10]
:
: :
Num main/lifestyles Template Views <= 0:
:
: :
:...Cookie First Visit Date <= 2000/02/29:
:
: :
:...Session First Content ID > 11337: Other (11)
:
: :
: Session First Content ID <= 11337: [S11]
:
: :
Cookie First Visit Date > 2000/02/29: [S12]
:
: Session Visit Count > 1: [S13]
:
Session First Request Day of Week in [Thursday-Sunday]: [S14]
Num BrandOrder Assortment Views > 0:
:...Session Cookie ID > 847463:
:...Session First Request Date > 2000/03/26:
: :...Session Browser Family in Google,Teleport Pro,AltaVista,
: : :
PerMan Surfer,link-check,EmailSiphon,
: : :
Java,Genie,WebTrends,Enfish Tracker,
: : :
ergyBot,Lotus Notes,InfoSeek,
: : :
Northern Light,
: : :
Cute FTP: Other (0)
: : Session Browser Family = Novell Border Manager: Other (2)
: : Session Browser Family = Lycos: Other (1)
: : Session Browser Family = Lesszilla: Other (1)
: : Session Browser Family = Netscape: Other (265/10)
: : Session Browser Family = Unknown: Other (28)
: : Session Browser Family = Nitro e-mail collector: Other (1)
: : Session Browser Family = AOL: Other (97/23)
: : Session Browser Family = Mozilla: Other (8)
: : Session Browser Family = Other: Other (73/4)
: : Session Browser Family = WebTV: [S15]
: : Session Browser Family = Internet Explorer:
: : :...Session Visit Count > 16: http://stores.shopnow.com (13/2)
: :
Session Visit Count <= 16:
: :
:...Num BrandOrder Assortment Views > 2: [S16]
: :
Num BrandOrder Assortment Views <= 2:
: :
:...Num products Template Views > 0: Other (61/12)
: :
Num products Template Views <= 0:
: :
:...Session First Request Date > 2000/03/28: [S17]
: :
Session First Request Date <= 2000/03/28:
: :
:...Cookie First Visit Date_Time <= 00:31:02: [S18]
: :
Cookie First Visit Date_Time > 00:31:02:
: :
:...Session Visit Count > 2: Other (5)
: :
Session Visit Count <= 2:
: :
:...Session Request Count > 2: Other (42/6)
: :
Session Request Count <= 2: [S19]
: Session First Request Date <= 2000/03/26:
: :...Num articles Template Views > 0:
:
:...Num Hanes Product Views <= 0: Other (44)
:
: Num Hanes Product Views > 0: http://www.fashionmall.com (2/1)
:
Num articles Template Views <= 0: [S20]
Session Cookie ID <= 847463:
:...Session ID > 142138:
Tree has been cut for convenience
67
Question 2.
How do most of our customers find about us?
See5 [Release 1.15] Tue Mar 26 13:35:19 2002
Options:
Cross-validate using 10 folds
Class specified by attribute `HowDidYouFindUs'
*** ignoring cases with bad or unknown class
*** line 19999 of `Customers.data': unexpected end of file
Read 58 cases (296 attributes) from Customers.data
[ Fold 0 ]
Decision tree:
Num main/freegift Template Views > 0:
:...Num main/freegift Template Views <= 1: Web/Banner Ad (5/1)
: Num main/freegift Template Views > 1: Friend/Co-worker (2)
Num main/freegift Template Views <= 0:
:...Num LEO Category Views > 0: Web/Banner Ad (3/2)
Num LEO Category Views <= 0:
:...Num Oroblu Product Views <= 0: Friend/Co-worker (40/10)
Num Oroblu Product Views > 0: Other (3)
Evaluation on hold-out data (5 cases):
Decision Tree
---------------Size
Errors
5 4(80.0%) <<
[ Fold 1 ]
Decision tree:
Num LEO Category Views > 0: Web/Banner Ad (3/2)
Num LEO Category Views <= 0:
:...Num main/freegift Template Views > 0:
:...Session Visit Count <= 2: Web/Banner Ad (5/1)
: Session Visit Count > 2: Friend/Co-worker (3)
Num main/freegift Template Views <= 0:
:...Num Oroblu Product Views > 1: Other (2)
Num Oroblu Product Views <= 1:
:...Email in MIL,GOV,ORG,NULL: Friend/Co-worker (0)
Email = NET: Friend/Co-worker (5)
Email = EDU: Search Engine (1)
Email = Gazelle: Friend/Co-worker (9)
Email = Other: Friend/Co-worker (1)
Email = COM:
:...Num CT Waist Control Views > 0: Other (3/1)
Num CT Waist Control Views <= 0:
:...Num HotSox Product Views > 0: Other (2)
Num HotSox Product Views <= 0:
:...Session Time Elapsed <= 44: Other (4)
Session Time Elapsed > 44: Friend/Co-worker (15/1)
Evaluation on hold-out data (5 cases):
Decision Tree
---------------Size
Errors
12 4(80.0%) <<
[ Fold 2 ]
68
Decision tree:
Num main/freegift Template Views > 0:
:...Session Visit Count <= 2: Web/Banner Ad (5/1)
: Session Visit Count > 2: Friend/Co-worker (3)
Num main/freegift Template Views <= 0:
:...Session Last Request Processing Time > 6813: Other (4/1)
Session Last Request Processing Time <= 6813:
:...Login Failure Count > 4: Other (3.4/0.3)
Login Failure Count <= 4:
:...Minority Census Tract = True: Other (1)
Minority Census Tract = False: Friend/Co-worker (25.7/3.9)
Minority Census Tract = NULL:
:...NumberOfChildren = 0: Friend/Co-worker (7.8/0.9)
NumberOfChildren in [1-4 or more]: Other (2)
Evaluation on hold-out data (6 cases):
Decision Tree
---------------Size
Errors
8 2(33.3%) <<
[ Fold 3 ]
Decision tree:
Num LEO Category Views > 0: Web/Banner Ad (2/1)
Num LEO Category Views <= 0:
:...Num main/freegift Template Views > 0:
:...Session Visit Count <= 2: Web/Banner Ad (5/1)
: Session Visit Count > 2: Friend/Co-worker (3)
Num main/freegift Template Views <= 0:
:...Num Oroblu Product Views > 1: Other (2)
Num Oroblu Product Views <= 1:
:...Session Last Request Processing Time > 6813: Search Engine (2/1)
Session Last Request Processing Time <= 6813:
:...Login Failure Count <= 4: Friend/Co-worker (33.1/4.9)
Login Failure Count > 4: Other (4.9/0.8)
Evaluation on hold-out data (6 cases):
Decision Tree
---------------Size
Errors
7 3(50.0%) <<
[ Fold 4 ]
Decision tree:
Num main/freegift Template Views > 0:
:...New Car Buyer = NULL: Web/Banner Ad (5/1)
: New Car Buyer = True: Friend/Co-worker (2)
Num main/freegift Template Views <= 0:
:...Num LEO Category Views > 0: Web/Banner Ad (3/2)
Num LEO Category Views <= 0:
:...Num Oroblu Product Views <= 1: Friend/Co-worker (40/10)
Num Oroblu Product Views > 1: Other (2)
Evaluation on hold-out data (6 cases):
Decision Tree
---------------Size
Errors
5 4(66.7%) <<
Tree has been cut for convenience
69
Question 3.
What gender is most of our customers?
See5 [Release 1.15] Tue Mar 26 13:38:12 2002
Options:
Cross-validate using 10 folds
Class specified by attribute `Gender'
*** line 19999 of `Customers.data': unexpected end of file
Read 19998 cases (296 attributes) from Customers.data
[ Fold 0 ]
Decision tree:
Other Indiv. Gender = Male:
:...Num MDS Category Views <= 0:
: :...Session Cookie ID <= 737171: Female (99/5)
: : Session Cookie ID > 737171:
: : :...Available Home Equity in [EQUITY $1-$4;999-EQUITY $75;000-$99;999]: Female (3)
: :
Available Home Equity in [EQUITY $100;000-$149;999-EQUITY $2;000;000 AND OVER]:
Male (2)
: Num MDS Category Views > 0:
: :...Session Browser Family Top 3 = Netscape: Female (0)
:
Session Browser Family Top 3 = AOL: Female (2)
:
Session Browser Family Top 3 = Other: Female (1)
:
Session Browser Family Top 3 = Internet Explorer:
:
:...Num EllenTracy Product Views <= 0: Male (3)
:
Num EllenTracy Product Views > 0: NULL (2)
Other Indiv. Gender = Female:
:...Session Visit Count > 5: Female (6)
: Session Visit Count <= 5:
: :...Num Fashion Product Views > 1: Female (2)
:
Num Fashion Product Views <= 1:
:
:...Number Of Adults <= 1: Female (2)
:
Number Of Adults > 1:
:
:...Cookie First Visit Date <= 2000/03/13: Male (15)
:
Cookie First Visit Date > 2000/03/13: Female (2)
Other Indiv. Gender = NULL:
:...Occupation in SELF EMPLOYED MANAGEMENT,MILITARY,SELF EMPLOYED PROF/TECH,
:
RETIRED,SELF EMPLOYED SALES/MARKETING,SELF EMPLOYED,
:
SELF EMPLOYED BLUE COLLAR,SELF EMPLOYED CLERICAL,
:
SELF EMPLOYED HOMEMAKER: NULL (0)
Occupation = SALES/SERVICE: Female (1)
Occupation = HOUSEWIFE: Female (1)
Occupation = CRAFTSMAN/BLUE COLLAR: Female (1)
Occupation = STUDENT: Female (3/1)
Occupation = ADMINISTRATIVE/MANAGERIAL: Male (6)
Occupation = OTHER: Female (1)
Occupation = CLERICAL/WHITE COLLAR: Female (2)
Occupation = PROFESSIONAL/TECHNICAL:
:...Working Woman = NULL: Female (0)
: Working Woman = True: Female (7)
: Working Woman = False: Male (2)
Occupation = NULL:
:...Household Status = NAME APPEARING ON INPUT IS INDIVIDUAL 2: NULL (15)
Household Status = NULL: NULL (17768)
Household Status = NAME APPEARING ON INPUT IS INDIVIDUAL 1:
:...Account Creation Date_Time > 19:55:49: Male (7)
Account Creation Date_Time <= 19:55:49: [S1]
SubTree [S1]
Property Type in apartment(5+ units),2-4 unit(duplex;triplex;quad),
:
mobile_home,
:
misc. residential (condo store/flat): NULL (0)
Property Type = condo: Male (1)
Property Type = single family dwelling:
:...Number Of Adults <= 1: Female (5/1)
: Number Of Adults > 1: NULL (4)
Property Type = NULL:
70
:...Num AmericanEssentials Product Views > 1: Male (2)
Num AmericanEssentials Product Views <= 1: [S2]
SubTree [S2]
Session First Referrer Top 5 in http://www.fashionmall.com,
:
http://stores.shopnow.com,
:
http://www.winnie-cooper.com: NULL (0)
Session First Referrer Top 5 = http://www.mycoupons.com: Female (2)
Session First Referrer Top 5 = http://www.gazelle.com:
:...Estimated Income Code in [Under $15;000-$30;000-$39;999]: Male (2/1)
: Estimated Income Code in [$40;000-$49;999-$125;000 OR MORE]: Female (2)
Session First Referrer Top 5 = Other:
:...Num main/login2 Template Views <= 0: NULL (15)
Num main/login2 Template Views > 0:
:...New Car Buyer = NULL: Female (7)
New Car Buyer = True: NULL (6)
Evaluation on hold-out data (1999 cases):
Decision Tree
---------------Size
Errors
34 5( 0.3%) <<
[ Fold 1 ]
Decision tree:
Other Indiv. Gender = Female:
:...Marital Status in Inferred Married,Inferred Single: Male (0)
: Marital Status = Married: Male (13/1)
: Marital Status = NULL: Female (9)
: Marital Status = Single:
: :...Num EllenTracy Product Views <= 0: Male (3)
:
Num EllenTracy Product Views > 0: Female (2)
Other Indiv. Gender = Male:
:...Num MDS Category Views > 0:
: :...Account Creation Date > 2000/03/09: Female (2)
: : Account Creation Date <= 2000/03/09:
: : :...Num EllenTracy Product Views <= 0: Male (3)
: :
Num EllenTracy Product Views > 0: NULL (2)
: Num MDS Category Views <= 0:
: :...DoYouPurchaseForOthers = NULL:
:
:...Speciality Store Retail = NULL: Female (0)
:
: Speciality Store Retail = True: Male (3)
:
: Speciality Store Retail = False: Female (4)
:
DoYouPurchaseForOthers = False:
:
:...Account Creation Date <= 2000/03/14: Female (92/3)
:
Account Creation Date > 2000/03/14:
:
:...Other Indiv. Age <= 28: Male (2)
:
Other Indiv. Age > 28: Female (4)
Other Indiv. Gender = NULL:
:...Occupation in SELF EMPLOYED MANAGEMENT,MILITARY,SELF EMPLOYED PROF/TECH,
:
SELF EMPLOYED SALES/MARKETING,SELF EMPLOYED,
:
SELF EMPLOYED BLUE COLLAR,SELF EMPLOYED CLERICAL,
:
SELF EMPLOYED HOMEMAKER: NULL (0)
Occupation = SALES/SERVICE: Female (1)
Occupation = HOUSEWIFE: Female (2)
Occupation = CRAFTSMAN/BLUE COLLAR: Female (1)
Occupation = STUDENT: Female (3/1)
Occupation = ADMINISTRATIVE/MANAGERIAL: Male (5)
Occupation = RETIRED: Female (1)
Occupation = OTHER: Female (1)
Occupation = CLERICAL/WHITE COLLAR: Female (2)
Occupation = PROFESSIONAL/TECHNICAL:
:...Working Woman = NULL: Female (0)
: Working Woman = True: Female (7)
: Working Woman = False: Male (2)
Occupation = NULL:
:...Household Status = NAME APPEARING ON INPUT IS INDIVIDUAL 2: NULL (16)
Household Status = NULL: NULL (17769)
Tree has been cut for convenience
71
Question 4.
How many customers wish to receive mail from our company?
See5 [Release 1.15] Tue Mar 26 13:49:10 2002
Options:
Cross-validate using 10 folds
Class specified by attribute `SendEmail'
*** line 19999 of `Customers.data': unexpected end of file
Read 19998 cases (296 attributes) from Customers.data
[ Fold 0 ]
Decision tree:
DoYouPurchaseForOthers = NULL:
:...Email in MIL,GOV,ORG: True (0)
: Email = NET: False (6/1)
: Email = COM: True (33/8)
: Email = EDU: True (1)
: Email = Gazelle: True (13/2)
: Email = Other: True (2)
: Email = NULL: NULL (17637)
DoYouPurchaseForOthers = False:
:...Account Creation Date <= 2000/03/15: False (278)
Account Creation Date > 2000/03/15:
:...Num Reinforced Toe Views <= 1: True (26/2)
Num Reinforced Toe Views > 1: False (3/1)
Evaluation on hold-out data (1999 cases):
Decision Tree
---------------Size
Errors
9 1( 0.1%) <<
[ Fold 1 ]
Decision tree:
DoYouPurchaseForOthers = False:
:...Account Creation Date <= 2000/03/15: False (279)
: Account Creation Date > 2000/03/15:
: :...Num products Template Views <= 7: True (24)
:
Num products Template Views > 7: False (4/1)
DoYouPurchaseForOthers = NULL:
:...Email in MIL,GOV,ORG: True (0)
Email = NET: False (7/1)
Email = EDU: True (1)
Email = Other: True (2)
Email = NULL: NULL (17638)
Email = Gazelle:
:...Num BrandOrder Assortment Views <= 0: False (3/1)
: Num BrandOrder Assortment Views > 0: True (9)
Email = COM:
:...Bank Card Holder = False: False (2)
Bank Card Holder = NULL:
:...Session Start Login Count <= 1: True (5)
: Session Start Login Count > 1: False (2)
Bank Card Holder = True:
:...Year Of Structure <= 1954: False (3/1)
Year Of Structure > 1954: True (20)
72
Evaluation on hold-out data (1999 cases):
Decision Tree
---------------Size
Errors
14 4( 0.2%) <<
[ Fold 2 ]
Decision tree:
DoYouPurchaseForOthers = False:
:...Account Creation Date <= 2000/03/15: False (279)
: Account Creation Date > 2000/03/15: True (29/2)
DoYouPurchaseForOthers = NULL:
:...DMA No Mail Solicitation Flag = NULL: NULL (17649/12)
DMA No Mail Solicitation Flag = True:
:...Own Or Rent Home = Renter: True (0)
Own Or Rent Home = NULL:
:...Num main/replenishment Template Views <= 1: False (8)
: Num main/replenishment Template Views > 1: True (2)
Own Or Rent Home = Owner:
:...Num checkout Template Views > 7: False (2)
Num checkout Template Views <= 7:
:...Year Of Structure > 1927: True (25/1)
Year Of Structure <= 1927:
:...Account Creation Date <= 2000/02/09: True (2)
Account Creation Date > 2000/02/09: False (2)
Evaluation on hold-out data (2000 cases):
Decision Tree
---------------Size
Errors
9 5( 0.3%) <<
[ Fold 3 ]
Decision tree:
DoYouPurchaseForOthers = False:
:...Account Creation Date <= 2000/03/15: False (279)
: Account Creation Date > 2000/03/15:
: :...Num Reinforced Toe Views <= 2: True (27/2)
:
Num Reinforced Toe Views > 2: False (3/1)
DoYouPurchaseForOthers = NULL:
:...DMA No Mail Solicitation Flag = NULL: NULL (17648/12)
DMA No Mail Solicitation Flag = True:
:...Num Danskin Product Views > 1: False (2)
Num Danskin Product Views <= 1:
:...Own Or Rent Home = Renter: True (0)
Own Or Rent Home = NULL:
:...Account Creation Date <= 2000/01/30: True (3)
: Account Creation Date > 2000/01/30: False (6)
Own Or Rent Home = Owner:
:...Year Of Structure > 1954: True (25/1)
Year Of Structure <= 1954:
:...Estimated Income Code in [Under $15;000-$75;000-$99;999]: False (3)
Estimated Income Code in [$100;000-$124;999-$125;000 OR MORE]: True (2)
Evaluation on hold-out data (2000 cases):
Decision Tree
---------------Size
Errors
10 1( 0.1%) <<
[ Fold 4 ]
Decision tree:
DoYouPurchaseForOthers = False:
Tree has been cut for convenience
73
Question 5.
How many of our customers are Premium Card holders?
See5 [Release 1.15] Tue Mar 26 13:51:51 2002
Options:
Cross-validate using 10 folds
Class specified by attribute `Premium Card Holder'
*** line 19999 of `Customers.data': unexpected end of file
Read 19998 cases (296 attributes) from Customers.data
[ Fold 0 ]
Decision tree:
TE Card Holder = NULL: NULL (17728)
TE Card Holder = True:
:...Session First Request Day of Week in [Monday-Tuesday]: False (6)
: Session First Request Day of Week in [Wednesday-Sunday]:
: :...Own Or Rent Home = NULL: True (0)
:
Own Or Rent Home = Owner: True (22/5)
:
Own Or Rent Home = Renter: False (2)
TE Card Holder = False:
:...Num Seasonal 1 Stock Views <= 4: False (232/37)
Num Seasonal 1 Stock Views > 4:
:...Truck Owner = NULL: True (0)
Truck Owner = True: False (3)
Truck Owner = False: True (6/1)
Evaluation on hold-out data (1999 cases):
Decision Tree
---------------Size
Errors
7 6( 0.3%) <<
[ Fold 1 ]
Decision tree:
TE Card Holder = NULL: NULL (17728)
TE Card Holder = True:
:...Session First Request Day of Week in [Monday-Tuesday]: False (6)
: Session First Request Day of Week in [Wednesday-Sunday]:
: :...RV Owner = NULL: True (0)
:
RV Owner = True: False (2)
:
RV Owner = False: True (18/3)
TE Card Holder = False:
:...Bank Card Holder = NULL: False (0)
Bank Card Holder = False: False (33)
Bank Card Holder = True:
:...Num main/lifestyles Template Views > 0: False (12)
Num main/lifestyles Template Views <= 0:
:...Num Replenishment Stock Views > 0: False (13)
Num Replenishment Stock Views <= 0:
:...Gas Card Holder = NULL: False (0)
Gas Card Holder = False: False (26/1)
Gas Card Holder = True:
:...Finance Retail Activity = NULL: False (0)
Finance Retail Activity = True: [S1]
Finance Retail Activity = False: [S2]
SubTree [S1]
Session First Request Day of Week in [Monday-Wednesday]: True (5/1)
Session First Request Day of Week in [Thursday-Sunday]: False (5/1)
SubTree [S2]
74
Property Type in apartment(5+ units),2-4 unit(duplex;triplex;quad),
:
mobile_home,
:
misc. residential (condo store/flat): False (0)
Property Type = condo: True (4/1)
Property Type = single family dwelling:
:...Num main/login2 Template Views <= 1: False (40/5)
: Num main/login2 Template Views > 1: True (4/1)
Property Type = NULL:
:...Num Ultra Sheer Look Product Views > 0: True (4/1)
Num Ultra Sheer Look Product Views <= 0:
:...RV Owner = NULL: False (0)
RV Owner = True: False (14/1)
RV Owner = False:
:...Number Of Adults > 3:
:...Age <= 26: False (3)
: Age > 26: True (8/1)
Number Of Adults <= 3:
:...Num PH Category Views <= 0:
:...Num Departments Assortment Views <= 15: False (61/7)
: Num Departments Assortment Views > 15: True (5/1)
Num PH Category Views > 0:
:...Session Last Request Processing Time <= 532: True (5)
Session Last Request Processing Time > 532: False (3)
Evaluation on hold-out data (1999 cases):
Decision Tree
---------------Size
Errors
21 9( 0.5%) <<
[ Fold 2 ]
Decision tree:
TE Card Holder = NULL: NULL (17728)
TE Card Holder = True:
:...Session First Request Day of Week in [Monday-Tuesday]: False (5)
: Session First Request Day of Week in [Wednesday-Sunday]:
: :...Num Welcome Assortment Views > 0: False (2)
:
Num Welcome Assortment Views <= 0:
:
:...Own Or Rent Home = NULL: True (0)
:
Own Or Rent Home = Owner: True (17/2)
:
Own Or Rent Home = Renter: False (2)
TE Card Holder = False:
:...Bank Card Holder = NULL: False (0)
Bank Card Holder = False: False (35)
Bank Card Holder = True:
:...Num main/lifestyles Template Views > 0: False (14)
Num main/lifestyles Template Views <= 0:
:...Num Replenishment Stock Views > 0: False (14)
Num Replenishment Stock Views <= 0:
:...Num Hanes Product Views > 0:
:...Session Browser Family Top 3 = Other: True (0)
: Session Browser Family Top 3 = Netscape: True (1)
: Session Browser Family Top 3 = AOL: False (5)
: Session Browser Family Top 3 = Internet Explorer: True (8/1)
Num Hanes Product Views <= 0:
:...Account Creation Date <= 2000/03/08: [S1]
Account Creation Date > 2000/03/08:
:...Working Woman = NULL: False (0)
Working Woman = True: True (6/2)
Working Woman = False:
:...Num Luxury Product Views > 0: True (2)
Num Luxury Product Views <= 0:
:...Num Tube Package Views <= 0: False (28/8)
Num Tube Package Views > 0: True (2)
Tree has been cut for convenience
75
Question 6.
Who is the Email provider of most of our customers?
See5 [Release 1.15] Tue Mar 26 13:03:09 2002
Options:
Cross-validate using 10 folds
Class specified by attribute `Email'
*** line 19999 of `Customers.data': unexpected end of file
Read 19998 cases (296 attributes) from Customers.data
[ Fold 0 ]
Decision tree:
DoYouPurchaseForOthers = NULL:
:...SendEmail = NULL: NULL (17638/1)
: SendEmail = False: [S1]
: SendEmail = True:
: :...Num Fashion 1 Stock Views > 0: Other (2)
:
Num Fashion 1 Stock Views <= 0:
:
:...Num Luxury Product Views > 0: NET (2/1)
:
Num Luxury Product Views <= 0:
:
:...Customer ID <= 236: Gazelle (6)
:
Customer ID > 236:
:
:...RV Owner = True: EDU (1)
:
RV Owner = NULL:
:
:...Session Start Login Count <= 1: COM (4)
:
: Session Start Login Count > 1: Gazelle (2)
:
RV Owner = False:
:
:...Num main/freegift Template Views <= 0: COM (18/1)
:
Num main/freegift Template Views > 0:
:
:...Account Creation Date <= 2000/02/03: Gazelle (2)
:
Account Creation Date > 2000/02/03: COM (2)
DoYouPurchaseForOthers = False:
:...Num Tube Package Views > 0:
:...Customer ID <= 12146: Gazelle (2/1)
: Customer ID > 12146: COM (5)
Num Tube Package Views <= 0:
:...Request Processing Time Sum <= 62:
:...Num main/login2 Template Views <= 0: NET (8/1)
: Num main/login2 Template Views > 0: COM (2/1)
Request Processing Time Sum > 62:
:...Session Browser Family in Google,Teleport Pro,AltaVista,
:
PerMan Surfer,Novell Border Manager,
:
link-check,EmailSiphon,Java,Genie,
:
WebTrends,Enfish Tracker,Lycos,ergyBot,
:
Lotus Notes,Lesszilla,InfoSeek,
:
Northern Light,Unknown,Cute FTP,
:
Nitro e-mail collector,
:
Mozilla: COM (0)
Session Browser Family = WebTV: NET (1)
Session Browser Family = AOL: COM (63)
Session Browser Family = Netscape:
:...Num UniqueBoutiques Assortment Views <= 29: COM (39/14)
: Num UniqueBoutiques Assortment Views > 29: EDU (2)
Session Browser Family = Other:
:...Session Start Login Count > 1: NET (2)
: Session Start Login Count <= 1:
: :...New Car Buyer = NULL: COM (7)
:
New Car Buyer = True:
:
:...Length Of Residence <= 9: NET (4)
:
Length Of Residence > 9: COM (3)
Session Browser Family = Internet Explorer:
:...Upscale Retail = True: COM (4/2)
Upscale Retail = NULL:
:...Cookie First Visit Date_Time <= 16:10:16: COM (31/3)
: Cookie First Visit Date_Time > 16:10:16:
: :...Num PH Category Views <= 0: NET (8/1)
76
:
Num PH Category Views > 0: COM (3/2)
Upscale Retail = False:
:...Miscellaneous Retail Activity = NULL: COM (0)
Miscellaneous Retail Activity = True:
:...Session Request Count <= 13: NET (3)
: Session Request Count > 13: COM (2/1)
Miscellaneous Retail Activity = False:
:...Num main/freegift Template Views > 1: NET (2)
Num main/freegift Template Views <= 1:
:...Num NicoleMiller Product Views > 2: NET (4/1)
Num NicoleMiller Product Views <= 2:
:...Num Liquid Product Views > 0: NET (4/1)
Num Liquid Product Views <= 0:
:...Speciality Store Retail = NULL: COM (0)
Speciality Store Retail = True: COM (12)
Speciality Store Retail = False:
:...BasicOrFashion Last = Fashion: COM (0)
BasicOrFashion Last = Basic: COM (6)
BasicOrFashion Last = NULL: [S2]
SubTree [S1]
Session First Referrer Top 5 in http://www.mycoupons.com,
:
http://www.fashionmall.com,
:
http://stores.shopnow.com,
:
http://www.winnie-cooper.com: NET (0)
Session First Referrer Top 5 = http://www.gazelle.com: Gazelle (2)
Session First Referrer Top 5 = Other:
:...Marital Status = Single: NET (0)
Marital Status = Inferred Married: COM (1)
Marital Status = Inferred Single: NET (6)
Marital Status = Married: COM (1)
Marital Status = NULL: COM (3)
SubTree [S2]
Num main/vendor Template Views <= 0: COM (83/21)
Num main/vendor Template Views > 0:
:...Estimated Income Code in [Under $15;000-$40;000-$49;999]: COM (4)
Estimated Income Code in [$50;000-$74;999-$125;000 OR MORE]: NET (5)
Evaluation on hold-out data (1999 cases):
Decision Tree
---------------Size
Errors
41 11( 0.6%) <<
[ Fold 1 ]
Decision tree:
DoYouPurchaseForOthers = NULL:
:...SendEmail = NULL: NULL (17638/1)
: SendEmail = False: [S1]
: SendEmail = True:
: :...Num Fashion 1 Stock Views > 0: Other (2)
:
Num Fashion 1 Stock Views <= 0:
:
:...Customer ID <= 328: Gazelle (6)
:
Customer ID > 328:
:
:...Num BrandOrder Assortment Views <= 1: COM (28/5)
:
Num BrandOrder Assortment Views > 1: Gazelle (2)
DoYouPurchaseForOthers = False:
:...Session Browser Family in Google,Teleport Pro,AltaVista,PerMan Surfer,
:
Novell Border Manager,link-check,EmailSiphon,
:
Java,Genie,WebTrends,Enfish Tracker,Lycos,
:
ergyBot,Lotus Notes,Lesszilla,InfoSeek,
:
Northern Light,Unknown,Cute FTP,
:
Nitro e-mail collector,Mozilla: COM (0)
Tree has been cut for convenience
77
Question 7.
How many of our customers are mail order buyers?
See5 [Release 1.15] Tue Mar 26 13:07:39 2002
Options:
Cross-validate using 10 folds
Class specified by attribute `Mail Order Buyer'
*** line 19999 of `Customers.data': unexpected end of file
Read 19998 cases (296 attributes) from Customers.data
[ Fold 0 ]
Decision tree:
Mail Responder = NULL: NULL (17728)
Mail Responder = False: False (68)
Mail Responder = True:
:...Num Opaque Look Product Views > 0:
:...UnitsPerInnerBox Average <= 5.25: True (4)
: UnitsPerInnerBox Average > 5.25: False (3)
Num Opaque Look Product Views <= 0:
:...Unknown Card Type = NULL: True (0)
Unknown Card Type = True: True (84/7)
Unknown Card Type = False:
:...Session Visit Count <= 13: True (105/19)
Session Visit Count > 13:
:...Account Creation Date <= 2000/01/30: True (2)
Account Creation Date > 2000/01/30: False (5)
Evaluation on hold-out data (1999 cases):
Decision Tree
---------------Size
Errors
8 4( 0.2%) <<
[ Fold 1 ]
Decision tree:
Mail Responder = NULL: NULL (17728)
Mail Responder = False: False (70)
Mail Responder = True:
:...Num Opaque Look Product Views <= 0:
:...Session Start Login Count <= 6: True (189/25)
: Session Start Login Count > 6: False (5/1)
Num Opaque Look Product Views > 0:
:...UnitsPerInnerBox Average <= 5.25: True (4)
UnitsPerInnerBox Average > 5.25: False (3)
Evaluation on hold-out data (1999 cases):
Decision Tree
---------------Size
Errors
6 7( 0.4%) <<
[ Fold 2 ]
Decision tree:
78
Mail Responder = NULL: NULL (17728)
Mail Responder = False: False (69)
Mail Responder = True:
:...Num Opaque Look Product Views > 0:
:...UnitsPerInnerBox Average <= 5.25: True (3)
: UnitsPerInnerBox Average > 5.25: False (3)
Num Opaque Look Product Views <= 0:
:...Login Failure Count <= 4: True (185/24.8)
Login Failure Count > 4:
:...Session Visit Count <= 12: True (5.9/1.1)
Session Visit Count > 12: False (4.1)
Evaluation on hold-out data (2000 cases):
Decision Tree
---------------Size
Errors
7 5( 0.3%) <<
[ Fold 3 ]
Decision tree:
Mail Responder = NULL: NULL (17727)
Mail Responder = False: False (67)
Mail Responder = True:
:...Unknown Card Type = NULL: True (0)
Unknown Card Type = True: True (82/8)
Unknown Card Type = False:
:...Num Opaque Look Product Views > 0: False (4/1)
Num Opaque Look Product Views <= 0:
:...Session Visit Count <= 13: True (112/21)
Session Visit Count > 13:
:...Account Creation Date <= 2000/01/30: True (2)
Account Creation Date > 2000/01/30: False (4)
Evaluation on hold-out data (2000 cases):
Decision Tree
---------------Size
Errors
7 1( 0.1%) <<
[ Fold 4 ]
Decision tree:
Mail Responder = NULL: NULL (17727)
Mail Responder = False: False (68)
Mail Responder = True:
:...Num Opaque Look Product Views <= 0: True (196/32)
Num Opaque Look Product Views > 0:
:...UnitsPerInnerBox Average <= 5.25: True (4)
UnitsPerInnerBox Average > 5.25: False (3)
Evaluation on hold-out data (2000 cases):
Decision Tree
---------------Size
Errors
5 3( 0.1%) <<
[ Fold 5 ]
Decision tree:
Mail Responder = NULL: NULL (17727)
Mail Responder = True: True (202/34)
Mail Responder = False: False (69)
Tree has been cut for convenience
79
Question 8.
Find out how many children do most of our customers have?
See5 [Release 1.15] Tue Mar 26 13:10:46 2002
Options:
Cross-validate using 10 folds
Class specified by attribute `NumberOfChildren'
*** ignoring cases with bad or unknown class
*** line 19999 of `Customers.data': unexpected end of file
Read 58 cases (296 attributes) from Customers.data
[ Fold 0 ]
Decision tree:
Vehicle Lifestyle in IMPORT (STANDARD/ECONOMY),REGULAR (MIDSIZE/SMALL),
:
TRUCK OR UTILITY VEHICLE: 0 (0)
Vehicle Lifestyle = FULL SIZE (STANDARD/LUXURY): 0 (1)
Vehicle Lifestyle = SPECIALTY (MIDSIZE/SMALL): 4 or more (3)
Vehicle Lifestyle = PERSONAL LUXURY CAR: 0 (1)
Vehicle Lifestyle = STATION WAGON: 1 (1)
Vehicle Lifestyle = NULL:
:...Num LifeStyles Assortment Views > 0: 2 (2/1)
Num LifeStyles Assortment Views <= 0:
:...Num Tube Package Views <= 0: 0 (43/6)
Num Tube Package Views > 0: 2 (2)
Evaluation on hold-out data (5 cases):
Decision Tree
---------------Size
Errors
7 0( 0.0%) <<
[ Fold 1 ]
Decision tree:
Other Indiv. Occupation in SALES/SERVICE,SELF EMPLOYED MANAGEMENT,
:
SELF EMPLOYED RETIRED,MILITARY,HOUSEWIFE,
:
CRAFTSMAN/BLUE COLLAR,RELIGIOUS,STUDENT,
:
SELF EMPLOYED PROF/TECH,RETIRED,SELF EMPLOYED,
:
SELF EMPLOYED SALES/MARKETING,
:
SELF EMPLOYED BLUE COLLAR,SELF EMPLOYED CLERICAL,
:
CLERICAL/WHITE COLLAR: 0 (0)
Other Indiv. Occupation = ADMINISTRATIVE/MANAGERIAL: 2 (3)
Other Indiv. Occupation = PROFESSIONAL/TECHNICAL:
:...NumberOfAdults in [0-1]: 0 (2)
: NumberOfAdults in [2-3 or more]: 4 or more (3)
Other Indiv. Occupation = NULL:
:...Num main/freegift Template Views > 1: 2 (2)
Num main/freegift Template Views <= 1:
:...Num main/boutique Template Views <= 2: 0 (41/4)
Num main/boutique Template Views > 2: 1 (2/1)
Evaluation on hold-out data (5 cases):
Decision Tree
---------------Size
Errors
6 1(20.0%) <<
80
[ Fold 2 ]
Decision tree:
Session First Request Day of Week = Sunday: 4 or more (2)
Session First Request Day of Week in [Monday-Saturday]:
:...Num main/departments Template Views > 3: 2 (4/1)
Num main/departments Template Views <= 3:
:...Account Creation Date_Time <= 20:28:38: 0 (42/3)
Account Creation Date_Time > 20:28:38:
:...Account Creation Date_Time <= 21:54:16: 1 (2)
Account Creation Date_Time > 21:54:16: 0 (2/1)
Evaluation on hold-out data (6 cases):
Decision Tree
---------------Size
Errors
5 2(33.3%) <<
[ Fold 3 ]
Decision tree:
Session First Request Day of Week = Sunday: 4 or more (2)
Session First Request Day of Week in [Monday-Saturday]:
:...Num LifeStyles Assortment Views > 0: 2 (2/1)
Num LifeStyles Assortment Views <= 0:
:...Num Tube Package Views <= 0: 0 (46/6)
Num Tube Package Views > 0: 2 (2)
Evaluation on hold-out data (6 cases):
Decision Tree
---------------Size
Errors
4 2(33.3%) <<
[ Fold 4 ]
Decision tree:
Session Last Request Processing Time > 20093: 2 (2/1)
Session Last Request Processing Time <= 20093:
:...Other Indiv. Occupation in SALES/SERVICE,SELF EMPLOYED MANAGEMENT,
:
SELF EMPLOYED RETIRED,MILITARY,HOUSEWIFE,
:
CRAFTSMAN/BLUE COLLAR,RELIGIOUS,STUDENT,
:
SELF EMPLOYED PROF/TECH,RETIRED,SELF EMPLOYED,
:
SELF EMPLOYED SALES/MARKETING,
:
SELF EMPLOYED BLUE COLLAR,
:
SELF EMPLOYED CLERICAL,
:
CLERICAL/WHITE COLLAR: 0 (0)
Other Indiv. Occupation = PROFESSIONAL/TECHNICAL: 0 (3/1)
Other Indiv. Occupation = ADMINISTRATIVE/MANAGERIAL: 2 (3)
Other Indiv. Occupation = NULL:
:...Account Creation Date_Time <= 20:28:38: 0 (40/3)
Account Creation Date_Time > 20:28:38:
:...Account Creation Date_Time <= 21:54:16: 1 (2)
Account Creation Date_Time > 21:54:16: 0 (2/1)
Evaluation on hold-out data (6 cases):
Decision Tree
---------------Size
Errors
6 2(33.3%) <<
Tree has been cut for convenience
81
Question 9.
How many of our customers respond to our marketing mails?
See5 [Release 1.15] Tue Mar 26 13:13:34 2002
Options:
Cross-validate using 10 folds
Class specified by attribute `Mail Responder'
*** line 19999 of `Customers.data': unexpected end of file
Read 19998 cases (296 attributes) from Customers.data
[ Fold 0 ]
Decision tree:
Mail Order Buyer = NULL: NULL (17728)
Mail Order Buyer = True: True (167)
Mail Order Buyer = False:
:...Truck Owner = NULL: False (0)
Truck Owner = True: True (9/1)
Truck Owner = False: [S1]
SubTree [S1]
Other Indiv. Occupation in SALES/SERVICE,SELF EMPLOYED MANAGEMENT,
:
SELF EMPLOYED RETIRED,MILITARY,HOUSEWIFE,RELIGIOUS,
:
STUDENT,SELF EMPLOYED PROF/TECH,RETIRED,
:
SELF EMPLOYED,SELF EMPLOYED SALES/MARKETING,
:
SELF EMPLOYED BLUE COLLAR,SELF EMPLOYED CLERICAL,
:
CLERICAL/WHITE COLLAR: False (0)
Other Indiv. Occupation = PROFESSIONAL/TECHNICAL: True (5)
Other Indiv. Occupation = CRAFTSMAN/BLUE COLLAR: True (1)
Other Indiv. Occupation = ADMINISTRATIVE/MANAGERIAL: False (1)
Other Indiv. Occupation = NULL:
:...Occupation in SELF EMPLOYED MANAGEMENT,MILITARY,CRAFTSMAN/BLUE COLLAR,
:
SELF EMPLOYED PROF/TECH,RETIRED,
:
SELF EMPLOYED SALES/MARKETING,SELF EMPLOYED,OTHER,
:
SELF EMPLOYED BLUE COLLAR,SELF EMPLOYED CLERICAL,
:
SELF EMPLOYED HOMEMAKER: False (0)
Occupation = SALES/SERVICE: True (1)
Occupation = HOUSEWIFE: True (1)
Occupation = STUDENT: True (2/1)
Occupation = ADMINISTRATIVE/MANAGERIAL: False (6/1)
Occupation = CLERICAL/WHITE COLLAR: True (3)
Occupation = PROFESSIONAL/TECHNICAL:
:...Number Of Adults <= 1: True (5/1)
: Number Of Adults > 1: False (3/1)
Occupation = NULL:
:...Premium Card Holder = NULL: False (0)
Premium Card Holder = False: False (54/3)
Premium Card Holder = True:
:...Dwelling Unit Size = NULL: False (0)
Dwelling Unit Size = MULTI FAMILY DWELLING UNIT: False (5)
Dwelling Unit Size = SINGLE FAMILY DWELLING UNIT:
:...Length Of Residence <= 4: False (2)
Length Of Residence > 4: True (6)
Evaluation on hold-out data (1999 cases):
Decision Tree
---------------Size
Errors
17 4( 0.2%) <<
82
[ Fold 1 ]
Decision tree:
Mail Order Buyer = NULL: NULL (17728)
Mail Order Buyer = True: True (171)
Mail Order Buyer = False:
:...RV Owner = NULL: False (0)
RV Owner = True: True (3)
RV Owner = False:
:...Texture Last = Flat: True (3)
Texture Last = Textured: True (2)
Texture Last = NULL:
:...Presence Of Children = NULL: False (0)
Presence Of Children = True:
:...Num Gift Sets & Special Items Views > 0: False (2)
: Num Gift Sets & Special Items Views <= 0:
: :...Customer ID <= 9064: False (4)
:
Customer ID > 9064: [S1]
Presence Of Children = False:
:...Num main/departments Template Views > 4: True (3)
Num main/departments Template Views <= 4:
:...Login Failure Count <= 6: False (63.6/4.7)
Login Failure Count > 6:
:...Session Start Login Count <= 2: False (2.4/0.3)
Session Start Login Count > 2: True (2)
SubTree [S1]
Available Home Equity in [EQUITY $1-$4;999-EQUITY $10;000-$19;9999]: False (2.1/0.1)
Available Home Equity in [EQUITY $20;000-$29;000-EQUITY $2;000;000 AND OVER]: True (12.9)
Evaluation on hold-out data (1999 cases):
Decision Tree
---------------Size
Errors
13 5( 0.3%) <<
[ Fold 2 ]
Decision tree:
Mail Order Buyer = NULL: NULL (17728)
Mail Order Buyer = True: True (167)
Mail Order Buyer = False:
:...RV Owner = NULL: False (0)
RV Owner = True: True (5)
RV Owner = False:
:...TE Card Holder = NULL: False (0)
TE Card Holder = True: True (2)
TE Card Holder = False:
:...Occupation in SELF EMPLOYED MANAGEMENT,MILITARY,
:
CRAFTSMAN/BLUE COLLAR,SELF EMPLOYED PROF/TECH,
:
RETIRED,SELF EMPLOYED SALES/MARKETING,SELF EMPLOYED,
:
OTHER,SELF EMPLOYED BLUE COLLAR,
:
SELF EMPLOYED CLERICAL,
:
SELF EMPLOYED HOMEMAKER: False (0)
Occupation = SALES/SERVICE: True (1)
Occupation = PROFESSIONAL/TECHNICAL: True (7/2)
Occupation = HOUSEWIFE: True (2)
Occupation = STUDENT: True (2/1)
Occupation = ADMINISTRATIVE/MANAGERIAL: False (4)
Occupation = CLERICAL/WHITE COLLAR: True (3)
Occupation = NULL:
:...Dwelling Unit Size = MULTI FAMILY DWELLING UNIT:
Tree has been cut for convenience
83
Question 10.
What percentage of our customers come back to our website?
See5 [Release 1.15] Tue Mar 26 13:16:07 2002
Options:
Cross-validate using 10 folds
Class specified by attribute `Retail Activity'
*** line 19999 of `Customers.data': unexpected end of file
Read 19998 cases (296 attributes) from Customers.data
[ Fold 0 ]
Decision tree:
Unknown Card Type = NULL: NULL (17728)
Unknown Card Type = False: False (177/6)
Unknown Card Type = True:
:...Upscale Retail = NULL: True (0)
Upscale Retail = True:
:...Truck Owner = NULL: True (0)
: Truck Owner = True: False (4)
: Truck Owner = False: True (8/1)
Upscale Retail = False:
:...Num LifeStyles Assortment Views > 1:
:...Cookie First Visit Date_Time <= 16:44:31: False (2)
: Cookie First Visit Date_Time > 16:44:31: True (2)
Num LifeStyles Assortment Views <= 1:
:...Num WDCS Category Views <= 4: True (71/2)
Num WDCS Category Views > 4:
:...Year Of Structure <= 1991: True (5)
Year Of Structure > 1991: False (2)
Evaluation on hold-out data (1999 cases):
Decision Tree
---------------Size
Errors
9 3( 0.2%) <<
[ Fold 1 ]
Decision tree:
Unknown Card Type = NULL: NULL (17728)
Unknown Card Type = False: False (177/6)
Unknown Card Type = True:
:...Num Rayon Product Views > 0:
:...Account Creation Date_Time <= 17:18:40: False (2)
: Account Creation Date_Time > 17:18:40: True (2)
Num Rayon Product Views <= 0:
:...Num LifeStyles Assortment Views <= 1: True (85/6)
Num LifeStyles Assortment Views > 1:
:...Num main/home Template Views <= 2: False (2)
Num main/home Template Views > 2: True (3)
Evaluation on hold-out data (1999 cases):
Decision Tree
---------------Size
Errors
7 3( 0.2%) <<
84
[ Fold 2 ]
Decision tree:
Unknown Card Type = NULL: NULL (17728)
Unknown Card Type = False: False (177/7)
Unknown Card Type = True:
:...Num Rayon Product Views > 0:
:...Account Creation Date_Time <= 17:21:14: False (2)
: Account Creation Date_Time > 17:21:14: True (2)
Num Rayon Product Views <= 0:
:...Upscale Retail = NULL: True (0)
Upscale Retail = False: True (77/5)
Upscale Retail = True:
:...Bank Retail Activity = NULL: True (0)
Bank Retail Activity = True: True (9/1)
Bank Retail Activity = False: False (3)
Evaluation on hold-out data (2000 cases):
Decision Tree
---------------Size
Errors
7 1( 0.1%) <<
[ Fold 3 ]
Decision tree:
Unknown Card Type = NULL: NULL (17727)
Unknown Card Type = False: False (179/7)
Unknown Card Type = True:
:...Upscale Retail = NULL: True (0)
Upscale Retail = False: True (80/6)
Upscale Retail = True:
:...Bank Retail Activity = NULL: True (0)
Bank Retail Activity = True: True (9/1)
Bank Retail Activity = False: False (3)
Evaluation on hold-out data (2000 cases):
Decision Tree
---------------Size
Errors
5 2( 0.1%) <<
[ Fold 4 ]
Decision tree:
Unknown Card Type = NULL: NULL (17727)
Unknown Card Type = False: False (177/7)
Unknown Card Type = True:
:...Upscale Retail = NULL: True (0)
Upscale Retail = False: True (83/7)
Upscale Retail = True:
:...Bank Retail Activity = NULL: True (0)
Bank Retail Activity = False: False (3)
Bank Retail Activity = True:
:...Truck Owner = NULL: True (0)
Truck Owner = True: False (3/1)
Truck Owner = False: True (5)
Evaluation on hold-out data (2000 cases):
Decision Tree
---------------Size
Errors
6 0( 0.0%) <<
Tree has been cut for convenience
85
Appendix H
Project Log
Activity
Procedures and Timetable Meeting
Met Prof Berzins, My project supervisor and he told me that I have been
allocated a project on Data Mining. He asked me to do some reading and
to find out some information before deciding what I should do.
Wrote an email to Stuart Robert, the project initiator, asking him the
questions, which I had to find answers for.
Forwarded the answers I got from Stuart to Prof Berzins.
Attended the Questionnaire briefing
Met with my supervisor and discussed minimum requirements for my
project. He was worried that I might be doing something that has been
done already on my case study. The thing is we don’t have datasets to do
some test on. So I have to rely on the Internet KDDCUP datasets to do
some mining, and their datasets are specific to their questions.
Submitted my Aim and minimum requirements of my project.
Met my supervisor to update him about what I have submitted. Asked him
to back me up on my Disk space requirements.
Send an Email to support requesting some disk space for downloading the
data sets and the See5 software
Granted space and started downloading the datasets and software
Updated my supervisor about what I have done. He asked me to think
about what the answers to my question will teach me.
Started writing my Mid-year report
Submitted my report to my supervisor to have a look at it before I finally
submitted it.
Submitted my report
Received my marked mid-year report from my Supervisor
Tried doing the data mining exercise using the See5 Demo that I
downloaded from the Web but it could only read 400 records.
Filled a form requesting the See5 evaluation licence from RuleQuest
Received an Email from Ross Quinlan with the evaluation licence and
installed the Software.
Started doing the mining exercise with the Evaluation Software
Started writing my Draft chapter and table of contents
Received an email from Liu Bing granting me permission to download
CBA, which is to be used to validate the results from the See5 program
Sent an Email to support reporting an error I was experiencing when trying
to install the CBA program
Received an Email from Support telling me that I should not install the
CBA software. Forwarded the Email to my supervisor
Submitted the Table of contents and draft chapter to my Supervisor
Did some presentation on what I have done during the Progress Meeting
with supervisor and Assessor. Got good advice from Assessor on how to
split the dataset.
Received another Evaluation licence from a friend and started doing the
mining exercise again.
Started writing up the project report
Finished writing the project report
Date & Time
12.10.2001, 3pm
15.10.2001, 2pm.
16.10.2001, 10:25am.
16.10.2001, 13:15pm
26.10.01 3pm
29.10.01, 2pm
05/11/01
06/11/01
07/11/01
19/11/01
21/11/01
08/12/01
13/12/01
28/01/02
04/02/02
05/02/02
06/02/02
07/02/02
01/03/02
07/03/02
07/03/02
08/03/02
11/03/02
20/03/02
26/03/02
01/04/02
25/04/02
86
87