Download Proceedings Template - WORD

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inverse problem wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Data analysis wikipedia , lookup

Multidimensional empirical mode decomposition wikipedia , lookup

Data assimilation wikipedia , lookup

Genetic algorithm wikipedia , lookup

Theoretical computer science wikipedia , lookup

Corecursion wikipedia , lookup

Pattern recognition wikipedia , lookup

Transcript
Pariseau Correlation in Business Intelligence
William K. Perrizo
Gregory H. Wettstein
Computer Science Department
North Dakota State University
Fargo, ND 58105 USA
Computer Science Department
North Dakota State University
Fargo, ND 58105 USA
[email protected]
[email protected]
ABSTRACT
In this paper, we describe a new correlation measure
used in a reliable high performance classification system
that includes a Nearest Neighbor Vote based classification
and a vertical data structure (Predicate Tree or P-tree1) for
efficient vertical processing. The method was used to
produce a solution to the, now famous, Netfix problem.
In the Probe Dataset provided by Netflix, the method
produces a 19% improvement in Root Mean Square Error
rating prediction over the company’s current algorithm,
CineMatch, which predicts the rating a renter will give to
a movie based on the entire history of previous ratings. A
solution has been submitted based on this correlation
measure and algorithm.
Keywords
Pariseau Correlation,
Nearest-Neighbor
Predicate-Tree,
Classification
1. INTRODUCTION
Business Intelligence (BI) problems are very
interesting data mining problems. Typical BI training data
sets are large and extremely unbalanced among classes.
The Business Intelligence problem is a classification
problem. Given a history of customer events (purchase or
rentals) and satisfaction levels (ratings), one classifies
potential future customer events as to most likely
satisfaction level. This information is then used to suggest
to customers what item they might like to purchase or rent
next. Clearly this is the central need for any Internet
based company or any company seeking to do targeted
advertising (all companies?). The classification training
set or Business Intelligence Data Set, BIDS, usually
consists of the past history of customer satisfaction rating
events (the rating of a product purchased or rented. In the
Netflix case, it is a 1-5 rating of a rented movie). Thus, at
it’s simplest, the training set looks something like
BIDS(customerId, productID, rating), where the features
are customerID and productID and the class label is
rating. Very often there are other features such as
DataOfEvent, etc. Of course there are many other single
entity features that might be useful to the classification
1
United States Patent Number US 6,941,303 B2, Sept. 6, 2005.
process, such as Customer Features (name, address,
ethnicity, age,…) and Product features (type,
DataOfCreation, Creator, color, weight, …). Together
these form a Data Warehouse in the Star Model in which
the central fact is BIDS and the Dimension star points are
the Customer and Product feature files. In this paper we
will keep it simple and work only with BIDS. In the more
complex case, the Dimension files are usually joined into
the central fact file to form the Classification Training Set.
Throughout this paper, it is assumed that the data is in
vertical format. Horizontal format is the standard in the
industry and is ubiquitous. However, it is the authors
contention (proved valid by the success of the method)
that vertical formatting is superior for many data mining
applications, whereas for standard data processing,
horizontal formatting is still best.
Horizontal Data formatting simply means that the data
on an entity type is collected into horizontal records (of
fields), one for each entity instance (e.g., employee
records, one for each employee). In order to process
horizontal data, scans down the collection (file) of these
horizontal records are required.
Of course, these
horizontal files can be indexed, providing efficient,
alternate entry points, but still, scans are usually necessary
(full inversion of the file will eliminate the need for scans
but that is extremely expensive to do and maintain, and
nearly impossible for large, volatile horizontal files.).
Vertical Data formatting simply means that the data on
an entity is collected into vertical slices by feature or by
bit position of feature (or by some other vertical slicing of
a coding scheme applied to a feature or features of the
entity; such as value slicing in which each individual value
in a feature column is bitmapped).
In this BIDS classifier, variant forms of Nearest
Neighbor Vote based classification were combined with a
vertical data structure (Predicate Tree or P-tree ) for
efficient processing. Even for small data sets, this would
result in a large number of horizontal training database
scans to arrive at an acceptable solution. The use of a
vertical data structure (P-tree), provides acceptable
computation times (as opposed to scanning horizontal
record data sets). Each component of the proposed
solution are discussed in subsequent sections of this paper.
2. BACKGROUND
2.1 Similarity and Relevance Analysis
Classification models that consider all attributes
equally, such as the classical K-Nearest-Neighbor
classification (KNN), work well when all attributes are
similar in their relevance to the classification task[14].
This is, however, often not the case. The problem is
particularly pronounced for situations with a large number
of attributes in which some are known to be irrelevant.
Many solutions have been proposed that weight
dimensions according to their relevance to the
classification problem. The weighting can be derived as
part of the algorithm [5]. In an alternative strategy the
attribute dimensions are scaled, using a evolutionary
algorithm to optimize the classification accuracy of a
separate algorithm, such as KNN [21].
In an effort to reduce the number of classification
voters and to reduce the dimension of the vector space
over which these class votes are calculated, judicious
customer (also called “user” in this paper) and item (also
called “movie” in this paper) selection proved extremely
important. That is to say, if one lets all neighboring users
vote over all neighboring movies, the resulting predicted
class (rating) turns out be grossly in error. It is also
important to note that the vertical representation facilitates
efficient attribute relevance in situations with a large
number of attributes. Attribute relevance for a particular
attribute with respect to the class attribute can be achieved
by only reading those two data columns compared to
reading all the long rows of data in a horizontal approach.
2.2 P-tree Data Structure
The BIDS data set was structured into vertical
Predicate-trees or P-trees. P-trees are a lossless,
compressed, and data-mining-ready vertical data
structures. P-trees are used for fast computation of counts
and for masking specific phenomena. This vertical data
representation consists of set structures representing the
data column-by-column rather than row-by row
(horizontal relational data). Predicate-trees are one choice
of vertical data representation, which can be used for data
mining instead of the more common sets of relational
records. A P-tree is a lossless, compressed, and datamining ready data structure. This data structure has been
successfully applied in data mining applications ranging
from Classification and Clustering with K-Nearest
Neighbor, to Classification with Decision Tree Induction,
to Association Rule Mining [12][7][19][1][22][2][13][18]
[24][6]. A basic P-tree represents one attribute bit that is
reorganized into a tree structure by recursive sub-division,
while recording the predicate truth value for each division.
Each level of the tree contains truth-bits that represent
sub-trees and can then be used for phenomena masking
and fast computation of counts. This construction is
continued recursively down each tree path until downward
closure is reached. E.g., if the predicate is “purely 1 bits”,
downward closure is reached when purity is reached
(either purely 1 bits or purely 0 bits). In this case, a tree
branch is terminated when a sub-division is reached that is
entirely pure (which may or may not be at the leaf level).
These basic P-trees and their complements are combined
using boolean algebra operators such as AND(&) OR(|)
and NOT(`) to produce mask P-trees for individual values,
individual tuples, value intervals, tuple rectangles, or any
other attribute pattern[3]. The root count of any P-tree will
indicate the occurrence count of that pattern. The P-tree
data structure provides a structure for counting patterns in
an efficient, highly scalable manner.
2.3 Pariseau Correlation Based Nearest
Neighbor Pruning
The Pariseau Correlation for each pair of data points in
a data set can be computed rapidly using vertical Ptrees[2]. For example, for each given pair of user, (u,v),
the Pariseau Correlation (also called 1 Perpendicular
Correlation), or PC(u,v), is
[ SQRT{(v-u-vbar+ubar)o(v-u-vbar+ubar) /(a + nb)} ]2
where o is the inner product of the ratings of users, u and
v, across n selected movies and a and b are tunable
parameters. This approach avoids the requirement of
computing all Euclidian distances between the sample and
all the other points in the training data, in arriving at the
Nearest Neighbor set. But more importantly, it produces a
select set of neighbors which, when allowed to vote for
the most likely rating, produces very low error. Later in
this paper, an explanation will be given of the motivation
for this correlation measure and a pictorial description of
it will also be provided.
3. CLASSIFICATION
3.1 Closed Nearest Neighbor Classification
Most of the classification algorithms applicable to
pattern recognition problems are based on the k-nearest
neighbor (KNN) rule[24]. Each training data vector is
labeled by the class it belongs to and is treated as a
reference vector. During classification k nearest reference
vectors to the unknown (query) vector X are found, and
the class of vector X is determined by a ‘majority rule’
vote. Restricting the Nearest Neighbor set to k can lead to
wrong classifications if there are many ties for the kth
neighbor. In a classical KNN approach the nearest
neighborhood is not closed due to algorithmic
complexities of finding all neighbors within a given
distance using one horizontal pass of the dataset (it takes
two passes, one to produce exactly k nearest neighbors
and therefore the distance of the kth nearest neighbor, ε,
then another pass to find all additional neighbors also at a
distance of ε from the unclassified sample).
Using P-tree vertical structures, however, all these kth
nearest neighbor ties can be included at no additional
processing cost using the ‘closed k Nearest Neighbor’
algorithm [12]. Rather than allowing a uniform vote for
each neighbor, Gaussian like weighted votes were used.
Gaussian weighting is based on the distance from the
sample to the voting training point. A training sample
closer to the test sample will have a higher influence on
the final classification outcome compared to a sample that
is further away from the test sample based on its’ distance
to the sample. The vote contribution is also further
adjusted based on a vote weighting derived from the
Pariseau Correlations of both the movies (as related to the
movie rating to be predicted) and the user Pariseau
Correlation (as related to the user rating to be predicted)
calculated as above.
4. PARAMETER OPTIMIZATION &
CLASSIFICATION RESULTS
The multiple parameters involved in the classification
algorithms were optimized via a hill-climbing algorithm.
The method has some similarity to the winning method
applied by one of the authors to the Association of
Training
Data
Computing Machinery Knowledge Discovery and Data
Mining Cup in 2006 (KDDCup06) [16] in which the
Negative Predictive Value, NPV, was calculated by
TN/(TN+FN). The above fitness function encourages
solutions with high TN, provided that NPV was within a
threshold value. Although this task specified threshold for
NPV was 1.0, with the very low number of negative cases
it was hard to expect multiple solutions that meet the
actual NPV threshold and also maintained a high TN
level. A genetic algorithm, GA, was used on collections
of quality solutions in each generation potentially
influence the set of solutions in the next generation.
Since the training data set in the KDDCup case was
small, patient level bootstrapping was used to validate
solutions. Bootstrap implies taking out one sample from
the training set for testing and repeating until all samples
were used for testing. In this specific task which involves
multiple candidates for the same patient we removed all
candidates from the particular training set for a given
patient when used as a test sample. The overall approach
of the solution is summarized in the following
diagram.ttribute relevance is carried out on the training
data to identify the top M attributes. Those attributes are
subsequently used to optimize the parameter space, with a
standard Genetic Algorithm (GA) [9]. The set of weights
on the parameters represented the solution space that was
encoded for the GA. Multiple iterations with the feedback
from the solution evaluation is used to arrive at the best
solution. Finally the optimized parameters are used on the
combined classification algorithm to classify the unknown
samples.
Hill Climbing
User and Movie
Relevance
Accuracy
Param.
Top K
Pariseau Corr
Users, Movies
Nearest Neighbor Vote
Classification Algorithm
Heuristically Optimized
Test
Data
Figure 1 Overview of the Classification Approach
Classification Algorithm
Classified
Results
6. REFERENCES
Attributes with zero weights at the end of the
optimization run were omitted. These parameters
optimized with the training data were able to achieve over
19 % improvement over the industry leading Netflix
CineMatch algorithm. For a given training data set once
the values are optimized they can be repeatedly used on
unknown test data until the training data set is changed.
[1] Abidin T., and Perrizo W., SMART-TV: A Fast and
Scalable Nearest Neighbor Based Classifier for Data
Mining. Proceedings of the 21st ACM Symposium on
Applied Computing , Dijon, France, April 2006.
The solution quality of this approach is evident by the
achievement of performance against the industry leading
algorithm, Netflix’s CineMatch algorithm. This compares
to the maximum possible quality that has been achieved
by any of the other nearly 30,000 Netflix prize
competitors (namely ~8.7% as of November 19, 2007)
according to the Netflix Leader Board.. To further prove
the quality of the proposed solution we used this solution
against the well known Fisher’s Iris data set [26]. In the
case of the linearly separable classification (easy) our
approach scores very high, as expected. In the case of the
classification of the linearly non separable classification
(hard) it also scores very high.
[3] Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree
Algebra, Proceedings of the ACM Sym. on App. Comp.,
pp. 426-431, 2002.
5. CONCLUSIONS AND FUTURE WORK
In this paper we describe a successful approach to a
Business Intelligence classification. The approach takes
into consideration typical Business Intelligence (BIDS)
training data set characteristics such as large and
extremely unbalanced classes, noisy class labels, large
number of potential descriptive features with irrelevant
and redundant features. The approach also takes into
consideration the requirement to have extremely high
performance thresholds in proposed systems in order to
achieve acceptance. The approach described in this paper
was successfully used to submit an entry for the Netflix
Problem as described on the company website,
netflix.com.
In this classifier, a variant form of Nearest Neighbor
Vote based classification was combined with a vertical
data structuring (Predicate Tree or P-tree ) for efficient
processing. The use of a vertical data structure (P-tree)
provided acceptable computation times and very high
accuracy for this approach.
Further research and testing is ongoing to produce a
classifier system that can be used to solve any Business
Intelligence or related problem in day-to-day practice.
[2] Abidin, T., Perera, A., Serazi,M., Perrizo,W., Vertical Set
Square Distance: A Fast and Scalable Technique to
Compute Total Variation in Large Datasets, CATA-2005
New Orleans, 2005.
[4] Bandyopadhyay, S., and Muthy, C.A.,
Classification Using Genetic Algorithms.
Recognition Letters, Vol. 16, (1995) 801-808.
Pattern
Pattern
[5] Cost, S. and Salzberg, S., A weighted nearest neighbor
algorithm for learning with symbolicfeatures, Machine
Learning, 10, 57-78, 1993.
[6] DataSURG, P-tree Application Programming Interface
Documentation, North Dakota State University.
http://midas.cs.ndsu.nodak.edu/~datasurg/ptree/
[7] Ding, Q., Ding, Q., Perrizo, W., “ARM on RSI Using
Ptrees,” Pacific-Asia KDD Conf., pp. 66-79, Taipei,
May2002.
[8] Duch W, Grudzi ´N.K., and Diercksen G., Neural Minimal
Distance Methods., World Congress of Computational
intelligence, May 1998, Anchorage, Alaska, IJCNN’98
Proceedings, pp. 1299-1304.
[9] Goldberg, D.E., Genetic Algorithms in Search
Optimization, and Machine Learning, Addison Wesley,
1989.
[10] Guerra-Salcedo C., and Whitley D., Feature Selection
mechanisms for ensemble creation: a genetic search
perspective, Data Mining with Evolutionary Algorithms:
Research Directions – Papers from the AAAI Workshop,
13-17. Technical Report WS-99-06. AAAI Press (1999).
[11] Jain, A. K.; Zongker, D. Feature Selection: Evaluation,
Application, and Small Sample Performance. IEEE
Transaction on Pattern Analysis and Machine Intelligence,
Vol. 19, No. 2, February (1997)
[12] Khan M., Ding Q., Perrizo W., k-Nearest Neighbor
Classification on Spatial Data Streams Using P-trees,
Advances in KDD, Springer Lecture Notes in Artificial
Intelligence, LNAI 2336, 2002, pp 517-528.
[13] Khan,M., Ding, Q., and Perrizo,W., K-Nearest Neighbor
Classification of Spatial Data Streams using P-trees,
Proceedings of the PAKDD, pp. 517-528, 2002.
[14] Krishnaiah, P.R., and Kanal L.N., Handbook of statistics 2:
classification, pattern recognition and reduction of
dimensionality. North Holland, Amsterdam 1982.
[15] Kuncheva, L.I., and Jain, L.C.: Designing Classifier Fusion
Systems by Genetic Algorithms. IEEE Transaction on
Evolutionary Computation, Vol. 33 (2000) 351-373.
[16] Lane, T., ACM Knowledge Discovery and Data Mining
Cup 2006, http://www.kdd2006.com/kddcup.html
[17] Martin-Bautista M.J., and Vila M.A.: A survey of genetic
feature selection in mining issues. ProceedingCongress on
Evolutionary Computation (CEC-99), Washington D.C.,
July (1999) 1314-1321.
[18] Perera, A., Abidin T., Serazi, M. Hamer, G., Perrizo, W.,
Vertical Set Square Distance Based Clustering without
Prior Knowledge of K, 14th International Conference on
Intelligent and Adaptive Systems and Software Engineering
(IASSE'05), Toronto, Canada, 2004.
[19] Perera, A., Denton A., Kotala P., Jockheck W., Valdivia
W., Perrizo W., P-tree Classification of Yeast Gene
Deletion Data, SIGKDD Explorations, Volume 4, Issue 2,
December 2002.
of the Fifth Int. Conf. on Genetic Algorithms, pp 557-564,
San Mateo, CA, 1993.
[22] Rahal, I. and Perrizo, W., An Optimized Approach for
KNN Text Categorization using P-Trees. Proceedings. of
ACM Symposium on Applied Computing, pp. 613-617,
2004.
[23] Raymer, M.L. Punch, W.F., Goodman, E.D., Kuhn, L.A.,
and Jain, A.K.: Dimensionality Reduction Using Genetic
Algorithms. IEEE Transactions on Evolutionary
Computation, Vol. 4, (2000) 164-171
[24] Serazi, M. Perera, A., Ding, Q., Malakhov, V., Rahal, I.,
Pan, F., Ren, D., Wu, W., and Perrizo,W., DataMIME,
ACM SIGMOD, Paris, France, June 2004.
[20] Perera A. and Perrizo W., Vertical K-Median Clustering, In
Proceeding of the 21st International Conference on
Computers and Their Applications (CATA-06), March 2325, 2006 Seattle, Washington, USA.
[25] Vafaie, H. and De Jong, K.: Robust feature Selection
algorithms. Proceeding of IEEE International Conference
on Tools with AI, Boston, Mass., USA. November. (1993)
356-363.
[21] Punch, W.F. Goodman, E.D., Pei, M., Chia-Shun, L.,
Hovland, P., and Enbody, R., Further research on feature
selection and classification using genetic algorithms, Proc.
[26] R. A. Fisher, The Use of Multiple Measurements in
Axonomic Problems. Annals of Eugenics 7, pp. 179-188,
1936