Download Paper Template

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Transcript
Social Media, Anonymity, and Fraud:
HP Forest Node in SAS® Enterprise Miner™
Taylor K. Larkin, The University of Alabama, Tuscaloosa, Alabama
Denise J. McManus, The University of Alabama, Tuscaloosa, Alabama
ABSTRACT
With an ever-increasing flow of information from the Web, droves of data are constantly being introduced into the
world. While the internet provides people with an easy avenue to purchase their favorite products or participate in
social media, the anonymity of users can make the internet a breeding ground for fraud, distribution of illegal material,
or communication between terrorist cells. In response, areas such as author identification through online writeprinting
are increasing in popularity. Much like a human fingerprint, writeprinting quantifies the characteristics of a person’s
writing style in ways such as the frequency of certain words, the lengths of sentences and their structures, and the
punctuation patterns. The ability to translate someone’s post into quantifiable inputs allows the ability to implement
advanced machine learning tools to make predictions as to the identity of malicious users. To demonstrate prediction
on this type of problem, the Amazon Commerce Reviews dataset from the UCI Machine Learning Repository is used.
This dataset consists of 1,500 observations, representing 30 product reviews from 50 authors, and 10,000 numeric
writeprint variables. Given that the target variable has 50 classes, only select models are adequate for this
classification task. Fortunately, with the help of the HP Forest node in SAS® Enterprise Miner™, we can apply the
popular Random Forest model very efficiently, even in a setting where the number of predictor variables is much
larger that the observation count. Results show that the HP Forest node produces the best results compared to the
HP Neural, MBR, and Decision Tree nodes.
INTRODUCTION
In today’s society, the internet plays a pivotal role in everyday life. It has a variety of uses, whether it be for
communicating with business leaders across the world or something as mundane as paying bills. The ease of use
promotes individuals to be more involved in online shopping and social media. Despite the wealth of information the
internet presents, it is no stranger to misuse. Specifically in the communications realm, malicious users can spread
illegal materials, such as pirated software and child pornography, via online messaging. Moreover, terrorist
organizations remain active in this type of transmission (Li, Zheng, & Chen, 2006). Therefore, it is desirable to
demystify online anonymity for the purpose of identifying cyber criminals and terrorist cells. This can be accomplished
empirically through analyzing an individual’s writeprint.
Much like a fingerprint, a writeprint characterizes the writing style by identifying specific attributes. Examples of this
include (Liu, Liu, Sun, & Liu, 2011):
•
Word and sentence length
•
Vocabulary richness
•
Occurrence of special characters and punctuations
•
Character n-grams frequencies
•
Types of words used and misspellings
The goal of this author identification system is to “identify a set of features that remain relatively constant among a
number of writings by a particular author” (Li, Zheng, & Chen, 2006, p. 78). Due to the variety of possible
characteristics to include in an individual’s writeprint, these datasets can quickly become high dimensional (i.e. where
the number of predictor variables is much larger than the number of observations). Thus, in order to perform
classification tasks, powerful algorithms such as Random Forests (RFs) are necessary to handle the nature of these
datasets. These models can be executed efficiently through SAS® products such as Enterprise Miner™.
This paper first introduces the concept of RFs and its implementation, the HP Forest node, in SAS® Enterprise
Miner™ 13.1. Next, it demonstrations the HP Forest node in action on a real, high-dimensional writeprint dataset
along with some other model nodes for comparison. Finally, it discusses the predictive performances and
computational expenses of each node within Enterprise Miner™ on the dataset.
1
Social Media, Anonymity, and Fraud: HP Forest Node in SAS® Enterprise Miner™, continued
WHAT ARE RANDOM FORESTS?
Classification and Regression Trees (CART) (Brieman, Friedman, Olshen, & Stone, 1984) are a very popular
technique for analyzing data due to their practical interpretations. They can identify predictive splits on predictor
variables in a dataset using binary recursive partitioning. In deciding these partitions, maximizing the information gain
according to the Gini impurity criterion is typically used to determine which predictor variables to use. Some
advantages to using these tree models are that they assume no distribution, produce interpretable logic rules, can
handle varying scales and types of data as well as missing values, and are able to capture complex interactions
(Lewis, 2000).
However, these tree models can be highly variable, which can induce over-fitting (Hastie, Tibshirani, & Friedman,
2009). In other words, small changes in the data can drastically change the tree structure since it is a hierarchical
process. To remedy this issue, Breiman (1996) suggests it is advantageous to perform bagging, or bootstrap
aggregation, in conjunction with CART. That is, train many trees independently on bootstrap samples (sampling with
replacement) of the same size as the training data and average the individual predictions. To improve upon this
procedure, Breiman (2001) introduces randomness to posit the idea of a RF. RFs grow a plethora of trees that
choose a random subset of predictor variables from all those available at each node in each tree to search for the
best split. This randomization promotes a forest of diverse tree models, which is necessary for accurate predictions
(Breiman, 2001). Besides increasing performance, another advantage to these models is the ability to effectively
estimate the error rate simultaneously during model training through the use of out-of-bag (OOB) observations. OOB
observations are comprised from the observations not used in the sampling for generating each tree’s training data.
Running the OOB observations through each constructed tree and aggregating the error rate across all the trees in
the forest yields very similar error to that from cross-validation (Hastie, Tibshirani, & Friedman, 2009). Although RFs
are typically viewed as a “black box” approach, variable importance rankings can be extracted. However, it is
important to note that the variable selection procedure in RFs can be biased when the predictor variables differ in
number of categories or scale of measurement and due to the bootstrap aggregation process which will affect the
reliability of these variable importance estimates (Strobl, Boulesteix, Zeileis, & Hothorn, 2007); thus, subsampling
(sampling without replacement) may be preferred in some scenarios when some inference is important.
Nevertheless, RFs have become incredibly popular in both academic literature and in practice because they inherit
many of the benefits of decision trees while boosting predictive performance.
METHODS
To construct RFs via SAS®, the HP Forest node within Enterprise Miner™ can be used. This implementation
improves upon the original RF algorithm in a couple of ways. First, the training data for each tree is subsampled as
opposed to being bootstrapped. Second, instead of searching for the best splitting rule in the subset of randomly
selected predictor variables, the HP Forest node preselects the predictor variable with the greatest amount of
association, determined by a statistical test, between the subset and the target variable and then finds its best
splitting rule (Maldonado, Dean, Czika, & Haller, 2014). This helps reduce the chances of producing a spurious split
and decreases the biases prompted by an exhaustive search for the Gini information gain, thus, leading to more
reliable predictions on new data and variable importance rankings. The method of using these tests of association for
determining splitting is similar to the RF variant proposed by Strobl, Boulesteix, Zeileis, and Hothorn (2007) which can
produce unbiased variable selection in each tree of the forest.
The main three parameters to tune in HP Forest are highlighted in red in figure 1 with their default values. The
Maximum Number of Trees, Number of vars to consider in split search, and Proportion of obs in each sample
correspond to the number of trees constructed in the forest, the number of predictor variables to be randomly
selected at each node, and the number of observations dedicated to each tree’s training data given by the
subsampling, respectively. A numerical value for the Number of vars to consider in split search is not listed
because it varies for each dataset. As is typical for RFs, the square root of the number of predictor variables is a
suitable default. Tuning this parameter controls the amount of correlation between the trees. Smaller values yield less
correlation in the trees; on the other hand, if only a few predictor variables are informative for the target, then this
value should be larger to increase the chances of these being selected. This parameter will likely lead to differences
in the predictive performance unlike for Maximum Number of Trees, which does not greatly impact the prediction for
a sufficient number of trees.
2
Social Media, Anonymity, and Fraud: HP Forest Node in SAS® Enterprise Miner™, continued
Figure 1. Main Sections of HP Forest Node's Properties Panel
DATA AND EXPERIMENTAL PROCEDURE
To investigate the performance of the HP Forest node for author identification tasks, the writeprint dataset, Amazon
Commerce Reviews, is used which be found on the UCI Machine Learning Repository (Lichman, 2013). This dataset
represents 30 product reviews from 50 active users. Accompanying these observations, 10,000 writeprint variables
are constructed from the text of the product reviews (see (Liu, Liu, Sun, & Liu, 2011) for more detail). Given the large
number of target classes (in this case 50), only certain models can be utilized for prediction.
Figure 2 shows the Enterprise Miner™ diagram constructed. Along with the HP Forest node, the HP Neural, MBR
(Memory Based Reasoning), and Decision Tree nodes are also executed for comparison. These are all left at their
default settings. Three versions of the HP Forest node are investigated. Denote ntree to represent Maximum
Number of Trees and mtry to symbolize Number of vars to consider in split search. The three version include:
1.
ntree = 50; mtry = 100 (default)
2.
ntree = 500; mtry = 100
3.
ntree = 5,000; mtry = 1,000
Figure 2. Enterprise Miner Diagram for Experimentation
3
Social Media, Anonymity, and Fraud: HP Forest Node in SAS® Enterprise Miner™, continued
To estimate the misclassification error for the comparison models, a random, stratified split is conducted with 70%
allocated for training (1,018 observations) and 30% for testing. Capitalizing on the OOB feature for RFs, the Number
of obs in each sample parameter is set to 1,018 so that each tree in the forest is trained on the same number of
observations as the comparison models.
RESULTS
Table 1 displays the misclassification rates for each model tested. Each of the RF models delivers a lower error
compared to the comparison models. Increasing ntree boosts performance at the default level of mtry; however,
increasing both ntree and mtry drastically does not lead to a large difference in error on this dataset compared to the
default setting with much larger computational expense. Note that processing time for a normal run of the Decision
Tree node is only a few seconds less than the default HP Forest node, which constructs 50 trees. Since the trees are
built independently from one another in a RF, this model can be ran in parallel using multiple cores, thanks to the
SAS® High-Performance (HP) Data Mining framework within Enterprise Miner™ (Maldonado, Dean, Czika, & Haller,
2014).
Model
Misclassification Rate
Run Duration (Minutes:Seconds)
HP Forest 1
0.693
2:51
HP Forest 2
0.483
3:58
HP Forest 3
0.649
40:38
HP Neural
0.923
2:08
MBR
0.751
1:01
Decision Tree
0.865
2:36
Table 1. Predictive Performance and Computational Expense.
Figure 2 plots the changing misclassification rate as more trees are added to the best model. Major decreases in the
misclassification rate begin to decline at around 100 trees indicating that it may not be necessary to construct so
many trees for this dataset. As expected, the misclassification rate given by the OOB observations is less optimistic
than when the RF is trained on all the data. This reinforces the necessity to always test predictive models on
independently held-out samples of data. Fortunately for RFs, this can be measured at the same time as model
training.
Figure 3. Misclassification Rate across Varying Numbers of Trees for Best Performing HP Forest Node.
CONCLUSION
Since their introduction, RFs have become an increasingly popular data mining tool for prediction and inference. In
this study, the RF implementation in SAS® Enterprise Miner™, HP Forest, is used to predict the authorship of various
4
Social Media, Anonymity, and Fraud: HP Forest Node in SAS® Enterprise Miner™, continued
Amazon users based on their online writeprint from their product reviews. Because of the large number of target
classes, only certain models are adequate for this classification task. The HP Forest node out-performs the other
comparison models tested with little increases in computational expense due to the high-performance environment
within Enterprise Miner™. As the internet becomes more prevalent in society, it becomes all the more important to
utilize the best analytical tools to prevent the distribution of illegal content and the communication of malicious
entities. Given the high-dimensional nature of these online writeprint datasets, RFs can be a data-driven solution to
help authorities identify authorship in online settings.
REFERENCES
•
Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.
•
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
•
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth &
Brooks.
•
Hastie, T., Tibshirani R., & Friedman, J. (2009). The elements of statistical learning. New York: Springer.
•
Lewis, R. J. (2000, May). An introduction to classification and regression tree (CART) analysis. In Annual meeting of the society for
academic emergency medicine in San Francisco, California (pp. 1-14).
•
Li, J., Zheng, R., & Chen, H. (2006). From fingerprint to writeprint. Communications of the ACM, 49(4), 76-82.
•
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.
•
Liu, S., Liu, Z., Sun, J., & Liu, L. (2011). Application of synergetic neural network in online writeprint identification. International
Journal of Digital Content Technology and its Applications, 5(3), 126-135.
•
Maldonado, M., Dean, J., Czika, W., & Haller, S. (2014). Leveraging Ensemble Models in SAS® Enterprise Miner™. In Proceedings
of the SAS Global Forum 2014 Conference. Cary, NC: SAS Institute Inc.
•
Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations,
sources and a solution. BMC bioinformatics, 8(1), 1.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the authors at:
Name: Taylor Larkin
Enterprise: The University of Alabama
Address: Box 870226
City, State ZIP: Tuscaloosa, AL 35487
E-mail: [email protected]
Name: Denise McManus
Enterprise: The University of Alabama
Address: Box 870226
City, State ZIP: Tuscaloosa, AL 35487
Work Phone: 205-348-7571
E-mail: [email protected]
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
5