Download Assessing Real World Applications of Data Mining With SAS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
Assessing Real World Applications of Data Mining With SAS
Enterprise Miner (EM): A Technical Report for Teaching the
Big Data Generation
1
1
Chon Abraham (Corresponding Author), 2 Margaret Poston
Associate Professor, Mason School of Business, College William of Mary, 101 Ukrop Way, Williamsburg, US
2 Graduate Student, College of William and Mary, US
1
[email protected]
ABSTRACT
The explosive growth in data collection across different industries has made it necessary to have processes that can infer
useful information from this data in a limited amount of time. The software application describes data mining techniques
and illustrates the practicality of data mining using SAS Enterprise Miner (EM), leading data analytics software, through
two separate data sets and case studies provided by SAS as part of a teaching series. As academic institutions continue to
revamp and develop curriculum to meet the challenges of educating the “the big data” generation, assessment and
overview of leading tools in industry is insightful for educators to contextualize problems for students. This paper is an
attempt to do so.
Keywords: Big data, data mining, SAS enterprise miner, knowledge discovery in databases
1. INTRODUCTION
On a daily basis, a tremendous amount of data is
collected and subsequently circulated. The amassment of
large datasets has led to the field of “Big Data” and “Big
Data Analytics “where rapid analysis is made possible
with the use of computational techniques [Chiang and
Storey 2012]. Websites like Google collect and store
information about web searches. Credit card companies
store information about who uses credit cards, stores,
both grocery and retail, collect information about what a
person is purchasing, how much, and how often. The
number of web pages indexed by Google[2015] exceeded
60 trillion this year, this is twice of the 30 trillion unique
URLs reported in 2013 [Koetsier 2013]and a major leap
from the 1 trillion that was reported in 2008[Fan and
Bifet 2013]. In order to store say 30 trillion unique
WebPages, it will require 100 million gigabytes or 1000
terabytes of space. The number of images that are shared
on the picture collection website, Flicker, on a daily basis
would require 3.6 tear bytes of storage space per day. On
a daily basis 2.5 quintillion bytes of data is generated and
the 90% of the data that exists in the world has been
generated over the last two years [Wu et al. 2015]. The
list of who is collecting data and what type of information
is being collected is vast and growing daily.
Much of the information is collected to provide
insight into a specific question or concern. Data mining
is used to extract important descriptive and predictive
information from these warehouses by utilizing certain
tools and techniques.
However, a massive shortage of professionals
who are skilled in data mining is changing the landscape
of educational institutions [Noyes 2014]. Educators need
insight on industry grade tools and offerings for
supporting classroom instruction especially in the context
of business. This paper is an attempt to provide an
overview of SAS EM as a leading tool used available to
educators for teaching data mining. Cases and data are
provided by the SAS educators teaching series and thus
similarity of thecae description, overview of the data, and
output could be replicated or published elsewhere by
educators attempting to use the software. The novelty of
what appears is the additional insight provided in from a
graduate student level perspective and contextualization
of the impact of applications of data mining in industry.
There are hundreds of software applications that
use a specific type of algorithms in order to extract
information, predictions and decisions from these large
datasets. Examples of these tools are, SAS Enterprise
Miner, IBM Intelligent Miner, R, Unica Pattern
Recognition Workbench (PRW), Watson Analytics, IBM
SPSS Modeler, Ghost Miner, SAP , SGI Mine set, Oracle
Darwin, Angoss Knowledge Seeker, Weka, Rapid Miner
and several others[Mikut and Reischl 2011]. However,
SAS EM was selected because of its leading status and
academic alliance that makes use of the technology an
enabler for classroom use.
In the proceeding sections, we describe what
data mining is and how it finds applications in the real
world. We also briefly describe the classification of data
mining techniques and algorithms. The application and
usage of SAS Enterprise Miner in different business
segments is discussed and two case studies are used to
illustrate the practical benefits of SAS Enterprise Miner
in analyzing datasets. Lastly the issues and challenges
faced by the data mining industry are taken into
considerations that are applicable for discussing in an
analytics curriculum.
2. FUNDAMENTALS OF DATA MINING
Like the name indicates, Data mining is the
process of retrieving useful information from large
amounts of data. The process of data mining helps in
recognizing patterns and trends in the data. When the data
becomes too big for manual analysis, the usability of
automatic mining comes into play [Kriegel et al.
488 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
2007].Richard Watson[2013] describes data mining as
``the search for relationships and global patterns that exist
in large databases but are hidden in vast amounts of data''.
Data mining is often considered synonymous to
Knowledge Discovery in Databases (KDD), where it is
referred to as “the non-trivial process of identifying valid,
novel, potentially useful, and ultimately understandable
patterns in data”[Fayyad et al. 1996]. The process of
KDD is broken up into five phases, namely; selection,
preprocessing, transformation, data mining and
interpretation. A variation of these five steps has been
used by the many data mining tools and techniques that
are now popular in the market.
2.1 Practical Applications of Data Mining Across
Industries
The edge of data mining over the traditional
statistical approaches is in the amount of data this
relatively new branch of computer science can handle.
While statistical tests may be able to give a significant
result with just 100 entries, data mining techniques may
require millions or billions of datasets to decipher a useful
pattern [Matignon and SAS Institute 2007].
Nearly all sectors and industries have benefited
from the use of this technology. Its role in scientific and
engineering applications has been long established, data
mining has proven its worth in the field of biology,
chemistry, physics, remote sensing and astronomy
[Grossman 2001].
In healthcare and medicine, mining of clinical
records, biomedical research and other experimental
results have been used to make policy changes and have
helped in defining scientific hypothesis for further
analysis [Yoo et al. 2012].
Text mining of product reviews, such as those
available on Amazon.com, has also been found useful in
analyzing user preferences and in predicting future
changes in sales [Archak et al. 2011]. Additionally, data
mining techniques have also been applied in predicting
sales of movies by mining their online reviews [Yu et al.
2012].
Data mining also finds its applications in the
field of education where an explosive growth in
educational data has led to the use of mining techniques
in order to make managerial decisions [Kumar and
Chadha 2011]. In the field of higher education, data
science has been used to improve students’ learning
activities and course development [Kumar and Chadha
2011].
2.2 Data Driven Decision Making In Marketing
There are numerous applications of data-driven
decision making that have been used in the industry for
variable purposes. The most common being the use of
data science by online advertising agencies to project
targeted ads. These companies handle billions of ad
impressions in a day and process these to make decisions
in milliseconds.
In the past decade, the most effective and
revolutionizing use of web mining has been carried out by
the online retailer Amazon. By introducing cookie
tracking to monitor the user’s browsing habits, Amazon
has been able to match up product ads to the person’s
liking and preferences[Broderick and Grinberg 2013].
Even more recently, in 2013 Amazon made use of data
mining to choose which TV show to produce from a
group of 14 pilot episodes for its new video streaming
service.
The online retailer collected data from one
million viewers of the pilot episodes which included
specifics like, viewing patterns, comments on video,
ratings and number of shares to assess popularity of the
TV shows [Sharma 2013]. Based on the analysis of this
data, Amazon chose to produce Alpha House out of
thousands of show ideas. The TV series went on to get a
7.5/10 rating on IMDb and 8/10 on TV.com.
A classic example of using mining techniques to
predict consumer behavior is its clever use by Wal-Mart
in 2004. However more recently a New York Times
article explains that Wal-Mart used information
uncovered by data mining to make inventory decisions
prior to the landfall of Hurricane Frances in 2004. Data
mining revealed that pop-tarts and beer were among the
top sellers prior to a hurricane’s arrival. The company
later indicated that those items were sold out more rapidly
than usual [Hays 2004; Provost and Fawcett 2013].
The strategy of database marketing, as used by
Wal-Mart has been applied by other retailers, like
Macy’s, which stores credit cards usage to collect data.
The collected information is used to develop offers of
special discounts benefitting frequent shoppers. Target
and Wawa are among the stores that use credit cards to
collect data about their customers. Companies even share
or sell customer information.
Airlines, car rental
companies, and hotel chains, for instance, allow the use
of member numbers for discounts between one another.
Companies share information about when customers
travel, the duration of the travel, and the type of travel.
These companies can use this information to create travel
packages for a specific type of customer or individual.
Casinos are using data mining in order to insure
patron loyalty. Harrah’s, one of the largest and most
successful casino chains, has continued to be successful
as a result of their loyalty rewards program. This
program allows customers to identify themselves at
stations all over the casino in order to earn points. These
points “can bring them a stream of ‘comps’ - small
complimentary gifts, such as meals and free hotel rooms”
[Schofield 2004].
489 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
2.3 Classification of Data Mining Algorithms
There are two basic categories of data mining
algorithms: one is descriptive which is also called
unsupervised learning and the other is predictive that is
also referred to as supervised learning[Ye 2013].
Descriptive or unsupervised methods work by measuring
similarity between objects to establish relationships
whereas predictive/supervised methods first infer
prediction rules which are then applied on unclassified
data. Clustering, association analysis, sequence discovery
and summarization are types of descriptive methods
whereas classification, regression, time series analysis,
and prediction methods are part of predictive
algorithms[Anderson 2012]. These algorithms are used in
conjunction with statistical methods and visualization
techniques. Some of the common methods for this
purpose are decision trees, genetic algorithms, k-means
clustering and regression techniques.
A brief overview of some of these algorithms is given
below:
a.
Association Analysis
An association analysis “identifies affinities
existing among the collection of items in a given set of
records”. Association analysis is often referred to as
market basket analysis or an affinity analysis. The
association rules take the form of Set A  Set B, or the
items in set A imply the items in set B are also in the
transaction.
In order to determine the strength of the
association, the support and the confidence of each rule is
determined. The support of a rule is the probability that
the items in the two sets (on each side of the rule) occur
together. Equation 1 shows how the support of A  B is
calculated.
numberof transactio
ns containingeveryitemin A andB
totalnumberof transacti
ons
Equation 1: Support Calculation
While the confidence of A  B is the
probability of a transaction containing the items in set B
given that it contains the items in set A. Equation 2
shows how the confidence of A  B is calculated.
numberof transactio
ns containingeveryitemin A andB
thetransactio
n containstheitemsin A
Equation 2: Confidence Calculation
One important thing to note is that cause and
effect is not implied by high levels of support and
confidence. In fact, there could potentially be no
correlation between the two sets of interest. Also, the
term confidence does not maintain the same meaning as
in statistics.
Other measurements of the strength of an
association are the expected confidence and the lift. The
expected confidence of the rule looks at each side of the
rule as if it is an independent event. Consequently, the
expected confidence of the rule A  B is calculated by
dividing the number of transaction that include set B by
the total number of transactions. The lift of A  B is a
measure of association between A and B such that a lift is
greater than 1 indicates a positive correlation, a lift less
than one indicates a negative correlation, and a lift equal
to 1 indicates no correlation. The lift is calculated by
Equation 3:
confidence of the rule
expected confidence of the rule
Equation 3: Lift of a Rule
An association analysis is used to determine if
the purchase of an item implies that another item will also
be purchased. Suppose there are 4 baskets with three
items in each as in Table 1.
Table 1: Items in grocery baskets
Basket 1 Basket 2 Basket 3 Basket 4
A, B, C
B, C, D
A, C, D
A, D, E
A few rules and their strength measures derived
from the baskets are illustrated in Table 2. Notice that the
support of a rule and the lift of a rule are symmetric,
support of Rule 1 and the support of Rule 2 are equal.
The confidence, however, is not symmetric.
Table 2: Association rules
Expect
ed
Sup Confi
Rule
Confid
port dence
ence
Rule 1:
Item
1/4
1/3
¼
A  Item E
Rule 2:
Item
1/4
1/1
¾
E  Item A
Rule 3:
Item
2/4
2/3
¾
A  Item C
Rule 4:
Item
2/4
2/2
¾
B  Item C
Rule 5:
Item
1/4
1/3
¼
D  Item E
Lift
4/3
4/3
8/9
4/3
4/3
b. Sequential Patterns
Identifying sequential patters involves detecting
“frequently occurring sequences from given records.
”Using a sequence analysis is much like the association
analysis but includes a time dimension. A sequence
analysis is performed if a company wants to know the
order in which a customer bought items. For instance,
customers might buy gloves, hats, and scarves on separate
trips to the store. Sequential patterns will tell the store if
there is a particular order in which customers make their
purchases.
490 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
c.
Classifying
Classifying involves separating “predefined
classes…into mutually exclusive groups”. Classifications
are done before an analysis is performed. For instance, if
a regression was to be performed on data representing
factors that influence tuition rates colleges some
classifications that could be made are based on state in
which a college is located and whether the school is a
private or public school. Stores might classify customers
as frequent shoppers, occasional shoppers, or infrequent
shoppers. A frequent shopper might shop in a store once
a week; an occasional shopper might shop in a store once
a month; and an infrequent shopper might shop in the
store once a year. In a regression analysis, these variables
could be used as binary indicator variables.
d. Clustering
Clustering is similar to classifying in that it
involves separating the data into classes. Clustering
identifies unknown classes, rather than predetermined
classes, within the data. A clustering analysis can also be
referred to as unsupervised classification or segmenting.
It can be useful in creating marketing strategies
if it can uncover groups with unique profiles. It can also
be used as a tool in developing predictive models. There
are a number of methods for clustering such as the kmeans clustering algorithm.
e.
Predictive
Prediction does what the name implicates; it
predicts a “future value of a variable”. Predictive models
must
 “provide a rule to transform a measurement into
a prediction
 have a means of choosing useful inputs from a
potentially vast number of candidates
 Be able to adjust its complexity to compensate
for noisy training data
Predictive modeling is done using a variety of
statistics techniques including decision trees, neural
networks, and regression.
3. DATA MINING WITH SAS
ENTERPRISE MINER (EM)
SAS EM is a leading platform that is used in
mining data especially in the advanced and predictive
analytics market segment [Gartner 2015]. The software
uses a modification of the KDD algorithm, where the
process has five SEMMA steps; sample, explore, modify,
model and assessment [Abell 2014; Al Ghoson 2010].
SAS Enterprise Miner has consistently been
ranked among the top ten most popular data analytics
tools for the past several years. The results from 16th
annual KDnuggets Software Poll 2015 ranked SAS
Enterprise Miner as the ninth most popular tool used for
real projects, taking 11.3% of the 2900 votes [Piatetsky
2015].
SAS Enterprise Miner has the most solid
standing in the advanced analytics market segment, where
it dominates with 36.2% market share, according to
IDC’s 2012 report [Vesset et al. 2013]. SAS was again
ranked as the top supplier
in advanced and
predictive analytics market with 35.4% market share in
IDC’s report based on a 2013 survey [Vesset et al. 2014].
Among business analytics, SAS has been consistently
ranked as the fifth-largest vendor based on revenue
generated in 2012 and 2013 [Vesset et al. 2013; Vesset et
al. 2014]. In 2013, it controlled 6.3% of the market share
after Oracle, SAP, IBM and Microsoft whereas in 2012
its revenue made up 6.9% of the business analytics
industry.
In the industry SAS Enterprise Miner has been
used for various applications of customer relationship
management (CRM) and is widely used in a variety of
commercial applications [Farooqi and Raza 2012]. SAS
has also been used in mining customer value via direct
marketing strategies [Wang et al. 2005];here predictive or
incremental response models are applied on a particular
group of customers that maximizes the profit return while
minimizing the cost [Lee et al. 2013].
SAS has also found its usage in several fields,
for instance in the healthcare market, insurance company
Highmark was able to build a fraud detection system
based on real-time or near real-time analysis. The same
company also used SAS to construct a decision tree
model using patient symptoms, health history and
demographics that helped in maximizing revenues at the
company. Another health services company, Health ways
used predictive models in SAS Enterprise Miner to
minimize healthcare costs [Yoo et al. 2012].
As mentioned earlier, SAS approaches data
mining in a five step SEMMA process. Given a large
dataset, the first step is to sample. During this step, a
subset of the dataset is extracted. This subset must be
“small enough to process” efficiently, yet large enough to
contain significant data. If the subset of data is too large,
the obvious issues of lack computer memory and lengthy
processing time are encountered. On the other hand, if
the subset is too small, the risk becomes that there is not
enough data to accurately identify relationships and
patterns. The procedure might identify a pattern that is
only relevant to the small subset of data selected rather
than the large subset.
Once an appropriate subset is selected, the next
step in the process is to explore the data. This entails
performing a preliminary investigation to identify trends
or relationships that might be of significance and worth
studying in more detail. The intention is that one
becomes familiar with the dataset.
The data is modified next. Modifying data
consists of “creating, selecting, and transforming the
variables”. These modifications are made with the
intention of building a model.
491 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
Modeling is the fourth step of this process. This
step entails taking the variables from the modification
step and applying “analytical tools” to ascertain a
relationship which accurately predicts a desired result.
No model will ever be perfect, but the goal is to create a
model which is as good as possible.
The final step is to assess the model. In this
stage, the model is evaluated to determine its relevance.
This step can involve comparing multiple models to
identify the best model for a specific situation.
SAS Enterprise Miner utilizes this process to
organize available tools and functions of the software.
SAS utilizes the functions and statistical techniques in
order to perform analysis on vast quantities of data.
4. CASE STUDIES
The following section provides two data mining
case studies using SAS Enterprise Miner that are offered
to educators free of charge along with access to the SAS
EM platform. The case study overviews and datasets are
provided by SAS EM as part of the teaching series to
demonstrate capabilities of the software [Matignon and
SAS Institute 2007]. Thus, the description of the case
study, data sets, output are freely used and can appear in
any number of resources published by those employing
the
platform
(e.g.,
http://mis.aug.edu/drjmatls/JMP%20Training%20Folders/
Adv%20Analytics%20AAEM71%20July%202012/CA/C
ase%20Studies.doc). The first study, which is rather
simplistic, involves the usage of web site services by a
radio station’s listeners. The second study is about
enrollment management at a college.
Table 4 shows that the ID or the URL has
1,586,124 observations corresponding to unique web
users. The Target or web services show only 8 different
levels. These are associated with the 8 web services
provided to the listeners by the radio station.
Table 3: Variables in web site case study
Model Measureme
Name
Description
Role
nt Level
URL (with
ID
Nominal
anonymous ID
ID
numbers)
Nominal
Web service selected
TARGET Target
Table 4: Summary of variables in web site case
Variable Levels Summary
Number
of
Variable Role
Levels
ID
ID
1586124
TARGET TARGET
8
Plots displaying the frequency of the target (or
web service) are also generated using the Stat Explore
node. Figure 2 is the bar graph of the usage of services
while
Figure 3 presents the same information in the form of a
pie chart.
Case No 1: Web Site Usage Associations
A radio station collected data about the usage of
it website by its listeners. The website provides a number
of services and the radio station would like to know if any
unusual patterns existed in the combinations of services
selected by its web users.
To begin the analysis using SAS Enterprise
Miner, the diagram is created and the data source is
defined. The next step is to begin exploring the data to
become familiar with the database as well as to collect
initial statistics. This is done adding the Stat Explore
node to the diagram and running the path. The diagram is
displayed in Figure 1.
Figure 2: Bar graph of usage of web site services
Figure 1: Initial diagram
As seen in Table 3, there are only two variables,
the URL and the web service that is selected.
492 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
Figure 3: Pie graph of usage of web services
Both figures identify the eight levels of the
target, corresponding to the eight services offered by the
website, and the relative frequency of each in a sample of
10,000 observations.
These services include news
streams, archives, music streams, simulcast, external
referrers, podcast, website, and live streams.
Unfortunately, there is no more information regarding
what each of the services entails.
The 10,000
observations were randomly sampled from the database.
Figure 5: Statistics plot of web site services usage
The use of the website is often followed by the
use podcast. Of the 10,000 observations, 40.44% are
using the website and 31.38% are using the podcast
service. The least utilized services are the external
referrers and the live streams. External referrers are used
in only 1.25% of the observations. Live streams are used
the least in only 1.15% of the observations.
Figure 6 is a plot of the rules based on which
side of the rule the items belong. The plot also uses the
colors of each point to display the confidence of the rules.
The rules with the highest level of confidence belong to
the column with red points (the third column from the
right). The only item with 100% confidence is that use of
the live stream implies the use of the Website.
After exploring the data, the association tool is
used to identify possible associations within in the
database. The diagram is updated with the additions of
the Association node, as seen in
Figure 4.
Figure 6: Rules matrix
The statistics line plot in
Figure 4: Final diagram for web site services
To obtain association rules, the default settings
of the Association node were altered to allow the number
of items to process to be 3 million and the minimum
support percentage to be 1. The results of running the
path are 3 plots and a list of associations. The list of
associations can be found in Appendix A.
Figure7plots the lift, expected confidence,
confidence, and support for each of the rules in order by
the rule index. This graph indicates that the rules are
indexed in descending order according to the lift. Using
the rule with 100% confidence as identified in the rules
matrix, this plot reveals more information about the rule.
The index of the rule 34, the lift of the rule is 1.74, the
expected confidence of the rule is 57.52, and the support
of the rule is 2.15.
The statistics plot of the website services usage
in Figure 5 plots the support on the Y-axis against the
confidence level on the X-axis. Notice that for the
associations with three and four relations, the support
level always falls below 5%. Also note that the
associations with the highest support only involved two
relations.
Figure 7: Statistics line plot
All of the information in the plots can be found
in the output results in Appendix A. However, the plots
493 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
provide visual depictions to better identify trends and
important rules. The link graph in
Figure 8 displays the associations between the all of the
services. The nodes are sized according to the count and
the thickness of the links indicates the level of
confidence. Notice the node corresponding to both the
website and podcast is not connected to any of the other
nodes.
In addition to the information discovered about
the rule that the use of the live stream tool implies the use
of the website, interesting conclusions drawn from this
data are:
 External referrers also pointed to the archives
with 98% confidence. This is also the case for
External referrers and users of website pointing
to the archives;
 Those who used live streams, podcasts, new
services, or the simulcast tools were not as likely
to go the Web site; and,
 Those who used the simulcast service were three
times as likely to also use the news services.
Figure 8: Linked graph of website services
Case No 2: Enrollment Management
“In the fall of 2004, the administration of a large
private university requested that the Office of Enrollment
Management and the Office of Institutional Research
work together to help identify prospective students who
would most likely to enroll as new freshmen in the fall
2005 semester. The administration stated several goals for
this project: increase new freshman enrollment, increase
diversity, and increase SAT scores of entering students.
Historically, inquiries numbered about 90,000+ students,
and the university enrolled from 2400 to 2800 new
freshmen each fall semester.”
4.1 Initial Observations
The description of the data can be found in
Appendix B. The data set contains variables that
described demographics of enrolled students, financials,
correspondence, interests, and visits to the campus. There
are a number of variables that were rejected from the use
in the model. Some were rejected because there were too
many data points missing. The nominal variables that
described academic interests and high school code were
replaced with interval variables. The academic codes
were replaced by variables that described the rate at
which the code was used over 5 years. The school code
was replaced by the enrollment rate from each school
over 5 years. The variables that described race and sex
were rejected because they are not admissible factors for
the decision process.
4.2 Descriptive Statistics
The next step in the analysis is to explore the
data and to collect initial statistics. This is done by
adding the Stat Explore node to the diagram and
displaying the results, like in the radio website case study.
Table 5 displays the class variable summary
statistics. Notice that the territory variable, which
describes the recruitment area, is the only variable
missing an observation. Moreover, the variables that
have a mode of zero also have a very high percentage of
zeros ranging from 51.01% to 96.61%.
Table 5: Class input variables summary statistics for
enrollment management
Class Variable Summary Statistics
Mode
Variable
Role NumcatNMiss Mode
Pct
Mode2 Mode2Pct
CAMPUS_VISIT
INPUT
3
0
0
96.61 1
3.31
INSTATE
INPUT
2
0
Y
62.04
N
37.96
REFERRAL_CNTCTS
INPUT
6
0
0
96.46 1
3.21
SOLICITED_CNTCTS
INPUT
8
0
0
52.45 1
41.60
TERRITORY
INPUT
12
1
2
15.98
5
15.34
TRAVEL_INIT_CNTCTS INPUT
7
0
0
67.00 1
29.90
INTEREST
INPUT
4
0
0
95.01
1
4.62
MAILQ
INPUT
5
0
5
69.33
2
12.80
PREMIERE
INPUT
2
0
0
97.11
1
2.89
STUEMAIL
INPUT
2
0
0
51.01
1
48.99
ENROLL
TARGET
2
0
0
96.86
1
3.14
Table 6 shows the distribution of the class target
which is enrollment. Enrollment is an indicator variable
such that a 1 implies a successful enrollment and a 0 is a
failure to enroll. Successful enrollment occurs only
3.135%.
Table 6: Distribution of class target and segment
variables
Distribution of Class Target and Segment Variables
Formatted
Variable Role
Value
Frequency Percent
ENROLL TARGET
0
88614 96.8650
ENROLL TARGET
1
2868
3.1350
494 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
Table 7 displays the summary statistics for the
interval variables. Notice that the average income and the
distance variables are missing a number of data points.
The distributions of these variables are plotted in
Figure9. Each of the eight plots indicate skewness in
the distributions. The INIT_SPAN input is not as skewed as
the other interval variables. The appearance of skewness suggests
that a transformation might be necessary in order to obtain and
accurate regression model.
495 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
Table 7: Interval variable summary statistics
Figure 9: Distribution of interval variables for enrollment
management
4.3 Sampling
Since the data chance of enrollment is so small,
a stratified sample is taken using the Sample tool. A
Sample node is added to the diagram. (The final diagram
can be found in Appendix C.) The sample is created such
that each case which enrolled in the school is forced into
the sample. Then, for each of the 2,868 enrollments, 7
other observations are included in the sample, resulting in
a sample size of 22,944. Running the sample node and
exploring the results shows that this increases the
percentage of cases that enroll to 12.5%. The results are
displayed in Table 8.
Table 8: Summary statistics for sample in enrollment
management case
Summary Statistics for Class Targets
Data=DATA
Numeric Formatted
Variable Value
Value
Frequency Percent
Enroll
0
0
88614 96.8650
Enroll
1
1
2868
3.1350
Data=SAMPLE
Numeric Formatted
Variable Value
Value
Frequency Percent
Enroll
0
0
20076
87.5
Enroll
1
1
2868
12.5
4.4 Decision Process
Recall that the administration wants to be able to
identify students that are most likely to enroll at the
college. A good candidate will have an above average
probability of enrollment. In order to incorporate this in
the model, a decision node is used. The decision node
allows the prior probability enrollment of 3% to be used
and the central decision rule to be created. The central
decision rule is a matrix whose trace is the inverse of the
prior probabilities. The matrix is used to force enrollment
when the estimated probability is greater than 3% (or the
prior probability). Otherwise, the applicant is considered
unlikely to enroll.
4.5 Prediction Model (All Cases)
After the decision node was setup to determine
the central decision rule, the diagram was completed in
order to perform a stepwise regression on the given data.
In order to do this, the data is partitioned using a Data
Partition node that uses 60% of the data for training and
40% of the data for validating the model. An impute
node is used next to fill in the missing data. The node
uses the tree method for both class and interval variables.
Also, under the indicator properties, unique missing
indicator variables are utilized to create binary indicator
variables for every imputed variable. These binary
indicator variables are then used as inputs. A stepwise
regression was used with the entry and stay significance
levels at the default setting of 0.05. The variables
selected in stepwise regression are then used by the
Neural Network and Instate Regression nodes.
Table 9 displays the results from the Stepwise
Regression node. The stepwise regression employs a
logistic regression with the link type of logit. Both the
count of self-initiated contacts and the rate of enrollment
of the high school are extremely important to the model
with p-values of 0+. The student e-mail, though included,
is not significantly different than zero with a large p-value
of 0.6482.
Table 9: Stepwise regression results
Analysis of Maximum Likelihood Estimates
Standard
Wald
Standardized
Parameter
DF Estimate Error Chi-Square Pr>ChiSq
Estimate Exp(Est)
INTERCEPT
1 -12.1953 16.8329
0.52 0.4688
0.000
SELF_INIT_CNTCTS
1 0.7069 0.0195
1315.95
<.0001
0.8446 2.028
HSCRAT
1 16.5157 0.7711
458.71
<.0001
0.7267 999.000
STUEMAIL
0 1 -7.6794 16.8321
0.21 0.6482
0.000
Odds Ratio Estimates
Point
Effect
Estimate
SELF_INIT_CNTCTS
2.028
HSCRAT
999.000
STUEMAIL
0 VS 1
<0.001
496 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
The regression equation for the stepwise node is:
 pˆ 
  12.20  0.71X1  16.52 X 2  7.68X 3
log
 1  pˆ 
where X 1 denotes self-initiated contacts, X 2 denotes the
rate of enrollment from the high school over the past 5
years, and X 3 denotes if an e-mail address is supplied.
The odds ratios expresses the increase in the
number of students who enrolled in the college with a
unit change in that particular input. Consequently, if the
count of self-initiated contacts increases by one unit, then
it increases the odds of that student enrolling by 2.
Notice that the odds ratios are unusual for the high school
enrollment rate and the student e-mail. This is likely a
result of a strong association within each of those inputs.
For example, of the students who enrolled, the majority
provided and e-mail address. Also, for some high
schools, all of the students that applied enrolled or all of
the students that applied chose not to enroll at this
particular school. A neural network node is also
processed using the variables selected by the stepwise
regression.
Next, the INSTATE input is added to the
regression model using the regression node named Instate
Regression. This uses the variables selected in the
stepwise regression and forces the INSTATE input to be
included, as well. This model variable is included
because it is thought that students enrollment decision
depends on whether or not the student is in-state or outof-state. The results of this regression are displayed in
Table 10.
Notice that whether a student is instate or out-ofstate is significant with a p-value of 0+. Again, the
student e-mail is not significant since the p-value is
0.6606 which is larger than the significance level of 0.05.
Since the student e-mail address is not statistically
different than zero, the regression function for the instate
regression is:
 pˆ 
  12.05  0.41X 1  0.69 X 2  16.23 X 3
log
 1  pˆ 
,
where X 1 denotes instate or out-of-state status,
X2
denotes the number of self-initiated contacts, and X 3
denotes rate of enrollment from the high school over the
past 5 years. Notice that the coefficient estimates for all
of the previously included variables decreased with the
addition of the status of instate or out-of-state.
The initial splits in the decision tree provides
inside into the predictions.
Figure 10 shows that those who have made at
fewer than four self-initiated contacts are unlikely to
enroll. Students that made fewer than three self-initiated
contacts almost never enroll. This can be seen because
for both the training and validation in the branch with less
than 2.5 self-initiated contacts, the probability of a 0 (or
not enrolling in the school) are 99.6% and 99.7%,
respectively.
Table 10: Instate regression results
Analysis of Maximum Likelihood Estimates
Standard
Wald
Standardized
Parameter
DF Estimate Error Chi-Square Pr>ChiSq
Estimate Exp(Est)
INTERCEPT
1 -12.0541 16.7449
0.52
0.4716
0.000
INSTATE
N 1 -0.4145 0.0577
51.67
<.0001
0.661
SELF_INIT_CNTCTS
1
0.6889
0.0196
1233.22
<.0001
0.8231 1.992
HSCRAT
1 16.2327 0.7553
461.95
<.0001
0.7142 999.000
STUEMAIL
0 1 -7.3528 16.7443
0.19
0.6606
0.001
Odds Ratio Estimates
Point
Effect
Estimate
INSTATE
N VS Y 0.437
SELF_INIT_CNTCTS
1.992
HSCRAT
999.000
STUEMAIL
0 VS 1 <0.001
Figure 10: Decision tree (initial splits)
Figure 11 indicates the optimal number of leaves is 18
because the average profit levels off at the amount. By
adding more leaves, nothing is gained.
497 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
Figure 11: Average profit
4.6 Comparison of Models
The neural network, the regression including the
instate variable, and tree were all analyzed separately in
the previous section. A tool aptly named the Model
Comparison tool is used in the diagram to compare all of
the models created. Using the results from running this
node in Table 11, the neural network is found to increase
the validation profit for enrollment while decreasing the
validation average squared error.
Table 11: Model comparison output
Fit Statistics
Model selection based on _VAPROF_
Valid: Valid:
Valid:
Average Average
Valid:
Valid: KolmogorovSelected Model
Profit for
Squared
Misclassification Roc Smirnov
Model Node Enroll
Error
Rate
Index Statistic
Y Neural 1.88576 0.036167
0.07551
0.98120 0.88705
Reg2
1.86236 0.041097
0.07736
0.97691 0.86352
Tree
1.88127
0.040314
0.12497 0.96481 0.88220
The receiver operating characteristic (ROC)
chart plots the trade-off between sensitivity and false
positive fraction across all selected fraction of data. It
provides the measure of the predictive accuracy of a
logistic model. All of the models appear strong since the
curves are far from the diagonal line. The neural model
appears to perform slightly better than the other models as
its corresponding curve is outside other curves. This
supports the previous findings that the neural network
decreased the average squared error.
Table 11 also indicates that the validation model
has a ROC index, or a rank decision statistic, of 98%.
This implies that the split between enrollment and failures
to enroll is close to perfect. Utilizing the decision tree
gives some clarification as to why the ROC index is so
high. The decision tree indicates that the number of selfinitiated contacts with the school is key to whether or not
the student decides to enroll. Recall that if a student
initiated two or less contacts it is almost guaranteed that
he will not enroll at the school.
Figure 12: ROC chart
The goal of this case study is to identify the
students most likely to enroll at the college. This
information can then be used by the admissions office to
aid in the decision making process of which students to
offer acceptance. The results indicate that the largest
factor to look at is whether or not the student contacts the
school and how many times he does this. If the student
contacts the university on his own more than three times,
he is likely to enroll in the college. If the student, on the
other hand, contacts the university fewer than three times,
it is almost certain that he will decline acceptance to the
university.
4.7 Issues and Challenges In Data Mining
Protecting the privacy of those whose
information is being collected is the most vital ethical
issue with data mining that has been widely discussed.
Automated Data mining raises the issues of privacy,
security and governance frequently [Sharmaa et al. 2013].
Data that originates from online forums, like Gmail,
Facebook, Twitter and Amazon is stored and analyzed by
companies which is then used to engineer targeted ads
and make recommendations. Wherever there has been an
application of web-mining , the issue of protecting
privacy has always been broached[Berendt 2012].
In this regard the initial outcry on Face book’s
tracking of its users’ behavior is worth mentioning. In
2012, the social networking company bought data from a
data mining firm Data logix. With this data on 70 million
U.S households, Facebook was able to predict if the user
bought the item after seeing it marketed on Facebook.
The targeted ads that followed the company’s venture
into data mining were protested by the users, as a result
Facebook included an option to opt out of this service
[Steel and Dembosky 2012].
Another controversial use of data mining is
getting too personal with the customer’s online or in-store
behavior. In a presentation given in 2010, Andrew Pole, a
predictive analysis expert at Target, said that there is
immense potential in focusing ads and offers around a
family that is expecting a child and predicting a
pregnancy would be a very lucrative sales opportunity.
This presentation was dug up two years later, stories
appeared in New York Times[Duhigg 2012] and
subsequently in Forbes in 2012[Hill 2012], which went
viral within days. These reports criticized the lengths to
498 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
which retail stores would go to offer targeted packages
and coupons[Siegel 2013].
risk and reward between invasions of privacy and crucial
uses personal data mining [Polonetsky and Tene 2013].
There are obvious benefits for companies and
the government for using data mining techniques. Retail
stores use the information to improve the efficiency in
their operations and are able to cater to the demands of
consumers more effectively. Organizations also use this
strategy to personalize a customer’s experience, as seen
with Harrah’s casinos. These uses and outcomes appear
to be positive for both the corporation and the consumer.
While the business is becoming more efficient in
operations, consumers reap the benefit of enjoyable
experiences, discounts, and special offers.
Another criticism of using data mining to detect
terrorism concerns the accuracy of analysis results. Since
the amount of information is so immense and “terrorism
is so rare,” the occurrence of “false positives are
inevitable and often more common than truly accurate
results” [Stannard 2006]. This implies that accusing an
innocent person is unavoidable. The severity of the
consequences for terrorists is not something that an
innocent person should be forced to endure.
Government’s unconstrained access to general
public’s personal data, that includes telephone
conversations, messages, credit card usage, and driving
records allows it the opportunity to mine and analyze the
information as it pleases. In the best-case scenario, the
authorities are able to use the acquired knowledge in
preventing criminal activity and terrorist plots. On the
other hand, the government’s use of data mining is
viewed by some as an invasion of privacy and a violation
of basic civil liberties. This objection is due in large part
because people suspected as terrorists, any person with a
connection to a suspect, or a randomly selected person
could be investigated in this manner. Law abiding
people, however, do not want their telephones tapped,
their credit card information examined, or the driving
records probed without their knowledge and good reason.
The collection and subsequent analysis of the data is seen
has an infringement upon one’s privacy.
The most recent controversy in this regard was
the revelation of National Security Agency (NSA)’s
surveillance program by Edward Snowden, a former
employee of the organization [Greenwald 2013].
Snowden leaked surprising facts about the depth of
NSA’s invasion into personal communications; these
revelations began in June 2013 and continued through the
year. The story revealed that NSA had records of millions
of phone calls from telecom company Verizon and also
had access to servers of large technology companies like,
Apple, Facebook, Google, Microsoft, Skype, Yahoo, and
YouTube [Lyon 2014]. Critics of the government’s
unbridled invasion into personal data argue that in states
of emergency the government has the right to suspend
some rights but this level of blanket surveillance is
unacceptable. While some experts are of the opinion that
new laws need to be formulated to provide a check and
balance in such instances of big data monitoring, some
have said that NSA’s PRISM initiative is completely
against the Fourth Amendment and has no allowance in
the current legal framework [Park and Wang 2013].
In an essay published in Stanford Law Review in
2013, the authors emphasize the need for attention to both
positive and negative aspects of big data and government
surveillance. It is a tremendous challenge to balance the
4.8 Future Prospects
A step ahead in big data analytics would be the
complete automation of data mining technologies. A
combination of cognitive computing and big data would
truly transform the present potential of data mining
applications, as it will supersede the need for human
supervision and speed up the process of inferring
decisions and predictions from large datasets[Uddin
2013]. Cognitive computing is mirroring human thought
process by making systems that use pattern recognition,
data mining and natural language processing. Successful
applications of cognitive computing include, IBM’s
Watson, Google’s Deep Mind and Qualcomm’s Zero th
Platform[Delgado 2015]. Machine learning is now being
deemed the as the next generation of data mining
capabilities to enable cognitive computing.
5. CONCLUSION
Data mining has proven to have immense
applications in marketing, education and scientific
research. There are a number of tools that are able to
apply mining techniques, however SAS Enterprise Miner
has proved and maintained its dominance in the industry
over the past decade because of the robust statistical
capabilities. In the two case studies of SAS Enterprise
Miner, the mining tool was able to successfully elicit
patterns of visitor behavior based on usage of services at
the radio station website and also found the most crucial
indicator that determined student’s inclination to enroll at
the institute in the second dataset. SAS Enterprise Miner
is a user-friendly platform that can be adjusted to an
organization or project’s requirements with ease and
therefore has vast applications both at the commercial and
institutional level. Thus, this platform is a viable tool for
incorporation into curriculum for education the big data
generation.
REFERENCES
[1]
Martha Abell. 2014. First Steps in DATA
MINING with SAS ENTERPRISE MINER.
CreateSpace Independent Publishing Platform,
USA.
[2]
Abdullah M. Al Ghoson. 2010. Decision Tree
Induction & Clustering Techniques In SAS
Enterprise Miner, SPSS Clementine, And IBM
Intelligent Miner – A Comparative Analysis. Intl J
Manag & Inf Syst 14, 57-70.
499 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
[3]
[4]
Russell K. Anderson. 2012. Prediction Algorithms
for Data Mining. John Wiley & Sons, Ltd,
Chichester, UK.
[15]
Glenn Greenwald. 2013. NSA Prism program taps
in to user data of Apple, Google and others. In The
Guardian.
[16]
Robert Grossman. 2001. Data mining for scientific
and engineering applications. Kluwer Academic,
Boston, Mass.
[17]
Constance L. Hays. 2004. What Wal-Mart Knows
About Customers' Habits. Retrieved April 15,
2015
from
http://www.nytimes.com/2004/11/14/business/your
money/14wal.html
[18]
Kahsmir Hill. 2012. How Target Figured Out A
Teen Girl Was Pregnant Before Her Father Did.
Retrieved
May
20,
2015
from
http://www.forbes.com/sites/kashmirhill/2012/02/1
6/how-target-figured-out-a-teen-girl-was-pregnantbefore-her-father-did/
[19]
Rick Delgado. 2015. Cognitive Computing:
Solving the Big Data Problem? Retrieved July 1,
2015
from
http://www.kdnuggets.com/2015/06/cognitivecomputing-solving-big-data-problem.html
John Koetsier. 2013. How Google searches 30
trillion web pages, 100 billion times a month.
Retrieved
June
3,
2015
from
http://venturebeat.com/2013/03/01/how-googlesearches-30-trillion-web-pages-100-billion-timesa-month/
[20]
Charles Duhigg. 2012. How companies learn your
secrets.
Retrieved
May 20, 2015
from
http://www.nytimes.com/2012/02/19/magazine/sho
pping-habits.html?_r=0
Hans-Peter Kriegel, Karsten M Borgwardt, Peer
Kroger, Alexey Pryakhin, Matthias Schubert and
Arthur Zimek. 2007. Future trends in data mining.
Data Min Knowl Disc 15, 1, 87-97.
[21]
Varun Kumar and Anupama Chadha. 2011. An
Empirical Study of the Applications of Data
Mining Techniques in Higher Education. Intl J of
Adv Comp Sci and Appl 2, 3, 80-84.
[22]
Taiyeong Lee, Ruiwen Zhang, Xiangxiang Meng
and Laura Ryan. 2013. Incremental Response
Modeling Using SAS® Enterprise Miner. In
Proceedings of the SAS Global Forum (2013).
SAS Institute Inc.
[23]
David Lyon. 2014. Surveillance, Snowden, and
Big Data: Capacities, consequences, critique. Big
Data & Society (July-Sep), 1-13.
[24]
Randall Matignon and SAS Institute. 2007. Data
mining using SAS Enterprise miner. WileyInterscience, Hoboken, N.J.
[25]
Ralf Mikut and Markus Reischl. 2011. Data
mining tools. Wiley Interdiscip Rev 1, 5, 431-443.
[26]
Katherine Noyes 2014. Educating the "Big Data"
Generations
Nikolay Archak, Anindya Ghose and Panagiotis G
Ipeirotis. 2011. Deriving the Pricing Power of
Product Features by Mining Consumer Reviews.
Management Science 57, 8, 1485 - 1509.
[5]
Bettina Berendt. 2012. More than modelling and
hiding: towards a comprehensive view of Web
mining and privacy. Data Min Knowl Disc 24,
697-737.
[6]
Ryan Broderick and Emanuella Grinberg. 2013. 10
ways you give up data without knowing it.
Retrieved
May
15,
2015
from
http://edition.cnn.com/2013/06/13/living/buzzfeeddata-mining/
[7]
[8]
[9]
Roger H. L. Chiang and Veda C. Storey. 2012.
Business Intelligence and Analytics: From Big
Data to Big Impact. MIS Quarterly 36, 4, 11651188.
[10]
Wei Fan and Albert Bifet. 2013. Mining big data:
current status, and forecast to the future. SIGKDD
Explor. Newsl. 14, 2, 1-5.
[11]
Md Rashid Farooqi and Khalid Raza. 2012. A
Comprehensive Study of CRM through Data
Mining Techniques. In
Proc of the Nat
Conf(NCCIST 2011). New Delhi.
[12]
Usama Fayyad, Gregory Piatetsky-Shapiro and
Padhraic Smyth. 1996. Knowledge discovery and
data mining: Towards a unifying framework. In
Proc of the 2nd ACM int conf know disc and data
mining (KDD). Portland, OR, 82-88.
[13]
[14]
Gartner.2015. Magic Quadrant for Advanced
Analytics
Platforms.
http://www.gartner.com/technology/reprints.do?id
=1-2A881DN&ct=150219&st=sb
Google. 2015. How search works, from algorithms
to answers. Retrieved April 28, 2015 from
http://www.google.com/insidesearch/howsearchwo
rks/thestory/
500 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
http://edition.cnn.com/2012/09/23/business/facebo
ok-datalogix/
http://fortune.com/2014/05/27/educating-the-bigdata-generation/
[27]
Chanmin Park and Taehyung Wang. 2013. Big
Data and NSA Surveillance -- Survey of
Technology and Legal Issues. IEEE Int Symp on
Multimedia (ISM), 516-517.
[28]
Gregory Piatetsky. 2015. R leads RapidMiner,
Python catches up, Big Data tools grow, Spark
ignites.
Retrieved
June 30, 2015
from
http://www.kdnuggets.com/2015/05/poll-rrapidminer-python-big-data-spark.html
[37]
Niaz Uddin. 2013. James Kobielus: Big data,
cognitive computing and future of product.
Retrieved
July
2,
2015
from
http://etalks.me/james-kobielus-big-data-cognitivecomputing-and-future-of-product/
[38]
Dan Vesset, Brian Mcdonough, David Schubmehl
and Mark Wardely. 2013. IDC Business Analytics
Software 2013 2017 Forecast and 2012 Vendor
Shares.
[39]
Dan Vesset, Brian Mcdonough, David Schubmehl,
Alys Woodward, Mary Wardley and Carl W
Olofson. 2014. Worldwide Business Analytics
Software 2014–2018 Forecast and 2013 Vendor
Shares.
[29]
Jules Polonetsky and Omer Tene. 2013. Privacy
and big data, making ends meet. The Stanford Law
Review 66, 25.
[30]
Foster Provost and Tom Fawcett. 2013. Data
Science and its Relationship to Big Data and DataDriven Decision Making. Big Data 1, 51-59.
[40]
Jack Schofield. 2004. Casino Rewards Total
Loyalty.
Retrieved
April 25, 2015
from
http://www.theguardian.com/technology/2004/jan/
15/onlinesupplement
Ke Wang, Senqiang Zhou, Qiang Yang and Jack
Man Shun Yeung. 2005. Mining Customer Value:
From Association Rules to Direct Marketing. Data
Min Knowl Disc 11, 57-79.
[41]
Richard Thomas Watson. 2013. Data management:
databases and organizations. John Wiley & Sons,
New York.
[42]
Xindong Wu, Xingquan Zhu, Gong-Qing Wu and
Wei Ding. 2015. Data Mining With Big Data.
IEEE Trans. Knowledge and Data Eng 26, 1, 97107.
[43]
Nong Ye. 2013. Data Mining Theories,
Algorithms, and Examples Hoboken. CRC Press,
Hoboken.
[44]
Illhoi Yoo, Patricia Alafaireet, Miroslav Marinov,
Keila Pena-Hernandez, Rajhita Gopidi, Jia-Fu
Chang and Lei Hua. 2012. Data mining in
healthcare and biomedicine: a survey of the
literature. J Med Syst 36, 4, 2431-2448.
[45]
Xiaohui Yu, Yang Liu, Xiangji Huang and Aijun
An. 2012. Mining Online Reviews for Predicting
Sales Performance: A Case Study in the Movie
Domain. IEEE Trans. Knowledge and Data Eng
24, 4, 720-734.
[31]
[32]
[33]
Amol Sharma. 2013. Amazon Mines Its Data
Trove to Bet on TV's Next Hit. Retrieved May
15,
2015
from
http://www.wsj.com/articles/SB100014240527023
04200804579163861637839706
Bhoj Raj Sharmaa, Daljeet Kaura and Manju.
2013. A Review on Data Mining: Its Challenges,
Issues and Applications. Intl J of Curr Eng and
Tech 3, 2, 695-700.
[34]
Eric Siegel. 2013. Predictive Analytics: The
Power to Predict Who Will Click, Buy, Lie, or Die.
John Wiley & Sons, Hoboken, New Jersey.
[35]
Matthew B Stannard. 2006. U.S. PHONE-CALL
DATABASE IGNITES PRIVACY UPROAR /
DATA MINING. Retrieved May 15, 2015 from
http://www.sfgate.com/news/article/U-S-PHONECALL-DATABASE-IGNITES-PRIVACYUPROAR-2535457.php
[36]
Emily Steel and April Dembosky. 2012. Facebook
raises fears with ad tracking. Retrieved May 17,
2015
from
APPENDICES
Website
RULE1 WEBSITE & EXTREF ==> ARCHIVE
RULE2 ARCHIVE ==> WEBSITE & EXTREF
RULE3 EXTREF ==> ARCHIVE
501 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
RULE4 ARCHIVE ==> EXTREF
RULE5 EXTREF ==> WEBSITE & ARCHIVE
RULE6 WEBSITE & ARCHIVE ==> EXTREF
RULE7 WEBSITE & SIMULCAST ==> PODCAST & MUSICSTREAM
RULE8 PODCAST & MUSICSTREAM ==> WEBSITE & SIMULCAST
RULE9 SIMULCAST & PODCAST ==> WEBSITE & MUSICSTREAM
RULE10 WEBSITE & MUSICSTREAM ==> SIMULCAST & PODCAST
RULE11 NEWS & MUSICSTREAM ==> SIMULCAST
RULE12 WEBSITE & NEWS ==> SIMULCAST
RULE13 WEBSITE & PODCAST & MUSICSTREAM ==> SIMULCAST
RULE14 SIMULCAST & MUSICSTREAM ==> NEWS
RULE15 NEWS ==> SIMULCAST & MUSICSTREAM
RULE16 PODCAST & MUSICSTREAM ==> SIMULCAST
RULE17 WEBSITE & SIMULCAST & PODCAST ==> MUSICSTREAM
RULE18 SIMULCAST & PODCAST ==> MUSICSTREAM
RULE19 WEBSITE & NEWS ==> MUSICSTREAM
RULE20 SIMULCAST & NEWS ==> MUSICSTREAM
RULE21 MUSICSTREAM ==> WEBSITE & SIMULCAST
RULE22 WEBSITE & SIMULCAST ==> MUSICSTREAM
RULE23 SIMULCAST ==> NEWS
RULE24 NEWS ==> SIMULCAST
RULE25 SIMULCAST ==> WEBSITE & MUSICSTREAM
RULE26 WEBSITE & MUSICSTREAM ==> SIMULCAST
RULE27 SIMULCAST ==> MUSICSTREAM
RULE28 MUSICSTREAM ==> SIMULCAST
RULE29 WEBSITE & SIMULCAST ==> ARCHIVE
RULE30 ARCHIVE ==> WEBSITE & SIMULCAST
RULE31 WEBSITE & SIMULCAST ==> NEWS
RULE32 WEBSITE & MUSICSTREAM ==> ARCHIVE
RULE33 ARCHIVE ==> WEBSITE & MUSICSTREAM
RULE34 LIVESTREAM ==> WEBSITE
RULE35 SIMULCAST & ARCHIVE ==> WEBSITE
RULE36 MUSICSTREAM & ARCHIVE ==> WEBSITE
RULE37 PODCAST & ARCHIVE ==> WEBSITE
RULE38 NEWS ==> MUSICSTREAM
RULE39 MUSICSTREAM ==> NEWS
RULE40 WEBSITE ==> ARCHIVE
RULE41 ARCHIVE ==> WEBSITE
RULE42 WEBSITE & MUSICSTREAM ==> NEWS
RULE43 SIMULCAST & PODCAST & MUSICSTREAM ==> WEBSITE
RULE44 EXTREF & ARCHIVE ==> WEBSITE
RULE45 EXTREF ==> WEBSITE
RULE46 SIMULCAST & MUSICSTREAM ==> WEBSITE & PODCAST
RULE47 PODCAST & MUSICSTREAM ==> WEBSITE
RULE48 SIMULCAST & PODCAST ==> WEBSITE
502 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
RULE49 PODCAST & NEWS ==> WEBSITE
RULE50 WEBSITE & ARCHIVE ==> MUSICSTREAM
RULE51 WEBSITE & ARCHIVE ==> SIMULCAST
RULE52 ARCHIVE ==> MUSICSTREAM
RULE53 ARCHIVE ==> SIMULCAST
RULE54 ARCHIVE ==> WEBSITE & PODCAST
RULE55 WEBSITE & NEWS ==> PODCAST
RULE56 WEBSITE & SIMULCAST & MUSICSTREAM ==> PODCAST
RULE57 SIMULCAST & MUSICSTREAM ==> WEBSITE
RULE58 SIMULCAST ==> WEBSITE & PODCAST
RULE59 MUSICSTREAM ==> WEBSITE & PODCAST
RULE60 MUSICSTREAM ==> WEBSITE
RULE61 SIMULCAST ==> WEBSITE
RULE62 NEWS & MUSICSTREAM ==> WEBSITE
RULE63 WEBSITE & SIMULCAST ==> PODCAST
RULE64 WEBSITE & MUSICSTREAM ==> PODCAST
RULE65 WEBSITE ==> PODCAST
RULE66 PODCAST ==> WEBSITE
RULE67 SIMULCAST & MUSICSTREAM ==> PODCAST
RULE68 SIMULCAST & NEWS ==> WEBSITE
RULE69 SIMULCAST ==> PODCAST
RULE70 WEBSITE & ARCHIVE ==> PODCAST
RULE71 ARCHIVE ==> PODCAST
RULE72 MUSICSTREAM ==> PODCAST
RULE73 NEWS ==> WEBSITE
RULE74 NEWS ==> PODCAST
503 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
Enrollment management data
Model diagram
504 Vol. 6, No. 9, September 2015
ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
©2009-2015 CIS Journal. All rights reserved.
http://www.cisjournal.org
Fit statistics
Fit Statistics
Model selection based on _VAPROF_
Valid: Valid:
Valid:
Average Average
Valid:
Valid:
KolmogorovSelected Model Profit for Squared Misclassification
Roc Smirnov
Model Node Enroll
Error
Rate
Index
Statistic
Y Neural 1.88576 0.036167
0.88705
Reg2 1.86236 0.041097
0.86352
Tree 1.88127 0.040314
0.88220
0.07551
0.98120
0.07736
0.97691
0.12497
0.96481
505