Download SAS Text Miner: Introduction and Case Study

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Business intelligence wikipedia , lookup

Transcript
NESUG 2009
Applications Big & Small
An application of SAS® Text Miner to the imputation of missing key
player description in a customer database
Maria Skaletsky, Bentley University
Lisa Chang, Liberty Mutual
James Lally, EMC Corporation
Dominique Haughton, Bentley University
ABSTRACT
In this paper we demonstrate an application of SAS Text Miner to the imputation of the missing key player
description field in a customer database. Often textual data collected by marketing departments remain
unused due to a difficulty of any automated processing. SAS Text Miner provides a solution to this
problem and makes processing of textual data manageable. As a result of this project missing textual
customer information in a database collected by a marketing department of a large provider of information
infrastructure was successfully imputed. This project demonstrates that the implementation of SAS Text
Miner into the processing of customer data could eliminate the need for manual treatment of text data and
increase the usefulness of the customer database.
BACKGROUND AND A BRIEF REVIEW OF THE LITERATURE
Text mining is a very useful technique whenever large amounts of text are accumulated and need to be
analyzed.
Hennings et al. in the paper „A SAS® Text Mining Approach to Predicting the Resolvability of Disputes
between eBay‟s Sellers and Buyers‟ demonstrated the power of using textual data in predicting
resolvability of conflicts between buyers and sellers on EBay. They compared results of predictive models
that included sellers‟ prices only to those of models that included textual data from buyers‟ and sellers‟
comments describing their disputes. Models with the textual data proved to be superior to the models that
took into consideration prices alone. They also compared models that included manual coding of textual
data to models where textual data were analyzed using SAS Text Miner. They found that the use of the
software improved the predictive power of the models (Henning and Lin 2008).
Christiana Paetrou in the paper „Use of Text Mining to Predict Patient Compliance‟ describes the use of
SAS Text Miner to cluster codes that define the procedures performed during dental visits. These codes
were grouped into clusters using SAS Text Miner, and the clusters were then ranked according to the
level of complexity of procedures in the clusters. Assigning patients to clusters allowed for a further
analysis of patients‟ compliance, which was the goal of the project (Paetrou 2008).
Jeske and Liu (2007) propose an approach to classifying text data as well as tracking the occurrence of a
theme in the classified documents, and apply their method to the study of a Federal Aviation
Administration (FAA) aviation safety report repository. They find that the method is successful and
suggest further refinements for future work.
Many companies collect textual data that require significant manual effort to be analyzed. This paper
describes an application of SAS Text Miner to the imputation of missing textual data in a database
provided by a large company that provides information infrastructure to clients. This company‟s products
include customized software, hardware, training, etc.
1
NESUG 2009
Applications Big & Small
SAS TEXT MINER METHODOLOGY
We will briefly explain the methodology behind the Text Miner interface.
First, the text miner builds a matrix where the documents are the columns and the terms are the rows.
Terms are all the words contained in all of the documents, except for the words included into the
„exclusion‟ dictionary. The default exclusion dictionary is automatically included within the software and
can be modified by the researcher. Similar words can be included into the same term, i.e. „manager‟ and
„managers‟ are considered as the same term.
Since the resulting matrix includes every term encountered in any of the documents, it is very large and
would be difficult to work with. SAS Text Miner reduces the size of the matrix by using the singular value
decomposition method.
After the data reduction is completed, the documents can be clustered based on the SVD results. Each
document is assigned a cluster membership and most frequent terms are listed for each cluster. There
are two clustering methods available in Text Miner: Expectation maximization (EM) and Hierarchical
clustering. The researcher selects the SVD resolution (low, medium or high) that determines the number
of SVD extracted. Also, the researcher has an option to determine the number of clusters and the number
of terms displayed to describe each of the clusters. The resulting clusters can be displayed as a
dendrogram or a tree view and can be saved as a table that contains the terms describing each cluster.
DATA
Data in the customer database consist of information provided by customers during various marketing
contact points; the database is later used to classify prospects for targeted marketing messaging and
activities.
One of the data items collected from customers is their job titles. This information is entered into the
database; on the basis of the job title, the company uses an in-house procedure to create a text variable
referred to as the Key Player Description. For example, a job title such as “Director of IT” would give rise
to a Key Player Description of “IT Operations”. The only information used to create the Key Player
description text variable is the job title field provided by customers. The problem is that in some cases,
the company in-house procedure was not able to obtain the Key Player Description which is therefore
missing for these cases. The database contains over a million records so any manual analysis of it to
attempt to replace those missing Key Player Descriptions would be very time consuming and thus
practically impossible.
PROBLEM
There are several problems with the job title variable which can lead to a failure to obtain a corresponding
Key Player Description. Some customers do not enter their information at all and in this case no
imputation is possible. Some customers enter incorrect information, i.e. “God”; for these cases, the Key
player Description cannot be imputed either. However, some customers provided a correct job title, and
yet the in-house procedure failed to obtain any Key Player Description. The objective of this project is to
replace the missing Key Player Descriptions for these cases by using the SAS Text Miner tool.
APPROACH AND RESULTS
To make the dataset more manageable we randomly selected 10,000 cases for analysis. Then all cases
where the job description field was missing were removed. We decided to create a new variable that was
a concatenation of two fields: Job Title and Key Player Description and run the text miner on this variable.
The idea is that this combined variable should be able to provide more information to the text miner than
the Job Title alone would and should lead to clusters of “Job Titles/ Key Player Description” where it is
2
NESUG 2009
Applications Big & Small
clear what the Key Player Description should be for this cluster. However, the results of clustering of this
new variable were not entirely satisfying. One of the clusters created by the text miner included terms that
were not informative, such as „unclassified‟, „none‟, „no‟, „title‟, „provided‟. All of the cases that had a Key
Player Description „Unclassified‟ were put into this cluster, which did not solve the imputation problem.
Figure 1. Text mining of the Job Title/ Key Player Description concatenated variable
To correct this issue we went back to the list of terms extracted from the data by the text miner and set
the status of all the terms included in this „Unclassified‟ cluster as inactive. By changing their status we
excluded them from clustering. This forced all cases with missing Key Player Descriptions to be put into
meaningful clusters, which was the objective of this project.
Text Miner identified eight clusters (Figure 2). Each cluster is identified by 20 most relevant key words.
Figure 2. Clusters identified by SAS Text Miner (EM algorithm)
These clusters can be defined in the following way. Cluster 1 is the Administrator/Developer (App, DBA,
Network) /IT Department cluster. Cluster 2 primarily includes job descriptions related to Pharmaceuticals,
Life Sciences and Healthcare. Cluster 3 can be defined as the Finance cluster. Cluster 4 is the Functional
3
NESUG 2009
Applications Big & Small
area/Support cluster. Cluster 5 includes Engineer/Architect and IT Infrastructure. Cluster 6 is the middle
management cluster. Cluster 7 can be defined as the Administration cluster and cluster 8 – Upper
Management.
To ensure that these clusters are in line with Key Player descriptions originally assigned to the data by
the company, we ran cross tabulations between clustering membership and the original Key Player
description variable. If the Text Miner classification is accurate, all or most cases with the same Key
Player description should be placed into the same clusters. Partial results of these cross tabulations are
provided in Figure 3. This figure shows that 29 cases out of 30 with Key Player Description „Application
Administrator‟ were placed into cluster 1. The results for all other Key Player Descriptions were satisfying
as well. Therefore, we concluded that the clustering results were quite coherent.
Figure 3. Crosstabs between Cluster memberships and Key Job Descriptions
After the clustering was complete we separated cases that originally had Job Titles but had missing Key
Player Descriptions into a separate small dataset (166 cases out of 10,000). This file was examined by
database managers at the company to ensure that the classification into clusters was meaningful to them.
The feedback from these managers was highly positive and raised their interest for more research using
SAS Text Miner. Clusters definitions are shown in Figure 4.
LIMITATIONS AND FUTURE RESEARCH
In some cases, some of the terms do not provide meaningful information to clusters. Further term
„cleaning‟ might improve the accuracy of the imputation of Key Player Description. For example, Cluster 8
may need to be further refined, possibly by excluding some of the terms defining it.
In the future we would like to implement the imputation of the missing field to the complete database.
Another possible extension would be to validate the imputation on a separate validation sample (which is
not used for clustering).
4
NESUG 2009
Applications Big & Small
REFERENCES
Henning, K. and Z. Lin (2008). A SAS® Text Mining Approach to Predicting the Resolvability of Disputes
between eBay‟s Sellers and Buyers. SAS Global Forum 2008.
Jeske, D. and R Liu (2007). Mining and Tracking Massive Text Data: Classification, Constructions of
Tracking Statistics and Inference under Missclassification. Technometrics, 40(2), 116-128.
Paetrou, C. (2008). Use of Text Mining to Predict Patient Compliance. SAS Global Forum 2008.
ACKNOWLEDGMENTS
SAS is a Registered Trademark of the SAS Institute, Inc., of Cary, North Carolina.
CONTACT INFORMATION
Please contact the author with any comments or questions:
Maria Skaletsky
[email protected]
Dominique Haughton
[email protected]
5