Download SAS Text Miner: Introduction and Case Study

NESUG 2009 Applications Big & Small An application of SAS® Text Miner to the imputation of missing key player description in a customer database Maria Skaletsky, Bentley University Lisa Chang, Liberty Mutual James Lally, EMC Corporation Dominique Haughton, Bentley University ABSTRACT In this paper we demonstrate an application of SAS Text Miner to the imputation of the missing key player description field in a customer database. Often textual data collected by marketing departments remain unused due to a difficulty of any automated processing. SAS Text Miner provides a solution to this problem and makes processing of textual data manageable. As a result of this project missing textual customer information in a database collected by a marketing department of a large provider of information infrastructure was successfully imputed. This project demonstrates that the implementation of SAS Text Miner into the processing of customer data could eliminate the need for manual treatment of text data and increase the usefulness of the customer database. BACKGROUND AND A BRIEF REVIEW OF THE LITERATURE Text mining is a very useful technique whenever large amounts of text are accumulated and need to be analyzed. Hennings et al. in the paper „A SAS® Text Mining Approach to Predicting the Resolvability of Disputes between eBay‟s Sellers and Buyers‟ demonstrated the power of using textual data in predicting resolvability of conflicts between buyers and sellers on EBay. They compared results of predictive models that included sellers‟ prices only to those of models that included textual data from buyers‟ and sellers‟ comments describing their disputes. Models with the textual data proved to be superior to the models that took into consideration prices alone. They also compared models that included manual coding of textual data to models where textual data were analyzed using SAS Text Miner. They found that the use of the software improved the predictive power of the models (Henning and Lin 2008). Christiana Paetrou in the paper „Use of Text Mining to Predict Patient Compliance‟ describes the use of SAS Text Miner to cluster codes that define the procedures performed during dental visits. These codes were grouped into clusters using SAS Text Miner, and the clusters were then ranked according to the level of complexity of procedures in the clusters. Assigning patients to clusters allowed for a further analysis of patients‟ compliance, which was the goal of the project (Paetrou 2008). Jeske and Liu (2007) propose an approach to classifying text data as well as tracking the occurrence of a theme in the classified documents, and apply their method to the study of a Federal Aviation Administration (FAA) aviation safety report repository. They find that the method is successful and suggest further refinements for future work. Many companies collect textual data that require significant manual effort to be analyzed. This paper describes an application of SAS Text Miner to the imputation of missing textual data in a database provided by a large company that provides information infrastructure to clients. This company‟s products include customized software, hardware, training, etc. 1 NESUG 2009 Applications Big & Small SAS TEXT MINER METHODOLOGY We will briefly explain the methodology behind the Text Miner interface. First, the text miner builds a matrix where the documents are the columns and the terms are the rows. Terms are all the words contained in all of the documents, except for the words included into the „exclusion‟ dictionary. The default exclusion dictionary is automatically included within the software and can be modified by the researcher. Similar words can be included into the same term, i.e. „manager‟ and „managers‟ are considered as the same term. Since the resulting matrix includes every term encountered in any of the documents, it is very large and would be difficult to work with. SAS Text Miner reduces the size of the matrix by using the singular value decomposition method. After the data reduction is completed, the documents can be clustered based on the SVD results. Each document is assigned a cluster membership and most frequent terms are listed for each cluster. There are two clustering methods available in Text Miner: Expectation maximization (EM) and Hierarchical clustering. The researcher selects the SVD resolution (low, medium or high) that determines the number of SVD extracted. Also, the researcher has an option to determine the number of clusters and the number of terms displayed to describe each of the clusters. The resulting clusters can be displayed as a dendrogram or a tree view and can be saved as a table that contains the terms describing each cluster. DATA Data in the customer database consist of information provided by customers during various marketing contact points; the database is later used to classify prospects for targeted marketing messaging and activities. One of the data items collected from customers is their job titles. This information is entered into the database; on the basis of the job title, the company uses an in-house procedure to create a text variable referred to as the Key Player Description. For example, a job title such as “Director of IT” would give rise to a Key Player Description of “IT Operations”. The only information used to create the Key Player description text variable is the job title field provided by customers. The problem is that in some cases, the company in-house procedure was not able to obtain the Key Player Description which is therefore missing for these cases. The database contains over a million records so any manual analysis of it to attempt to replace those missing Key Player Descriptions would be very time consuming and thus practically impossible. PROBLEM There are several problems with the job title variable which can lead to a failure to obtain a corresponding Key Player Description. Some customers do not enter their information at all and in this case no imputation is possible. Some customers enter incorrect information, i.e. “God”; for these cases, the Key player Description cannot be imputed either. However, some customers provided a correct job title, and yet the in-house procedure failed to obtain any Key Player Description. The objective of this project is to replace the missing Key Player Descriptions for these cases by using the SAS Text Miner tool. APPROACH AND RESULTS To make the dataset more manageable we randomly selected 10,000 cases for analysis. Then all cases where the job description field was missing were removed. We decided to create a new variable that was a concatenation of two fields: Job Title and Key Player Description and run the text miner on this variable. The idea is that this combined variable should be able to provide more information to the text miner than the Job Title alone would and should lead to clusters of “Job Titles/ Key Player Description” where it is 2 NESUG 2009 Applications Big & Small clear what the Key Player Description should be for this cluster. However, the results of clustering of this new variable were not entirely satisfying. One of the clusters created by the text miner included terms that were not informative, such as „unclassified‟, „none‟, „no‟, „title‟, „provided‟. All of the cases that had a Key Player Description „Unclassified‟ were put into this cluster, which did not solve the imputation problem. Figure 1. Text mining of the Job Title/ Key Player Description concatenated variable To correct this issue we went back to the list of terms extracted from the data by the text miner and set the status of all the terms included in this „Unclassified‟ cluster as inactive. By changing their status we excluded them from clustering. This forced all cases with missing Key Player Descriptions to be put into meaningful clusters, which was the objective of this project. Text Miner identified eight clusters (Figure 2). Each cluster is identified by 20 most relevant key words. Figure 2. Clusters identified by SAS Text Miner (EM algorithm) These clusters can be defined in the following way. Cluster 1 is the Administrator/Developer (App, DBA, Network) /IT Department cluster. Cluster 2 primarily includes job descriptions related to Pharmaceuticals, Life Sciences and Healthcare. Cluster 3 can be defined as the Finance cluster. Cluster 4 is the Functional 3 NESUG 2009 Applications Big & Small area/Support cluster. Cluster 5 includes Engineer/Architect and IT Infrastructure. Cluster 6 is the middle management cluster. Cluster 7 can be defined as the Administration cluster and cluster 8 – Upper Management. To ensure that these clusters are in line with Key Player descriptions originally assigned to the data by the company, we ran cross tabulations between clustering membership and the original Key Player description variable. If the Text Miner classification is accurate, all or most cases with the same Key Player description should be placed into the same clusters. Partial results of these cross tabulations are provided in Figure 3. This figure shows that 29 cases out of 30 with Key Player Description „Application Administrator‟ were placed into cluster 1. The results for all other Key Player Descriptions were satisfying as well. Therefore, we concluded that the clustering results were quite coherent. Figure 3. Crosstabs between Cluster memberships and Key Job Descriptions After the clustering was complete we separated cases that originally had Job Titles but had missing Key Player Descriptions into a separate small dataset (166 cases out of 10,000). This file was examined by database managers at the company to ensure that the classification into clusters was meaningful to them. The feedback from these managers was highly positive and raised their interest for more research using SAS Text Miner. Clusters definitions are shown in Figure 4. LIMITATIONS AND FUTURE RESEARCH In some cases, some of the terms do not provide meaningful information to clusters. Further term „cleaning‟ might improve the accuracy of the imputation of Key Player Description. For example, Cluster 8 may need to be further refined, possibly by excluding some of the terms defining it. In the future we would like to implement the imputation of the missing field to the complete database. Another possible extension would be to validate the imputation on a separate validation sample (which is not used for clustering). 4 NESUG 2009 Applications Big & Small REFERENCES Henning, K. and Z. Lin (2008). A SAS® Text Mining Approach to Predicting the Resolvability of Disputes between eBay‟s Sellers and Buyers. SAS Global Forum 2008. Jeske, D. and R Liu (2007). Mining and Tracking Massive Text Data: Classification, Constructions of Tracking Statistics and Inference under Missclassification. Technometrics, 40(2), 116-128. Paetrou, C. (2008). Use of Text Mining to Predict Patient Compliance. SAS Global Forum 2008. ACKNOWLEDGMENTS SAS is a Registered Trademark of the SAS Institute, Inc., of Cary, North Carolina. CONTACT INFORMATION Please contact the author with any comments or questions: Maria Skaletsky [email protected] Dominique Haughton [email protected] 5

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download SAS Text Miner: Introduction and Case Study