Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Service Learning Outcomes in an Undergraduate Data Mining Course Terry Letsche Department of Mathematics, Computer Science, and Physics Wartburg College Waverly, IA 50677 [email protected] Abstract Data mining is becoming increasingly important as the amount of data being stored increases, and data miners are highly sought after in the job market. This paper describes a pilot course-offering in data mining in a small, four-year liberal arts college as part of a mixed Computer Science and Computer Information Systems curriculum. Several curricular choices are examined, including text and software selection, and the design of a final project to assess student performance. The final project offered three alternatives: The student could implement or extend a data mining algorithm in the language of their choice, the student could prepare a major paper on the impact of data mining to their discipline, or the student could find an outside project (with instructor approval) to analyze using the developed data mining techniques. The role of service learning as an intentional curricular choice is also examined, and two student projects are discussed. 1 Introduction Data mining is gaining exposure as a way to make sense out of the increasing amounts of data that are being collected and stored. The federal government recently announced new software, ADVISE, would be used in homeland security to link and cross-match material between websites, government records, and personal data.[1] The case has been made that the business case for federal data mining efforts has not been made.[2] Netflix has recently announced a contest to improve the accuracy of their current selection prediction algorithms using data mining.[3,4] Data mining can even be applied to the NBA draft process![5] Data miners are highly sought after in the job market. Data mining stands at the confluence of a number of different disciplines. Wartburg College is a four-year liberal arts college located in Waverly, Iowa. The college offers majors in not only computer science (CS), but also in computer information systems (CIS), a hybrid of computer science and business administration with a heavier emphasis placed on computing than a typical management information systems major. The initial offering of a data mining course occurred in the winter term of 2006 as a rotating special topic, offered every three years. Since it would be some time until the course could be offered again, the only prerequisite was CS1 or its equivalent. Also, since students from many disciplines take CS1 as part of their major requirements (engineering science, for example), the course would have to be accessible to a wide variety of interests and computing backgrounds. When designing the course, a number of previously published works were examined. For example, Musicant’s [6] approach of basing an undergraduate data mining course around research papers would be unworkable in an environment where the students had such widely varying backgrounds. Additional sources were examined [7,8], but the closest match to what was envisioned was Dian Lopez’s course, described in [9]. The University of Minnesota, Morris approach of creating an undergraduate level course that not only looked at the breadth of the data mining field, but also allowed students to formulate their own research problem was extremely appealing. This paper begins with a brief discussion of the texts that were examined, followed by the software choice and a description of the final project. Two student projects are discussed, followed by student reactions and thoughts on future data mining course offerings. 2 Book Selection The text for the course was selected with two primary criteria in mind. First, it must be accessible to the students while still being rigorous. One of the goals of the course was to introduce students to several of the key algorithms in data mining, so it was natural that a more computational approach was required that also used many examples. Secondly, it would be a plus if the text was also linked to some data mining software that could be 1 used by the class since a second course goal was experience with a data mining tool. Seven texts were considered. 2.1 Data Mining: A Tutorial-Based Primer by Roiger & Geatz Roiger and Geatz [10] begin by covering data mining fundamentals with brief overviews of the data mining process, classification, prediction, clustering, and market basket analysis, then spend more time in later chapters on specific algorithms for decision trees, association rules, the K-Means algorithm, and genetic algorithms. The book comes with a 180 day trial version of iData Analyzer, an Excel data mining add-in, which are both used extensively in the development of data mining processes and techniques in later chapters. The book also has more advanced chapters that could be used as separate topics on neural networks, statistical analysis techniques, time-series analysis, rule-based systems, and fuzzy reasoning. The book comes with several data sets that are used in the book as examples, and has questions at the end of chapters. This book met the criteria above in that the core of the content was accessible to almost any student, with opportunities for further exploration in the advanced topics for more advanced students. 2.2 Principles of Data Mining by Hand, et. al. Hand, et. al. [11] focuses on bridging the ideas of statistical modeling and computational data mining algorithms. The book is split into three sections. In the first section, a foundational tutorial of data mining principles is presented at an intuitive level. The second section builds on this, covering algorithms on trees, classification and regression rule sets, association rules, statistical models, and neural networks, among others. The final section shows how the preceding can be used together to solve real-world data mining problems. This book is marketed towards senior-level undergraduate or beginning graduate students. As such, the level of statistics, in particular, seemed daunting for an offering with a mix of potentially firs-t through fourth-year students of varying quantitative experience. The book lacks exercises, but has extensive “Further Reading” sections at the end of each chapter. 2.3 Data Mining by Adriaans & Zantinge Adriaans and Zantinge [12] is a concise, management-level overview of data mining that could be used as a springboard for a class made up of this overview and additional readings from journals or other books on topics of interest to the instructor. The book gives overviews of the data mining and knowledge discover process, with more extensive treatment of three real-life applications. 2 The book itself is a great resource as an introductory overview, but its lack of algorithmic depth, examples, and exercises caused this book to be placed further down the list. 2.4 Data Mining: Introductory and Advanced Topics by Dunham Dunham [13] is also a rather concise book, split into three sections. In the first section, data mining tasks such as classification, regression, prediction, clustering, etc. are described and the core topics are motivated. The Second part devotes a more thorough coverage to classification, clustering, and association rules using pseudocode and numerous examples to describe these techniques. The book concludes with additional advanced topics including web, spatial, and temporal mining techniques and algorithms. Each chapter concludes with exercises. However, the book is geared to advanced undergraduates and beginning graduate students who have completed at least an introductory database course. The book has an extensive appendix that surveys numerous data mining software packages, although the book itself does not adopt any particular software for its examples. 2.5 Data Mining Techniques: For Marketing, Sales, and Customer Support by Berry and Linoff Berry and Linoff [14] approach data mining from the standpoint of the business practitioner. After a lengthy motivation, seven data mining techniques are discussed: cluster detection, memory-based reasoning, market basket analysis, genetic algorithms, link analysis, decision trees, and neural networks. One of the books strong points is its use of case studies. Its focus on the business, rather than computational end of the spectrum, plus its lack of exercises caused this book to not be considered. 2.6 Data Mining: Concepts and Techniques by Han and Kamber Han and Kamber [15] have the reputation of being the gold standard that other data mining books are judged by. It is a comprehensive book – its broad coverage and indepth development of algorithms make it an excellent resource for the instructor. It begins with an overview of data mining, progressing to a description of the data mining process, including data preprocessing and transformation. Since the authors take a database view of data mining, there are two chapters on data warehousing and data cubes that could be omitted without affecting the flow of the course. The book has an extensive treatment of association, correlation, classification, prediction, and clustering algorithms, with additional material highlighted in the bibliographic notes at the end of each chapter. The book also contains a number of advanced chapters on time series, social network, and spatial data mining, as well as a concluding chapter on trends in data mining and a number of case studies. 3 Although this book is database-oriented, the information presented doesn’t rely on a database interpretation. Each chapter concludes with exercises, and the publisher makes extensive instructor resources available. The book does not focus on any particular software package. 2.7 Data Mining: Practical Machine Learning and Techniques by Witten and Frank Witten and Frank [16] have written a data mining textbook for use with the java-based Waikato Environment for Knowledge Analysis, or Weka. The objective of the book is to introduce the tools and techniques for machine learning that are used in data mining. The book’s approach lands it somewhere between the practical approach of many businessoriented data mining books and the more theoretical approach used in other textbooks. The book is presented in two sections. In the first section, data mining is introduced, and the various algorithms are presented. Later chapters delve into algorithmic detail for each of the eight main families of algorithms (decision trees, classification rules, linear models, instance-based learning, numeric prediction, clustering, and Bayesian networks), with a subsequent chapter on common transformations on the input and output and a concluding chapter on extensions and applications of data mining. The second part is devoted to the Weka software itself, from an introduction and tutorial, to information on how the reader can extend or implement additional algorithms using the Weka framework. While the book itself doesn’t have exercises, the authors have made exercise sets and other instructor materials available through the publisher. 2.8 And the winner is… The KDnuggets website recently had a poll asking which book data mining practitioners liked as an introduction or textbook to the field.[17] Twenty-three percent of the respondents in the unscientific poll chose Han and Kamber, eighteen percent chose Witten and Frank, and seventeen percent chose Hand, et. al. For this course, in the end, it came down to a choice between Han and Kamber or Witten and Frank, with Witten and Frank getting the nod due to the Weka software, particularly since the software itself was able to run on virtually any platform, allowing the students to use a state-of-the-art testbed for doing real data mining work, plus the added benefit of the software being used in all the examples in the book. Obviously, this list is hardly exhaustive; there are many other data mining books and textbooks available that would suit many course needs. 3 Software Selection Software selection was concurrent with book selection. The principle benefit of Weka was not only its platform independence, but also its cost: free. There are a number of data 4 mining companies that make trial versions of their software available for academic use for free or at a reduced cost, such as IBM’s Intelligent Miner[18]. The only other seriously considered software was YALE (Yet Another Learning Environment?)[19], developed at the University of Dortmund. Like Weka, YALE is written in Java and uses a building-block paradigm to allow rapid prototyping of data mining algorithms. It also has a GUI and incorporates Weka’s machine learning library. YALE also supports the paradigm of external modules that can be treated as plugins to allow additional functionality to be incorporated. Once the Witten and Frank textbook was selected, it seemed natural to use the Weka software. Each student was invited to download a version for their own computer (windows or Mac), and a version was installed in the Linux lab adjacent to the classroom. The windows version comes with its own JRE, while the Mac and Linux users downloaded only the Weka jar file. In the Linux lab, IBM’s free JRE was used since it included just-in-time compilation (JIT), which can significantly reduce processing time on large datasets. 4 Course Projects The course followed the Witten and Frank book’s order of topics, with additional examples done in class using active and collaborative learning. An invaluable resource was the sample data mining curriculum created by Dr. Gregory Piatetsky-Shapiro and Prof. Gary Parker located at [20]. A semester-long project was assigned and timed to follow the content of the text. This project mirrored the efforts to use microarray data to classify leukemia as acute myeloid leukemia or acute lymphoblastic leukemia, as reported in Golub [21] and Piatestsky[22]. In the first part of the project, the microarray data has been narrowed down to fifty relevant genes. The students use this data and a number of suggested classification algorithms to determine which gene(s) is/are the best predictor of the two classes of leukemia. In the second phase, the original data set is used and students perform three tasks: separate training from test data sets, perform a variety of data cleaning techniques on both data sets, and lastly, the cleaned training data is used to build models using a variety of algorithms that are then used to evaluate predictions against the test data set. Phase 3 covers feature set reduction using a number of different methods. In phase 4, the reduced feature set is used to isolate gene U82759, the Human homeodomain protein HoxA9 mRNA. For the class’s final project, students were given the option of three different final projects, all of which were to be presented to the class. In the first, students could code a data mining algorithm in the language of their choice or extend a data mining algorithm and demonstrate its effectiveness. The intent of this final project was to appeal to the computer science majors who had learned Python in CS1 and Java in CS2. The simplest data mining algorithms could easily be coded in Python, while extensions to an existing algorithm could be done in Weka using Java. The second project possibility was to write a formal paper on data mining and some discipline-specific data mining issue, e.g. privacy, market basket analysis, etc. This paper 5 would be a major paper, with a minimum length of ten pages and annotated bibliography. This option was suggested for those students who may not be as confident in their programming skills, or who might have an interest in some of the non-technical aspects of data mining. The final project area was to complete a research activity with “real data” from an outside source, spending a minimum of ten hours on the project. The presentation was required to include a high-level description of the data, what cleaning procedures needed to be done on the data, the results of mining, and a report on the activity to the “customer”. The intent of this project was to pilot the use of service learning. Wartburg conducted a service learning faculty development event in the fall of 2005, hosted by Campus Compact. Service learning consists of six key components. First, there should be curricular connections between the project itself and the learning process in the class, preferably building on existing disciplinary skills. Student voice is also important. Students should have the opportunity to select, design, implement, and then evaluate the service learning activity. Reflection is a third aspect, where students are urged to think, talk, and optionally write about their service experience. Reciprocity is also important. For the partnership to be successful, both the student and the customer should not only contribute to the project, but both should also benefit. Ideally, the project should address an authentic community need, and lastly, students should participate in some sort of assessment where constructive feedback and reflection provide insight into the reciprocal learning. The goal of service learning in this context was to enrich the learning experience for the student, sustain interest in the topic, and provide expertise to community partners. [23, 24] 4.1 Juvenile Court Services District 1 In the first case study, student A did a project for her mother, an employee of Juvenile Court Services District 1 (JCS). JCS was becoming more aware of research-based data, and was looking for ways to examine their outcome data to have better programming for “at risk” youth that have been referred to JCS by juvenile court and the Department of Human Services. Prior to this study, the state of Iowa visited with JCS to explain the importance of routinely examining the outcome data, but it is unclear whether this subsequently occurred. The goal of the project was to take the available data and identify areas where adjustments could be made in order to gain better outcomes, where the primary outcome goal was to decrease program recidivism. [25] Student A began work with the data for the 4 Oaks day treatment program in Dubuque, Independence, and Black Hawk county, Iowa. Individuals were anonymized by replacing names with JIJI numbers, an internal identifier, and birthdates were replaced with their current age. Data also included the current criminal charge, e.g. none, simple misdemeanor, misdemeanor, and felony, as well as length of stay, anticipated discharge date, assigned risk factor (low, medium, high), age at admission, county of residence, and referral source. It was hoped to find relationships between risk, gender, age, and length of 6 stay, or to be able to predict recidivism. However, data was limited to 207 individuals, and the three programs had wildly different clientele. Various algorithms were used on the data, but predictive accuracy averaged around 60% when using a training set. The analysis of the data confirmed the belief that minority students were not succeeding in the traditional day treatment program. As a result, JCS has added a culturally specific program, Dare to be King, for minority males. This program was implemented in July, and by this January it appears to have made an impact. A second finding was that there were not enough data points available, particularly at any individual program site. Programmatic data was combined over multiple sites for the study, on which further analysis demonstrated that increased time in the program is not correlated to higher individual success rate or recidivism. As a result, a more structured research program has been established at the three sites, two in July of 2006, and one in September of 2006. Early results indicate that the improved programmatic followthrough has improved success rates. JCS continues to also use an existing risk/needs assessment tool, but has recently more narrowly focused its efforts on the medium to high-risk youth. JCS continues to collect data and assess their program. JCS felt that the data mining effort was immensely useful, and has an additional project to be completed with data mining in the near future. Student A reports that she feels the project was worthwhile: “The fact that I could help make programs better for kids who need them was really motivating. And knowing that the things I found would be used and helpful in “real life” was very motivating as well. It did make me realize how important data collection is. I was able to give my mom (and thus her whole department) some advice as to how they could improve their data collection so that they can find some statistically significant results.” [26] 4.2 Wartburg College Retention Student B works in ITS at Wartburg as an application support specialist, primarily supporting administrative staff using the college’s SQL server-based administrative system. This database contains over 800 tables that hold data for all campus offices, e.g. Admissions, Financial Aid, Registrar, Controller, Alumni/Development, etc. Student B was interested in applying data mining principles to study retention, with the goal of developing a model that could accurately predict high-risk retention students. Retention is a measure of academic progression of a group of students from one period of time to the next. Last year, Wartburg published an 84% retention rate across first through third year students. The office of enrollment management is charged with recruitment and retention of students through ongoing analysis and academic support services. At Wartburg, Admissions, Financial Aid, Registrar, Pathways Center, and ITS serve under 7 the direction of the Vice-President of Enrollment Management. [27] There is also a thirteen member standing committee of the faculty that recommends policies and procedures that maximize student retention, as well monitor overall retention trends. Student B met with the director of Institutional Research, who shared with her the procedure that he uses to prepare the annual retention study. Student B learned that on the tenth day of each fall term, a “frozen” copy of the database is created to allow the processing of the retention report that is later reported by gender, class (1Y, 2Y, 3Y), ethnicity, GPA, citizenship, and transfer status. The retention committee indicated that they were also interested in additional possible predictors of retention, namely whether the student was housed on campus, involved in activities, admitted by committee (deviation from normal admittance), ACT scores, and high school rank. A first cut at assembling the data for the study produced 1405 students, whereas the official retention study was based on 1418 students, of which 10 students were in student’s B data that were not in the retention data, and 23 students were in the retention data that were not in the study data. It was discovered that student B was including deceased students, those on church mission leave and those who left on schedule for a cooperative degree program or who had graduated even though they had not started the year with fourth year status. Once there was agreement between the two data sets, student B began by anonymizing the data and discretizing various features, including religion, home state, class code, ethnicity, gender, citizenship, entrance code (transfer), and majors. It was later decided that major code might be too restrictive, so student B replaced major codes with CIP (Classification of Instructional Programs) codes to indicate the department of the major. Student B discovered that the best indicator of retention, using a variety of algorithms was GPA. Some algorithms provided more information than others, i.e. J48, the Weka algorithm that builds on Quinlan’s C4.5,[28] indicated that students with a GPA less than 2.0 were much less likely to be retained. A surprising result was that within the group with GPA less than 2.0, religious preference can be viewed as a secondary predictor, where students with an undeclared religious affiliation are much less likely to be retained within this group. Further analysis with other algorithms demonstrated high predictive accuracy with three primary attributes: incoming class code, GPA, and religion. Student B reports: “What did I learn? I suspected all along that retention would be based on GPA since if the student’s GPA is under 2.0, they are very close to being put on probation or suspension. Data clean-up is very tedious and time consuming. Domain knowledge is also very important when refining the data to be mined.” [29] Student B had a number of recommendations for enrollment management and the retention committee. ITS has recently installed a SAN (Storage Area Network) to contain the Jenzabar databases, allowing more years of historical retention data to be preserved. It 8 was also suggested that a disability identifier be added to the database to more easily allow them to be removed from the analysis. Thirdly, a new table should be created that matches major codes and CIP code. Lastly, there should be increased effort put into retaining cocurricular transcript data showing students’ activities, athletics, music, and other involvement. It is hoped that data mining retention data could be an ongoing effort between enrollment management and ITS. The ability to store multiple years worth of data makes it reasonable to assume that more highly predictive models could be developed. 5 Conclusion Students found the inclusion of a service learning option to be a novel and exciting prospect. Students who performed a service learning project reported a greater sense of engagement with the course and relevant material. Students were also enthusiastic about “making a difference”. One anonymous student reported on their course evaluation, “I got to research something that really interested me in this field!” The only negative comment from students overall was that they felt there should have been a statistics prerequisite for the course, although a poll during the course showed that almost the entire class had already taken an algebra-based statistics course. 6 Acknowledgements The author gratefully acknowledges Cassandra Frush, Susan Higdon, and Dr. Edith Waldstein, Vice President of Enrollment Management, for allowing me to share the results of their research. 7 References [1] [2] [3] [4] [5] [6] [7] J. Yaukey, "Feds test new data mining program," in USA Today Washington, D.C., 2007, p. 3A. B. Worthen, “IT Versus Terror,” CIO, vol. 19, no. 20, p.34, August 1, 2006. http://www.netflixprize.com K. Greene, “The $1 Million Netflix Challenge,” in Technology Review, October 6, 2006, http://www.technologyreview.com/Biztech/17587/page1/ P. Gearan, “Predicting NBA Draft Success and Failure through Historical Trends,” in Draft Express, June 21, 2006, http://www.draftexpress.com/viewarticle.php?a=1362 D. R. Musicant, "A data mining course for computer science: primary sources and implementations," in Proceedings of the 37th SIGCSE technical symposium on Computer science education Houston, Texas, USA: ACM Press, 2006. R. Connelly, "Introducing data mining," J. Comput. Small Coll., vol. 19, pp. 8796, 2004. 9 [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] Y. Lu and J. Bettine, "Data mining: an experimental undergraduate course," J. Comput. Small Coll., vol. 18, pp. 81-86, 2003. D. Lopez and L. Ludwig, "Data mining at the undergraduate level," in Midwest Instruction and Computing Symposium, Cedar Falls, IA, 2001. R. J. Roiger and M. W. Geatz, Data Mining: A Tutorial-Based Primer. Boston: Addison Wesley, 2003. D. Hand, H. Mannila, and P. Smyth, Principles of Data Mining. Cambridge, Massachusetts: The MIT Press, 2001. P. Adriaans and D. Zantinge, Data Mining. Harlow, England: Addison Wesley Longman Limited, 1996. M. H. Dunham, Data Mining: Introductory and Advanced Topics. Upper Saddle River, New Jersey: Pearson Education, Inc., 2003. M. J. A. Berry and G. Linoff, Data Mining Techniques: for Marketing, Sales, and Customer Support. New York: John Wiley & Sons, Inc., 1997. J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed. San Francisco: Morgan Kaufmann, 2006. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. San Francisco: Morgan Kaufmann, 2005. http://www.kdnuggets.com/polls/2005/data_mining_textbooks.htm http://www.ibm.com/software/data/iminer/ http://rapid-i.com/ http://www.kdnuggets.com/data_mining_course/index.html T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, pp. 531-537, October 15, 1999. G. Piatetsky-Shapiro, T. Khabaza, and S. Ramaswamy, “Capturing best practice for microarray gene expression in data analysis,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge Discovery and Data Mining Washington, D.C.: ACM Press, 2003. R. G. Bringle and J. A. Hatcher, “A Service-Learning Curriculum for Faculty,” Michigan Journal of Community Service Learning, vol. 2, p. 112, 1995. “Introduction to Service-Learning Toolkit: Readings and Resources for Faculty,” Campus Compact, 2nd edition, 2003. R. Frush, C. Frush, Ed., 2007, pp. e-mail correspondence. C. Frush, T. Letsche, Ed., 2007, pp. e-mail correspondance. http://www.wartburg.edu/academics/enrollment.html R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993. S. Higdon, T. Letsche, Ed., 2007, pp. e-mail correspondence. 10