Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS831-001: Knowledge Discovery in Databases Fall 2011 (201130) Instructor Robert J. Hilderman Office: CW308.23, 3rd Floor, College West Voice: (306) 585-4061 Fax: (306) 585-4745 e-Mail: [email protected] WWW: http://www.cs.uregina.ca/~hilder Classroom Location: ED632, 6th Floor, Education Building Time: TR 10:00 – 11:15 AM Office Hours Location: CW308.23, 3rd Floor, College West Time: TR 1:30 – 3:00 PM (or by appointment) Course Overview This course will be a mix of lectures delivered by the instructor (based upon various core data mining topics), and a series of presentations delivered by the students (based upon research papers recently published in journals and conference proceedings). Students will be required to complete a few short assignments based upon the lecture material. Students will also be required to complete a term project based upon the research paper chosen for their presentation. There will be no exams. The primary objectives of this course are two-fold: (1) for students to develop an understanding of many fundamental data mining techniques, and (2) for students to develop their research and technical skills by working independently on a significant data mining term project involving both software development and technical writing components. Mark Distribution Assignments (5) (written) Presentation of one research paper (oral) Project proposal (written) Final project report and appendices (written) Presentation of project results (oral) 15% 15% 15% 50% 5% ------- 100% Note: At the instructor’s discretion, the final mark may be adjusted +/-5%. Important Dates Thursday, September 8, 2011: First Meeting Thursday, October 6, 2011: Research Paper Selection Approved Thursday, October 13, 2011 at 10:00 AM: Project Proposal Due Note 1: Submit to your instructor in ED632. Note 2: A late project proposal will be assessed a 25% penalty for each day that it is overdue. Monday, December 5, 2011 at 2:30 PM: Final Project Report Due Note 1: Submit to your instructor in CW308.23. Note 2: A late final project report will be assessed a 10% penalty for each day that it is overdue. Plagiarism Please familiarize yourself with the following sections in the Graduate Studies and Research Calendar (http://www.uregina.ca/gradstudies/calendar): Policies and Procedures of the University (particularly the section on Academic Conduct and Misconduct) Program Requirements General Information All written material submitted for grading must be submitted on 8.5” by 11” paper, securely and humanely stapled in the top left corner. The first page must show your name, your student number, the course number, a document description, the date submitted, and your e-mail address in the top left corner. An example is shown below: Name: Joe Student Student #: 123 456 789 Course #: CS831 Document: Assignment 1 Date Submitted: September 13, 2011 E-Mail: [email protected] The following signed and dated pledge must also be shown at the bottom of the first page: Pledge of Academic Integrity: I pledge that the contents of this document represent my own original work and that I am personally responsible for its creation and dissemination. Further, I am aware of the penalties for academic misconduct as described in the Graduate Studies and Research Calendar. Signed: _________________________ ________________________ Date: Research Paper Presentation Requirements Choosing a Paper The research paper upon which your presentation and term project are based must be a recent publication (i.e., published after 2008, so anything published in 2009, 2010, and 2011 is eligible) and must be approved by the instructor. The topic addressed in the research paper must be different and distinct from any studied in previous courses or theses. The scope of your term project must include significant software development, data mining, and results evaluation components. Research papers and topics will be approved on a first-come/first-served basis (i.e., if you happen to choose a paper/topic for which someone else has already received approval, you will need to choose another paper/topic). Preparing the Presentation Your presentation must receive approval from the instructor prior to being delivered. Consequently, a copy of your presentation must be submitted for inspection and approval by the instructor at least one week prior to the scheduled delivery date. Your presentation must include a thorough discussion of the mathematical model, algorithm, implementation, complexity analysis, and/or experiments described in the research paper selected. You will be penalized for being inadequately prepared. Your presentation must be no shorter than 30 minutes and no longer than 35 minutes. The remaining five to 10 minutes will be devoted to responding to questions from the audience. You will be penalized for a presentation that is too short or too long. Your presentation should be developed in PowerPoint (or something equivalent to PowerPoint). A data projector will be available in the classroom. Use large font sizes and do not clutter your slides with too many points. Use colour whenever appropriate, but be sure to use colours that are easily differentiated when projected. You will be penalized for slides containing details that cannot be easily read by the audience. The liberal use of pertinent diagrams, figures, and graphs is strongly encouraged. However, photocopying from research papers, textbooks, or other technical material is seldom appropriate. And a blackboard- or whiteboard- based presentation is not acceptable. If you find yourself needing to draw other supplementary material during your presentation, it likely means that you were not adequately prepared. Also, be sure that pertinent details are obvious and easily read by the audience. The liberal use of examples is strongly encouraged. These should be examples that you derived yourself, not ones described in the research paper. When you introduce new terminology, provide a formal definition for a term, state a general condition/requirement, or state a theorem/axiom/principle/conjecture, it is often useful to provide an example, at an appropriate level of detail, describing the ideas in practical and concrete terms. Try to structure an example so that it builds upon previous examples by using the same base data/scenarios/context. In this way, the size, scope, and complexity of your examples increase as your audience becomes familiar with your material. But remember, most people in the audience will not have the same comfort with, or understanding of, your topic as you do. Remember, your objective is not to baffle the audience, but to transmit some knowledge, even if for some it’s just at the most fundamental or rudimentary level. The walk through and discussion of an algorithm, without the support of a detailed example demonstrating its operation, is not acceptable. Consequently, you should not waste time during your presentation by walking through an algorithm line-by-line. If an algorithm merits discussion, you should plan on a general overview sufficient to describe the significant characteristics/nuances that make it unique/novel. The remainder of your discussion should then focus on a detailed example describing the operation of the algorithm as it is intended to be used in practice, again highlighting the significant characteristics/nuances, as required. Delivering the Presentation Actually standing in front of an audience and knowing what to say is very different from going over your presentation in your mind while it is being prepared. If you have little or no public speaking experience, you may want to try rehearsing your presentation for time, content, and clarity. This could reveal weaknesses in how the presentation flows, deficiencies in the details, or errors. Prior to your presentation, some students will be assigned to a panel whose job it is to ask pertinent questions following your presentation. The other students are also encouraged to ask questions, but if there are time constraints, the panel will have priority. Face the audience and make eye contact with the audience. Address all your comments to the audience, not to the screen. Speak loudly enough to be heard, and speak clearly. You will be penalized if you cannot be understood. The audience may ask questions during the presentation, so you must understand and be able to explain all aspects of the selected paper. Any questions that you are not able to answer in class, you will have to respond to later, in writing, and submit to the instructor. Project Proposal Requirements The project proposal must contain the following sections (the minimum requirement): Statement of Problem: Provide a statement of the problem addressed in the selected paper. Examples: Provide example/s of the problem that will be solved and an approximate form of the proposed solution. These must be complete handderived examples. Proposed Solution: Provide a detailed description of the approach that will be taken to solve the problem, a discussion of the relevant literature, and an overview of the details of your implementation. Evaluation Criteria: Provide criteria that you will use to evaluate the success of the project. Schedule: Provide a detailed timeline describing the tasks that need to be completed in order to complete the project on time. References: Provide a complete, properly formatted list of the cited references. The project proposal must be eight to 10 single-spaced typewritten pages in 12pt font. It must be typewritten and proofread so that it is relatively free of errors. Proper English grammar is required. Final Project Report Requirements The final project report should be modeled on a format that is similar to typical research papers that you have read. It must contain the following sections (the minimum requirement): Introduction: Provide some background on the problem addressed by the project, an overview of the proposed solution, and a description of the report document (i.e., the organization of the report). Statement of Problem and Examples: This section can be adapted from the project proposal document. Proposed Approach: Provide detailed descriptions of algorithms, data structures, and/or theoretical results. Experimental Results: Provide a description of sample/typical experimental results, tabular/graphical comparisons of your results compared to other published results, and a summary of your results (a detailed description of your results will be in the appendix). Comparison to Related Work: Provide a detailed analysis and discussion of your results in comparison to other related work. Limitations/Extensions: Provide a description of the limitations of your solution and any possible extensions that may overcome these limitations. Conclusions: Provide a summary of what was achieved, and in relation to the originally stated criteria for success from the project proposal, evaluate the success of the project. References: Provide a complete, properly formatted list of the cited references. The final project report must be 16 to 18 single-spaced typewritten pages in 12pt font. It must also be proofread so that it is relatively free of errors. Proper English grammar is required. Appendices The final project report will essentially be a summary of your research efforts. Most of the material that you generate will be attached to the final project report in the appendices. The appendices must contain the following sections (the minimum requirement): User’s Manual: Provide a tutorial guide to running the software that you have developed. Implementation Manual: Provide a description of your implementation, including important design decisions, data structures, algorithms, and compilation instructions (basically anything needed to understand your software and how to make it work). Source Code Listing: Provide a complete listing of well-formatted, welldocumented source code. Experimental Results: Provide a complete listing of all experimental results (i.e., both raw data and summary data). Presentation of Project Results Project results presentations must be 10 minutes long. Follow the same guidelines as used for the research paper presentations. Sources for Reference Materials Books (many other books are available and those below may have newer editions) Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R. (eds.), Advances in Knowledge Discovery and Data Mining, AAAI Press / The MIT Press, 1996. Berry, M.J.A. and Linoff, G.S., Mastering Data Mining: The Art and Science of Customer Relationship Management, Wiley, 2000. Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. Hand, D., Mannila, H., and Smyth, P., Principles of Data Mining, The MIT Press, 2001. Witten, I.H. and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. Hastie, T., Tibshirani, R., and Friedman, J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2001. Fayyad, U., Grinstein, G.G., and Wierse, A. (eds.), Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2002. Thuraisingham, B., Data Mining: Technologies, Techniques, Tools, and Trends, CRC Press, 1999. Pyle, D., Data Preparation for Data Mining, Morgan Kaufmann, 1999. Dunham, M.H., Data Mining: Introductory and Advanced Topics, Prentice Hall, 2003. Russell, S. and Norvig, P., Artificial Intelligence: A Modern Approach, Prentice Hall, 2003. Mitchell, T.M., Machine Learning, McGraw-Hill, 1997. Guillet, F. and Hamilton, H.J., Quality Measures in Data Mining, Springer, 2007. Liu, B., Web Data Mining, Springer, 2007. Wu, X. and Kumar, V., The Top Ten Algorithms in Data Mining, CRC Press, 2009. Bramer, M., Principles of Data Mining, Springer, 2007. Conference Proceedings (there are many others that have KDD tracks) Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD) Proceedings of the European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD) Proceedings of the Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD) Proceedings of the Data Warehousing and Knowledge Discovery Conference (DaWaK) Proceedings of the International Conference on Data Mining (ICDM) Proceedings of the International Conference on Very Large Databases (VLDB) Proceedings of the International Conference on Management of Data (SIGMOD) Journals (these are just a few of many dealing with KDD) IEEE Transactions on Knowledge and Data Engineering Data Mining and Knowledge Discovery Intelligent Data Analysis Journal of Intelligent Information Systems Knowledge and Information Systems SIGKDD Explorations Sources for Real World Datasets To locate each of the sources shown below, use the terms given as keywords in a web search engine. UCI KDD Database Repository UCI Machine Learning Repository DELVE FEDSTATS FIMI Repository Financial Data Finder Grain Market Research Investor Links MIT Cancer Genomics Gene Expression Datasets MLnet National Space Science Data Center PubGene Gene Database Stanford Microarray Database STATLOG Project Datasets United States Census Bureau DataCrunch Reuters-21578 Text Categorization Collection UCR Time Series Archive DataWeb WHO Statistical Information System