Download CS831: Knowledge Discovery in Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CS831-001: Knowledge Discovery in Databases
Fall 2011 (201130)
Instructor
Robert J. Hilderman
Office: CW308.23, 3rd Floor, College West
Voice: (306) 585-4061
Fax: (306) 585-4745
e-Mail: [email protected]
WWW: http://www.cs.uregina.ca/~hilder
Classroom
Location: ED632, 6th Floor, Education Building
Time: TR 10:00 – 11:15 AM
Office Hours
Location: CW308.23, 3rd Floor, College West
Time: TR 1:30 – 3:00 PM (or by appointment)
Course Overview
This course will be a mix of lectures delivered by the instructor (based upon
various core data mining topics), and a series of presentations delivered by the
students (based upon research papers recently published in journals and
conference proceedings). Students will be required to complete a few short
assignments based upon the lecture material. Students will also be required to
complete a term project based upon the research paper chosen for their
presentation. There will be no exams.
The primary objectives of this course are two-fold: (1) for students to develop an
understanding of many fundamental data mining techniques, and (2) for students
to develop their research and technical skills by working independently on a
significant data mining term project involving both software development and
technical writing components.
Mark Distribution





Assignments (5) (written)
Presentation of one research paper (oral)
Project proposal (written)
Final project report and appendices (written)
Presentation of project results (oral)
15%
15%
15%
50%
5%
-------
100%
Note: At the instructor’s discretion, the final mark may be adjusted +/-5%.
Important Dates




Thursday, September 8, 2011: First Meeting
Thursday, October 6, 2011: Research Paper Selection Approved
Thursday, October 13, 2011 at 10:00 AM: Project Proposal Due
Note 1: Submit to your instructor in ED632.
Note 2: A late project proposal will be assessed a 25% penalty for each day
that it is overdue.
Monday, December 5, 2011 at 2:30 PM: Final Project Report Due
Note 1: Submit to your instructor in CW308.23.
Note 2: A late final project report will be assessed a 10% penalty for each day
that it is overdue.
Plagiarism
Please familiarize yourself with the following sections in the Graduate Studies
and Research Calendar (http://www.uregina.ca/gradstudies/calendar):


Policies and Procedures of the University (particularly the section on
Academic Conduct and Misconduct)
Program Requirements
General Information
All written material submitted for grading must be submitted on 8.5” by 11”
paper, securely and humanely stapled in the top left corner. The first page must
show your name, your student number, the course number, a document
description, the date submitted, and your e-mail address in the top left corner. An
example is shown below:
Name: Joe Student
Student #: 123 456 789
Course #: CS831
Document: Assignment 1
Date Submitted: September 13, 2011
E-Mail: [email protected]
The following signed and dated pledge must also be shown at the bottom of the
first page:
Pledge of Academic Integrity: I pledge that the
contents of this document represent my own original
work and that I am personally responsible for its
creation and dissemination. Further, I am aware of the
penalties for academic misconduct as described in the
Graduate Studies and Research Calendar.
Signed: _________________________
________________________
Date:
Research Paper Presentation Requirements
Choosing a Paper




The research paper upon which your presentation and term project are based
must be a recent publication (i.e., published after 2008, so anything published
in 2009, 2010, and 2011 is eligible) and must be approved by the instructor.
The topic addressed in the research paper must be different and distinct from
any studied in previous courses or theses.
The scope of your term project must include significant software
development, data mining, and results evaluation components.
Research papers and topics will be approved on a first-come/first-served basis
(i.e., if you happen to choose a paper/topic for which someone else has
already received approval, you will need to choose another paper/topic).
Preparing the Presentation






Your presentation must receive approval from the instructor prior to being
delivered. Consequently, a copy of your presentation must be submitted for
inspection and approval by the instructor at least one week prior to the
scheduled delivery date.
Your presentation must include a thorough discussion of the mathematical
model, algorithm, implementation, complexity analysis, and/or experiments
described in the research paper selected. You will be penalized for being
inadequately prepared.
Your presentation must be no shorter than 30 minutes and no longer than 35
minutes. The remaining five to 10 minutes will be devoted to responding to
questions from the audience. You will be penalized for a presentation that is
too short or too long.
Your presentation should be developed in PowerPoint (or something
equivalent to PowerPoint). A data projector will be available in the classroom.
Use large font sizes and do not clutter your slides with too many points. Use
colour whenever appropriate, but be sure to use colours that are easily
differentiated when projected. You will be penalized for slides containing
details that cannot be easily read by the audience.
The liberal use of pertinent diagrams, figures, and graphs is strongly
encouraged. However, photocopying from research papers, textbooks, or other
technical material is seldom appropriate. And a blackboard- or whiteboard-


based presentation is not acceptable. If you find yourself needing to draw
other supplementary material during your presentation, it likely means that
you were not adequately prepared. Also, be sure that pertinent details are
obvious and easily read by the audience.
The liberal use of examples is strongly encouraged. These should be examples
that you derived yourself, not ones described in the research paper. When you
introduce new terminology, provide a formal definition for a term, state a
general condition/requirement, or state a theorem/axiom/principle/conjecture,
it is often useful to provide an example, at an appropriate level of detail,
describing the ideas in practical and concrete terms. Try to structure an
example so that it builds upon previous examples by using the same base
data/scenarios/context. In this way, the size, scope, and complexity of your
examples increase as your audience becomes familiar with your material. But
remember, most people in the audience will not have the same comfort with,
or understanding of, your topic as you do. Remember, your objective is not to
baffle the audience, but to transmit some knowledge, even if for some it’s just
at the most fundamental or rudimentary level.
The walk through and discussion of an algorithm, without the support of a
detailed example demonstrating its operation, is not acceptable. Consequently,
you should not waste time during your presentation by walking through an
algorithm line-by-line. If an algorithm merits discussion, you should plan on a
general overview sufficient to describe the significant characteristics/nuances
that make it unique/novel. The remainder of your discussion should then focus
on a detailed example describing the operation of the algorithm as it is
intended to be used in practice, again highlighting the significant
characteristics/nuances, as required.
Delivering the Presentation




Actually standing in front of an audience and knowing what to say is very
different from going over your presentation in your mind while it is being
prepared. If you have little or no public speaking experience, you may want to
try rehearsing your presentation for time, content, and clarity. This could
reveal weaknesses in how the presentation flows, deficiencies in the details, or
errors.
Prior to your presentation, some students will be assigned to a panel whose
job it is to ask pertinent questions following your presentation. The other
students are also encouraged to ask questions, but if there are time constraints,
the panel will have priority.
Face the audience and make eye contact with the audience. Address all your
comments to the audience, not to the screen. Speak loudly enough to be heard,
and speak clearly. You will be penalized if you cannot be understood.
The audience may ask questions during the presentation, so you must
understand and be able to explain all aspects of the selected paper. Any
questions that you are not able to answer in class, you will have to respond to
later, in writing, and submit to the instructor.
Project Proposal Requirements
The project proposal must contain the following sections (the minimum
requirement):






Statement of Problem: Provide a statement of the problem addressed in the
selected paper.
Examples: Provide example/s of the problem that will be solved and an
approximate form of the proposed solution. These must be complete handderived examples.
Proposed Solution: Provide a detailed description of the approach that will be
taken to solve the problem, a discussion of the relevant literature, and an
overview of the details of your implementation.
Evaluation Criteria: Provide criteria that you will use to evaluate the success
of the project.
Schedule: Provide a detailed timeline describing the tasks that need to be
completed in order to complete the project on time.
References: Provide a complete, properly formatted list of the cited
references.
The project proposal must be eight to 10 single-spaced typewritten pages in 12pt
font. It must be typewritten and proofread so that it is relatively free of errors.
Proper English grammar is required.
Final Project Report Requirements
The final project report should be modeled on a format that is similar to typical
research papers that you have read. It must contain the following sections (the
minimum requirement):






Introduction: Provide some background on the problem addressed by the
project, an overview of the proposed solution, and a description of the report
document (i.e., the organization of the report).
Statement of Problem and Examples: This section can be adapted from the
project proposal document.
Proposed Approach: Provide detailed descriptions of algorithms, data
structures, and/or theoretical results.
Experimental Results: Provide a description of sample/typical experimental
results, tabular/graphical comparisons of your results compared to other
published results, and a summary of your results (a detailed description of
your results will be in the appendix).
Comparison to Related Work: Provide a detailed analysis and discussion of
your results in comparison to other related work.
Limitations/Extensions: Provide a description of the limitations of your
solution and any possible extensions that may overcome these limitations.


Conclusions: Provide a summary of what was achieved, and in relation to the
originally stated criteria for success from the project proposal, evaluate the
success of the project.
References: Provide a complete, properly formatted list of the cited
references.
The final project report must be 16 to 18 single-spaced typewritten pages in 12pt
font. It must also be proofread so that it is relatively free of errors. Proper English
grammar is required.
Appendices
The final project report will essentially be a summary of your research efforts.
Most of the material that you generate will be attached to the final project report
in the appendices. The appendices must contain the following sections (the
minimum requirement):




User’s Manual: Provide a tutorial guide to running the software that you have
developed.
Implementation Manual: Provide a description of your implementation,
including important design decisions, data structures, algorithms, and
compilation instructions (basically anything needed to understand your
software and how to make it work).
Source Code Listing: Provide a complete listing of well-formatted, welldocumented source code.
Experimental Results: Provide a complete listing of all experimental results
(i.e., both raw data and summary data).
Presentation of Project Results
Project results presentations must be 10 minutes long. Follow the same guidelines
as used for the research paper presentations.
Sources for Reference Materials
Books (many other books are available and those below may have newer editions)




Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R. (eds.),
Advances in Knowledge Discovery and Data Mining, AAAI Press / The MIT
Press, 1996.
Berry, M.J.A. and Linoff, G.S., Mastering Data Mining: The Art and Science
of Customer Relationship Management, Wiley, 2000.
Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Morgan
Kaufmann, 2001.
Hand, D., Mannila, H., and Smyth, P., Principles of Data Mining, The MIT
Press, 2001.












Witten, I.H. and Frank, E., Data Mining: Practical Machine Learning Tools
and Techniques with Java Implementations, Morgan Kaufmann, 2000.
Hastie, T., Tibshirani, R., and Friedman, J., The Elements of Statistical
Learning: Data Mining, Inference, and Prediction, Springer, 2001.
Fayyad, U., Grinstein, G.G., and Wierse, A. (eds.), Information Visualization
in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2002.
Thuraisingham, B., Data Mining: Technologies, Techniques, Tools, and
Trends, CRC Press, 1999.
Pyle, D., Data Preparation for Data Mining, Morgan Kaufmann, 1999.
Dunham, M.H., Data Mining: Introductory and Advanced Topics, Prentice
Hall, 2003.
Russell, S. and Norvig, P., Artificial Intelligence: A Modern Approach,
Prentice Hall, 2003.
Mitchell, T.M., Machine Learning, McGraw-Hill, 1997.
Guillet, F. and Hamilton, H.J., Quality Measures in Data Mining, Springer,
2007.
Liu, B., Web Data Mining, Springer, 2007.
Wu, X. and Kumar, V., The Top Ten Algorithms in Data Mining, CRC Press,
2009.
Bramer, M., Principles of Data Mining, Springer, 2007.
Conference Proceedings (there are many others that have KDD tracks)







Proceedings of the International Conference on Knowledge Discovery and
Data Mining (KDD)
Proceedings of the European Conference on the Principles of Data Mining and
Knowledge Discovery (PKDD)
Proceedings of the Pacific-Asia Conference on Advances in Knowledge
Discovery and Data Mining (PAKDD)
Proceedings of the Data Warehousing and Knowledge Discovery Conference
(DaWaK)
Proceedings of the International Conference on Data Mining (ICDM)
Proceedings of the International Conference on Very Large Databases
(VLDB)
Proceedings of the International Conference on Management of Data
(SIGMOD)
Journals (these are just a few of many dealing with KDD)






IEEE Transactions on Knowledge and Data Engineering
Data Mining and Knowledge Discovery
Intelligent Data Analysis
Journal of Intelligent Information Systems
Knowledge and Information Systems
SIGKDD Explorations
Sources for Real World Datasets
To locate each of the sources shown below, use the terms given as keywords in a
web search engine.




















UCI KDD Database Repository
UCI Machine Learning Repository
DELVE
FEDSTATS
FIMI Repository
Financial Data Finder
Grain Market Research
Investor Links
MIT Cancer Genomics Gene Expression Datasets
MLnet
National Space Science Data Center
PubGene Gene Database
Stanford Microarray Database
STATLOG Project Datasets
United States Census Bureau
DataCrunch
Reuters-21578 Text Categorization Collection
UCR Time Series Archive
DataWeb
WHO Statistical Information System