Download Learn more... - Seidenberg School of CSIS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
Spring 2011
PACE UNIVERSITY
SEIDENBERG SCHOOL OF COMPUTER SCIENCE AND INFORMATION SYSTEMS
DEPARTMENT:
Computer Science
SUBJECT CODE/ COURSE
TITLE:
CS 325/CIT 348 [Data Mining]
CLASS HOURS:
4 Hours per week
CREDITS:
4
PREREQUISTE:
TEXTBOOKS:
Introduction Data Mining [ISBN: 0321321367]
P. Tan, M. Steinbach, & V. Kumar
Pearson Prentice Hall/ 2006
REFERENCE:
Data Mining: Introductory and Advanced Topics
[0130888923]
M. Dunham/Pearson Prentice Hall/ 2003
Internet; Journals; & Magazines
SEMESTER:
Spring 2011
Instructors:
Dr. A. Joseph and Dr. J. Lawler
Course Description: This course will provide an overview of topics such as data mining and knowledge discovery; data
mining with structured and unstructured data; foundations of pattern clustering; clustering paradigms; clustering for data
mining; data mining using neural networks and genetic algorithms; fast discovery of association rules; applications of data
mining to pattern classification; and feature selection. The goal of this course is to introduce students to current machine
learning and related data mining methods. It is intended to provide enough background to allow students to apply machine
and data mining techniques to learning problems in a variety of application areas.
Textbook: Introduction to Data Mining/ Pearson Prentice Hall/ 2006
Prepared by: Dr. A. Joseph
1
Data Mining
Spring 2011
PROFESSOR’S PROFILE
Professor:
Office:
Telephone:
Email:
Office Hours:
Dr. A. Joseph
163 Williams St., 2nd floor, Room 231
212 346 1492
[email protected]
Monday (NYC)
9:00am – 2:00pm
COURSE PROFILE
EVALUATION AND ASSESSMENT
Grading Policy
Final examination:
In-class examinations (6 -- 20 minutes exams):
Homework:
Student participation and contribution:
Coordinator:
Journal:
Project and project presentation:
35%
30% [best 5 of 6]
5%
15%
5%
10%
15% (3% for presentation)
Extra credit assignment (Optional):
10% (Due week 12 and no later)
Note: Only for students who are otherwise fulfilling all
the course requirements.
Final grade Determination
90% -- 100%
A
85% -- 89%
B+
82% -- 84%
B
80% -- 81%
B75% --79%
C+
70% -- 74%
C
65% -- 69%
D+
60% -- 64%
D
Below 60%
F
Note: Grade is computed to the nearest whole number.
Textbook: Introduction to Data Mining/ Pearson Prentice Hall/ 2006
Prepared by: Dr. A. Joseph
2
Data Mining
Spring 2011
Learning Objectives and Outcomes
Students are expected to accomplish the following learning objectives and attained the corresponding outcomes by the end of
the course.
Objective #1
Students will develop an intimate understanding of data and their characteristics.
Outcomes
a. Demonstrate a clear understanding and knowledge of the complexity and possible solutions to the problem of data
collection and data organization capabilities and the available expertise to analyze the data.
b. Know when to determine and prepare data for quality analysis and its importance to informed decision making as
well as be able to identify and clearly explain at least six indicators of data quality.
c. Able to define and discuss a global definition of data warehouse as well as know the categories of the data it
contains and the main transformation methods use to prepare them.
d. Understand and know different ways in which data are characterized as well as how to identify and preprocess them.
e. Able to demonstrate deep knowledge and understanding of data similarity and dissimilarity with regard to the
operations involved and data analysis.
Objective #2
Students will develop a sound knowledge and understanding of the data preparation and exploration.
Outcomes
a. Able to demonstrate ability to analyze basic representations and characteristics of raw data, apply different
normalization techniques on numerical attributes, and recognize different techniques for data preparation.
b. Able to compare different methods for elimination of missing data as well as compare different methods for outlier
detection.
c. Can apply summary statistics such as mean, median, and standard deviation to capture important characteristics in
data sets.
d. Know and able explain the purpose and significance of data visualization as well as know the forms, representations,
and procedures of visualization techniques appropriate for a particular application.
e. Able to identify the differences in dimensionality reduction based of features and reduction of value techniques as
well as can clearly explain data reduction in the preprocessing phase.
f. Show unambiguous understanding of the basic principles of feature selection and feature composition tasks.
g. Demonstrate a clear understanding of the differences between decision tree and decision rule representation in a
classification model.
h. Able to identify the basic components of an artificial neural network and its properties and capabilities in such
learning tasks as classification and pattern association.
i. Able to describe the main steps of a genetic algorithm with an illustrative example.
Objective #3
Students will improve their team-building, social, organizational, and collaborative skills through assignments, team
activities, and projects and that they can further develop in other classes and in their professional careers.
Outcomes
a. Demonstrate an ability to work effectively in teams.
b. Demonstrate the ability for effective verbal and written communication
c. Able to differentiate between the different types of learning teams and can clearly explain the stages of team
development and the characteristics of an effective team.
d. Know the importance of task, friendship, and interaction to a team’s performance
e. Able to demonstrate a clear understanding of the role and significance of team norms, teamwork skills,
communication; leadership, decision making, and conflict management in the effective functioning of a team.
Objective #4
Textbook: Introduction to Data Mining/ Pearson Prentice Hall/ 2006
Prepared by: Dr. A. Joseph
3
Data Mining
Spring 2011
Students will develop foundational knowledge and understanding of the core concepts of data mining inherent in
classification, cluster analysis, and association analysis as well as their examples of their applications.
Outcomes
a. Show clear understanding by being to describe or discuss hierarchical (e.g., agglomerative), partitional (e.g., kmeans), ROCK, and ABSCAN algorithms as well as their appropriateness to different data clustering applications.
b. Able to briefly describe supervised, unsupervised, and relative cluster evaluation measures as well as to compare
and contrast them
c. Able to demonstrate using illustrative examples basic knowledge and understanding of statistical, distance, decision
tree, neural networks, rule-based, and support vector machine algorithms in solving the classification problem.
d. Able to evaluate, compare, and contrast the performance of two or more classifier models using different techniques.
e. Demonstrate the ability to differentiate between and descriptively explain the different types association analysis
related algorithms such as a priori, sampling, partitioning, parallel, distributed, frequent pattern growth.
f. Able to compare and contrast qualitative and quantitative measures of for evaluating the quality of association
patterns
Objective #5
Students will acquire the knowledge, skills, and expertise needed to design and develop innovative and imitative
algorithms for competitive products, processes, or services in a technology oriented financial and health informatics
related enterprise.
Outcomes
a. Develop skills and expertise in applying the knowledge of classification, clustering, association algorithms to solve
problems relating to financial and health care services, processes, or products.
b. Demonstrate the needed know -how to design and develop or modify algorithms for specific data mining
applications in finance and health care.
Objective #6
Students will be provided with opportunities to increase their knowledge of and exposure to entrepreneurial skills
through course activities, assignments, and interactions with mentors.
Outcomes
a. Acquire entrepreneurial skills while interacting with financial, health care, and/or information technology experts for
at least 10 hours to determine and execute the project as measured by different reporting mechanisms.
Tentative Examination Schedule:
Course Section
CS 325/CIT 348
CRN: 23191/23190
In-class examination Dates
2/9, 2/23, 3/9, 3/30, 4/13, & 4/27
Project Due date
April 13, 2011
Final Examination Date
May 5, 2011
Note 1: In general, the lessons will highlight inquiry-based lecture-discussion and may include storytelling. The central focus
of the course will be critical thinking and problem-solving. To get the most out of the course, each student is expected to
study the reading assignments and genuinely attempt each homework problem before coming to class. The idea is to come to
class ready with questions about and ideas relating to the course materials and associated problems.
Note 2: In the interest of learning, it is very important to come to class prepared to learn – do all required assignments.
Failure to do so could diminish your ability to get the most out of each lesson and the class. Remember that learning is action
oriented. That is, it is not enough to come to class to listen to what others have to say. You should come to class prepared to
become involve in all aspects of classroom activities because learning is an active process.
Note 3: It is very important you read and familiarize yourself with SCSIS Statement of Student Responsibilities (see
Blackboard).
Textbook: Introduction to Data Mining/ Pearson Prentice Hall/ 2006
Prepared by: Dr. A. Joseph
4
Data Mining
Spring 2011
Textbook: Introduction to Data Mining/ Pearson Prentice Hall/ 2006
Prepared by: Dr. A. Joseph
5
Data Mining
Spring 2011
TOPICS
Weeks
1-2
Topics
Data: Types of data; data quality; data preprocessing; measures of similarity
and dissimilarity; large data sets; and data warehouses.
Assignments
Readings: chapter 2
Problems: chapter 2/ 1, 3, 6, 7,
9, & 12
3-4
Data Preparation and Exploration: Raw data representation,
characteristics, and transformation; missing data; summary statistics;
decision trees and rules; data reduction techniques; neural networks; genetic
algorithms; and visualization.
Reading: chapter 3 &
handouts
Problem: Chapter 3/ 1, 2, 4, 6,
& 17.
5-7
Classification: Introduction; approach to solve a classification problem;
decision tree induction; model overfitting; evaluating a classifier
performance; comparing classifiers; rule based classifiers; nearest neighbors
classifiers; Bayesian classifiers; neural networks; and support vector
machines
Reading: chapter s 4 & 5
Problems: chapter 4/. 1;
Chapter 5/ 1.
8-10
Cluster Analysis: Introduction; K-means; agglomerative hierarchical
clustering; DBSCAN database; cluster evaluation; and clustering with
categorical attributes.
Reading: chapter 8
Problems: To be assigned.
11
12-13
Project Submission and Presentation
Association Analysis: problem definition; generation and compact
representation of frequent itemsets; rule generation; algorithms (sampling,
partitioning, parallel, distributed, & FP-growth algorithm); & measuring and
evaluation of association patterns.
Reading: Chapter 6
Problems: Chapter 6/ 1
13
Review for Final Examination
:
14
Final Examination.
Note 1: This course is structured around freely formed small collaborative groups in a cooperative learning environment.
Students are encouraged to work together in their respective groups to form effective and productive teams that share the
learning experience within the context of the course, help each other with learning difficulties, spend time to get to know
each other, and spend time each week to discuss and help one another with the course work (content and assignments).
Each group member is responsible for the completion and submission of each assignment. Each group member will be
individually graded.
Note 2: During the first class session, student background information will be collected to get a sense of the diversity of
student educational background and an assessment test will be given to determine students’ knowledge of the subject.
Group project: Students in small groups of two to four will participate in a project or research and prepare a report that
involves the use of a low level or high-level programming language. In this project, students will write a program to
determine the solution of a technical problem, and then demonstrate their knowledge and understanding of how the
program is processed in the typical digital computer system. Assignment of grade to individual students for group project
will be based upon their involvement in the following items: programming, report writing, proofreading and correction of
programming codes and written report, and combinations of the above.
Textbook: Introduction to Data Mining/ Pearson Prentice Hall/ 2006
Prepared by: Dr. A. Joseph
6
Data Mining
Spring 2011
Web support: This course is supported with most or all of the following Blackboard postings: lesson questions, lessons
(PowerPoint), instructions and guidelines pertaining to the course, computer architecture and related news, group and
class discussions boards, email correspondence about the course, homework solutions, examination grades, and
miscellaneous course related activities and information.
Supplementary materials: Handouts in class or web postings of current events and issues affecting computer
architecture. Some books that may be helpful for the course will be posted on Blackboard.
In class group activity and participation: Students are recommended to bring to class current newsworthy events in
computer organization/architecture and related news to share with the class. Students will inform the class of the news
events and their significance to computing. Devote 15-20 minutes to this activity.
The collaborative groups are designed to function outside of the classroom. Collaborative group activities will be
reinforced inside the class during the lessons. Student groups are encouraged to function cohesively and to participate in
class activities. Devote 30-45 minutes of each class period to collaborative group activities.
Students are strongly encouraged to download posted lessons from Blackboard, review them, and should be able
to ask intelligent questions about the material in these lessons.
Every effort will be made to present each lesson using the storytelling format supported with subsequent
discussion and elaboration on the central points of the lesson.
The key elements of a story are the following: causality, conflict, complication, and character.
The following excerpts about collaborative learning are from research documents:

In the university environment, educational success and social adjustments
effectiveness of developmental academic support systems.

Most organized learning occurs in some kind of group
group characteristics and group processes significantly
contribute to success or failure in the classroom and directly effect the quality and quantity of learning within the group.

Group work invariably produces tensions that are normally absent, unnoticed, or suppressed in traditional classes.
Students bring with them a variety of personality types, cognitive styles, expectations about their own role in the
classroom and their relationship to the teacher, peers, and the subject matter of the course.

Collaborative learning involves both management and decision-making skills to choose among competing needs. The
problems encountered with collaboration have management, political, competence, and ethical dimensions

The two key underlying principles of the collaborative pedagogy are that active student involvement is a more powerful
learning tool than the passive attendance and that students working in groups can make for more effective learning than
students acting alone. The Favorable outcomes of collaborative learning include greater conceptual understanding, a
heightened ability to apply concepts, and improved attendance. Moreover, students become responsible for their own
learning is likely to increase their skills for coping with ambiguity, uncertainty, and continuous change, all of which are
characteristics of contemporary organizations.
depend primarily on the availability and
Who creates a new activity in the face of risk and uncertainty for the purpose of achieving success and growth by identifying
opportunities and putting together the required resources to benefit from them?
Creativity is the ability to develop new ideas and to discover new ways to of looking at problems and opportunities
Textbook: Introduction to Data Mining/ Pearson Prentice Hall/ 2006
Prepared by: Dr. A. Joseph
7
Data Mining
Spring 2011
Innovation is the ability to apply creative solutions to those problems and opportunities to enhance or to enrich people’s
lives.
Each group may be viewed as a small business that is seeking creative and innovative ways to maximize its
product, academic outcome or average group grade. A satisfactory product is the break-even group average grade
of 85%. Groups getting average grades above 85% are profitable enterprises.
Textbook: Introduction to Data Mining/ Pearson Prentice Hall/ 2006
Prepared by: Dr. A. Joseph
8