Download syllabus

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining and Data Warehousing
General Course Information:
Faculty:
Email:
Course title/number:
Credits:
Semester/Year:
Class meeting/location:
Class Location:
Office Hours
Sam Sultan, Adjunct Assistant Professor
[email protected]
Data Mining and Data Warehousing – MASY-GC3510-100
3 credits
Spring 2017
March 23, 2017 – May 2, 2017,
12 In-Person, Lab Sessions, Tuesday & Thursday at 6:00pm-9:00pm
Email me to request an appointment
Course Description:
In the competitive information age, data mining and data warehousing are increasingly essential in business
decision-making. Although outsourcing is available and in-house personnel may not be very proficient, many
organizations are already implementing and will continue to implement data mining and data warehousing
technologies internally because of information security concerns and the administrative convenience of in-sourcing.
This course teaches students concepts, methods and skills for mining data warehouses for the best competitive
business strategy. It also develops analytical thinking to identify the strategy. The course utilizes lectures, readings,
problems and case studies, and a data warehouse project. Upon the completion of this course, students will be able
to apply and use various methods and techniques for mining the warehouse and using the resultant information for
business decision-making.
Course Structure/Method:
This course will be delivered in-person, twice a week for 7 weeks. The class will encompass lectures, assignments,
examples and demos, midterm and final exams, and a team project. All class content and assignments will be made
available online via NYU Classes. Students should check the web site on a daily basis for any updates or
announcements.
Course Learning Outcomes:
By the end of this course, students will be able to:








Recite the purpose of an OLTP (transactional application/database)
Recite the purpose of an ODS (operational reporting database)
Recite the purpose of an DW or DM (data warehouse or data mart)
Evaluate proper techniques for data normalization and de-normalization
Distinguish between various database design models for OLTP, ODS, and DW
Create well-structured dimension tables
Create well-structured fact tables
Model STAR and Snowflake schemas






Demonstrate techniques for integrating various DM STAR schemas to create a data warehouse
Clearly articulate the various steps of an ETL process (extract, transform and load)
Recite the various issues of data transformation and integration
Define dimension hierarchies and aggregates facts
Recite the various techniques for data mining
Apply data mining tools to help segment/classify your data
Communication Policy:
Credit students must use their NYU email to communicate. NYU Classes course-mail supports student privacy
and FERPA guidelines. All email inquiries will be responded to within 24 hours during Monday through Friday
5pm. Email sent on Saturday or Sunday will not be responded to until Monday. I will respond to you using NYU
email.
Course Expectations:
Students are expected to participate in each class session by offering their understanding of the subject, sharing
ideas or discussing/commenting on another students comment. In addition, students must complete and
submit all assigned homework on time. Late submission of homework will either not be accepted, or will result
in a lower grade. Students are also expected to develop with and present a team project with other students,
as well as take and pass a midterm exam and a final exam.
See full detail of expectations under “Assessment Strategy” below. Further information about specific
assignments can also be found in the “Course Outline” section.
Attendance: Students are expected to attend all classes. Excused absences are granted in cases of
documented serious illness, family emergency, religious observance, or civic obligation. In the case of
religious observance or civic obligation, this should be reported in advance. Unexcused absences from
sessions may have a negative impact on a student’s final grade. Students are responsible for assignments
given during any absence. Each unexcused absence may result in a student’s grade being lowered by a letter
grade. A student who has three unexcused absences may earn a Fail grade
University Calendar Policy on Religious Holidays
https://www.nyu.edu/about/policies-guidelines-compliance/policies-and-guidelines/university-calendar-policy-on-religiousholidays.html
Class Participation: To receive full credit for the course, you should attend all classes since much of the
learning occurs during class lecture, presentation and class discussions. Please contact the instructor if you
anticipate missing any part of the class. Please arrive on time so as not to disturb the flow of the lecture.
Excessive lateness’s may result in lower overall grade.
Required Material:

The Data Warehouse Lifecycle Toolkit (2nd Edition) – Available through Amazon
o
o

The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling (3rd Edition) – Available
through Amazon
o
o

Authors - Kimball, Ross, Thornthwaite, Mundy & Becker
Publisher - Wiley
Authors - Ralph Kimball, Margy Ross
Publisher - Wiley
Instructor may also provide session by session content, which will be posted online.
Recommended Material

Building the Data Warehouse (4th edition)
o
Author - William Inmon
o
Publisher - Wiley
Assessment Strategy:
Contributing factors for determining your course grade include:





Class Participation - 15% (Attendance is a prerequisite to participation)
Homework - 15%
Team Project - 20% (10% for individual contribution to the project, 10% for overall group contribution)
Midterm Exam - 25%
Final Exam - 25%

Class Participation: To receive full credit for the course, you should attend all classes since much of the
learning occurs during class lecture, presentation and class discussions. Please contact the instructor if you
anticipate missing any part of the class. Participation grades will be based on:
o
o
o
o
Involvement in class discussions and activities
Participation which demonstrates integration of reading, class work, relevance and application.
Willingness to learn by accepting feedback, trying new skills and approaches, etc.
Quality/quantity of providing effective and balanced feedback.

Homework: Homework assignments must be submitted on time within 1 week of date assigned (unless
otherwise instructed). Late submission will severely impact your homework grade, or may not be accepted
altogether at instructor’s discretion. All homework pages must be stapled together (paper clips not
accepted).

Group/Team Project: There will be a group/team class project. The project will be a culmination of written,
visual and proper presentation skills. It will include the culmination of topics, concepts and competencies
learned in this class. The group project grade will be based on:
o Student level of participation in the team project.
o Student will be assessed both as an individual, and as part of the overall team
o Individual contribution will be assessed by identifying the components of the project student worked on
and contributed to the overall project (Example database creation, data preparation and load, etc.)
o Fulfilment of all requirements stated for the project

Midterm Exam: There will be a midterm exam. The exam will be an open book, open notes/internet style
exam. The exam will test the student's acquisition of topics, concepts and competencies learned in this class
up to mid-term.

Final Exam: There will be a final exam. The exam will be an open book, open notes/internet style exam. The
exam will test the student's acquisition of topics, concepts and competencies learned in this class. The final
exam will only cover material covered in the second half of the term.
NYUSPS Policies:
“NYUSPS policies regarding the Family Educational Rights and Privacy Act (FERPA),
Academic Integrity and Plagiarism, Students with Disabilities Statement, and Standards of
Classroom Behavior among others can be found on the NYU Classes Academic Policies tab
for all course sites as well as on the University and NYUSPS websites.
Every student is responsible for reading, understanding, and complying with all of these
policies.”
The full list of policies can be found at the web links below:


University: http://www.nyu.edu/about/policies-guidelines-compliance.html
NYUSPS: http://sps.nyu.edu/academics/academic-policies-and-procedures.html
TurnItIn:
TurnItIn is a plagiarism detection software used to verify academic originality. It is available only to degree courses
and students.
Some assignments in this course may be checked for plagiarism using TurnItIn.
School Grading Policies:
NYUSPS Graduate
http://sps.nyu.edu/academics/academic-policies-and-procedures/graduate-academic-policies-and-procedu
res.html#Grades
Course Outline:
Readings should be completed within the week.
Session 1, Week 1, March 23, 2017, Introduction to Data Warehousing












Introduction to Data Warehousing
Relationship of Data Mining and Data Warehousing
What is a Data Warehouse?
Data Warehousing ROI
DSS - Decision Support Systems
Operational vs. Analytical Systems
Evolution of DSS and Data Warehousing
OLTP - Online Transaction Processing
Characteristics of a Data Warehouse
What is a Data Mart? Creating a Data Mart
Data Comparison Chart
OLAP - Online Analytical Processing

Assignments (due one week from today)
 Reading: Chapter 1 (from both Data Warehouse Lifecycle Toolkit, and Building the Data Warehouse).
Skim thru glossary (of Data Warehouse Lifecycle Toolkit)
Session 2, Week 2, March 28, 2017, Planning and Building the Data Warehouse












Planning & Building the Data Warehouse
Sponsorship and Cost Justification
Project Prerequisites
Barriers, Challenges and Risks
Preparing for Implementation
Developing the Data Warehouse
SDLC Methodologies - Waterfall vs. RUP Approach
Planning & Project Management
Analysis
Logical & Physical Design
Implementation and Deployment
Operations

Assignments (due one week from today):
 Reading: Chapter 1, 2 (The Data Warehouse Lifecycle Toolkit)
Session 3, Week 2, March 30, 2017, Data Warehouse Design













Data Warehouse Design
Drivers for Multi-Dimensional Analysis
Limitations of Relational Models
The Data Cube
What is dimensional modeling?
Advantages of Dimensional Models
Logical and Physical Design
Data Normalization
Benefits and Drawbacks of Data Normalization
De-Normalizing of Data
Characteristics of a Data Warehouse
Subject Oriented, Integrated, Time Variant, Non-Volatile
The Star Schema

Assignments (due one week from today):
 Reading: Chapter 6 (The Data Warehouse Lifecycle Toolkit)
Session 4, Week 3, April 4, 2017, Data Warehouse Schemas










Data Warehouse Schemas
Dimensions and Dimension Tables
Facts and Fact Tables
The Star Schema
The Snowflake Schema
Degenerate and Junk Dimensions
The Data Warehouse Bus Architecture
Conformed Dimensions and Standard Facts
Data Granularity
Changing Dimensions

Assignments (due one week from today):
 Reading: Chapter 6 (The Data Warehouse Lifecycle Toolkit)
 Assignments: Create a logical database model using data normalization rules
Session 5, Week 3, April 6, 2017, Components of a Data Warehouse









Components of a Data Warehouse
Source Systems, Staging Area, Presentation, Access Tools
Building the Data Matrix
The Four Steps Process
Multiple Fact Tables in a single Data Mart
Chain, Heterogeneous, Transaction/Snapshot & Aggregate Facts
Fact and Dimension Table Detail
Identifying Source for each Fact & Dimension
Mapping from Source to Target

Assignments (due one week from today):
 Reading: Chapter 7, 4 (The Data Warehouse Lifecycle Toolkit)
 Assignments: Create SELECT statements that perform various table joins from our database
Session 6, Week 4, April 11, 2017, The ETL Process











The ETL Process
Extracting the Data into the Staging Area
The Challenge of Extracting from Disparate Platforms
Full vs. Incremental Extracts
Detecting Changes to Data
Transforming the Data
Complexity of Data Integration
Dealing with Missing & Dirty Data
Data Transformation Tasks
Loading the Data
Timing and Job Control of Data Loads

Assignments (due one week from today):
 Reading: Chapter 9 (The Data Warehouse Lifecycle Toolkit)
Session 7, Week 4, April 13, 2017, Midterm Exam, Aggregating Data







Aggregating Data
Goals and Risks of Data Aggregation
Deciding What to Aggregate
Data Sparsity
Design Requirement for Aggregates
The problem with Aggregates
Aggregate Navigators

Assignments (due one week from today):
 Reading: Chapter 8 p353-357 (The Data Warehouse Lifecycle Toolkit)
 Assignments: Create SELECT statements that perform data analytics against the data warehouse
using aggregate functions and the GROUP BY clause
 Midterm Exam
Session 8a, Week 5, April 18, 2017, Selecting the Business Subject



Selecting the Business Subject
Declaring the Grain
Choosing the Dimension









Identify the Fact
Avoiding Null Keys
Retail Market Basket Analysis
Additive and Semi-Additive Facts
The Value Chain Integrated Inventory Model
Order Management Data Marts
Date and Other Dimension Role Playing
Allocation to Lower Level Facts
Profit and Loss Data Marts

Assignments (due one week from today):
 Reading: Chapter 2, 3, 5 (The Data Warehouse Toolkit)
Session 8b, Week 5, April 18, 2017, CRM and Financial Data Warehouses











CRM Overview
Customer Dimension
Demographic Dimension Outriggers
Date Dimension Outriggers
Large Changing Customer Dimension
Mini-Dimensions
Commercial Customer Hierarchies
Fixed vs. Variable Level Hierarchies
General Ledger Accounting
OLAP role in G/L and Chart of Accounts
Time Stamped Employee Dimensions

Assignments (due one week from today):
 Reading: Chapter 6, 7, 8 (The Data Warehouse Toolkit)
 Assignments: Use the logical database model created in session 4 to create the physical tables
in your database
Session 9, Week 5, April 20, 2017, Clickstream and Web Based Data Warehouses











Clickstream and Web Based Data Warehouses
Overview of Web Based Interaction
Challenges of Tracking Data
Creating Persistent State on the Web
Techniques for Tracking States
Working with Cookies
User Registration
Web Server Log Files
Online Advertising
Online Page Tracking and Analytics
User Dimension and Page Hits Facts

Assignments (due one week from today):
 Reading: Chapter 15 (The Data Warehouse Toolkit)
Session 10, Week 6, April 25, 2017, Introduction to Data Mining




Data Mining
What is Data Mining Good For?
Statistics, Artificial Intelligence & Machine Learning
Data Mining Examples and Tools







Connection between Data Mining and Data Warehousing
Retrospective Reporting vs. Predictive
Data Mining Applications
Data Mining vs. Statistics vs. OLAP
Data Mining Statistical Techniques (Sampling, Regression & Decision Trees)
Clustering, Segmentation and Nearest Neighbor Techniques
Keys to commercial success of Data Mining

Assignments (due one week from today):
 Reading: Online web research and reading
Session 11, Week 6, April 27, 2017, Data Mining Techniques









Data Mining Techniques
Hands-on Presentation and Lab
Classification, Regression, Similarity Matching, Co-occurrence Grouping
Predictive Modeling
Clustering/Segmentation
Data Mining and Statistics Terminologies
Supervised vs. Unsupervised
Tree Induction
Entropy and Information Gain

Assignments:
 Reading: Online web research and reading
Session 12, Week 7, May 2, 2017, Final Exam and Team Project


Final Exam
Team Project Due