Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining and Data Warehousing General Course Information: Faculty: Email: Course title/number: Credits: Semester/Year: Class meeting/location: Class Location: Office Hours Sam Sultan, Adjunct Assistant Professor [email protected] Data Mining and Data Warehousing – MASY-GC3510-100 3 credits Spring 2017 March 23, 2017 – May 2, 2017, 12 In-Person, Lab Sessions, Tuesday & Thursday at 6:00pm-9:00pm Email me to request an appointment Course Description: In the competitive information age, data mining and data warehousing are increasingly essential in business decision-making. Although outsourcing is available and in-house personnel may not be very proficient, many organizations are already implementing and will continue to implement data mining and data warehousing technologies internally because of information security concerns and the administrative convenience of in-sourcing. This course teaches students concepts, methods and skills for mining data warehouses for the best competitive business strategy. It also develops analytical thinking to identify the strategy. The course utilizes lectures, readings, problems and case studies, and a data warehouse project. Upon the completion of this course, students will be able to apply and use various methods and techniques for mining the warehouse and using the resultant information for business decision-making. Course Structure/Method: This course will be delivered in-person, twice a week for 7 weeks. The class will encompass lectures, assignments, examples and demos, midterm and final exams, and a team project. All class content and assignments will be made available online via NYU Classes. Students should check the web site on a daily basis for any updates or announcements. Course Learning Outcomes: By the end of this course, students will be able to: Recite the purpose of an OLTP (transactional application/database) Recite the purpose of an ODS (operational reporting database) Recite the purpose of an DW or DM (data warehouse or data mart) Evaluate proper techniques for data normalization and de-normalization Distinguish between various database design models for OLTP, ODS, and DW Create well-structured dimension tables Create well-structured fact tables Model STAR and Snowflake schemas Demonstrate techniques for integrating various DM STAR schemas to create a data warehouse Clearly articulate the various steps of an ETL process (extract, transform and load) Recite the various issues of data transformation and integration Define dimension hierarchies and aggregates facts Recite the various techniques for data mining Apply data mining tools to help segment/classify your data Communication Policy: Credit students must use their NYU email to communicate. NYU Classes course-mail supports student privacy and FERPA guidelines. All email inquiries will be responded to within 24 hours during Monday through Friday 5pm. Email sent on Saturday or Sunday will not be responded to until Monday. I will respond to you using NYU email. Course Expectations: Students are expected to participate in each class session by offering their understanding of the subject, sharing ideas or discussing/commenting on another students comment. In addition, students must complete and submit all assigned homework on time. Late submission of homework will either not be accepted, or will result in a lower grade. Students are also expected to develop with and present a team project with other students, as well as take and pass a midterm exam and a final exam. See full detail of expectations under “Assessment Strategy” below. Further information about specific assignments can also be found in the “Course Outline” section. Attendance: Students are expected to attend all classes. Excused absences are granted in cases of documented serious illness, family emergency, religious observance, or civic obligation. In the case of religious observance or civic obligation, this should be reported in advance. Unexcused absences from sessions may have a negative impact on a student’s final grade. Students are responsible for assignments given during any absence. Each unexcused absence may result in a student’s grade being lowered by a letter grade. A student who has three unexcused absences may earn a Fail grade University Calendar Policy on Religious Holidays https://www.nyu.edu/about/policies-guidelines-compliance/policies-and-guidelines/university-calendar-policy-on-religiousholidays.html Class Participation: To receive full credit for the course, you should attend all classes since much of the learning occurs during class lecture, presentation and class discussions. Please contact the instructor if you anticipate missing any part of the class. Please arrive on time so as not to disturb the flow of the lecture. Excessive lateness’s may result in lower overall grade. Required Material: The Data Warehouse Lifecycle Toolkit (2nd Edition) – Available through Amazon o o The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling (3rd Edition) – Available through Amazon o o Authors - Kimball, Ross, Thornthwaite, Mundy & Becker Publisher - Wiley Authors - Ralph Kimball, Margy Ross Publisher - Wiley Instructor may also provide session by session content, which will be posted online. Recommended Material Building the Data Warehouse (4th edition) o Author - William Inmon o Publisher - Wiley Assessment Strategy: Contributing factors for determining your course grade include: Class Participation - 15% (Attendance is a prerequisite to participation) Homework - 15% Team Project - 20% (10% for individual contribution to the project, 10% for overall group contribution) Midterm Exam - 25% Final Exam - 25% Class Participation: To receive full credit for the course, you should attend all classes since much of the learning occurs during class lecture, presentation and class discussions. Please contact the instructor if you anticipate missing any part of the class. Participation grades will be based on: o o o o Involvement in class discussions and activities Participation which demonstrates integration of reading, class work, relevance and application. Willingness to learn by accepting feedback, trying new skills and approaches, etc. Quality/quantity of providing effective and balanced feedback. Homework: Homework assignments must be submitted on time within 1 week of date assigned (unless otherwise instructed). Late submission will severely impact your homework grade, or may not be accepted altogether at instructor’s discretion. All homework pages must be stapled together (paper clips not accepted). Group/Team Project: There will be a group/team class project. The project will be a culmination of written, visual and proper presentation skills. It will include the culmination of topics, concepts and competencies learned in this class. The group project grade will be based on: o Student level of participation in the team project. o Student will be assessed both as an individual, and as part of the overall team o Individual contribution will be assessed by identifying the components of the project student worked on and contributed to the overall project (Example database creation, data preparation and load, etc.) o Fulfilment of all requirements stated for the project Midterm Exam: There will be a midterm exam. The exam will be an open book, open notes/internet style exam. The exam will test the student's acquisition of topics, concepts and competencies learned in this class up to mid-term. Final Exam: There will be a final exam. The exam will be an open book, open notes/internet style exam. The exam will test the student's acquisition of topics, concepts and competencies learned in this class. The final exam will only cover material covered in the second half of the term. NYUSPS Policies: “NYUSPS policies regarding the Family Educational Rights and Privacy Act (FERPA), Academic Integrity and Plagiarism, Students with Disabilities Statement, and Standards of Classroom Behavior among others can be found on the NYU Classes Academic Policies tab for all course sites as well as on the University and NYUSPS websites. Every student is responsible for reading, understanding, and complying with all of these policies.” The full list of policies can be found at the web links below: University: http://www.nyu.edu/about/policies-guidelines-compliance.html NYUSPS: http://sps.nyu.edu/academics/academic-policies-and-procedures.html TurnItIn: TurnItIn is a plagiarism detection software used to verify academic originality. It is available only to degree courses and students. Some assignments in this course may be checked for plagiarism using TurnItIn. School Grading Policies: NYUSPS Graduate http://sps.nyu.edu/academics/academic-policies-and-procedures/graduate-academic-policies-and-procedu res.html#Grades Course Outline: Readings should be completed within the week. Session 1, Week 1, March 23, 2017, Introduction to Data Warehousing Introduction to Data Warehousing Relationship of Data Mining and Data Warehousing What is a Data Warehouse? Data Warehousing ROI DSS - Decision Support Systems Operational vs. Analytical Systems Evolution of DSS and Data Warehousing OLTP - Online Transaction Processing Characteristics of a Data Warehouse What is a Data Mart? Creating a Data Mart Data Comparison Chart OLAP - Online Analytical Processing Assignments (due one week from today) Reading: Chapter 1 (from both Data Warehouse Lifecycle Toolkit, and Building the Data Warehouse). Skim thru glossary (of Data Warehouse Lifecycle Toolkit) Session 2, Week 2, March 28, 2017, Planning and Building the Data Warehouse Planning & Building the Data Warehouse Sponsorship and Cost Justification Project Prerequisites Barriers, Challenges and Risks Preparing for Implementation Developing the Data Warehouse SDLC Methodologies - Waterfall vs. RUP Approach Planning & Project Management Analysis Logical & Physical Design Implementation and Deployment Operations Assignments (due one week from today): Reading: Chapter 1, 2 (The Data Warehouse Lifecycle Toolkit) Session 3, Week 2, March 30, 2017, Data Warehouse Design Data Warehouse Design Drivers for Multi-Dimensional Analysis Limitations of Relational Models The Data Cube What is dimensional modeling? Advantages of Dimensional Models Logical and Physical Design Data Normalization Benefits and Drawbacks of Data Normalization De-Normalizing of Data Characteristics of a Data Warehouse Subject Oriented, Integrated, Time Variant, Non-Volatile The Star Schema Assignments (due one week from today): Reading: Chapter 6 (The Data Warehouse Lifecycle Toolkit) Session 4, Week 3, April 4, 2017, Data Warehouse Schemas Data Warehouse Schemas Dimensions and Dimension Tables Facts and Fact Tables The Star Schema The Snowflake Schema Degenerate and Junk Dimensions The Data Warehouse Bus Architecture Conformed Dimensions and Standard Facts Data Granularity Changing Dimensions Assignments (due one week from today): Reading: Chapter 6 (The Data Warehouse Lifecycle Toolkit) Assignments: Create a logical database model using data normalization rules Session 5, Week 3, April 6, 2017, Components of a Data Warehouse Components of a Data Warehouse Source Systems, Staging Area, Presentation, Access Tools Building the Data Matrix The Four Steps Process Multiple Fact Tables in a single Data Mart Chain, Heterogeneous, Transaction/Snapshot & Aggregate Facts Fact and Dimension Table Detail Identifying Source for each Fact & Dimension Mapping from Source to Target Assignments (due one week from today): Reading: Chapter 7, 4 (The Data Warehouse Lifecycle Toolkit) Assignments: Create SELECT statements that perform various table joins from our database Session 6, Week 4, April 11, 2017, The ETL Process The ETL Process Extracting the Data into the Staging Area The Challenge of Extracting from Disparate Platforms Full vs. Incremental Extracts Detecting Changes to Data Transforming the Data Complexity of Data Integration Dealing with Missing & Dirty Data Data Transformation Tasks Loading the Data Timing and Job Control of Data Loads Assignments (due one week from today): Reading: Chapter 9 (The Data Warehouse Lifecycle Toolkit) Session 7, Week 4, April 13, 2017, Midterm Exam, Aggregating Data Aggregating Data Goals and Risks of Data Aggregation Deciding What to Aggregate Data Sparsity Design Requirement for Aggregates The problem with Aggregates Aggregate Navigators Assignments (due one week from today): Reading: Chapter 8 p353-357 (The Data Warehouse Lifecycle Toolkit) Assignments: Create SELECT statements that perform data analytics against the data warehouse using aggregate functions and the GROUP BY clause Midterm Exam Session 8a, Week 5, April 18, 2017, Selecting the Business Subject Selecting the Business Subject Declaring the Grain Choosing the Dimension Identify the Fact Avoiding Null Keys Retail Market Basket Analysis Additive and Semi-Additive Facts The Value Chain Integrated Inventory Model Order Management Data Marts Date and Other Dimension Role Playing Allocation to Lower Level Facts Profit and Loss Data Marts Assignments (due one week from today): Reading: Chapter 2, 3, 5 (The Data Warehouse Toolkit) Session 8b, Week 5, April 18, 2017, CRM and Financial Data Warehouses CRM Overview Customer Dimension Demographic Dimension Outriggers Date Dimension Outriggers Large Changing Customer Dimension Mini-Dimensions Commercial Customer Hierarchies Fixed vs. Variable Level Hierarchies General Ledger Accounting OLAP role in G/L and Chart of Accounts Time Stamped Employee Dimensions Assignments (due one week from today): Reading: Chapter 6, 7, 8 (The Data Warehouse Toolkit) Assignments: Use the logical database model created in session 4 to create the physical tables in your database Session 9, Week 5, April 20, 2017, Clickstream and Web Based Data Warehouses Clickstream and Web Based Data Warehouses Overview of Web Based Interaction Challenges of Tracking Data Creating Persistent State on the Web Techniques for Tracking States Working with Cookies User Registration Web Server Log Files Online Advertising Online Page Tracking and Analytics User Dimension and Page Hits Facts Assignments (due one week from today): Reading: Chapter 15 (The Data Warehouse Toolkit) Session 10, Week 6, April 25, 2017, Introduction to Data Mining Data Mining What is Data Mining Good For? Statistics, Artificial Intelligence & Machine Learning Data Mining Examples and Tools Connection between Data Mining and Data Warehousing Retrospective Reporting vs. Predictive Data Mining Applications Data Mining vs. Statistics vs. OLAP Data Mining Statistical Techniques (Sampling, Regression & Decision Trees) Clustering, Segmentation and Nearest Neighbor Techniques Keys to commercial success of Data Mining Assignments (due one week from today): Reading: Online web research and reading Session 11, Week 6, April 27, 2017, Data Mining Techniques Data Mining Techniques Hands-on Presentation and Lab Classification, Regression, Similarity Matching, Co-occurrence Grouping Predictive Modeling Clustering/Segmentation Data Mining and Statistics Terminologies Supervised vs. Unsupervised Tree Induction Entropy and Information Gain Assignments: Reading: Online web research and reading Session 12, Week 7, May 2, 2017, Final Exam and Team Project Final Exam Team Project Due