Download IS 579— Business Intelligence and Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
IS 579— Business Intelligence and Data Mining
Instructor:
Phone:
WWW:
Course WWW:
Office hours:
Dr. Debabrata (Deb) Dey
Office:
206–543–1855
Email:
http://faculty.washington.edu/ddey/index.htm
UW Canvas (http://canvas.uw.edu/)
Thursdays 4–6 PM
PCAR 514
[email protected]
Text, Software, and Other Resources
Text: Data Mining: Practical Machine Learning Tools and Techniques, Witten, Frank, & Hall, Morgan Kaufmann, 3rd
Ed., 2011.
Software: Microsoft Excel and WEKA–Version 3.6.10 (http://www.cs.waikato.ac.nz/ml/weka/).
Links to web-based resources will sometimes be provided.
(There are many web sites that provide tutorials and other background materials for basic data mining
concepts. Unfortunately, not all of them are always correct. Since it is not possible for me to review every
possible site and check its correctness, I suggest that you restrict yourself to only those web resources that are
provided by me in class and at the Canvas site.)
Course Packet
For each topic discussed in this course, I have prepared an extensive set of PowerPoint slides. These slidedecks are all posted at the Canvas site. These presentations are not intended to substitute regular reading
from the textbook; they are there to merely act as a summary of topics/issues/concepts discussed in class.
These slide-decks will save you the time while taking notes in class, so that you can make better use of that
time by listening, asking question, and participating in class.
In addition, all practice questions and solutions will be posted at the Canvas site.
(You do not have to buy any course packet from copy centers.)
Course Description and Objective
The objective of this course is to introduce students to the various techniques of data mining so that they can
identify problems and opportunities in their companies and apply these techniques. Special attention would
be given to existing real-world applications that make use of data mining techniques. Students are expected to
understand the basic concepts and their applicability, but are not expected to do programming or detailed
implementation.
Course Motivation
Over the years, organizations have accumulated a vast amount of data in their enterprise-wide information
systems. These data typically represent daily operations and transactions within a business context. It is easy
to see that all the business intelligence and rules are, in some way, embedded in these data. The question then
becomes one of how we can mine this vast amount of data in order to: (i) learn the embedded business
intelligence, and (ii) apply that intelligence to run a business in a more efficient and effective manner.
Over the last decade or so, data mining has become a very important part of intelligent business
practices, leading to higher revenues and lower costs, while maintaining or enhancing the quality of products
and services. This technique could be applied in any one or combination of functional areas. Examples of
such applications abound in the real world. Companies such as Blockbuster, Amazon, and American Express,
for example, are mining their transactions data to recommend appropriate products and services to their
customers. WalMart is using data mining techniques for more efficient logistics, supply chain management,
inventory control, and pricing. Many financial institutions are using data mining for loan processing, credit
rating, and target marketing. Companies are also using data mining techniques along with their online
presence for technical support and customized solution provision.
IS 579 Syllabus: Deb Dey
2
Classroom Expectations
Please bring copies of the posted slide-decks to every class. Please display your nametag at each session.
Please turn off (or put in the silent mode) your cell phone during class time.
We will be working with numbers in class to gain better understanding of the concepts; please bring your
calculator to every class. If you bring a laptop computer or a handheld device to class, please restrict its use
for class-related purpose only.
Class Participation
Interactive learning is not only fun, but is also very effective in grasping the material more quickly. In order to
promote a classroom environment conducive to interactive (often, also called active) learning, I will regularly
grade students based on their participation during class. Participation points are awarded for thoughtful,
pertinent questions, answers, and discussions; at the same time, frivolous questions may lead to a deduction
of participation points. Please place your nametag in front of you throughout the quarter to enable me to
grade you correctly; missing nametag simply means no participation points.
Homework and Course Project
There are no graded homework assignments for this class. However, to provide regular practice, sets of
practice questions will be given out. Students are encouraged to work on these questions individually, and
consult the posted solutions. If required, direct feedback on your work will also be provided.
There is a team project for this course. Details for the project will be provided separately.
Exams and Pop Quizes
There will be two exams in this course: an in-class Exam 1 and a take-home Exam 2. Both exams are open
book and notes. For Exam 1, a calculator will be required; a computer is not required. For Exam 2, you may
use a calculator or a computer. The final exam is cumulative. It should be submitted at the Canvas site on or
before the due date.
Makeup exams are usually not granted. In exceptional situations, a makeup oral/written exam can be
arranged for a student. Please consult me early in the quarter if you foresee conflicts with your work or workrelated travel schedule.
I will be regularly giving out in-class quizzes to ensure that you keep up with the basic concepts. The
lowest-score quiz will be dropped in grade calculations.
Grading
Course Project
Pop Quizzes
Exam 1 (in-class)
Exam 2 (take-home)
Class Participation
20%
20%
20%
30%
10%
Graded Work, Feedback, and Solutions
Graded work (pop quizzes and midterm) and feedback (on submitted practice questions) will be returned
promptly to your pick-up folder; please check your pick-up folders regularly. Solutions to practice questions
will be posted at the Canvas site. Please check your answers against these solutions. If you have questions,
please bring them to my attention as soon as possible.
IS 579 Syllabus: Deb Dey
3
Tentative Schedule
Date
09/26/13
10/03/13
10/10/13
10/17/13
Lecture Topic
Introduction
Database and Probability
Information Theory
Data Cleaning, Conversion, & Preparation
Basic Classification
Basic Classification
WEKA
Testing & Validation
10/24/13
Advanced Classification
Review for Exam 1
10/31/13
Exam 1
11/07/13
11/14/13
11/21/13
Association Rule Mining
Association Rule Mining
Numerical Prediction
Numerical Prediction
Other Classification Techniques
Implementation & Management
Review for Exam 2
11/25/13
Due
Practice Set 1
Practice Set 2
Practice Set 3
Practice Set 4
Exam 1
(In-Class)
Practice Set 5
Practice Set 6
Exam 2
(Take-Home)
Class Project
Exam 2
(Take-Home)
11/28/13
No Class — Thanksgiving
12/05/13
Advanced Topics (TBA)
Additional Tutorial and Review
Date
Posted
Time
10/10/13
9:00–10:00 PM
10/24/13
9:00–10:00 PM
11/07/13
9:00–10:00 PM
11/21/13
9:00–10:00 PM
IS 579 Syllabus: Deb Dey
Lecture Notes and Description of Topics
Chapter 1: Introduction
Topics: Basic concepts
Reading: Text Chapters 1 and 3
Chapter 2: Database and Probability Basics
Topics: Relational databases, SQL, Bayes’ rule, conditional independence, contingency tables
Chapter 3: Information Theory
Topics: Information theory, information gain, tree induction, gain ratio
Chapter 4: Data Cleaning, Conversion and Preparation
Topics: Noise, redundancy, (lack of) specificity, heterogeneity, attribute expansion, attribute consolidation,
input formatting, data partitioning
Reading: Text Chapters 2 and 7.2
Chapter 5: Basic Classification
Topics: Basic concepts of Naïve Bayesian and decision tree (ID3) classifiers
Reading: Text Chapter 4.2 and 4.3
Chapter 6: Testing and Validation
Topics: Training versus testing data, partitioning datasets, accuracy, stratified accuracy, confusion matrix,
relative information score (RIS), cost-based measure, lift ratio
Reading: Text Chapter 5
Chapter 7: Advanced Classification
Topics: Dealing with missing values and numeric features, decision tree (C4.5), pruning
Reading: Text Chapters 4.2, 4.3, and 6.1
Chapter 8: Association Rules
Topics: Market basket analysis, items and itemsets, support, confidence, mining basics, Apriori-Gen
algorithm, single item transactions
Reading: Text Chapter 4.5, 6.3
Chapter 9: Numerical Prediction
Topics: Linear regression, multiple regression, performance measures, regression trees
Reading: Text Chapters 4.6, 6.5
Chapter 10: Other Classification Techniques
Topics: Rule-based and case-based classification
Reading: Text Chapters 4.4, 4.7, 6.4
Chapter 11: Implementation and Management
Topics: Data integration and preparation, data quality, choice of techniques, choice of software tools, build
vs. buy, insource vs. outsource, performance issues, testing, deployment, change management
Reading: To be announced
4