Download Syllabus - Brandeis University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
BUS 212f (2) ANALYZING BIG DATA II
PRELIMINARY SYLLABUS—CASES SUBJECT TO CHANGE
Spring 2017—Tuesdays & Thursdays 5:00–6:20 pm
Sachar 116 (International Hall)
Prof. Robert Carver
781-775-5493 (mobile)
[email protected]
Office: Sachar 1C (far end of computer cluster)
Hours: Tuesdays, 3:00 – 4:45 and by appointment
TAs:
Boxi Pang & Shiyu Wang
Overview
This is a two credit module that is a continuation of BUS 211f. This module
provides theoretical and hands-on instruction in major elements of Big Data
analytics: management-oriented visualizations, data mining, selected machine
learning methods, and predictive modeling. Through the use of widely adopted
software tools, students will build models and execute analyses to address
current needs of selected Brandeis administrative offices as well as solve
problems presented in cases. Assignments and classroom time will be devoted
both to analysis of current developments in business analytics and to gaining
experience with current tools.
Required Readings
Provost, Foster & Fawcett, Tom. Data Science for Business: What You Need to
Know about Data Mining and Data-Analytic Thinking. (2013, Sebastopol, CA:
O’Reilly Media) 978-1449361327. Purchase at Bookstore or on-line.
There is a required on-line course pack available for purchase at the Harvard
Business Publishing website. A direct link is available on LATTE, and is
http://cb.hbsp.harvard.edu/cbmp/access/56407458
See last page of Syllabus for course pack contents.
Other readings as posted on LATTE site.
Recommended
Readings
Berry, M. and Linoff, G. Data Mining Techniques for Marketing, Sales, and
Customer Relationship Management. 3rd ed. (2011, Wiley) available on-line
through LTS. Ebook ISBN9781118087459.
Hastie, T., Tibshirani, R. and Friedman, J.H. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. (2001, Springer). Available in library
main stacks; pdf of new edition available for download at http://wwwstat.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf
Prerequisites
BUS 211f or permission of instructor.
BUS 212 f(2) Spring 2017
Learning Goals and
Objectives
Course Approach
2
Upon successful completion of this module, students will:

Understand the challenges of performing a business needs assessment to
determine how analytics and visual displays can provide business value

Be able to use training, validation, and test datasets to carry out data mining
analyses

Use common techniques such as multiple regression, partition trees, kmeans clustering to develop predictive models

Apply best practices of predictive modeling to real and realistic business
problems

Design informational graphics and displays grounded in concepts of
business needs and principles of human cognitive processes
Analysis of massive, real-time data is rapidly gaining prominence in
numerous industries, with applications ranging from fraud detection to
consumer behavior. As in the predecessor course (BUS 211f), BUS212f uses
theory, cases, and hands-on analysis to approach course topics. In six short
weeks, we can only dive so deep; we aim for depth in a carefully selected list of
topics rather than breadth. Students should expect to grapple with complex
software-based analyses that do not lend themselves to quick, easy solutions.
Communications
We’ll make regular use of LATTE. All lecture notes, handouts, assignments,
and supporting materials will be available via LATTE, and any late-breaking
news will reach you via email. Please check your Brandeis email and the LATTE
site regularly to keep apprised of important course-related announcements.
Other Course
Technology
All of the software we will use in this course can be accessed on the public
computer clusters at IBS and/or on your personal laptops. If you do use a
laptop, the class schedule below indicates dates when it will be useful to have it
with you.
As in BUS 211f, we will make use of proprietary and public-use databases
accessible through the World Wide Web. We’ll continue to use some of the tools
we adopted in that course as well as R for most of our analysis.
You should bring your laptop to each class; please use it judiciously.

R: R is a free software environment for statistical computing and graphics,
and is widely used by both academia and industry. The advantage of the R
software is that it can work on both Windows and Mac-OS. It is ranked no. 1
in the KDnuggets 2013 poll on top languages for analytics, data mining, and
data science. RStudio is a user friendly environment for R that has become
popular.
R Software: http://www.r-project.org/index.html.
RStudio: http://www.rstudio.com/products/RStudio/#Desk
This term we expect to use RStudio Server, which allows us to access the software
through a web browser.

Github is also a free environment that facilitates (a) collaborative work and
(b) version control for software projects that are under development. It is
very widely used by data scientists to manage and share their work.
BUS 212 f(2) Spring 2017
Student Classroom
Contributions
3
Class participation is important in this course both as a means of
developing understanding and as an indicator of student progress. Participation
can take many forms, and each student is expected to contribute actively, freely,
and effectively to the classroom experience by raising questions, demonstrating
preparedness and proficiency in the analysis of problems and cases, and
explaining the implications of particular analyses in context. Homework-based
discussion and presentations are an important part of participation. To this
end, regular class attendance is required, and students should use name
cards. We meet only twelve times, so absence can become a serious problem.
Even if you must arrive late or leave early, be here.
With assistance from the TAs, I will evaluate the quality of your
contributions in class each evening, as well as the quality of your contributions
via email, LATTE discussion, etc. These will all be factored together in
determining your ultimate Contributions grade (see below). In general, absence
from class reduces your contribution grade.
Written
Assignments and
Projects
Students will complete five analytic assignments during the course. Three
of these will be brief analyses, requiring both computer modeling and writing.
These may be completed with one or two partners, and each student should
expect to briefly discuss one of their work products in class.
Two other written assignments will be two more substantial projects
requiring more significant time and analysis. The project assignments will be
prepared in teams of four students, and will include written and computerbased elements. Owing to the size of the class, students will have only limited
opportunities to present parts of their projects orally in the course.
All assignments should be submitted via LATTE upload prior to the start of
class. Papers should be professional in appearance and use clear, grammatically
correct business English. Analytical work (graphs, tables, and other output)
should be incorporated seamlessly into the written document, showing readers
exactly and only what you want them to see.
Evaluation
Your final grade in the course will be computed using these weights:
Contributions to Class Discussions
Brief analyses (3)
Projects (2)
TOTAL
20%
40%
40%
please note!
100%
Workload
Expecations
Success in this two- credit course is based on the expectation that students will
spend a minimum of 9 hours of study time per week for six weeks in
preparation for class (readings, papers, discussion sections, preparation for
exams, etc.)."
Academic Integrity
You are expected to be honest in all of your academic work. Please consult
Brandeis University Rights and Responsibilities for all policies and procedures
related to academic integrity. Students may be required to submit work to
TurnItIn.com software to verify originality. Allegations of alleged academic
dishonesty will be forwarded to the Director of Academic Integrity. Sanctions
for academic dishonesty can include failing grades and/or suspension from the
university. Citation and research assistance can be found at LTS - Library
BUS 212 f(2) Spring 2017
4
guides.
Disabilities
If you are a student with a documented disability on record at Brandeis
and wish to have a reasonable accommodation made for you in this class,
please see me immediately.
Study Groups
Working with one or two partners is an excellent way to gain understanding of
this subject. I encourage small groups to work on assignments, with a few
caveats:



Be sure that you are neither carrying nor being carried by the group; each
member of the group is entitled to learn and expected to contribute.
Except for the group project, each student is responsible for turning in
original memos and problem sets.
Each group member retains the right to “go it alone.” Joining a group is not
a marriage. Similarly, teams are encouraged to dismiss underperforming
members.
Course Outline
Note: for each session, you should complete the assigned reading before coming to class. See list of
deliverables on next page; detailed assignments will be distributed in class each week, and all
assignments and handouts will also be available on our LATTE site. The abbreviation “P&F” refers
to the Provost and Fawcett book.
Session
Date
Topics and Readings
Deliverable Due
by class time
Support Business Intelligence:
READINGS: Russom, Big Data Analytics (2013, on LATTE)
P&F, Chapter 1 & 2
Session 1
Tu March 14
a.
b.
c.
Course introduction and objectives
Relationship of Business knowledge and Big Data Analytics
Data Mining Process (overview)
d.
Review R & R Studio
(none)
Laptops helpful
Session 2
Th March 16
READINGS Watson, “All about Analytics”
Leek & Peng, ”What is the Question?”
Decision Trees & Logistic Regression
Session 3
Tu March 21
READINGS: P&F, Chap 3
Loh, “Classification and Regression Trees” (LATTE)
CASE READING: A Game of Two Halves: In-Play Betting in Football
a.
b.
Supervised Segmentation
Theory: Decision trees and concepts of Logistic Regression
(simple/ multinomial logistic)
Analysis I
(R data analysis)
BUS 212 f(2) Spring 2017
Session
Date
5
Topics and Readings
Deliverable Due
by class time
READINGS: P&F, Chap 4
Session 4
Th March 23
CASE READING: A Game of Two Halves: In-Play Betting in Football
a.
Application: Game of Two Halves
Classification Models and Performance
READINGS: P&F, Chaps 5
CASE READING: Framingham Heart Study
Session 5
Tu March 28
a.
b.
Classification models
Training & Validation
Analysis 2
(Game of Two
Halves)
Laptops helpful
Classification Model Performance
Session 6
Th March 30
READINGS: (LATTE) O’Donnella, CJ, and Elosua, R. “Cardiac Risk
Factors”
CASE READING: (continue Framingham)
a.
b.
Confusion Matrix to assess model performance
ROC curves
Clustering for Segmentation
Session 7
Tu April 4
READINGS: P&F, Chaps 6
“Cluster Analysis for Segmentation”
Recommended: Hastie & Tishbirani (parts of 13 & 14—
LATTE)
a.
b.
Project 1 Debriefing
Introduction to Clustering methods
More about Clustering Methods and Model performance
Session 8
Th April 6
CASE READING: TBA
a.
Comparing methods
b.
Selecting a model
April 10-18
Passover/ Spring Break No classes
Project 1
Wine
Classification
BUS 212 f(2) Spring 2017
Session
Date
6
Deliverable Due
by class time
Topics and Readings
Association Rules
Session 9
Th April 20
READINGS: P&F, Chaps 7,8
Unsupervised Data Mining: Association Rules/Market Basket Analysis
Text Mining
READINGS: P&F, Chap 10
Session 10
Tu April 25
CASE READING: “How a Math Genius Hacked OK Cupid…”
a.
b.
Text Mining basics
Word clouds in R
From Word Clouds to Understanding
READINGS: tba
Session 11
Th April 27
CASE READING: “How a Math Genius Hacked OK Cupid…”


Analysis 3
(OK Cupid
explorations)
Debrief Text Mining assignment
Sentiment analysis
Review, Summary & Project Time
READINGS: P&F, Chaps 11 & 12
Project 2 instructions
Session 12
Tu May 2



Thursday
May 11
Brief project-2
discussion
More on the Data Analytic Mindset
Other application areas and challenges
Developing models with Business Value
Project 2 deadline


Final project due before this date
Graduating students are encouraged to submit early
Project 2

Brief Description of Assignments (complete assignment details to be distributed in class):
Analysis 1
Introduction to Modeling with R and R Studio
Analysis 2
Build a model to support In-Game Betting in Football (soccer)
Analysis 3
OK Cupid exploratory analysis
Project 1
Classifying Wines
Project 2
OKCupid sentiment analysis
BUS 212 f(2) Spring 2017
Supplementary Readings and Cases (chronologically during course):
Those in bold-face are in the Harvard Business Publishing on-line course.
Russom P., (2011) “Big Data Analytics”, TDWI Best Practices Report
Watson, H. (2013) “All about Analytics” International Journal of Business Intelligence Research,
January-March, Vol. 4, No. 1.
Leek, J. and Peng, R. (2015) “What is the Question?” Sciencepress. Published online 26
February:10.1126/science.aa6146.
Loh, Wei-Lin (2011) “Classification and Regression Trees” WIREs Data Mining and Knowledge
Discovery, Wiley.
Kumar, U., Sandeep, V. and Satyabala (2013) “A Game of Two Halves: In-Play Betting in
Football” (IMB-401). Indian Institute of Management–Bangalore.
O’Donnella, CJ, and Elosua, R. (2008) “Cardiovascular Risk Factors. Insights From Framingham
Heart Study.” Rev Esp Cardiol. 2008;61(3):299-310.
Venkatesan, Rajkumar (2014). “Cluster Analysis for Segmentation” (UV0745-PDF-ENG).
Darden School of Business.
Poulsen, Kevin (2014). “How a math genius hacked OKCUPID to find true love” (online):
https://www.wired.com/2014/01/how-to-hack-okcupid/
Rev. 02/2016
7