Download CIS4930/6930 – Data Science: Large

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CAP4770/5771
Introduction to Data Science
Fall 2016
University of Florida, CISE Department
Dr. Daisy Zhe Wang
(Big) Data Science
Data science is the study of the generalizable extraction of
knowledge from data. It incorporates varying components
and builds on techniques and theories from many fields with
the goal of extracting meaning from data and creating data
products. [Wikipedia]
By 2018, the U.S. faces a shortage of 1.5 million analysts
with Big Data know-how. [McKinsey]
Peter Novig, DJ Patil, Michael Franklin, O’Reilly books and
summit, Kaggle …
• Data -> Knowledge -> Action
• Extraction -> Learning -> Inference
Some recent ML
Competitions at
https://www.kaggle.c
om/
Data Science – A Definition
Data Science is the science which uses
computer science, statistics and machine
learning, visualization and humancomputer interactions to collect, clean,
integrate, analyze, visualize, interact with
data to create data products.
4
Goal of Data Science
Turn data into data products.
Data Science – A Visual Definition
Course Goals
• Teach state-of-the-art tools to do Data Science
• Data Collection, Storage, Manipulation, Querying and
Processing
• Data Analytics and Modeling
• Big Data and Parallel Processing
• Data Driven Prediction and Decision Making
• Data Science Applications
7
Vital Information
•
•
•
•
Instructor: Prof. Daisy Zhe Wang
Office: E456
Class time: Mon/Wed/Fri 3-3:50pm (8th period)
Office hours: Mon/Wed 4-5pm or by
appointment in E456/457
• TA: Xiaofeng Zhou, Dihong Gong (Office hour:
TBA)
• Course page and Syllabus is up already – take
a look at the course tentative schedule:
https://ufl.instructure.com/courses/331578
(read announcements frequently!)
8
Course Formats
• Lectures
– 25 Lectures on 13 topics
• In-class Labs and Homework
– 1 bootcamp on python basics (a.k.a., lab0)
– 7 Labs
• In-class Midterm (10/10)
– Review (10/7)
• Final Project (NIST DSE 2016)
– Traffic Domain: Maps, Sensors and Events
– Data Cleaning and Prediction
This Course will Teach the following techniques
through Lectures and Hands-on Labs
• Using Unix commands and Python scripts to perform
data collection and preparation.
• Using Python pandas for data analytics for Tabular
data.
• Using Scikit and NLTK to perform statistical
modeling/machine learning for structured and
unstructured data analysis.
• Using Map-Reduce to process data at scale over AWS
cloud services.
• Use visualizations, presentation and write-ups to tell
your data story.
10
This Course will NOT
• Teach the basic programming (e.g.,
JAVA/Python) or data structures
• Teach some of the advanced topics in Data
Science (e.g., probabilistic graphical models)
• Attempt to improve existing Data Science
systems and algorithms
11
Other Data Science courses @ UF
• First of the three-course series in the Data Science
curriculum, followed by
• Projects in Data Science (CAP4773/CAP6779)
• Consider applying skills learned in this class to solve a
real-life/research problem/application
• Advanced Topics in Data Science (CAP 6769)
• Research Oriented: paper reading, presentation,
research projects
12
Pre-requisites
• Require
– Data Structures and Algorithms (COP3530)
– Or equivalent
• Prefer
– Information and Database Systems I (CIS4301)
– Statistics and Probabilities (STA 5325/5328)
– Programming experience with JAVA, SQL, R, Python
• Academic honesty
13
Course Outline
• Programming Tools: An introduction to the basic data science
techniques including programming in Python, SQL/SPARQL and
Map-Reduce for small and big data manipulation and analytics.
• Basic Components in Data Science Pipelines: Teach basic
techniques for data collection, data preparation, data querying, data
analytics including pattern mining, classification, clustering, data
visualization, and parallel computing platforms.
• Algorithmic and Data Analytics Tools: Teach advanced data
analytics techniques including NLP, knowledge extraction, graph
analytics, graph querying, knowledge bases and crowd sourcing.
• Data Science Applications: Introduce key application areas of
data science including business intelligence, social media,
biomedicine, and e-discovery.
• More details: https://ufl.instructure.com/courses/331578
14
Suggested Reading I
• Data Science From Scratch: First Principles with
Python, Joel Grus, O’Reilly Media Inc.,
https://www.amazon.com/Data-Science-ScratchPrinciples-Python/dp/149190142X
• Python for Data Analysis, Wes McKinney, O’Reilly
Media Inc., https://www.amazon.com/Python-DataAnalysis-Wrangling-IPython/dp/1449319793
• Mining of massive datasets, A. Rajaraman and J.D.
Ullman, Cambridge University Press, 2011. ISBN-10:
1107015359, ISBN-13: 978-1107015357,
http://www.mmds.org/ (public online access)
15
Suggested Readings II
• Natural Language Processing with Python
(NLTK Book):
http://www.nltk.org/book/ (Links to an
external site.) (public online access)
• Learning scikit-learn: Machine Learning in
Python, https://www.amazon.com/Learningscikit-learn-Machine-Python/dp/1783281936
Further Reading:
• Doing Data Science, Cathy O'Neil and Rachel
Schutt, O’Reilly Media Inc.,
http://proquest.safaribooksonline.com/978144
9363871
Textbooks and Software Required
• None – refer to recommended reading online and
class materials including lecture notes, labs,
homework
• We will be using Amazon Web Services (AWS) and
software supported on top of AWS. Each student will
need to register an AWS account with free tier:
https://aws.amazon.com/free/ and student account
with 40$ credits at:
https://aws.amazon.com/education/awseducate/
• We may also have access to/use Google Cloud and
credits
Attendance and Expectations
• We require class attendance and participation, since
most of the class material will be delivered in class.
• Moreover, in class lab assignments will be conducted to
test the understanding of the material via canvas.
• Personal laptops are required in class for lab
participation and roll-call.
• Please return your labs/homework/projects in time. Late returns
will cause 20% deduction in your grade for that
lab/homework/project for each late day.
Course Evaluation
How can I get an A ?
• 1 In-class Midterms
(20%)
• ~7 In-class Exercises
(10%)
• 7 Labs/Homework (35%)
• Final Project (30%)
• In-class Attendance (5%)
19
Computing Resources
• Amazon Web Services
– need a credit card to create an AWS account
– 40$-$100 AWS credits per student will be provided
– Will be used for Lab/Homework 5, 7 & Final
Project
– Should be enough to complete the projects
– We may also have access to Google Cloud and
credits
– Beyond the credit limit is at your own cost
• Please create an AWS Educate Student
account today
Individual Lab 0-5
• Lab 0 (Aug 26): Python Basics
• Lab 1 (Sep 2): Unix Data Preparation – 5%
• Lab 2 (Sep 12): Python/Pandas/Matplotlib
Exploratory Data Analysis – 5%
• Lab 3 (Sep 19): Python/Pandas Tabular data
Manipulation – 5%
• Lab 4 (Sep 26): Python/Scikit Machine
Learning – 5%
• Lab 5 (Oct 3): Map-Reduce Parallel Data
Processing – 5%
Lab 5
• Get AWS started
• Finish AWS tutorials on AWS account and S3 setup,
create and run a job flow, command line tools, AWS
instance types and pricing, EMR, debugging, etc.
• Implementation of a well-defined algorithm over a given
dataset using Map-Reduce on AWS
• Evaluation: correctness, performance and selected code
review
Individual Lab 6 and onward
• Lab 6 (Oct 17): Python/NLTK Text Analytics –
5%
• Lab 7 (Oct 24): Map-Reduce Graph Analytics
– 5%
• 1 midterm (Oct 10), 1 review lecture (Oct 7)
– 20%
• Final project Oct 5 – Dec 7 (30%) with final
report due during exam period
23
Group Final Project (30%)
• Work in groups of 2 people to work on 2
main tasks – data cleaning and prediction
• Given datasets with guidelines for
analytics
• Evaluation: evaluation of result quality and
runtime efficiency, and write-up on data
processing, analytics techniques applied
and data product results
In-class Lab assignment (10%)
• In the form of timed question
answering via canvas
• Personal laptop needed to take in-class
lab/quizzes
• Grade based on correctness of answer
Grading
Roughly the boundaries will be:
•
•
•
•
•
•
•
•
90 -- 100 A
86 -- 89 B+
80 -- 85 B
76 -- 79 C+
70 -- 75 C
66 -- 69 D+
60 -- 65 D
0 -- 59 E
The boundary for A-, B-, C- are specified in the
syllabus.
Questions?
Next Lecture – Overview of
Data Science
Reading due before next class:
Chapter 1 DSS
If time permits: A taste of Data
Science Project
NIST Data Science 2015 Fall
Pre-Pilot Evaluation Plan
• National institute of standard and technology
(NIST) http://www.nist.gov/
– The institute's official mission is to:
Promote U.S. innovation and industrial
competitiveness by advancing measurement science,
standards, and technology in ways that enhance
economic security and improve our quality of life.
• UF DSR has registered for all participants
through this class
– http://www.nist.gov/itl/iad/mig/dseval.cfm
28