Download ppt1 - Slava Novgorodov

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
March 15, 2017
Introduction to Data Science: Lecture 1
Dr. Amitai Armon
Administrative Details
• Course lecturer: Prof. Tova Milo
[email protected]
• Course teaching assistant: Slava Novgorodov
[email protected]
• Grade structure:
• 30% Exercises
• 70% Final Exam
• Course website: http://slavanov.com/teaching/ds1617b/
Course Topics
• This course will provide a practical introduction to machinelearning and big data
• Main topics of the classes:
• Introduction to Machine Learning
• Data understanding and Data Preparation
• Feature Selection and Model Evaluation
• Supervised Modeling
• Unsupervised Modeling
• Deep Learning
• Introduction to Big Data
• Spark
• NoSQL databases
• Spark Streaming
Exercises
•
•
•
•
•
There will be four exercises during the course
The last exercise will be bigger
Exercises will be in Python
Submission is in pairs
See the course website:
http://slavanov.com/teaching/ds1617b/
Administrative Details
Questions?
Intel AdvanceD Analytics: A
about
US
OUR Little
• Use data science for upgrading Intel’s operations
MISSION
• Help Intel win the data-science market
Operational
Excellence
Design
Manufacturing Marketing
& Sales
Technology
Breakthrough
Deep Learning
Products
Health wearables
platform
Intel AdvanceD Analytics: A
about
US
OUR Little
• Use data science for upgrading Intel’s operations
MISSION
• Help Intel win the data-science market
CONTRIBUTIONS TO DataScience Community
Academy collaborations
Industry collaborations
Helping Intel VC Investments
Data
Science
Summit
conferencesi
n San
Francisco
and
Jerusalem
Machine Learning is Everywhere…
• Handwriting Recognition
• Speech Recognition
• Automatic translation
• Credit-card fraud detection
• Image Classification
• Social Networks Analysis (community detection)
• Movie / product / article recommendations
• Autonomous cars
•
….
Winning in Jeopardy
Winning Against Go Champion
Answering Visual Questions
Kan et al., 2015
Dialogue (“Turing Test”)
Google chatbot, 2015
What is Machine Learning?
Wikipedia:
Machine Learning, a branch of artificial intelligence, concerns
the construction and study of systems that can learn from data.
Tom Mitchell (1998): A computer program is said to learn from
experience, with respect to some task and some performance
measure if its performance, as measured by the performance
measure, improves with experience.
“Child Learning”
Action
Reaction
Lesson
Touching hot stove
aching hand
Do not touch again
Playing with toys
Fun
Continue playing
Running in to the road
Screaming parent
Don’t run to roads
Running in the house
Fun
Run in the house
Eating chocolate
Fun
Search for chocolate
Eating too much
chocolate
Stomach ache
Don’t eat too much
Saying bla bla
No Reaction
Try variations
Saying daddy
Overexcited parents
Do that again
Learning from Examples
What is “Dangerous”?
Learning from Examples
So are these items dangerous or not?
It’s important to have enough diverse examples, not all ‘same type’
Typical Machine Learning Tasks
No two machine learning tasks are identical, but still there are common prototypes:
• Supervised Learning
– Learning from labeled examples (for which the answer is known)
• Unsupervised Learning
– Learning from unlabeled examples (for which the answer is unknown)
• Semi-supervised Learning
– Learning from both labeled and unlabeled examples
• Active Learning
– Learning while interactively querying for labels of examples
• Reinforcement Learning
– Learning by trial and feedback, like the “child learning” example
Typical Machine Learning Tasks
Supervised Learning
Estimate an unknown result, given explicit values of some
explaining variables (“features”).
Estimate it based on a set of observations for which both the
result and the explaining variables are known (“training set”).
This may be prediction (“it’s difficult to give
forecasts, especially about the future”) or
estimation.
Typical Machine Learning Tasks
Supervised Learning
Example 1: What will be the annual spend of my clients?
• The unknown result: the annual spend (this is a prediction)
• Explaining variables (“features”): Client’s details (e.g.,
domain, size, purchase history)
• Training set: The annual spend in past years, with respect to
the client’s data available so far (at the beginning of that year)
Typical Machine Learning Tasks
Supervised Learning
Example 2: What is the activity currently performed by a Parkinson’s
patient?
• The unknown result: the activity (this is not a prediction – the fact
exists, we simply don’t know it)
• Explaining variables: Various features that are extracted from
sensory data on the patient’s body (accelerometers, gyro, compass)
• Training set: features and the corresponding activity labeling (we
must have a labeled training set)
Typical Machine Learning Tasks
Supervised Learning
Two main tasks are considered in Supervised Learning:
• Regression: the unknown result is a numerical value (e.g.,
annual spend)
• Classification: the unknown result is a class relation (e.g., the
activity)
Regression and classification have different objective measures,
and often different algorithms.
Typical Machine Learning Tasks
Unsupervised Learning
Given explicit values of some variables (pre-defined set), extract
interesting patterns that appear in the data, or provide an
insightful representation of the data inherent distribution.
Typical Machine Learning Tasks
Unsupervised Learning
Example: Market Segmentation
• Input data: Clients information
• Objective: Identify what types of clients are there?
– This objective is known as ‘Clustering’ or ‘Cluster Analysis‘
Typical Machine Learning Tasks
Reinforcement Learning
Reinforcement Learning is learning how to best react to
situations through trial and error.
In some sense reinforcement learning is the first way of learning
we think of.
Example: TD Gammon
Few Supervised Learning
Approaches
Supervised Learning
X1
X2
X3
…
Xn-2
Xn-1
Xn
Y
x1,1
x2,1
x3,1
…
xn-2,1
xn-1,1
xn,1
y1
x1,2
x2,2
x3,2
…
xn-2,2
xn-1,2
xn,2
y2
.
.
.
.
.
.
.
.
.
…
…
…
.
.
.
.
.
.
.
.
.
x1,m-1
x2,m-1
x3,m-1
…
xn-2,m-1
xn-1,m-1
xn,m-1
ym-1
x1,m
x2,m
x3,m
…
xn-2,m
xn-1,m
xn,m
ym
• Uses a set of labeled examples with known answer (“training set”)
• Success is evaluated on a separate set of examples (“test set”).
Various success criteria may be considered:
• For classification: Accuracy, Recall, Precision…
• For regression: MSE, RMSE,…
Lazy Learner: k-Nearest Neighbors
Identifying spam emails
Email Length
K=3
New Recipients
• What should be k?
• Which distance measure should
be used?
• Computation
Linear Classifiers
X1
How would you
classify this data?
X2
Linear Classifiers
X1
How would you
classify this data?
X2
Linear Classifiers
X1
Any of these would
be fine..
..but which is best?
X2
Maximum Margin
Email Length
Define the margin of
a linear classifier as
the width that the
boundary could be
increased by before
hitting a data point.
New Recipients
Email Length
Maximum Margin
New Recipients
The maximum
margin linear
classifier is the
linear classifier with
the maximum
margin.
This is found by the
SVM algorithm
(Support Vector
Machine)
Decision tree
•
•
•
•
A flow-chart-like tree structure
Internal node denotes a test on one of the features
Branch represents an outcome of the test
Leaf nodes represent class labels
DEEP NEURAL NETWORKS
Bengio, 2009
Block Diagram of a Supervised Learning
System
Hypothesis
Space
Training
Set
Learning Alg.
h
Test Set
Testing
h(x)≠ct(x)
Estimated
εg(h)
Evaluating What’s Been Learned
1.
Test set
2. Cross Validation
Confusion Matrix
Actual
Classified As
Blue
Red
Blue
7
1
Red
0
5
Regression Learning Example
Overfitting and Underfitting
Overfitting: The model learns the training set too well – it over fits the training
set such that it cannot generalize to new instances.
Underfitting: the model is too simple, both training and test errors are large
CRISP-DM Methodology
• CRISP-DM stands for Cross Industry Standard Process for
Data Mining
– Conceived in 1996-7 by SPSS, Teradata, Daimler, NCR and OHRA
– IBM is the primary corporation that embraced and incorporated it in
its SPSS Modeler product
– CRISP-DM defines a methodology for ML/DM projects
CRISP-DM Methodology
CRISP-DM breaks the process of data mining into six major
phases
1.
2.
3.
4.
5.
6.
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
The sequence of the phases is not strict and moving back and
forth between different phases may be required
Summary
We briefly discussed today:
• What is Machine Learning
• Typical Machine Learning tasks
• Supervised Learning:
– Learning means Generalization
– Overfitting and Underfitting
– Simple learning paradigms
– Training vs. Testing
– Classification and Regression
• CRISP-DM
Introduction to
Data Science
Questions?
Thank you!