Download The Data Science Process

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
The Data Science Process
Polong Lin
Big Data University Leader & Data Scientist
IBM
[email protected]
“Every day,
we create 2.5 quintillion bytes of data —
so much that 90% of the data in the world today
has been created in the last two years alone.”
2
Datascience
Theinterestindatascience
• Solveproblemsandanswerquestionsusingdata
• Goaltoimprovefutureoutcomes
Whatisthedatascienceprocess?
3
CRISP-DMMethodologydiagram
Business
Understanding
Analytic
Approach
Data
Requirements
Feedback
Data Collection
Deployment
Data
Understanding
Evaluation
Modeling
Data
Preparation
CrossIndustryStandardProcessforDataMining
4
1.Businessunderstanding
Business
Understanding
Everyprojectbeginswithbusinessunderstanding.
•
•
•
•
Projectobjective?
Businesssponsorsplaythemostcriticalrole
Whatarewetryingtodo– whatisthegoal?
Howdoyoudefine“success”andhowcanyoumeasureit?
5
1.Businessunderstanding
Business
Understanding
Traffic:
Problem:Trafficcongestionwastestimeandmoney
Clearquestion:Howcanweoptimizetrafficlightdurationusingdataontraffic
patterns,weather,andpedestriantraffic?
Measurableoutcomes:
- %decreaseincommutetime
- %decreaseinlength/durationoftrafficjams
6
2.AnalyticApproach
Business
Understanding
Analytic
Approach
• Expressproblemincontextofstatisticalandmachinelearningtechniques
• Regression:
• “Predictingrevenueinthenextquarter?”
• Classification:
• “DoesthispatienthavecancerA,cancerB,oraretheyhealthy?”
• Clustering:
• “Aretheregroupsofusersthatseemtobehavesimilarlytoeachother?”
• Recommendation/Personalization:
• “HowcanItargetdiscountstospecificcustomers?”
• OutlierDetection
7
Statistical/machinelearningtechnique(s)
• Linearregression
• Logisticregression
• Clustering
•
•
•
K-means
Hierarchical
Density-based
• ClassificationTrees
• RandomForests
• Neuralnetworks
• Textmining(natural
languageprocessing)
• Principalcomponent
analysis
• SupportVectorMachines
• HiddenMarkovModels
• …
8
Datacompilation
• Thechosenanalyticapproachdeterminesthe
datarequirements.
• Content,formats,representations
Data
Requirements
• Initialdatacollection isperformed.
• AvailableData?
Data Collection
• Obtaindata?
• Revisedatarequirementsorcollectmoredata?
• Thendataunderstanding isgained.
Data
Understanding
• Initialinsightsaboutdata
• Descriptivestatisticsandvisualization
• Additionaldatacollectiontofillgaps,ifneeded
9
#1Whatcanyoutellmeaboutthisdata?
10
#2Whatcanyoutellmeaboutthisdata?
11
#3Whatcanyoutellmeaboutthisdata?
12
#4Whatcanyoutellmeaboutthisdata?
13
ImportanceofVisualization
Sameproperties:
mean(x)=9
mean(y)=7.5
y=3.00+0.500x
corr(x,y)=0.816
Anscombe's Quartet
14
CRISP-DMMethodologydiagram
Business
Understanding
Analytic
Approach
Data
Requirements
Feedback
Data Collection
Deployment
Data
Understanding
Evaluation
Modeling
Data
Preparation
15
Datapreparation
• Datapreparation encompassesallactivitiestoconstructandcleanthedataset.
• Datacleaning
• Arguablythemosttime-consumingstep
• Missingorinvalidvalues
• “80%oftheentireDSprocessisin
• Eliminatingduplicaterows
datacleaningandpreparation”
• Formattingproperly
•
•
•
•
Combiningmultipledatasources
Transformingdata
Featureengineering
Textanalysis
• Acceleratedatapreparationby
automatingcommonsteps
Data
Preparation
16
Modeling
• Modeling:
• Developingpredictiveordescriptivemodels
• Maytryusingmultiplealgorithms
• Highlyiterativeprocess
Evaluation
Modeling
Data
Preparation
17
Example:Clustering
K-means
Clustering
Groupsimilarcuisinestogether
intok numberofclusters.
Example:Clustering
k =3
K-means
Clustering
Groupsimilarcuisinestogether
intok numberofclusters.
Example:Clustering
[Age:18,Sex:M,BMI:23,Exercise:Frequent,Hobbies:Golf,…]
CLUSTERA
[Age:45,Sex:F,BMI:28,Exercise:Frequent,Hobbies:Baseball,…]
CLUSTERB
[Age:83,Sex:F,BMI:25,Exercise:Sedentary,Hobbies:Gymnastics,…] CLUSTERC
[Age:28,Sex:M,BMI:23,Exercise:Normal,Hobbies:Softball,…]
CLUSTERB
[Age:30,Sex:F,BMI:25,Exercise:Normal,Hobbies:Golf,…]
CLUSTERA
[Age:15,Sex:M,BMI:22,Exercise:Frequent,Hobbies:Golf,…]
CLUSTERA
Model
20
Example:Classification
[Age:32,Sex:M,BMI:23,Exercise:Frequent,… ,Condition:Disorder1]
[Age:45,Sex:F,BMI:28,Exercise:Frequent,… ,Condition:Healthy]
[Age:63,Sex:F,BMI:21,Exercise:Sedentary,… ,Condition:Disorder2]
Model
Disorder1
[Age:48,Sex:M,BMI:23,Exercise:Sedentary,… ,Condition:________]
21
Modelevaluation
• Modelevaluation isperformedduringmodeldevelopmentandbefore
modeldeployment.
• Understandthemodel’squality
• Ensurethatitproperlyaddressesthebusinessproblem
• Diagnosticmeasures
• Suitabletothemodelingtechniqueused
• Training/Testingset
• Refinemodelasneeded
• Statisticalsignificancetests
Evaluation
Modeling
22
Deploymentandfeedback
• Oncefinalized,themodelisdeployed intoaproductionenvironment.
• Maystartinalimited/testenvironment
BigDataUniversity:
• Involvesotherroles:
• Inactive->Active
• Solutionowner
• Marketing
Feedback
• Applicationdevelopers
• ITadministration
Deployment
• Getting Feedback:
• Howwelldidthemodelperform?
• Iterativeprocessformodelrefinementandredeployment
• A/Btesting
23
CRISP-DMMethodologydiagram
Business
Understanding
Analytic
Approach
Data
Requirements
Feedback
Data Collection
Deployment
Data
Understanding
Evaluation
Modeling
Data
Preparation
24
“All models are wrong but some are useful” – George Box, Statistician
25
Variable#503
Variable#503
Variable#503
26
Variable#503
Variable#503
Variable#503
27
Variable#503
Variable#503
Variable#503
28
29
LearningMoreAboutDataScience
Wherecanyoulearnmoreaboutdatascience?
30
31
32
33
BigDataUniversity.com
Freecourses!
• DataScience
• BigData
• DataEngineering
Earnbadges!
Learnanytime!
Foryourorganizations
• Wecancreatededicatedportals
foryouremployeestogainskillsin
datascience
34