Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Data Science Process Polong Lin Big Data University Leader & Data Scientist IBM [email protected] “Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.” 2 Datascience Theinterestindatascience • Solveproblemsandanswerquestionsusingdata • Goaltoimprovefutureoutcomes Whatisthedatascienceprocess? 3 CRISP-DMMethodologydiagram Business Understanding Analytic Approach Data Requirements Feedback Data Collection Deployment Data Understanding Evaluation Modeling Data Preparation CrossIndustryStandardProcessforDataMining 4 1.Businessunderstanding Business Understanding Everyprojectbeginswithbusinessunderstanding. • • • • Projectobjective? Businesssponsorsplaythemostcriticalrole Whatarewetryingtodo– whatisthegoal? Howdoyoudefine“success”andhowcanyoumeasureit? 5 1.Businessunderstanding Business Understanding Traffic: Problem:Trafficcongestionwastestimeandmoney Clearquestion:Howcanweoptimizetrafficlightdurationusingdataontraffic patterns,weather,andpedestriantraffic? Measurableoutcomes: - %decreaseincommutetime - %decreaseinlength/durationoftrafficjams 6 2.AnalyticApproach Business Understanding Analytic Approach • Expressproblemincontextofstatisticalandmachinelearningtechniques • Regression: • “Predictingrevenueinthenextquarter?” • Classification: • “DoesthispatienthavecancerA,cancerB,oraretheyhealthy?” • Clustering: • “Aretheregroupsofusersthatseemtobehavesimilarlytoeachother?” • Recommendation/Personalization: • “HowcanItargetdiscountstospecificcustomers?” • OutlierDetection 7 Statistical/machinelearningtechnique(s) • Linearregression • Logisticregression • Clustering • • • K-means Hierarchical Density-based • ClassificationTrees • RandomForests • Neuralnetworks • Textmining(natural languageprocessing) • Principalcomponent analysis • SupportVectorMachines • HiddenMarkovModels • … 8 Datacompilation • Thechosenanalyticapproachdeterminesthe datarequirements. • Content,formats,representations Data Requirements • Initialdatacollection isperformed. • AvailableData? Data Collection • Obtaindata? • Revisedatarequirementsorcollectmoredata? • Thendataunderstanding isgained. Data Understanding • Initialinsightsaboutdata • Descriptivestatisticsandvisualization • Additionaldatacollectiontofillgaps,ifneeded 9 #1Whatcanyoutellmeaboutthisdata? 10 #2Whatcanyoutellmeaboutthisdata? 11 #3Whatcanyoutellmeaboutthisdata? 12 #4Whatcanyoutellmeaboutthisdata? 13 ImportanceofVisualization Sameproperties: mean(x)=9 mean(y)=7.5 y=3.00+0.500x corr(x,y)=0.816 Anscombe's Quartet 14 CRISP-DMMethodologydiagram Business Understanding Analytic Approach Data Requirements Feedback Data Collection Deployment Data Understanding Evaluation Modeling Data Preparation 15 Datapreparation • Datapreparation encompassesallactivitiestoconstructandcleanthedataset. • Datacleaning • Arguablythemosttime-consumingstep • Missingorinvalidvalues • “80%oftheentireDSprocessisin • Eliminatingduplicaterows datacleaningandpreparation” • Formattingproperly • • • • Combiningmultipledatasources Transformingdata Featureengineering Textanalysis • Acceleratedatapreparationby automatingcommonsteps Data Preparation 16 Modeling • Modeling: • Developingpredictiveordescriptivemodels • Maytryusingmultiplealgorithms • Highlyiterativeprocess Evaluation Modeling Data Preparation 17 Example:Clustering K-means Clustering Groupsimilarcuisinestogether intok numberofclusters. Example:Clustering k =3 K-means Clustering Groupsimilarcuisinestogether intok numberofclusters. Example:Clustering [Age:18,Sex:M,BMI:23,Exercise:Frequent,Hobbies:Golf,…] CLUSTERA [Age:45,Sex:F,BMI:28,Exercise:Frequent,Hobbies:Baseball,…] CLUSTERB [Age:83,Sex:F,BMI:25,Exercise:Sedentary,Hobbies:Gymnastics,…] CLUSTERC [Age:28,Sex:M,BMI:23,Exercise:Normal,Hobbies:Softball,…] CLUSTERB [Age:30,Sex:F,BMI:25,Exercise:Normal,Hobbies:Golf,…] CLUSTERA [Age:15,Sex:M,BMI:22,Exercise:Frequent,Hobbies:Golf,…] CLUSTERA Model 20 Example:Classification [Age:32,Sex:M,BMI:23,Exercise:Frequent,… ,Condition:Disorder1] [Age:45,Sex:F,BMI:28,Exercise:Frequent,… ,Condition:Healthy] [Age:63,Sex:F,BMI:21,Exercise:Sedentary,… ,Condition:Disorder2] Model Disorder1 [Age:48,Sex:M,BMI:23,Exercise:Sedentary,… ,Condition:________] 21 Modelevaluation • Modelevaluation isperformedduringmodeldevelopmentandbefore modeldeployment. • Understandthemodel’squality • Ensurethatitproperlyaddressesthebusinessproblem • Diagnosticmeasures • Suitabletothemodelingtechniqueused • Training/Testingset • Refinemodelasneeded • Statisticalsignificancetests Evaluation Modeling 22 Deploymentandfeedback • Oncefinalized,themodelisdeployed intoaproductionenvironment. • Maystartinalimited/testenvironment BigDataUniversity: • Involvesotherroles: • Inactive->Active • Solutionowner • Marketing Feedback • Applicationdevelopers • ITadministration Deployment • Getting Feedback: • Howwelldidthemodelperform? • Iterativeprocessformodelrefinementandredeployment • A/Btesting 23 CRISP-DMMethodologydiagram Business Understanding Analytic Approach Data Requirements Feedback Data Collection Deployment Data Understanding Evaluation Modeling Data Preparation 24 “All models are wrong but some are useful” – George Box, Statistician 25 Variable#503 Variable#503 Variable#503 26 Variable#503 Variable#503 Variable#503 27 Variable#503 Variable#503 Variable#503 28 29 LearningMoreAboutDataScience Wherecanyoulearnmoreaboutdatascience? 30 31 32 33 BigDataUniversity.com Freecourses! • DataScience • BigData • DataEngineering Earnbadges! Learnanytime! Foryourorganizations • Wecancreatededicatedportals foryouremployeestogainskillsin datascience 34