Download Handouts for Workshop on Rattle and R

Workshop Overview A Data Mining Workshop Excavating Knowledge from Data Introducing Data Mining using R 1 R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R [email protected] Data Scientist Australian Taxation Office Adjunct Professor, Australian National University Adjunct Professor, University of Canberra Fellow, Institute of Analytics Professionals of Australia [email protected] http://datamining.togaware.com Visit: http://onepager.togaware.com for Workshop Notes http: // togaware. com Copyright © 2014, [email protected] 1/17 http: // togaware. com Copyright R: A Language for Data Mining © 2014, [email protected] What is R? Workshop Overview Installing R Installing R and Rattle 1 R: A Language for Data Mining First task is to install R 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle As free/libre open source software (FLOSS or FOSS), R and Rattle are available to all, with no limitations on our freedom to use and share the software, except to share and share alike. 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R http: // togaware. com Copyright © 2014, [email protected] What is R? Visit CRAN at http://cran.rstudio.com Visit Rattle at http://rattle.togaware.com Linux: Install packages (Ubuntu is recommended) $ wajig install r-recommended r-cran-rattle Windows: Download and install from CRAN MacOSX: Download and install from CRAN 3/17 http: // togaware. com 4/28 Why a Workshop on R? Most widely used Data Mining and Machine Learning Package Machine Learning Statistics Software Engineering and Programming with Data But not the nicest of languages for a Computer Scientist! Free (Libre) Open Source Statistical Software Free (Libre) Open Source Statistical Software . . . all modern statistical approaches . . . many/most machine learning algorithms . . . opportunity to readily add new algorithms . . . all modern statistical approaches . . . many/most machine learning algorithms . . . opportunity to readily add new algorithms That is important for us in the research community Get our algorithms out there and being used—impact!!! 2014, [email protected] 2014, [email protected] Why do Data Science with R? Machine Learning Statistics Software Engineering and Programming with Data But not the nicest of languages for a Computer Scientist! © © What is R? Most widely used Data Mining and Machine Learning Package Copyright Copyright Why a Workshop on R? Why do Data Science with R? http: // togaware. com 2/17 That is important for us in the research community Get our algorithms out there and being used—impact!!! 5/28 http: // togaware. com Copyright © 2014, [email protected] 5/28 What is R? Why a Workshop on R? What is R? Why do Data Science with R? Why do Data Science with R? Most widely used Data Mining and Machine Learning Package Most widely used Data Mining and Machine Learning Package Machine Learning Statistics Software Engineering and Programming with Data But not the nicest of languages for a Computer Scientist! Machine Learning Statistics Software Engineering and Programming with Data But not the nicest of languages for a Computer Scientist! Free (Libre) Open Source Statistical Software Free (Libre) Open Source Statistical Software . . . all modern statistical approaches . . . many/most machine learning algorithms . . . opportunity to readily add new algorithms . . . all modern statistical approaches . . . many/most machine learning algorithms . . . opportunity to readily add new algorithms That is important for us in the research community Get our algorithms out there and being used—impact!!! http: // togaware. com Copyright © 2014, [email protected] What is R? That is important for us in the research community Get our algorithms out there and being used—impact!!! 5/28 © © http: // togaware. com Copyright Popularity of R? 5/28 Popularity of R? Source: http://r4stats.com/articles/popularity/ 2014, [email protected] 7/28 Popularity of R? How Popular is R? Professional Forums Registered for the main discussion group for each software. Source: http://r4stats.com/articles/popularity/ 2014, [email protected] © What is R? Number of R/SAS related posts to Stack Overflow by week. Copyright 2014, [email protected] Number of discussions on popular QandA forums 2013. How Popular is R? R versus SAS http: // togaware. com © How Popular is R? Discussion Topics Source: http://r4stats.com/articles/popularity/6/28 2014, [email protected] What is R? Copyright What is R? Monthly email traffic on software’s main discussion list. Copyright http: // togaware. com Popularity of R? How Popular is R? Discussion List Traffic http: // togaware. com Why a Workshop on R? 8/28 http: // togaware. com Copyright © Source: http://r4stats.com/articles/popularity/ 2014, [email protected] 9/28 What is R? Popularity of R? What is R? How Popular is R? Used in Analytics Competitions Popularity of R? How Popular is R? User Survey Rexer Analytics Survey 2010 results for data mining/analytic tools. Software used in data analysis competitions in 2011. Copyright http: // togaware. com © Source: http://r4stats.com/articles/popularity/ 2014, [email protected] What is R? 10/28 Copyright http: // togaware. com Popularity of R? © Source: http://r4stats.com/articles/popularity/ 2014, [email protected] 11/28 Data Mining, Rattle, and R What is R? Workshop Overview R — The Video A 90 Second Promo from Revolution Analytics http://www.revolutionanalytics.com/what-is-open-source-r/ Copyright http: // togaware. com © 2014, [email protected] An Introduction to Data Mining 12/28 1 R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R Copyright http: // togaware. com Big Data and Big Business © 2014, [email protected] An Introduction to Data Mining Data Mining 4/17 Big Data and Big Business Data Mining Application of Machine Learning Statistics Software Engineering and Programming with Data Effective Communications and Intuition A data driven analysis to uncover otherwise unknown but useful patterns in large datasets, to discover new knowledge and to develop predictive models, turning data and information into knowledge and (one day perhaps) wisdom, in a timely manner. . . . to Datasets that vary by Volume, Velocity, Variety, Value, Veracity . . . to discover new knowledge . . . to improve business outcomes . . . to deliver better tailored services http: // togaware. com Copyright © 2014, [email protected] 4/40 http: // togaware. com Copyright © 2014, [email protected] 5/40 An Introduction to Data Mining Big Data and Big Business An Introduction to Data Mining Data Mining in Research Data Mining in Government Health Research Adverse reactions using linked Pharmaceutical, General Practitioner, Hospital, Pathology datasets. Australian Taxation Office Lodgment ($110M) Tax Havens ($150M) Tax Fraud ($250M) Astronomy Microlensing events in the Large Magellanic Cloud of several million observed stars (out of 10 billion). Immigration and Border Control Psychology Investigation of age-of-onset for Alzheimer’s disease from 75 variables for 800 people. Check passengers before boarding Health and Human Services Social Sciences Survey evaluation. Social network analysis - identifying key influencers. Copyright http: // togaware. com © 2014, [email protected] An Introduction to Data Mining Big Data and Big Business Doctor shoppers Over servicing 6/40 Copyright http: // togaware. com Big Data and Big Business © 2014, [email protected] An Introduction to Data Mining The Business of Data Mining 7/40 Algorithms Basic Tools: Data Mining Algorithms Cluster Analysis (kmeans, wskm) Association Analysis (arules) Linear Discriminant Analysis (lda) SAS has annual revenues of $3B (2013) IBM bought SPSS for $1.2B (2009) Logistic Regression (glm) Analytics is >$100B business and >$320B by 2020 Random Forests (randomForest, wsrf) Amazon, eBay/PayPal, Google, Facebook, LinkedIn, . . . Boosted Stumps (ada) Shortage of 180,000 data scientists in US in 2018 (McKinsey) . . . Neural Networks (nnet) Decision Trees (rpart, wsrpart) Support Vector Machines (kernlab) ... That’s a lot of tools to learn in R! Many with different interfaces and options. Copyright http: // togaware. com © 2014, [email protected] The Rattle Package for Data Mining 8/40 Copyright http: // togaware. com A GUI for Data Mining © 2014, [email protected] The Rattle Package for Data Mining Why a GUI? 9/40 A GUI for Data Mining Users of Rattle Today, Rattle is used world wide in many industries Statistics can be complex and traps await Health analytics So many tools in R to deliver insights Customer segmentation and marketing Effective analyses should be scripted Fraud detection Government Scripting also required for repeatability It is used by R is a language for programming with data Universities to teach Data Mining Within research projects for basic analyses Consultants and Analytics Teams across business How to remember how to do all of this in R? How to skill up 150 data analysts with Data Mining? It is and will remain freely available. CRAN and http://rattle.togaware.com http: // togaware. com Copyright © 2014, [email protected] 11/40 http: // togaware. com Copyright © 2014, [email protected] 12/40 The Rattle Package for Data Mining Setting Things Up The Rattle Package for Data Mining Installation Tour A Tour Thru Rattle: Startup Rattle is built using R Need to download and install R from cran.r-project.org Recommend also install RStudio from www.rstudio.org Then start up RStudio and install Rattle: install.packages("rattle") Then we can start up Rattle: rattle() Required packages are loaded as needed. http: // togaware. com Copyright © 2014, [email protected] The Rattle Package for Data Mining 13/40 Tour Copyright © 2014, [email protected] The Rattle Package for Data Mining Copyright © 15/40 http: // togaware. com Tour 2014, [email protected] © 2014, [email protected] 14/40 Tour A Tour Thru Rattle: Explore Distribution Copyright © 2014, [email protected] The Rattle Package for Data Mining A Tour Thru Rattle: Explore Correlations http: // togaware. com Copyright The Rattle Package for Data Mining A Tour Thru Rattle: Loading Data http: // togaware. com http: // togaware. com 17/40 16/40 Tour A Tour Thru Rattle: Hierarchical Cluster http: // togaware. com Copyright © 2014, [email protected] 18/40 The Rattle Package for Data Mining Tour The Rattle Package for Data Mining A Tour Thru Rattle: Decision Tree Copyright http: // togaware. com © 2014, [email protected] The Rattle Package for Data Mining Tour A Tour Thru Rattle: Decision Tree Plot 19/40 Copyright http: // togaware. com Tour © 2014, [email protected] The Rattle Package for Data Mining A Tour Thru Rattle: Random Forest 20/40 Tour A Tour Thru Rattle: Risk Chart Risk Chart Random Forest weather.csv [test] RainTomorrow Risk Scores 0.90.7 0.8 0.6 0.5 0.4 0.3 0.2 0.1 Lift 100 4 80 Performance (%) 3 60 2 40 22% 1 20 RainTomorrow (92%) Rain in MM (97%) Precision 0 0 20 40 60 80 100 Caseload (%) http: // togaware. com Copyright © 2014, [email protected] Moving Into R 21/40 http: // togaware. com Programming with Data Copyright © 2014, [email protected] Moving Into R Data Scientists are Programmers of Data 22/40 Programming with Data From GUI to CLI — Rattle’s Log Tab But. . . Data scientists are programmers of data A GUI can only do so much R is a powerful statistical language Data Scientists Desire. . . Scripting Transparency Repeatability Sharing http: // togaware. com Copyright © 2014, [email protected] 24/40 http: // togaware. com Copyright © 2014, [email protected] 25/40 Moving Into R Programming with Data R Tool Suite From GUI to CLI — Rattle’s Log Tab The Power of Free/Libre and Open Source Software Tools Ubuntu GNU/Linux operating system Feature rich toolkit, up-to-date, easy to install, FLOSS RStudio Easy to use integrated development environment, FLOSS Powerful alternative is Emacs (Speaks Statistics), FLOSS R Statistical Software Language Extensive, powerful, thousands of contributors, FLOSS KnitR and LATEX Produce beautiful documents, easily reproducible, FLOSS http: // togaware. com Copyright © 2014, [email protected] R Tool Suite 26/40 http: // togaware. com Copyright © Ubuntu 2014, [email protected] RStudio Using Ubuntu 4/34 Interface RStudio—The Default Three Panels Desktop Operating System (GNU/Linux) Replacing Windows and OSX The GNU Tool Suite based on Unix — significant heritage Multiple specialised single task tools, working well together Compared to single application trying to do it all Powerful data processing from the command line: grep, awk, head, tail, wc, sed, perl, python, most, diff, make, paste, join, patch, . . . For interacting with R — start up RStudio from the Dash http: // togaware. com Copyright © 2014, [email protected] RStudio 5/34 http: // togaware. com Interface Copyright © 2014, [email protected] Introduction to R RStudio—With R Script File—Editor Panel 7/34 Simple Plots Scatterplot—R Code Our first little bit of R code: Load a couple of packages into the R library library(rattle) # Provides the weather dataset library(ggplot2) # Provides the qplot() function Then produce a quick plot using qplot() ds <- weather qplot(MinTemp, MaxTemp, data=ds) Your turn: give it a go. http: // togaware. com Copyright © 2014, [email protected] 8/34 http: // togaware. com Copyright © 2014, [email protected] 10/34 Introduction to R Simple Plots Introduction to R Scatterplot—Plot Scatterplot—RStudio ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● 30 ● ● ●● ● ● ● ● ● ● MaxTemp ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 20 Simple Plots ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 10 20 MinTemp Copyright http: // togaware. com © 2014, [email protected] Introduction to R 11/34 Installing Packages Copyright © 2014, [email protected] Introduction to R 13/34 http: // togaware. com library(rattle) head(weather) Installing Packages Copyright © 2014, [email protected] 14/34 Basic R Commands # Load the weather dataset. # First 6 observations of the dataset. ## Date Location MinTemp MaxTemp Rainfall Evapora... ## 1 2007-11-01 Canberra 8.0 24.3 0.0 ... ## 2 2007-11-02 Canberra 14.0 26.9 3.6 ... ## 3 2007-11-03 Canberra 13.7 23.4 3.6 ... .... Ctrl-Enter will send the line of code to the R console Ctrl-2 will move the cursor to the Console Console: UpArrow will cycle through previous commands Ctrl-UpArrow will search previous commands Tab will complete function names and list the arguments Ctrl-1 will move the cursor to the Editor str(weather) # Struncture of the variables in the dataset. ## 'data.frame': 366 obs. of 24 variables: ## $ Date : Date, format: "2007-11-01" "2007-11-... ## $ Location : Factor w/ 46 levels "Adelaide","Alba... ## $ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 ... .... Your turn: try them out. 2014, [email protected] 12/34 Basic R Editor: © 2014, [email protected] Introduction to R These will become very useful! Copyright © RStudio—Installing ggplot2 RStudio Shortcuts RStudio—Keyboard Shortcuts http: // togaware. com Copyright Introduction to R Missing Packages–Tools→Install Packages. . . http: // togaware. com http: // togaware. com 15/34 http: // togaware. com Copyright © 2014, [email protected] 16/34 Introduction to R Basic R Commands Introduction to R Basic R Visualising Data Visual Summaries—Add A Little Colour summary(weather) qplot(Humidity3pm, Pressure3pm, colour=RainTomorrow, data=ds) # Univariate summary of the variables. 1035 ## Date Location MinTemp ... ## Min. :2007-11-01 Canberra :366 Min. :-5.30 ... ## 1st Qu.:2008-01-31 Adelaide : 0 1st Qu.: 2.30 ... ## Median :2008-05-01 Albany : 0 Median : 7.45 ... ## Mean :2008-05-01 Albury : 0 Mean : 7.27 ... ## 3rd Qu.:2008-07-31 AliceSprings : 0 3rd Qu.:12.50 ... ## Max. :2008-10-31 BadgerysCreek: 0 Max. :20.90 ... ## (Other) : 0 ... ## Rainfall Evaporation Sunshine WindGust... ## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW : ... ## 1st Qu.: 0.00 1st Qu.: 2.20 1st Qu.: 5.95 NNW : ... ## Median : 0.00 Median : 4.20 Median : 8.60 E : ... ## Mean : 1.43 Mean : 4.52 Mean : 7.91 WNW : ... ## 3rd Qu.: 0.20 3rd Qu.: 6.40 3rd Qu.:10.50 ENE : ... .... ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1025 ● ● ● Pressure3pm ● ● ● 1015 ● ● ● ● ● ● ● 1005 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● RainTomorrow ● ● No ● Yes ● ● ● ● ● ● ● ● ● 995 25 50 2014, [email protected] Introduction to R 17/34 © Copyright http: // togaware. com Visualising Data 100 2014, [email protected] Introduction to R Visual Summaries—Careful with Categorics qplot(WindGustDir, Pressure3pm, data=ds) 18/34 Visualising Data Visual Summaries—Add A Little Jitter qplot(WindGustDir, Pressure3pm, data=ds, geom="jitter") 1035 1035 ● ● ● ● ● ● ● ● ● ● 1025 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1005 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●●●● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● 1005 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 1015 ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1025 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1015 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Pressure3pm ● ● ● ● ● ● Pressure3pm 75 Humidity3pm © Copyright http: // togaware. com ● ● ● ● ● ● ● ● ● ● ● ● 995 N NNE NE ENE E ESE SSE S SSW SW WSW W WNW NW NNW NA N NNE NE ENE E WindGustDir © Copyright http: // togaware. com SE 2014, [email protected] Introduction to R 19/34 http: // togaware. com Visualising Data Copyright ESE © SE SSE SSW SW WSW W WNW NW NNW NA WindGustDir 2014, [email protected] Introduction to R Visual Summaries—And Some Colour S 20/34 Help Getting Help—Precede Command with ? qplot(WindGustDir, Pressure3pm, data=ds, colour=WindGustDir, geom="jitter") ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● Pressure3pm ●● ● ●● ● 1015 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 1005 ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1025 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 995 N http: // togaware. com NNE NE ENE E Copyright ESE © SE SSE S SSW SW WSW W WNW NW NNW NA WindGustDir 2014, [email protected] 21/34 http: // togaware. com Copyright © 2014, [email protected] 22/34 Loading, Cleaning, Exploring Data in Rattle Loading, Cleaning, Exploring Data in Rattle Workshop Overview 1 R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R http: // togaware. com Copyright © 2014, [email protected] Loading Data 5/17 Loading, Cleaning, Exploring Data in Rattle Copyright © 2014, [email protected] 7/17 6/17 http: // togaware. com Copyright © 2014, [email protected] 8/17 Descriptive Data Mining Transform Data Copyright 2014, [email protected] Test Data Loading, Cleaning, Exploring Data in Rattle http: // togaware. com © Loading, Cleaning, Exploring Data in Rattle Exploring Data http: // togaware. com Copyright http: // togaware. com Workshop Overview © 2014, [email protected] 9/17 1 R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R http: // togaware. com Copyright © 2014, [email protected] 10/17 Cluster Analysis Requirements Algorithms What is Cluster Analysis? Cluster Methods Major Clustering Approaches Partitioning algorithms (kmeans, pam, clara, fanny): Construct various partitions and then evaluate them by some criterion. A fixed number of clusters, k, is generated. Start with an initial (perhaps random) cluster. Cluster: a collection of observations Similar to one another within the same cluster Dissimilar to the observations in other clusters Hierarchical algorithms: (hclust, agnes, diana) Create a hierarchical decomposition of the set of observations using some criterion Cluster analysis Grouping a set of data observations into classes Clustering is unsupervised classification: no predefined classes—descriptive data mining. Typical applications Density-based algorithms: based on connectivity and density functions Grid-based algorithms: based on a multiple-level granularity structure As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms Model-based algorithms: (mclust for mixture of Gaussians) A model is hypothesized for each of the clusters and the idea is to find the best fit of that model http: // togaware. com Copyright © 2014, [email protected] 6/33 http: // togaware. com Copyright Descriptive Data Mining © 2014, [email protected] Introduction KMeans Clustering 22/33 Rules Association Rule Mining An unsupervised learning algorithm—descriptive data mining. Identify items (patterns) that occur frequently together in a given set of data. Patterns = associations, correlations, causal structures (Rules). Data = sets of items in . . . transactional database relational database complex information repositories Rule: Body → Head [support, confidence] http: // togaware. com Copyright © 2014, [email protected] Introduction 11/17 http: // togaware. com Rules Copyright © 2014, [email protected] 3/29 Descriptive Data Mining Examples Association Rules Friday ∩ Nappies → Beer [0.5%, 60%] Age ∈ [20, 30] ∩ Income ∈ [20K , 30K ] → MP3Player [2%, 60%] Maths ∩ CS → HDinCS [1%, 75%] Gladiator ∩ Patriot → Sixth Sense [0.1%, 90%] Statins ∩ Peritonitis → Chronic Renal Failure [0.1%, 32%] http: // togaware. com Copyright © 2014, [email protected] 5/29 http: // togaware. com Copyright © 2014, [email protected] 12/17 Predictive Data Mining: Decision Trees Decision Trees Workshop Overview Basics Predictive Modelling: Classification 1 R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle Goal of classification is to build models (sentences) in a knowledge representation (language) from examples of past decisions. 4 Descriptive Data Mining The model is to be used on unseen cases to make decisions. 5 Predictive Data Mining: Decision Trees Often referred to as supervised learning. 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R http: // togaware. com Copyright © 2014, [email protected] Decision Trees Common approaches: decision trees; neural networks; logistic regression; support vector machines. 13/17 Basics 5/46 Basics Decision tree induction is an example of a recursive partitioning algorithm: divide and conquer. At start, all the training examples are at the root Partition examples recursively based on selected variables Internal nodes denotes a test on a variable Branch represents an outcome of the test Leaf nodes represent class labels or class distribution + + + − + + Females − + + + + − − + − + + Males − + + + − + + + −+ −− + Gender Male Female + _ + −− + + − + + −− >42 + + − − <42 Y > 42 Y 2014, [email protected] Tree Construction: Divide and Conquer Knowledge representation: A flow-chart-like tree structure Age © Decision Trees Language: Decision Trees < 42 Copyright http: // togaware. com − N + http: // togaware. com Copyright © 2014, [email protected] Decision Trees 6/46 http: // togaware. com Algorithm Algorithm A random data set may have high entropy: Y is from a uniform distribution a frequency distribution would be flat! a sample will include uniformly random values of Y Begin with all training examples at the root Data is partitioned recursively based on selected variables A data set with low entropy: Select variables on basis of a measure Y ’s distribution will be very skewed a frequency distribution will have a single peak a sample will predominately contain just Yes or just No Stop partitioning when? All samples for a given node belong to the same class There are no remaining variables for further partitioning – majority voting is employed for classifying the leaf There are no samples left 2014, [email protected] 7/46 We are trying to predict output Y (e.g., Yes/No) from input X . Tree constructed top-down, recursive, divide-and-conquer © 2014, [email protected] Basic Motivation: Entropy A greedy algorithm: takes the best immediate (local) decision while building the overall model Copyright © Decision Trees Algorithm for Decision Tree Induction http: // togaware. com Copyright Work towards reducing the amount of entropy in the data! 10/46 http: // togaware. com Copyright © 2014, [email protected] 11/46 Decision Trees Algorithm Decision Trees Basic Motivation: Entropy Algorithm Basic Motivation: Entropy We are trying to predict output Y (e.g., Yes/No) from input X . We are trying to predict output Y (e.g., Yes/No) from input X . A random data set may have high entropy: A random data set may have high entropy: Y is from a uniform distribution a frequency distribution would be flat! a sample will include uniformly random values of Y Y is from a uniform distribution a frequency distribution would be flat! a sample will include uniformly random values of Y A data set with low entropy: A data set with low entropy: Y ’s distribution will be very skewed a frequency distribution will have a single peak a sample will predominately contain just Yes or just No Y ’s distribution will be very skewed a frequency distribution will have a single peak a sample will predominately contain just Yes or just No Work towards reducing the amount of entropy in the data! Work towards reducing the amount of entropy in the data! http: // togaware. com Copyright © 2014, [email protected] Decision Trees 11/46 Copyright http: // togaware. com Algorithm © 2014, [email protected] Decision Trees Basic Motivation: Entropy 11/46 Algorithm Variable Selection Measure: Entropy Information gain (ID3/C4.5) We are trying to predict output Y (e.g., Yes/No) from input X . Select the variable with the highest information gain A random data set may have high entropy: Assume there are two classes: P and N Y is from a uniform distribution a frequency distribution would be flat! a sample will include uniformly random values of Y Let the data S contain p elements of class P and n elements of class N A data set with low entropy: Y ’s distribution will be very skewed a frequency distribution will have a single peak a sample will predominately contain just Yes or just No The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as Work towards reducing the amount of entropy in the data! http: // togaware. com Copyright © 2014, [email protected] Decision Trees IE (p, n) = − 11/46 p p n n log2 − log2 p+n p+n p+n p+n Copyright http: // togaware. com Algorithm © 2014, [email protected] Decision Trees Variable Selection Measure: Gini 15/46 Algorithm Variable Selection Measure Variable Importance Measure Gini index of impurity – traditional statistical measure – CART 1.00 Measure how often a randomly chosen observation is incorrectly classified if it were randomly classified in proportion to the actual classes. Measure Calculated as the sum of the probability of each observation being chosen times the probability of incorrect classification, equivalently: 0.75 IG (p, n) = 1 − (p 2 + (1 − p)2 ) Formula Info 0.50 Gini 0.25 As with Entropy, the Gini measure is maximal when the classes are equally distributed and minimal when all observations are in one class or the other. 0.00 0.00 0.25 0.50 0.75 1.00 Proportion of Positives http: // togaware. com Copyright © 2014, [email protected] 16/46 http: // togaware. com Copyright © 2014, [email protected] 17/46 Decision Trees Algorithm Building Decision Trees Information Gain In Rattle Startup Rattle library(rattle) rattle() Now use variable A to partition S into v cells: {S1 , S2 , . . . , Sv } If Si contains pi examples of P and ni examples of N, the information now needed to classify objects in all subtrees Si is: E (A) = v X pi + ni I (pi , ni ) p+n i=1 So, the information gained by branching on A is: Gain(A) = I (p, n) − E (A) So choose the variable A which results in the greatest gain in information. http: // togaware. com Copyright © 2014, [email protected] Building Decision Trees 18/46 http: // togaware. com In Rattle Copyright © 2014, [email protected] Building Decision Trees Load Example Weather Dataset 20/46 In Rattle Summary of the Weather Dataset A summary of the weather dataset is displayed. Click on the Execute button and an example dataset is offered. Click on Yes to load the weather dataset. http: // togaware. com Copyright © 2014, [email protected] Building Decision Trees 21/46 In Rattle © 2014, [email protected] © 2014, [email protected] 22/46 In Rattle Build Tree to Predict RainTomorrow Click on the Model tab to display the modelling options. Copyright Copyright Building Decision Trees Model Tab — Decision Tree http: // togaware. com http: // togaware. com Decision Tree is the default model type—simply click Execute. 23/46 http: // togaware. com Copyright © 2014, [email protected] 24/46 Building Decision Trees In Rattle Building Decision Trees Decision Tree Predicting RainTomorrow Evaluate Decision Tree Click the Draw button to display a tree (Settings → Advanced Graphics). http: // togaware. com Copyright © 2014, [email protected] Building Decision Trees Click Evaluate tab—options to evaluate model performance. 25/46 © 2014, [email protected] Building Decision Trees © 27/46 http: // togaware. com In Rattle 2014, [email protected] 2014, [email protected] 26/46 In Rattle Click the Risk type and then Execute. Copyright © 2014, [email protected] Building Decision Trees 28/46 In Rattle Score a Dataset Click the Score type to score a new dataset using model. Click the ROC type and then Execute. Copyright © Decision Tree Risk Chart Decision Tree ROC Curve http: // togaware. com Copyright Building Decision Trees Click Execute to display simple error matrix. Identify the True/False Positives/Negatives. Copyright http: // togaware. com In Rattle Evaluate Decision Tree—Error Matrix http: // togaware. com In Rattle 29/46 http: // togaware. com Copyright © 2014, [email protected] 30/46 Building Decision Trees In Rattle Building Decision Trees Log of R Commands Log of R Commands—rpart() Click the Log tab for a history of all your interactions. Save the log contents as a script to repeat what we did. http: // togaware. com Copyright © 2014, [email protected] Building Decision Trees In Rattle Here we see the call to rpart() to build the model. Click on the Export button to save the script to file. 31/46 Copyright http: // togaware. com In Rattle © 2014, [email protected] Building Decision Trees Help → Model → Tree 32/46 In R Weather Dataset - Inputs Rattle provides some basic help—click Yes for R help. ds <- weather head(ds, 4) ## ## 1 ## 2 ## 3 ## 4 .... Date 2007-11-01 2007-11-02 2007-11-03 2007-11-04 Location MinTemp MaxTemp Rainfall Evaporation Sunshine Canberra 8.0 24.3 0.0 3.4 6.3 Canberra 14.0 26.9 3.6 4.4 9.7 Canberra 13.7 23.4 3.6 5.8 3.3 Canberra 13.3 15.5 39.8 7.2 9.1 summary(ds[c(3:5,23)]) ## MinTemp ## Min. :-5.30 ## 1st Qu.: 2.30 ## Median : 7.45 ## Mean : 7.27 .... http: // togaware. com Copyright © 2014, [email protected] Building Decision Trees 33/46 In R Copyright © Rainfall Min. : 0.00 1st Qu.: 0.00 Median : 0.00 Mean : 1.43 RISK_MM Min. : 0.00 1st Qu.: 0.00 Median : 0.00 Mean : 1.43 2014, [email protected] Building Decision Trees Weather Dataset - Target 35/46 In R Simple Train/Test Paradigm target <- "RainTomorrow" summary(ds[target]) ## ## ## http: // togaware. com MaxTemp Min. : 7.6 1st Qu.:15.0 Median :19.6 Mean :20.6 set.seed(1421) train <- c(sample(1:nrow(ds), 0.70*nrow(ds))) head(train) RainTomorrow No :300 Yes: 66 ## [1] 288 298 363 107 # Training dataset 70 232 (form <- formula(paste(target, "~ ."))) length(train) ## RainTomorrow ~ . ## [1] 256 (vars <- names(ds)[-c(1, 2, 23)]) ## [1] "MinTemp" ## [5] "Sunshine" ## [9] "WindDir3pm" ## [13] "Humidity3pm" ## [17] "Cloud3pm" ## [21] "RainTomorrow" http: // togaware. com "MaxTemp" "WindGustDir" "WindSpeed9am" "Pressure9am" "Temp9am" Copyright © "Rainfall" "WindGustSpeed" "WindSpeed3pm" "Pressure3pm" "Temp3pm" 2014, [email protected] test <- setdiff(1:nrow(ds), train) length(test) "Evaporation" "WindDir9am" "Humidity9am" "Cloud9am" "RainToday" # Testing dataset ## [1] 110 36/46 http: // togaware. com Copyright © 2014, [email protected] 37/46 Building Decision Trees In R Building Decision Trees Display the Model Performance on Test Dataset model <- rpart(form, ds[train, vars]) model The predict() function is used to score new data. ## n= 256 ## ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## ## 1) root 256 44 No (0.82812 0.17188) ## 2) Humidity3pm< 59.5 214 21 No (0.90187 0.09813) ## 4) WindGustSpeed< 64 204 14 No (0.93137 0.06863) ## 8) Cloud3pm< 6.5 163 5 No (0.96933 0.03067) * ## 9) Cloud3pm>=6.5 41 9 No (0.78049 0.21951) ## 18) Temp3pm< 26.1 34 4 No (0.88235 0.11765) * ## 19) Temp3pm>=26.1 7 2 Yes (0.28571 0.71429) * .... head(predict(model, ds[test,], type="class")) ## 2 4 6 8 11 12 ## No No No No No No ## Levels: No Yes table(predict(model, ds[test,], type="class"), ds[test, target]) ## ## ## ## No Yes No 77 14 Yes 11 8 Notice the legend to help interpret the tree. Copyright http: // togaware. com In R © 2014, [email protected] Building Decision Trees 38/46 Copyright http: // togaware. com In R © 2014, [email protected] Building Decision Trees Example DTree Plot using Rattle 39/46 In R An R Scripting Hint 1 No .83 .17 100% yes Humidity3pm < 60 no Notice the use of variables ds, target, vars. 2 3 No .90 .10 84% Yes .45 .55 16% WindGustSpeed < 64 Pressure3pm >= 1015 Change these variables, and the remaining script is unchanged. Simplifies script writing and reuse of scripts. 4 No .93 .07 80% ds <- iris target <- "Species" vars <- names(ds) Cloud3pm < 6.5 9 No .78 .22 16% Then repeat the rest of the script, without change. Temp3pm < 26 http: // togaware. com 8 18 19 5 6 7 No .97 .03 64% No .88 .12 13% Yes .29 .71 3% Yes .30 .70 4% No .79 .21 7% Yes .17 .83 9% Copyright © 2014, [email protected] 40/46 http: // togaware. com Copyright © 2014, [email protected] 41/46 Rattle 2014−Feb−27 23:21:10 gjw Building Decision Trees In R Building Decision Trees An R Scripting Hint — Unchanged Code An R Scripting Hint — Unchanged Code This code remains the same to build the decision tree. form train test model model <<<<- Similarly for the predictions. formula(paste(target, "~ .")) c(sample(1:nrow(ds), 0.70*nrow(ds))) setdiff(1:nrow(ds), train) rpart(form, ds[train, vars]) head(predict(model, ds[test,], type="class")) ## 3 8 9 10 11 12 ## setosa setosa setosa setosa setosa setosa ## Levels: setosa versicolor virginica ## n= 105 ## ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## ## 1) root 105 69 setosa (0.34286 0.32381 0.33333) ## 2) Petal.Length< 2.6 36 0 setosa (1.00000 0.00000 0.00000) * ## 3) Petal.Length>=2.6 69 34 virginica (0.00000 0.49275 0.50725) ## 6) Petal.Length< 4.95 35 2 versicolor (0.00000 0.94286 0.05714) * ## 7) Petal.Length>=4.95 34 1 virginica (0.00000 0.02941 0.97059) * http: // togaware. com Copyright © 2014, [email protected] In R table(predict(model, ds[test,], type="class"), ds[test, target]) ## ## ## ## ## 42/46 setosa versicolor virginica http: // togaware. com setosa versicolor virginica 14 0 0 0 15 4 0 1 11 Copyright © 2014, [email protected] 43/46 Summary Overview Predictive Data Mining: Ensembles Summary Workshop Overview 1 R: A Language for Data Mining Decision Tree Induction. 2 Data Mining, Rattle, and R Most widely deployed machine learning algorithm. 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R Simple idea, powerful learner. Available in R through the rpart package. Related packages include party, Cubist, C50, RWeka (J48). Copyright http: // togaware. com © 2014, [email protected] 45/46 Copyright http: // togaware. com © 2014, [email protected] Multiple Models Boosting Building Multiple Models 14/17 Algorithm Boosting Algorithms Basic idea: boost observations that are “hard to model.” General idea developed in Multiple Inductive Learning algorithm (Williams 1987). Algorithm: iteratively build weak models using a poor learner: Ideas were developed (ACJ 1987, PhD 1990) in the context of: observe that variable selection methods don’t discriminate; so build multiple decision trees; then combine into a single model. Build an initial model; Identify mis-classified cases in the training dataset; Boost (over-represent) training observations modelled incorrectly; Basic idea is that multiple models, like multiple experts, may produce better results when working together, rather than in isolation Build a new model on the boosted training dataset; Repeat. The result is an ensemble of weighted models. Two approaches covered: Boosting and Random Forests. Best off the shelf model builder. (Leo Brieman) Meta learners. Copyright http: // togaware. com © 2014, [email protected] Boosting 4/36 Copyright http: // togaware. com © 2014, [email protected] Example Boosting Example: Error Rate Example Example: Variable Importance Notice error rate decreases quickly then flattens. Helps understand the knowledge captured. plot(m) varplot(m) Variable Importance Plot Temp3pm Pressure9am MinTemp Humidity3pm Temp9am MaxTemp Evaporation Cloud9am Humidity9am WindSpeed3pm WindSpeed9am Cloud3pm WindGustSpeed Sunshine Pressure3pm WindGustDir WindDir3pm Rainfall WindDir9am RainToday 1 0.10 Error 0.12 0.14 Training Error 1 Train 1 0.08 1 0.06 1 1 0 http: // togaware. com 10 Copyright 20 © 6/36 30 40 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.04 Iteration 1 to 50 2014, [email protected] 12/36 http: // togaware. com Copyright 0.05 © 0.06 0.07 0.08 0.09 0.10 Score 2014, [email protected] 13/36 Boosting Example Boosting Example: Sample Trees Example Example: Performance There are 50 trees in all. Here’s the first 3. predicted <- predict(m, weather[-train,], type="prob")[,2] actual <- weather[-train,]$RainTomorrow risks <- weather[-train,]$RISK_MM riskchart(predicted, actual, risks) fancyRpartPlot(m$model$trees[[1]]) fancyRpartPlot(m$model$trees[[2]]) fancyRpartPlot(m$model$trees[[3]]) Risk Scores 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Lift 100 1 −1 .96 .04 100% Cloud3pm < 7.5 yes 4 no 2 yes −1 .97 .03 94% 1 1 −1 .96 .04 100% −1 .97 .03 100% Pressure3pm >= 1012 no yes 80 Pressure3pm >= 1012 no 5 3 3 1 .82 .18 20% 1 .83 .17 18% MaxTemp >= 27 Sunshine >= 11 −1 .87 .13 18% Performance (%) 3 Pressure3pm >= 1012 60 2 40 Humidity3pm < 42 4 10 11 3 −1 .99 .01 76% −1 .96 .04 9% 1 .72 .28 9% 1 .42 .58 6% 2 6 7 2 6 7 −1 .98 .02 80% −1 1.00 .00 6% 1 .63 .37 14% −1 .99 .01 82% −1 .97 .03 6% 1 .67 .33 12% 23% 1 20 Recall (88%) Risk (94%) Precision 0 Rattle 2014−Feb−27 23:21:19 gjw Rattle 2014−Feb−27 23:21:19 gjw Rattle 2014−Feb−27 23:21:20 gjw 0 20 40 60 80 100 Caseload (%) http: // togaware. com Copyright © 2014, [email protected] Boosting 14/36 2 First application of the technology — 1995 Decision Stumps: Age > NN; Change in Marital Status 3 Boosted Neural Networks 4 OCR using neural networks as base learners Drucker, Schapire, Simard, 1993 2014, [email protected] 5 16/36 AdaBoost uses e −m ; LogitBoost uses log (1 + e −m ); Doom II uses 1 − tanh(m) AdaBoost tends to be sensitive to noise (addressed by BrownBoost) AdaBoost tends not to overfit, and as new models are added, generalisation error tends to improve. Can be proved to converge to a perfect model if the learners are always better than chance. http: // togaware. com Copyright © 2014, [email protected] 17/36 Random Forests Random Forests Random Forests Build many decision trees (e.g., 500). For each tree: Original idea from Leo Brieman and Adele Cutler. Select a random subset of the training set (N); Choose different subsets of variables for each node of the decision tree (m << M); Build the tree without pruning (i.e., overfit) The name is Licensed to Salford Systems! Hence, R package is randomForest. Typically presented in context of decision trees. Classify a new entity using every decision tree: Random Multinomial Logit uses multiple multinomial logit models. Copyright Example Boosting is implemented in R in the ada library Random Forests http: // togaware. com 15/36 Summary 1 © 2014, [email protected] Boosting ATO Application: What life events affect compliance? Copyright © Example Example Applications http: // togaware. com Copyright http: // togaware. com © 2014, [email protected] Each tree “votes” for the entity. The decision with the largest number of votes wins! The proportion of votes is the resulting score. 19/36 http: // togaware. com Copyright © 2014, [email protected] 20/36 Random Forests Random Forests Example: RF on Weather Data Example: Error Rate Error rate decreases quickly then flattens over the 500 trees. set.seed(42) (m <- randomForest(RainTomorrow ~ ., weather[train, -c(1:2, 23)], na.action=na.roughfix, importance=TRUE)) plot(m) 0.8 m Copyright http: // togaware. com 0.4 Error error rate: 13.67% 0.2 OOB estimate of Confusion matrix: No Yes class.error No 211 4 0.0186 Yes 31 10 0.7561 0.6 Call: randomForest(formula=RainTomorrow ~ ., data=weath... Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 4 0.0 ## ## ## ## ## ## ## ## ## ## ## ## 0 © 2014, [email protected] 21/36 http: // togaware. com Random Forests 100 200 Copyright © 300 400 500 trees 2014, [email protected] Random Forests Example: Variable Importance Example: Sample Trees Helps understand the knowledge captured. There are 500 trees in all. Here’s some rules from the first tree. varImpPlot(m, main="Variable Importance") ## Random Forest Model 1 ## ## ------------------------------------------------------... ## Tree 1 Rule 1 Node 30 Decision No ## ## 1: Evaporation <= 9 ## 2: Humidity3pm <= 71 ## 3: Cloud3pm <= 2.5 ## 4: WindDir9am IN ("NNE") ## 5: Sunshine <= 10.25 ## 6: Temp3pm <= 17.55 ## ------------------------------------------------------... ## Tree 1 Rule 2 Node 31 Decision Yes ## ## 1: Evaporation <= 9 ## 2: Humidity3pm <= 71 .... Variable Importance Sunshine Cloud3pm Pressure3pm Temp3pm WindGustSpeed MaxTemp Pressure9am Temp9am Humidity3pm MinTemp Cloud9am WindSpeed3pm WindSpeed9am Humidity9am WindGustDir Evaporation WindDir9am WindDir3pm Rainfall RainToday Pressure3pm Sunshine Cloud3pm Pressure9am WindGustSpeed Humidity3pm MinTemp Temp3pm Temp9am MaxTemp Humidity9am WindSpeed3pm WindSpeed9am Cloud9am Evaporation WindDir9am WindGustDir WindDir3pm Rainfall RainToday ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 MeanDecreaseAccuracy Copyright http: // togaware. com © 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 MeanDecreaseGini 8 2014, [email protected] 23/36 http: // togaware. com Random Forests 0.4 0.3 0.2 2014, [email protected] 24/36 Most accurate of current algorithms. Runs efficiently on large data sets. Risk Scores 0.5 © Features of Random Forests: By Brieman predicted <- predict(m, weather[-train,], type="prob")[,2] actual <- weather[-train,]$RainTomorrow risks <- weather[-train,]$RISK_MM riskchart(predicted, actual, risks) 0.6 Copyright Random Forests Example: Performance 0.9 0.7 0.8 22/36 0.1 Lift 100 4 Can handle thousands of input variables. 80 Performance (%) 3 60 Gives estimates of variable importance. 2 40 22% 1 20 Recall (92%) Risk (97%) Precision 0 0 20 40 60 80 100 Caseload (%) http: // togaware. com Copyright © 2014, [email protected] 25/36 http: // togaware. com Copyright © 2014, [email protected] 26/36 Summary Summary Moving into R and Scripting our Analyses Summary Workshop Overview 1 R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R Ensemble: Multiple models working together Often better than a single model Variance and bias of the model are reduced The best available models today - accurate and robust In daily use in very many areas of application http: // togaware. com Copyright © e 2014, [email protected] Moving Into R 36/36 Copyright http: // togaware. com Programming with Data © 2014, [email protected] Moving Into R Data Scientists are Programmers of Data 15/17 Programming with Data From GUI to CLI — Rattle’s Log Tab But. . . Data scientists are programmers of data A GUI can only do so much R is a powerful statistical language Data Scientists Desire. . . Scripting Transparency Repeatability Sharing http: // togaware. com Copyright © 2014, [email protected] Moving Into R 24/40 Copyright http: // togaware. com Programming with Data © 2014, [email protected] Moving Into R From GUI to CLI — Rattle’s Log Tab 25/40 Programming with Data Step 1: Load the Dataset dsname <- "weather" ds <- get(dsname) dim(ds) ## [1] 366 24 names(ds) ## [1] ## [5] ## [9] ## [13] .... http: // togaware. com Copyright © 2014, [email protected] 26/40 http: // togaware. com "Date" "Rainfall" "WindGustSpeed" "WindSpeed3pm" "Location" "Evaporation" "WindDir9am" "Humidity9am" Copyright © "MinTemp" "Sunshine" "WindDir3pm" "Humidity3pm" 2014, [email protected] "... "... "... "... 27/40 Moving Into R Programming with Data Moving Into R Step 2: Observe the Data — Observations Step 2: Observe the Data — Structure head(ds) str(ds) ## Date Location MinTemp MaxTemp Rainfall Evapora... ## 1 2007-11-01 Canberra 8.0 24.3 0.0 ... ## 2 2007-11-02 Canberra 14.0 26.9 3.6 ... ## 3 2007-11-03 Canberra 13.7 23.4 3.6 ... .... ## 'data.frame': 366 ## $ Date : ## $ Location : ## $ MinTemp : ## $ MaxTemp : ## $ Rainfall : ## $ Evaporation : ## $ Sunshine : ## $ WindGustDir : ## $ WindGustSpeed: ## $ WindDir9am : ## $ WindDir3pm : .... tail(ds) ## Date Location MinTemp MaxTemp Rainfall Evapo... ## 361 2008-10-26 Canberra 7.9 26.1 0 ... ## 362 2008-10-27 Canberra 9.0 30.7 0 ... ## 363 2008-10-28 Canberra 7.1 28.4 0 ... .... Copyright http: // togaware. com © 2014, [email protected] Moving Into R 28/40 2014, [email protected] Moving Into R © 2014, [email protected] 29/40 Programming with Data Step 2: Observe the Data — Variables id <- c("Date", "Location") target <- "RainTomorrow" risk <- "RISK_MM" (ignore <- union(id, risk)) ## Date Location MinTemp ... ## Min. :2007-11-01 Canberra :366 Min. :-5.3... ## 1st Qu.:2008-01-31 Adelaide : 0 1st Qu.: 2.3... ## Median :2008-05-01 Albany : 0 Median : 7.4... ## Mean :2008-05-01 Albury : 0 Mean : 7.2... ## 3rd Qu.:2008-07-31 AliceSprings : 0 3rd Qu.:12.5... ## Max. :2008-10-31 BadgerysCreek: 0 Max. :20.9... ## (Other) : 0 ... ## Rainfall Evaporation Sunshine Wind... ## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW ... ## 1st Qu.: 0.00 1st Qu.: 2.20 1st Qu.: 5.95 NNW ... ## Median : 0.00 Median : 4.20 Median : 8.60 E ... .... © Copyright http: // togaware. com Moving Into R summary(ds) Copyright obs. of 24 variables: Date, format: "2007-11-01" "2007-11-... Factor w/ 46 levels "Adelaide","Alba... num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 ... num 24.3 26.9 23.4 15.5 16.1 16.9 1... num 0 3.6 3.6 39.8 2.8 0 0.2 0 0 16... num 3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6... num 6.3 9.7 3.3 9.1 10.6 8.2 8.4 4.... Ord.factor w/ 16 levels "N"<"NNE"<"N... num 30 39 85 54 50 44 43 41 48 31 ... Ord.factor w/ 16 levels "N"<"NNE"<"N... Ord.factor w/ 16 levels "N"<"NNE"<"N... Programming with Data Step 2: Observe the Data — Summary http: // togaware. com Programming with Data ## [1] "Date" 30/40 "Location" "RISK_MM" (vars <- setdiff(names(ds), ignore)) ## [1] ## [5] ## [9] ## [13] .... "MinTemp" "Sunshine" "WindDir3pm" "Humidity3pm" "MaxTemp" "WindGustDir" "WindSpeed9am" "Pressure9am" Copyright http: // togaware. com Programming with Data © "... "... "... "... 2014, [email protected] Moving Into R Step 3: Clean the Data — Remove Missing "Rainfall" "WindGustSpeed" "WindSpeed3pm" "Pressure3pm" 31/40 Programming with Data Step 3: Clean the Data — Remove Missing dim(ds) dim(ds) ## [1] 366 24 ## [1] 328 24 sum(is.na(ds[vars])) sum(is.na(ds[vars])) ## [1] 47 ## [1] 0 ds <- ds[-attr(na.omit(ds[vars]), "na.action"),] http: // togaware. com Copyright © 2014, [email protected] 32/40 http: // togaware. com Copyright © 2014, [email protected] 33/40 Moving Into R Programming with Data Moving Into R Step 3: Clean the Data—Target as Categoric Programming with Data Step 3: Clean the Data—Target as Categoric summary(ds[target]) ## ## ## summary(ds[target]) ## RainTomorrow ## Min. :0.000 ## 1st Qu.:0.000 ## Median :0.000 ## Mean :0.183 ## 3rd Qu.:0.000 ## Max. :1.000 .... RainTomorrow 0:268 1: 60 count 200 100 ds[target] <- as.factor(ds[[target]]) levels(ds[target]) <- c("No", "Yes") 0 0 Copyright http: // togaware. com © 2014, [email protected] Moving Into R 34/40 http: // togaware. com Programming with Data Copyright 1 © RainTomorrow 2014, [email protected] Moving Into R Step 4: Prepare for Modelling 35/40 Programming with Data Step 5: Build the Model—Random Forest (form <- formula(paste(target, "~ ."))) ## RainTomorrow ~ . library(randomForest) model <- randomForest(form, ds[train, vars], na.action=na.omit) model (nobs <- nrow(ds)) ## [1] 328 ## ## Call: ## randomForest(formula=form, data=ds[train, vars], ... ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 4 .... train <- sample(nobs, 0.70*nobs) length(train) ## [1] 229 test <- setdiff(1:nobs, train) length(test) ## [1] 99 Copyright http: // togaware. com © 2014, [email protected] Moving Into R 36/40 http: // togaware. com Programming with Data Copyright © 2014, [email protected] 37/40 Literate Data Mining in R Step 6: Evaluate the Model—Risk Chart Workshop Overview pr <- predict(model, ds[test,], type="prob")[,2] riskchart(pr, ds[test, target], ds[test, risk], title="Random Forest - Risk Chart", risk=risk, recall=target, thresholds=c(0.35, 0.15)) 1 R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle Lift 5 4 Descriptive Data Mining 4 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R Random Forest − Risk Chart Risk Scores 0.9 0.8 100 0.7 0.60.5 0.4 0.3 0.2 0.1 Performance (%) 80 60 3 40 2 RainTomorrow (98%) 20 19% RISK_MM (97%) 1 Precision 0 0 20 40 60 80 100 Caseload (%) http: // togaware. com Copyright © 2014, [email protected] 38/40 http: // togaware. com Copyright © 2014, [email protected] 16/17 Motivation Motivation Why is Reproducibility Important? Literate Data Mining Overview Your Research Leader or Executive drops by and asks: “Remember that research you did last year? I’ve heard there is an update on the data that you used. Can you add the new data in and repeat the same analysis?” “Jo Bloggs did a great analysis of the company returns data just before she left. Can you get someone else to analyse the new data set using the same methods, and so produce an updated report that we can present to the Exec next week?” Copyright © 2014, [email protected] Authors productive with narrative and code in one document Sweave (Leisch 2002) and now KnitR (Yihui 2011) Embed R code into LATEX documents for typesetting “The fraud case you provided an analysis of last year has finally reached the courts. We need to ensure we have a clear trail of the data sources, the analyses performed, and the results obtained, to stand up in court. Could you document these please.” http: // togaware. com One document to intermix the analysis, code, and results 4/37 KnitR also supports publishing to the web http: // togaware. com Copyright Motivation Eliminate errors that occur when transcribing results into documents. Record the context for the analysis and decisions made about the type of analysis to perform in the one place. Document the processes to provide integrity for the conclusions of the analysis. 2014, [email protected] Those who receive the results of modern data analysis have limited opportunity to verify the results by direct observation. Users of the analysis have no option but to trust the analysis, and by extension the software that produced it. This places an obligation on all creators of software to program in such a way that the computations can be understood and trusted. This obligation I label the Prime Directive. John Chambers (2008) Software for Data Analysis: Programming with R Share approach with others for peer review and for learning from each other—engender a continuous learning environment. © 5/37 Prime Objective: Trustworthy Software Automatically regenerate documents when code, data, or assumptions change. Copyright 2014, [email protected] Motivation Why Reproducible Data Mining? http: // togaware. com © 6/37 http: // togaware. com Copyright Motivation © 2014, [email protected] 7/37 Motivation Beautiful Output by KnitR Beautiful Output by Default The reader wants to read the document and easily do so! KnitR combined with LATEX will Code highlighting is done automatically Intermix analysis and results of analysis Default theme is carefully designed Automatically generate graphics and tables Many other themes are available Support reproducible and transparent analysis R Code is “properly” reformatted Produce the best looking reports. Analyses (Graphs and Tables) automatically included. http: // togaware. com Copyright © 2014, [email protected] 8/37 http: // togaware. com Copyright © 2014, [email protected] 9/37 Using RStudio Basic LATEX Markup Using RStudio Introducing LATEX Simplified interaction with R, LATEX, and KnitR A text markup language rather than a WYSIWYG. Executes R code one line at a time Based on TEX from 1977 — very stable and powerful. LATEX is easier to use macro package built on TEX. Formats LATEX documents and provides and spell checking A single click compile to PDF and synchronised views Ensures consistent style (layout, fonts, tables, maths, etc.) Automatic indexes, footnotes and references. Documents are well structured and are clear text. Has a learning curve. Demonstrate: Startup and explore RStudio. http: // togaware. com Copyright © 2014, [email protected] 12/37 Copyright http: // togaware. com Basic LATEX Markup © 2014, [email protected] 14/37 Basic LATEX Markup Basic LATEX Usage Structures \documentclass{article} \documentclass{article} \begin{document} \begin{document} \section{Introduction} ... \end{document} \subsection{Concepts} ... Demonstrate Create a new Sweave document in RStudio \end{document} http: // togaware. com Copyright © 2014, [email protected] 15/37 Copyright http: // togaware. com Basic LATEX Markup © 2014, [email protected] 16/37 Basic LATEX Markup Formats RStudio Support for LATEX RStudio provides excellent support for working with LATEX documents \documentclass{article} Helps to avoid having to know too much abuot LATEX Best illustrated through a demonstration \begin{document} Format menu \begin{itemize} \item ABC \item DEF \end{itemize} Section commands Font commands List commands Verbatim/Block commands Spell Checker This if \textbf{bold} text or \textbf{italic} text, ... Demonstrate: Start a new document, add contents, format to PDF. \end{document} http: // togaware. com Compile PDF Copyright © 2014, [email protected] 17/37 http: // togaware. com Copyright © 2014, [email protected] 18/37 Incorporating R Code Incorporating R Code Incorporating R Code Making You Look Good We insert R code in a Chunk starting with << >>= <<format_example>>= for(i in 1:5){j<-cos(sin(i)*i^2)+3;print(j-5)} @ We terminate the Chunk with @ Save LATEX with extension Rnw This Chunk <<simple_example>>= x <- sum(1:10) x @ for(i in 1:5) { j <- cos(sin(i)*i^2)+3 print(j-5) } Produces ## [1] ## [1] ## [1] ## [1] .... x <- sum(1:10) x ## [1] 55 -1.334 -2.88 -1.704 -1.103 Demonstrate: Do this in RStudio http: // togaware. com Copyright © 2014, [email protected] 20/37 Copyright http: // togaware. com Incorporating R Code library(xtable) obs <- sample(1:nrow(weatherAUS), 8) vars <- 2:6 xtable(weatherAUS[obs, vars]) We can do that with \Sexpr{...}. Our dataset has \Sexpr{nrow(ds)} observations of \Sexpr{ncol(ds)} variables. 50959 26581 3947 73014 27770 67989 80467 33587 Becomes Our dataset has 82169 observations of 24 variables. Better Still: \Sexpr{format(nrow(ds), big.mark=",")} Our dataset has 82,169 observations of 24 variables. © 2014, [email protected] 22/37 Location Cairns Canberra Cobar SalmonGums Canberra PerthAirport Darwin Ballarat http: // togaware. com Formatting Tables and Plots http: // togaware. com Copyright © MaxTemp 27.30 31.30 34.00 33.50 14.00 24.00 31.70 19.30 Copyright © MaxTemp 27.30 31.30 34.00 33.50 14.00 24.00 31.70 19.30 Rainfall 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 Evaporation 4.80 6.60 7.20 1.60 5.80 9.00 2014, [email protected] 24/37 Table: Limit Number of Digits print(xtable(weatherAUS[obs, vars], digits=1), include.rownames=FALSE) print(xtable(weatherAUS[obs, vars]), include.rownames=FALSE) MinTemp 17.50 13.20 18.80 10.60 3.10 13.30 25.80 6.00 MinTemp 17.50 13.20 18.80 10.60 3.10 13.30 25.80 6.00 Formatting Tables and Plots Table: Exclude Row Names Location Cairns Canberra Cobar SalmonGums Canberra PerthAirport Darwin Ballarat 21/37 A Simple Table Include information about data within the narrative. Copyright 2014, [email protected] Formatting Tables and Plots R Within the Text http: // togaware. com © Rainfall 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 2014, [email protected] Evaporation 4.80 6.60 7.20 Location Cairns Canberra Cobar SalmonGums Canberra PerthAirport Darwin Ballarat 1.60 5.80 9.00 25/37 http: // togaware. com MinTemp 17.5 13.2 18.8 10.6 3.1 13.3 25.8 6.0 Copyright © MaxTemp 27.3 31.3 34.0 33.5 14.0 24.0 31.7 19.3 Rainfall 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.0 2014, [email protected] Evaporation 4.8 6.6 7.2 1.6 5.8 9.0 26/37 Formatting Tables and Plots Formatting Tables and Plots Table: Tiny Font Table: Column Alignment vars <- 2:8 print(xtable(weatherAUS[obs, vars], digits=0, align="rlrrrrrr"), size="tiny") vars <- 2:8 print(xtable(weatherAUS[obs, vars], digits=0), size="tiny", include.rownames=FALSE) Location Cairns Canberra Cobar SalmonGums Canberra PerthAirport Darwin Ballarat MinTemp 18 13 19 11 3 13 26 6 MaxTemp 27 31 34 34 14 24 32 19 Copyright http: // togaware. com © Rainfall 0 0 0 0 0 0 0 0 Evaporation 5 7 7 Sunshine 10 12 11 2 6 9 2 8 7 WindGustDir SSE WSW ENE SE S W WNW SE 2014, [email protected] 50959 26581 3947 73014 27770 67989 80467 33587 27/37 Location Cairns Canberra Cobar SalmonGums Canberra PerthAirport Darwin Ballarat MinTemp 18 13 19 11 3 13 26 6 Copyright http: // togaware. com Formatting Tables and Plots MaxTemp 27 31 34 34 14 24 32 19 © Rainfall 0 0 0 0 0 0 0 0 Evaporation 5 7 7 Sunshine 10 12 11 2 6 9 2 8 7 WindGustDir SSE WSW ENE SE S W WNW SE 2014, [email protected] 28/37 Formatting Tables and Plots Table: Caption Plots library(ggplot2) cities <- c("Canberra", "Darwin", "Melbourne", "Sydney") ds <- subset(weatherAUS, Location %in% cities & ! is.na(Temp3pm)) g <- ggplot(ds, aes(Temp3pm, colour=Location, fill=Location)) g <- g + geom_density(alpha = 0.55) print(g) print(xtable(weatherAUS[obs, vars], digits=1, caption="This is the table caption."), size="tiny") 0.20 Location Cairns Canberra Cobar SalmonGums Canberra PerthAirport Darwin Ballarat MinTemp 17.5 13.2 18.8 10.6 3.1 13.3 25.8 6.0 MaxTemp 27.3 31.3 34.0 33.5 14.0 24.0 31.7 19.3 Rainfall 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.0 Evaporation 4.8 6.6 7.2 Sunshine 10.0 11.6 11.3 1.6 5.8 9.0 1.9 8.1 7.3 WindGustDir SSE WSW ENE SE S W WNW SE 0.15 Location density 50959 26581 3947 73014 27770 67989 80467 33587 Canberra Darwin Melbourne 0.10 Sydney Table : This is the table caption. 0.05 0.00 http: // togaware. com Copyright © 10 2014, [email protected] Knitting 29/37 http: // togaware. com © 20 Copyright Our First KnitR Document 30 Knitting Create a KnitR Document: New→R Sweave 40 Temp3pm 2014, [email protected] 31/37 Our First KnitR Document Setup KnitR We wish to use KnitR rather than the older Sweave processor In RStudio we can configure the options to use knitr: Select Tools→Options Choose the Sweave group Choose knitr for Weave Rnw files using: The remaining defaults should be okay Click Apply and then OK http: // togaware. com Copyright © 2014, [email protected] 24/34 http: // togaware. com Copyright © 2014, [email protected] 25/34 Knitting Our First KnitR Document Knitting Simple KnitR Document Our First KnitR Document Simple KnitR Document Insert the following into your new KnitR document: Insert the following into your new KnitR document: \title{Sample KnitR Document} \author{Graham Williams} \maketitle \title{Sample KnitR Document} \author{Graham Williams} \maketitle \section*{My First Section} \section*{My First Section} This is some text that is automatically typeset by the LaTeX processor to produce well formatted quality output as PDF. This is some text that is automatically typeset by the LaTeX processor to produce well formatted quality output as PDF. Your turn—Click Compile PDF to view the result. Your turn—Click Compile PDF to view the result. http: // togaware. com Copyright © 2014, [email protected] Knitting 26/34 http: // togaware. com Copyright © Our First KnitR Document 2014, [email protected] Knitting Simple KnitR Document 26/34 Our First KnitR Document Simple KnitR Document—Resulting PDF Result of Compile PDF http: // togaware. com Copyright © 2014, [email protected] Knitting 27/34 http: // togaware. com Copyright © Including R Commands in KnitR 2014, [email protected] Knitting KnitR: Add R Commands Including R Commands in KnitR KnitR: Add R Commands R code can be used to generate results into the document: R code can be used to generate results into the document: <<echo=FALSE, message=FALSE>>= library(rattle) # Provides the weather dataset library(ggplot2) # Provides the qplot() function <<echo=FALSE, message=FALSE>>= library(rattle) # Provides the weather dataset library(ggplot2) # Provides the qplot() function ds <- weather qplot(MinTemp, MaxTemp, data=ds) @ ds <- weather qplot(MinTemp, MaxTemp, data=ds) @ Your turn—Click Compile PDF to view the result. Your turn—Click Compile PDF to view the result. http: // togaware. com Copyright © 2014, [email protected] 28/34 29/34 http: // togaware. com Copyright © 2014, [email protected] 29/34 Knitting Including R Commands in KnitR Knitting KnitR Document With R Code Including R Commands in KnitR Simple KnitR Document—PDF with Plot Result of Compile PDF http: // togaware. com Copyright © 2014, [email protected] Knitting 30/34 Copyright © Basics Cheat Sheet 2014, [email protected] Knitting LaTeX Basics % Introduce a Sub Section \subsubsection*{...} % Introduce a Sub Sub Section \textbf{...} \textit{...} % Bold font % Italic font \begin{itemize} \item ... \item ... \end{itemize} % A bullet list Basics Cheat Sheet echo=FALSE eval=TRUE # Do not display the R code # Evaluate the R code results="hide" # Hide the results of the R commands fig.width=10 fig.height=8 # Extend figure width from 7 to 10 inches # Extend figure height from 7 to 8 inches out.width="0.8\\textwidth" out.height="0.5\\textheight" Copyright © 2014, [email protected] 32/34 http: // togaware. com Copyright © 2014, [email protected] Moving Into R Workshop Overview 1 R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R Copyright © 2014, [email protected] # Fit figure 80% page width # Fit figure 50% page height Plus an extensive collection of other options. Plus an extensive collection of other markup and capabilities. http: // togaware. com 31/34 KnitR Basics \subsection*{...} http: // togaware. com http: // togaware. com 33/34 Resources Resources and References OnePageR: http://onepager.togaware.com – Tutorial Notes Rattle: http://rattle.togaware.com Guides: http://datamining.togaware.com Practise: http://analystfirst.com Book: Data Mining using Rattle/R Chapter: Rattle and Other Tales Paper: A Data Mining GUI for R — R Journal, Volume 1(2) 2/17 http: // togaware. com Copyright © 2014, [email protected] 39/40

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Handouts for Workshop on Rattle and R