Download Handouts for Workshop on Rattle and R

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Workshop Overview
A Data Mining Workshop
Excavating Knowledge from Data
Introducing Data Mining using R
1
R: A Language for Data Mining
2
Data Mining, Rattle, and R
3
Loading, Cleaning, Exploring Data in Rattle
4
Descriptive Data Mining
5
Predictive Data Mining: Decision Trees
6
Predictive Data Mining: Ensembles
7
Moving into R and Scripting our Analyses
8
Literate Data Mining in R
[email protected]
Data Scientist
Australian Taxation Office
Adjunct Professor, Australian National University
Adjunct Professor, University of Canberra
Fellow, Institute of Analytics Professionals of Australia
[email protected]
http://datamining.togaware.com
Visit: http://onepager.togaware.com for Workshop Notes
http: // togaware. com
Copyright
©
2014, [email protected]
1/17
http: // togaware. com
Copyright
R: A Language for Data Mining
©
2014, [email protected]
What is R?
Workshop Overview
Installing R
Installing R and Rattle
1
R: A Language for Data Mining
First task is to install R
2
Data Mining, Rattle, and R
3
Loading, Cleaning, Exploring Data in Rattle
As free/libre open source software (FLOSS or FOSS), R and
Rattle are available to all, with no limitations on our freedom to
use and share the software, except to share and share alike.
4
Descriptive Data Mining
5
Predictive Data Mining: Decision Trees
6
Predictive Data Mining: Ensembles
7
Moving into R and Scripting our Analyses
8
Literate Data Mining in R
http: // togaware. com
Copyright
©
2014, [email protected]
What is R?
Visit CRAN at http://cran.rstudio.com
Visit Rattle at http://rattle.togaware.com
Linux: Install packages (Ubuntu is recommended)
$ wajig install r-recommended r-cran-rattle
Windows: Download and install from CRAN
MacOSX: Download and install from CRAN
3/17
http: // togaware. com
4/28
Why a Workshop on R?
Most widely used Data Mining and Machine Learning Package
Machine Learning
Statistics
Software Engineering and Programming with Data
But not the nicest of languages for a Computer Scientist!
Free (Libre) Open Source Statistical Software
Free (Libre) Open Source Statistical Software
. . . all modern statistical approaches
. . . many/most machine learning algorithms
. . . opportunity to readily add new algorithms
. . . all modern statistical approaches
. . . many/most machine learning algorithms
. . . opportunity to readily add new algorithms
That is important for us in the research community
Get our algorithms out there and being used—impact!!!
2014, [email protected]
2014, [email protected]
Why do Data Science with R?
Machine Learning
Statistics
Software Engineering and Programming with Data
But not the nicest of languages for a Computer Scientist!
©
©
What is R?
Most widely used Data Mining and Machine Learning Package
Copyright
Copyright
Why a Workshop on R?
Why do Data Science with R?
http: // togaware. com
2/17
That is important for us in the research community
Get our algorithms out there and being used—impact!!!
5/28
http: // togaware. com
Copyright
©
2014, [email protected]
5/28
What is R?
Why a Workshop on R?
What is R?
Why do Data Science with R?
Why do Data Science with R?
Most widely used Data Mining and Machine Learning Package
Most widely used Data Mining and Machine Learning Package
Machine Learning
Statistics
Software Engineering and Programming with Data
But not the nicest of languages for a Computer Scientist!
Machine Learning
Statistics
Software Engineering and Programming with Data
But not the nicest of languages for a Computer Scientist!
Free (Libre) Open Source Statistical Software
Free (Libre) Open Source Statistical Software
. . . all modern statistical approaches
. . . many/most machine learning algorithms
. . . opportunity to readily add new algorithms
. . . all modern statistical approaches
. . . many/most machine learning algorithms
. . . opportunity to readily add new algorithms
That is important for us in the research community
Get our algorithms out there and being used—impact!!!
http: // togaware. com
Copyright
©
2014, [email protected]
What is R?
That is important for us in the research community
Get our algorithms out there and being used—impact!!!
5/28
©
©
http: // togaware. com
Copyright
Popularity of R?
5/28
Popularity of R?
Source: http://r4stats.com/articles/popularity/
2014, [email protected]
7/28
Popularity of R?
How Popular is R? Professional Forums
Registered for the main discussion group for each software.
Source: http://r4stats.com/articles/popularity/
2014, [email protected]
©
What is R?
Number of R/SAS related posts to Stack Overflow by week.
Copyright
2014, [email protected]
Number of discussions on popular QandA forums 2013.
How Popular is R? R versus SAS
http: // togaware. com
©
How Popular is R? Discussion Topics
Source: http://r4stats.com/articles/popularity/6/28
2014, [email protected]
What is R?
Copyright
What is R?
Monthly email traffic on software’s main discussion list.
Copyright
http: // togaware. com
Popularity of R?
How Popular is R? Discussion List Traffic
http: // togaware. com
Why a Workshop on R?
8/28
http: // togaware. com
Copyright
©
Source: http://r4stats.com/articles/popularity/
2014, [email protected]
9/28
What is R?
Popularity of R?
What is R?
How Popular is R? Used in Analytics
Competitions
Popularity of R?
How Popular is R? User Survey
Rexer Analytics Survey 2010 results for data mining/analytic tools.
Software used in data analysis competitions in 2011.
Copyright
http: // togaware. com
©
Source: http://r4stats.com/articles/popularity/
2014, [email protected]
What is R?
10/28
Copyright
http: // togaware. com
Popularity of R?
©
Source: http://r4stats.com/articles/popularity/
2014, [email protected]
11/28
Data Mining, Rattle, and R
What is R?
Workshop Overview
R — The Video
A 90 Second Promo from Revolution Analytics
http://www.revolutionanalytics.com/what-is-open-source-r/
Copyright
http: // togaware. com
©
2014, [email protected]
An Introduction to Data Mining
12/28
1
R: A Language for Data Mining
2
Data Mining, Rattle, and R
3
Loading, Cleaning, Exploring Data in Rattle
4
Descriptive Data Mining
5
Predictive Data Mining: Decision Trees
6
Predictive Data Mining: Ensembles
7
Moving into R and Scripting our Analyses
8
Literate Data Mining in R
Copyright
http: // togaware. com
Big Data and Big Business
©
2014, [email protected]
An Introduction to Data Mining
Data Mining
4/17
Big Data and Big Business
Data Mining
Application of
Machine Learning
Statistics
Software Engineering and Programming with Data
Effective Communications and Intuition
A data driven analysis to uncover otherwise unknown but useful
patterns in large datasets, to discover new knowledge and to develop
predictive models, turning data and information into knowledge and
(one day perhaps) wisdom, in a timely manner.
. . . to Datasets that vary by
Volume, Velocity, Variety, Value, Veracity
. . . to discover new knowledge
. . . to improve business outcomes
. . . to deliver better tailored services
http: // togaware. com
Copyright
©
2014, [email protected]
4/40
http: // togaware. com
Copyright
©
2014, [email protected]
5/40
An Introduction to Data Mining
Big Data and Big Business
An Introduction to Data Mining
Data Mining in Research
Data Mining in Government
Health Research
Adverse reactions using linked Pharmaceutical, General
Practitioner, Hospital, Pathology datasets.
Australian Taxation Office
Lodgment ($110M)
Tax Havens ($150M)
Tax Fraud ($250M)
Astronomy
Microlensing events in the Large Magellanic Cloud of several
million observed stars (out of 10 billion).
Immigration and Border Control
Psychology
Investigation of age-of-onset for Alzheimer’s disease from 75
variables for 800 people.
Check passengers before boarding
Health and Human Services
Social Sciences
Survey evaluation. Social network analysis - identifying key
influencers.
Copyright
http: // togaware. com
©
2014, [email protected]
An Introduction to Data Mining
Big Data and Big Business
Doctor shoppers
Over servicing
6/40
Copyright
http: // togaware. com
Big Data and Big Business
©
2014, [email protected]
An Introduction to Data Mining
The Business of Data Mining
7/40
Algorithms
Basic Tools: Data Mining Algorithms
Cluster Analysis (kmeans, wskm)
Association Analysis (arules)
Linear Discriminant Analysis (lda)
SAS has annual revenues of $3B (2013)
IBM bought SPSS for $1.2B (2009)
Logistic Regression (glm)
Analytics is >$100B business and >$320B by 2020
Random Forests (randomForest, wsrf)
Amazon, eBay/PayPal, Google, Facebook, LinkedIn, . . .
Boosted Stumps (ada)
Shortage of 180,000 data scientists in US in 2018 (McKinsey) . . .
Neural Networks (nnet)
Decision Trees (rpart, wsrpart)
Support Vector Machines (kernlab)
...
That’s a lot of tools to learn in R!
Many with different interfaces and options.
Copyright
http: // togaware. com
©
2014, [email protected]
The Rattle Package for Data Mining
8/40
Copyright
http: // togaware. com
A GUI for Data Mining
©
2014, [email protected]
The Rattle Package for Data Mining
Why a GUI?
9/40
A GUI for Data Mining
Users of Rattle
Today, Rattle is used world wide in many industries
Statistics can be complex and traps await
Health analytics
So many tools in R to deliver insights
Customer segmentation and marketing
Effective analyses should be scripted
Fraud detection
Government
Scripting also required for repeatability
It is used by
R is a language for programming with data
Universities to teach Data Mining
Within research projects for basic analyses
Consultants and Analytics Teams across business
How to remember how to do all of this in R?
How to skill up 150 data analysts with Data Mining?
It is and will remain freely available.
CRAN and http://rattle.togaware.com
http: // togaware. com
Copyright
©
2014, [email protected]
11/40
http: // togaware. com
Copyright
©
2014, [email protected]
12/40
The Rattle Package for Data Mining
Setting Things Up
The Rattle Package for Data Mining
Installation
Tour
A Tour Thru Rattle: Startup
Rattle is built using R
Need to download and install R from cran.r-project.org
Recommend also install RStudio from www.rstudio.org
Then start up RStudio and install Rattle:
install.packages("rattle")
Then we can start up Rattle:
rattle()
Required packages are loaded as needed.
http: // togaware. com
Copyright
©
2014, [email protected]
The Rattle Package for Data Mining
13/40
Tour
Copyright
©
2014, [email protected]
The Rattle Package for Data Mining
Copyright
©
15/40
http: // togaware. com
Tour
2014, [email protected]
©
2014, [email protected]
14/40
Tour
A Tour Thru Rattle: Explore Distribution
Copyright
©
2014, [email protected]
The Rattle Package for Data Mining
A Tour Thru Rattle: Explore Correlations
http: // togaware. com
Copyright
The Rattle Package for Data Mining
A Tour Thru Rattle: Loading Data
http: // togaware. com
http: // togaware. com
17/40
16/40
Tour
A Tour Thru Rattle: Hierarchical Cluster
http: // togaware. com
Copyright
©
2014, [email protected]
18/40
The Rattle Package for Data Mining
Tour
The Rattle Package for Data Mining
A Tour Thru Rattle: Decision Tree
Copyright
http: // togaware. com
©
2014, [email protected]
The Rattle Package for Data Mining
Tour
A Tour Thru Rattle: Decision Tree Plot
19/40
Copyright
http: // togaware. com
Tour
©
2014, [email protected]
The Rattle Package for Data Mining
A Tour Thru Rattle: Random Forest
20/40
Tour
A Tour Thru Rattle: Risk Chart
Risk Chart Random Forest weather.csv [test] RainTomorrow
Risk Scores
0.90.7
0.8
0.6
0.5
0.4
0.3
0.2
0.1
Lift
100
4
80
Performance (%)
3
60
2
40
22%
1
20
RainTomorrow (92%)
Rain in MM (97%)
Precision
0
0
20
40
60
80
100
Caseload (%)
http: // togaware. com
Copyright
©
2014, [email protected]
Moving Into R
21/40
http: // togaware. com
Programming with Data
Copyright
©
2014, [email protected]
Moving Into R
Data Scientists are Programmers of Data
22/40
Programming with Data
From GUI to CLI — Rattle’s Log Tab
But. . .
Data scientists are programmers of data
A GUI can only do so much
R is a powerful statistical language
Data Scientists Desire. . .
Scripting
Transparency
Repeatability
Sharing
http: // togaware. com
Copyright
©
2014, [email protected]
24/40
http: // togaware. com
Copyright
©
2014, [email protected]
25/40
Moving Into R
Programming with Data
R Tool Suite
From GUI to CLI — Rattle’s Log Tab
The Power of Free/Libre and Open Source Software
Tools
Ubuntu GNU/Linux operating system
Feature rich toolkit, up-to-date, easy to install, FLOSS
RStudio
Easy to use integrated development environment, FLOSS
Powerful alternative is Emacs (Speaks Statistics), FLOSS
R Statistical Software Language
Extensive, powerful, thousands of contributors, FLOSS
KnitR and LATEX
Produce beautiful documents, easily reproducible, FLOSS
http: // togaware. com
Copyright
©
2014, [email protected]
R Tool Suite
26/40
http: // togaware. com
Copyright
©
Ubuntu
2014, [email protected]
RStudio
Using Ubuntu
4/34
Interface
RStudio—The Default Three Panels
Desktop Operating System (GNU/Linux)
Replacing Windows and OSX
The GNU Tool Suite based on Unix — significant heritage
Multiple specialised single task tools, working well together
Compared to single application trying to do it all
Powerful data processing from the command line:
grep, awk, head, tail, wc, sed, perl, python, most, diff, make,
paste, join, patch, . . .
For interacting with R — start up RStudio from the Dash
http: // togaware. com
Copyright
©
2014, [email protected]
RStudio
5/34
http: // togaware. com
Interface
Copyright
©
2014, [email protected]
Introduction to R
RStudio—With R Script File—Editor Panel
7/34
Simple Plots
Scatterplot—R Code
Our first little bit of R code:
Load a couple of packages into the R library
library(rattle) # Provides the weather dataset
library(ggplot2) # Provides the qplot() function
Then produce a quick plot using qplot()
ds <- weather
qplot(MinTemp, MaxTemp, data=ds)
Your turn: give it a go.
http: // togaware. com
Copyright
©
2014, [email protected]
8/34
http: // togaware. com
Copyright
©
2014, [email protected]
10/34
Introduction to R
Simple Plots
Introduction to R
Scatterplot—Plot
Scatterplot—RStudio
●
● ●
●
●
●●
●
●
●
●●
●
●
●
●
●
30
●
●
●● ● ●
●
●
●
●
MaxTemp
●
● ●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ● ●
●●
●
●●
●
●
● ●● ●
● ●●
●
● ●
●●
●
●
●
●
●
●● ●
●
●
● ●●
●●
●
●
●
●
●
●
●●
●
●●
●
● ●●●
●
● ●
●
●● ● ●
●
●
●
●●
●
●
●
●
●
●
10
● ●
●
● ●
●
●
● ●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
● ● ●●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●●● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●● ●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ● ●
●
●●● ●●
●
●
● ●●
●
●
●
●●
●
●
●
●●●
●
● ●
●●
● ●
●
●
●●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●●
●
●
●
●
●
20
Simple Plots
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
10
20
MinTemp
Copyright
http: // togaware. com
©
2014, [email protected]
Introduction to R
11/34
Installing Packages
Copyright
©
2014, [email protected]
Introduction to R
13/34
http: // togaware. com
library(rattle)
head(weather)
Installing Packages
Copyright
©
2014, [email protected]
14/34
Basic R Commands
# Load the weather dataset.
# First 6 observations of the dataset.
##
Date Location MinTemp MaxTemp Rainfall Evapora...
## 1 2007-11-01 Canberra
8.0
24.3
0.0
...
## 2 2007-11-02 Canberra
14.0
26.9
3.6
...
## 3 2007-11-03 Canberra
13.7
23.4
3.6
...
....
Ctrl-Enter will send the line of code to the R console
Ctrl-2 will move the cursor to the Console
Console:
UpArrow will cycle through previous commands
Ctrl-UpArrow will search previous commands
Tab will complete function names and list the arguments
Ctrl-1 will move the cursor to the Editor
str(weather)
# Struncture of the variables in the dataset.
## 'data.frame': 366 obs. of 24 variables:
## $ Date
: Date, format: "2007-11-01" "2007-11-...
## $ Location
: Factor w/ 46 levels "Adelaide","Alba...
## $ MinTemp
: num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 ...
....
Your turn: try them out.
2014, [email protected]
12/34
Basic R
Editor:
©
2014, [email protected]
Introduction to R
These will become very useful!
Copyright
©
RStudio—Installing ggplot2
RStudio Shortcuts
RStudio—Keyboard Shortcuts
http: // togaware. com
Copyright
Introduction to R
Missing Packages–Tools→Install Packages. . .
http: // togaware. com
http: // togaware. com
15/34
http: // togaware. com
Copyright
©
2014, [email protected]
16/34
Introduction to R
Basic R Commands
Introduction to R
Basic R
Visualising Data
Visual Summaries—Add A Little Colour
summary(weather)
qplot(Humidity3pm, Pressure3pm, colour=RainTomorrow, data=ds)
# Univariate summary of the variables.
1035
##
Date
Location
MinTemp
...
## Min.
:2007-11-01
Canberra
:366
Min.
:-5.30
...
## 1st Qu.:2008-01-31
Adelaide
: 0
1st Qu.: 2.30
...
## Median :2008-05-01
Albany
: 0
Median : 7.45
...
## Mean
:2008-05-01
Albury
: 0
Mean
: 7.27
...
## 3rd Qu.:2008-07-31
AliceSprings : 0
3rd Qu.:12.50
...
## Max.
:2008-10-31
BadgerysCreek: 0
Max.
:20.90
...
##
(Other)
: 0
...
##
Rainfall
Evaporation
Sunshine
WindGust...
## Min.
: 0.00
Min.
: 0.20
Min.
: 0.00
NW
: ...
## 1st Qu.: 0.00
1st Qu.: 2.20
1st Qu.: 5.95
NNW
: ...
## Median : 0.00
Median : 4.20
Median : 8.60
E
: ...
## Mean
: 1.43
Mean
: 4.52
Mean
: 7.91
WNW
: ...
## 3rd Qu.: 0.20
3rd Qu.: 6.40
3rd Qu.:10.50
ENE
: ...
....
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
● ● ●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
● ●
●
● ●● ●
●
●
● ●●
●
●
●●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●● ●●●
●
●
●
●
●
●●
●
●●
●
●
●
● ●
●
● ●
●
●
●
● ●
●
● ● ●
●
●
●
●
● ●
● ● ●
●
●●●●
● ●●
● ● ●●
●
●●
●
● ●
●
●
● ●
●●
● ● ●
●
●
● ● ●
●
●
●
●
●
● ● ●
●
●
●●
●
●
●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●●●
● ●
● ●
●
●●
●
●
●
●
● ●●
● ● ●●
●
●●●● ●
●
●
●
●
● ●
●
●
●
●● ●
●
●
●
●●●
● ●
●
● ●
● ●
●
●
●●
●
●
●
●
●●
● ●
●
● ●
● ●
●
●●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1025
●
●
●
Pressure3pm
●
●
●
1015
●
●
●
●
●
●
●
1005
●
●
● ● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
RainTomorrow
●
●
No
●
Yes
●
●
●
●
●
●
●
●
●
995
25
50
2014, [email protected]
Introduction to R
17/34
©
Copyright
http: // togaware. com
Visualising Data
100
2014, [email protected]
Introduction to R
Visual Summaries—Careful with Categorics
qplot(WindGustDir, Pressure3pm, data=ds)
18/34
Visualising Data
Visual Summaries—Add A Little Jitter
qplot(WindGustDir, Pressure3pm, data=ds, geom="jitter")
1035
1035
●
●
●
●
●
●
●
●
●
●
1025
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1005
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●●
●
●
● ●
●
●● ● ●
●
● ●
● ●
●●
●●
●
● ●
●
●
● ●
●
●
● ●
●
●●
●
●● ●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●●●●
●
● ●
● ● ●●
● ●●
●●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
1005
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
1015
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1025
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1015
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Pressure3pm
●
●
●
●
●
●
Pressure3pm
75
Humidity3pm
©
Copyright
http: // togaware. com
●
●
●
●
●
●
●
●
●
●
●
●
995
N
NNE
NE
ENE
E
ESE
SSE
S
SSW SW WSW
W
WNW NW NNW NA
N
NNE
NE
ENE
E
WindGustDir
©
Copyright
http: // togaware. com
SE
2014, [email protected]
Introduction to R
19/34
http: // togaware. com
Visualising Data
Copyright
ESE
©
SE
SSE
SSW SW WSW
W
WNW NW NNW NA
WindGustDir
2014, [email protected]
Introduction to R
Visual Summaries—And Some Colour
S
20/34
Help
Getting Help—Precede Command with ?
qplot(WindGustDir, Pressure3pm, data=ds, colour=WindGustDir, geom="jitter")
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
● ● ●●
●●
●●
●
●
●●
●
●
●
●
Pressure3pm
●●
●
●●
●
1015
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
1005
●
●
●
●
●
●●●
●●
● ●
●
●
● ●
●
●
● ● ●
●
●
●●
●
● ●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●●
●
●
●
●
●● ●
●
●
●
●● ●
●
● ●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1025
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
995
N
http: // togaware. com
NNE
NE
ENE
E
Copyright
ESE
©
SE
SSE
S
SSW SW WSW
W
WNW NW NNW NA
WindGustDir
2014, [email protected]
21/34
http: // togaware. com
Copyright
©
2014, [email protected]
22/34
Loading, Cleaning, Exploring Data in Rattle
Loading, Cleaning, Exploring Data in Rattle
Workshop Overview
1
R: A Language for Data Mining
2
Data Mining, Rattle, and R
3
Loading, Cleaning, Exploring Data in Rattle
4
Descriptive Data Mining
5
Predictive Data Mining: Decision Trees
6
Predictive Data Mining: Ensembles
7
Moving into R and Scripting our Analyses
8
Literate Data Mining in R
http: // togaware. com
Copyright
©
2014, [email protected]
Loading Data
5/17
Loading, Cleaning, Exploring Data in Rattle
Copyright
©
2014, [email protected]
7/17
6/17
http: // togaware. com
Copyright
©
2014, [email protected]
8/17
Descriptive Data Mining
Transform Data
Copyright
2014, [email protected]
Test Data
Loading, Cleaning, Exploring Data in Rattle
http: // togaware. com
©
Loading, Cleaning, Exploring Data in Rattle
Exploring Data
http: // togaware. com
Copyright
http: // togaware. com
Workshop Overview
©
2014, [email protected]
9/17
1
R: A Language for Data Mining
2
Data Mining, Rattle, and R
3
Loading, Cleaning, Exploring Data in Rattle
4
Descriptive Data Mining
5
Predictive Data Mining: Decision Trees
6
Predictive Data Mining: Ensembles
7
Moving into R and Scripting our Analyses
8
Literate Data Mining in R
http: // togaware. com
Copyright
©
2014, [email protected]
10/17
Cluster Analysis
Requirements
Algorithms
What is Cluster Analysis?
Cluster Methods
Major Clustering Approaches
Partitioning algorithms (kmeans, pam, clara, fanny): Construct
various partitions and then evaluate them by some criterion. A
fixed number of clusters, k, is generated. Start with an initial
(perhaps random) cluster.
Cluster: a collection of observations
Similar to one another within the same cluster
Dissimilar to the observations in other clusters
Hierarchical algorithms: (hclust, agnes, diana) Create a
hierarchical decomposition of the set of observations using some
criterion
Cluster analysis
Grouping a set of data observations into classes
Clustering is unsupervised classification: no predefined
classes—descriptive data mining.
Typical applications
Density-based algorithms: based on connectivity and density
functions
Grid-based algorithms: based on a multiple-level granularity
structure
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
Model-based algorithms: (mclust for mixture of Gaussians) A
model is hypothesized for each of the clusters and the idea is to
find the best fit of that model
http: // togaware. com
Copyright
©
2014, [email protected]
6/33
http: // togaware. com
Copyright
Descriptive Data Mining
©
2014, [email protected]
Introduction
KMeans Clustering
22/33
Rules
Association Rule Mining
An unsupervised learning algorithm—descriptive data mining.
Identify items (patterns) that occur frequently together in a
given set of data.
Patterns = associations, correlations, causal structures (Rules).
Data = sets of items in . . .
transactional database
relational database
complex information repositories
Rule: Body → Head [support, confidence]
http: // togaware. com
Copyright
©
2014, [email protected]
Introduction
11/17
http: // togaware. com
Rules
Copyright
©
2014, [email protected]
3/29
Descriptive Data Mining
Examples
Association Rules
Friday ∩ Nappies → Beer
[0.5%, 60%]
Age ∈ [20, 30] ∩ Income ∈ [20K , 30K ] → MP3Player
[2%, 60%]
Maths ∩ CS → HDinCS
[1%, 75%]
Gladiator ∩ Patriot → Sixth Sense
[0.1%, 90%]
Statins ∩ Peritonitis → Chronic Renal Failure
[0.1%, 32%]
http: // togaware. com
Copyright
©
2014, [email protected]
5/29
http: // togaware. com
Copyright
©
2014, [email protected]
12/17
Predictive Data Mining: Decision Trees
Decision Trees
Workshop Overview
Basics
Predictive Modelling: Classification
1
R: A Language for Data Mining
2
Data Mining, Rattle, and R
3
Loading, Cleaning, Exploring Data in Rattle
Goal of classification is to build models (sentences) in a
knowledge representation (language) from examples of past
decisions.
4
Descriptive Data Mining
The model is to be used on unseen cases to make decisions.
5
Predictive Data Mining: Decision Trees
Often referred to as supervised learning.
6
Predictive Data Mining: Ensembles
7
Moving into R and Scripting our Analyses
8
Literate Data Mining in R
http: // togaware. com
Copyright
©
2014, [email protected]
Decision Trees
Common approaches: decision trees; neural networks; logistic
regression; support vector machines.
13/17
Basics
5/46
Basics
Decision tree induction is an example of a recursive partitioning
algorithm: divide and conquer.
At start, all the training examples are at the root
Partition examples recursively based on selected variables
Internal nodes denotes a test on a variable
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
+ + +
− +
+
Females
−
+ + +
+ − −
+
− +
+
Males
− + +
+
−
+
+ +
−+
−− +
Gender
Male
Female
+
_
+
−−
+
+
−
+ + −− >42
+
+ −
−
<42
Y
> 42
Y
2014, [email protected]
Tree Construction: Divide and Conquer
Knowledge representation: A flow-chart-like tree structure
Age
©
Decision Trees
Language: Decision Trees
< 42
Copyright
http: // togaware. com
−
N
+
http: // togaware. com
Copyright
©
2014, [email protected]
Decision Trees
6/46
http: // togaware. com
Algorithm
Algorithm
A random data set may have high entropy:
Y is from a uniform distribution
a frequency distribution would be flat!
a sample will include uniformly random values of Y
Begin with all training examples at the root
Data is partitioned recursively based on selected variables
A data set with low entropy:
Select variables on basis of a measure
Y ’s distribution will be very skewed
a frequency distribution will have a single peak
a sample will predominately contain just Yes or just No
Stop partitioning when?
All samples for a given node belong to the same class
There are no remaining variables for further partitioning –
majority voting is employed for classifying the leaf
There are no samples left
2014, [email protected]
7/46
We are trying to predict output Y (e.g., Yes/No) from input X .
Tree constructed top-down, recursive, divide-and-conquer
©
2014, [email protected]
Basic Motivation: Entropy
A greedy algorithm: takes the best immediate (local) decision
while building the overall model
Copyright
©
Decision Trees
Algorithm for Decision Tree Induction
http: // togaware. com
Copyright
Work towards reducing the amount of entropy in the data!
10/46
http: // togaware. com
Copyright
©
2014, [email protected]
11/46
Decision Trees
Algorithm
Decision Trees
Basic Motivation: Entropy
Algorithm
Basic Motivation: Entropy
We are trying to predict output Y (e.g., Yes/No) from input X .
We are trying to predict output Y (e.g., Yes/No) from input X .
A random data set may have high entropy:
A random data set may have high entropy:
Y is from a uniform distribution
a frequency distribution would be flat!
a sample will include uniformly random values of Y
Y is from a uniform distribution
a frequency distribution would be flat!
a sample will include uniformly random values of Y
A data set with low entropy:
A data set with low entropy:
Y ’s distribution will be very skewed
a frequency distribution will have a single peak
a sample will predominately contain just Yes or just No
Y ’s distribution will be very skewed
a frequency distribution will have a single peak
a sample will predominately contain just Yes or just No
Work towards reducing the amount of entropy in the data!
Work towards reducing the amount of entropy in the data!
http: // togaware. com
Copyright
©
2014, [email protected]
Decision Trees
11/46
Copyright
http: // togaware. com
Algorithm
©
2014, [email protected]
Decision Trees
Basic Motivation: Entropy
11/46
Algorithm
Variable Selection Measure: Entropy
Information gain (ID3/C4.5)
We are trying to predict output Y (e.g., Yes/No) from input X .
Select the variable with the highest information gain
A random data set may have high entropy:
Assume there are two classes: P and N
Y is from a uniform distribution
a frequency distribution would be flat!
a sample will include uniformly random values of Y
Let the data S contain p elements of class P and n elements of
class N
A data set with low entropy:
Y ’s distribution will be very skewed
a frequency distribution will have a single peak
a sample will predominately contain just Yes or just No
The amount of information, needed to decide if an arbitrary
example in S belongs to P or N is defined as
Work towards reducing the amount of entropy in the data!
http: // togaware. com
Copyright
©
2014, [email protected]
Decision Trees
IE (p, n) = −
11/46
p
p
n
n
log2
−
log2
p+n
p+n
p+n
p+n
Copyright
http: // togaware. com
Algorithm
©
2014, [email protected]
Decision Trees
Variable Selection Measure: Gini
15/46
Algorithm
Variable Selection Measure
Variable Importance Measure
Gini index of impurity – traditional statistical measure – CART
1.00
Measure how often a randomly chosen observation is incorrectly
classified if it were randomly classified in proportion to the actual
classes.
Measure
Calculated as the sum of the probability of each observation
being chosen times the probability of incorrect classification,
equivalently:
0.75
IG (p, n) = 1 − (p 2 + (1 − p)2 )
Formula
Info
0.50
Gini
0.25
As with Entropy, the Gini measure is maximal when the classes
are equally distributed and minimal when all observations are in
one class or the other.
0.00
0.00
0.25
0.50
0.75
1.00
Proportion of Positives
http: // togaware. com
Copyright
©
2014, [email protected]
16/46
http: // togaware. com
Copyright
©
2014, [email protected]
17/46
Decision Trees
Algorithm
Building Decision Trees
Information Gain
In Rattle
Startup Rattle
library(rattle)
rattle()
Now use variable A to partition S into v cells: {S1 , S2 , . . . , Sv }
If Si contains pi examples of P and ni examples of N, the
information now needed to classify objects in all subtrees Si is:
E (A) =
v
X
pi + ni
I (pi , ni )
p+n
i=1
So, the information gained by branching on A is:
Gain(A) = I (p, n) − E (A)
So choose the variable A which results in the greatest gain in
information.
http: // togaware. com
Copyright
©
2014, [email protected]
Building Decision Trees
18/46
http: // togaware. com
In Rattle
Copyright
©
2014, [email protected]
Building Decision Trees
Load Example Weather Dataset
20/46
In Rattle
Summary of the Weather Dataset
A summary of the weather dataset is displayed.
Click on the Execute button and an example dataset is offered.
Click on Yes to load the weather dataset.
http: // togaware. com
Copyright
©
2014, [email protected]
Building Decision Trees
21/46
In Rattle
©
2014, [email protected]
©
2014, [email protected]
22/46
In Rattle
Build Tree to Predict RainTomorrow
Click on the Model tab to display the modelling options.
Copyright
Copyright
Building Decision Trees
Model Tab — Decision Tree
http: // togaware. com
http: // togaware. com
Decision Tree is the default model type—simply click Execute.
23/46
http: // togaware. com
Copyright
©
2014, [email protected]
24/46
Building Decision Trees
In Rattle
Building Decision Trees
Decision Tree Predicting RainTomorrow
Evaluate Decision Tree
Click the Draw button to display a tree
(Settings → Advanced Graphics).
http: // togaware. com
Copyright
©
2014, [email protected]
Building Decision Trees
Click Evaluate tab—options to evaluate model performance.
25/46
©
2014, [email protected]
Building Decision Trees
©
27/46
http: // togaware. com
In Rattle
2014, [email protected]
2014, [email protected]
26/46
In Rattle
Click the Risk type and then Execute.
Copyright
©
2014, [email protected]
Building Decision Trees
28/46
In Rattle
Score a Dataset
Click the Score type to score a new dataset using model.
Click the ROC type and then Execute.
Copyright
©
Decision Tree Risk Chart
Decision Tree ROC Curve
http: // togaware. com
Copyright
Building Decision Trees
Click Execute to display simple error matrix.
Identify the True/False Positives/Negatives.
Copyright
http: // togaware. com
In Rattle
Evaluate Decision Tree—Error Matrix
http: // togaware. com
In Rattle
29/46
http: // togaware. com
Copyright
©
2014, [email protected]
30/46
Building Decision Trees
In Rattle
Building Decision Trees
Log of R Commands
Log of R Commands—rpart()
Click the Log tab for a history of all your interactions.
Save the log contents as a script to repeat what we did.
http: // togaware. com
Copyright
©
2014, [email protected]
Building Decision Trees
In Rattle
Here we see the call to rpart() to build the model.
Click on the Export button to save the script to file.
31/46
Copyright
http: // togaware. com
In Rattle
©
2014, [email protected]
Building Decision Trees
Help → Model → Tree
32/46
In R
Weather Dataset - Inputs
Rattle provides some basic help—click Yes for R help.
ds <- weather
head(ds, 4)
##
## 1
## 2
## 3
## 4
....
Date
2007-11-01
2007-11-02
2007-11-03
2007-11-04
Location MinTemp MaxTemp Rainfall Evaporation Sunshine
Canberra
8.0
24.3
0.0
3.4
6.3
Canberra
14.0
26.9
3.6
4.4
9.7
Canberra
13.7
23.4
3.6
5.8
3.3
Canberra
13.3
15.5
39.8
7.2
9.1
summary(ds[c(3:5,23)])
##
MinTemp
## Min.
:-5.30
## 1st Qu.: 2.30
## Median : 7.45
## Mean
: 7.27
....
http: // togaware. com
Copyright
©
2014, [email protected]
Building Decision Trees
33/46
In R
Copyright
©
Rainfall
Min.
: 0.00
1st Qu.: 0.00
Median : 0.00
Mean
: 1.43
RISK_MM
Min.
: 0.00
1st Qu.: 0.00
Median : 0.00
Mean
: 1.43
2014, [email protected]
Building Decision Trees
Weather Dataset - Target
35/46
In R
Simple Train/Test Paradigm
target <- "RainTomorrow"
summary(ds[target])
##
##
##
http: // togaware. com
MaxTemp
Min.
: 7.6
1st Qu.:15.0
Median :19.6
Mean
:20.6
set.seed(1421)
train <- c(sample(1:nrow(ds), 0.70*nrow(ds)))
head(train)
RainTomorrow
No :300
Yes: 66
## [1] 288 298 363 107
# Training dataset
70 232
(form <- formula(paste(target, "~ .")))
length(train)
## RainTomorrow ~ .
## [1] 256
(vars <- names(ds)[-c(1, 2, 23)])
## [1] "MinTemp"
## [5] "Sunshine"
## [9] "WindDir3pm"
## [13] "Humidity3pm"
## [17] "Cloud3pm"
## [21] "RainTomorrow"
http: // togaware. com
"MaxTemp"
"WindGustDir"
"WindSpeed9am"
"Pressure9am"
"Temp9am"
Copyright
©
"Rainfall"
"WindGustSpeed"
"WindSpeed3pm"
"Pressure3pm"
"Temp3pm"
2014, [email protected]
test <- setdiff(1:nrow(ds), train)
length(test)
"Evaporation"
"WindDir9am"
"Humidity9am"
"Cloud9am"
"RainToday"
# Testing dataset
## [1] 110
36/46
http: // togaware. com
Copyright
©
2014, [email protected]
37/46
Building Decision Trees
In R
Building Decision Trees
Display the Model
Performance on Test Dataset
model <- rpart(form, ds[train, vars])
model
The predict() function is used to score new data.
## n= 256
##
## node), split, n, loss, yval, (yprob)
##
* denotes terminal node
##
## 1) root 256 44 No (0.82812 0.17188)
##
2) Humidity3pm< 59.5 214 21 No (0.90187 0.09813)
##
4) WindGustSpeed< 64 204 14 No (0.93137 0.06863)
##
8) Cloud3pm< 6.5 163 5 No (0.96933 0.03067) *
##
9) Cloud3pm>=6.5 41 9 No (0.78049 0.21951)
##
18) Temp3pm< 26.1 34 4 No (0.88235 0.11765) *
##
19) Temp3pm>=26.1 7 2 Yes (0.28571 0.71429) *
....
head(predict(model, ds[test,], type="class"))
## 2 4 6 8 11 12
## No No No No No No
## Levels: No Yes
table(predict(model, ds[test,], type="class"), ds[test, target])
##
##
##
##
No Yes
No 77 14
Yes 11
8
Notice the legend to help interpret the tree.
Copyright
http: // togaware. com
In R
©
2014, [email protected]
Building Decision Trees
38/46
Copyright
http: // togaware. com
In R
©
2014, [email protected]
Building Decision Trees
Example DTree Plot using Rattle
39/46
In R
An R Scripting Hint
1
No
.83 .17
100%
yes
Humidity3pm < 60 no
Notice the use of variables ds, target, vars.
2
3
No
.90 .10
84%
Yes
.45 .55
16%
WindGustSpeed < 64
Pressure3pm >= 1015
Change these variables, and the remaining script is unchanged.
Simplifies script writing and reuse of scripts.
4
No
.93 .07
80%
ds <- iris
target <- "Species"
vars <- names(ds)
Cloud3pm < 6.5
9
No
.78 .22
16%
Then repeat the rest of the script, without change.
Temp3pm < 26
http: // togaware. com
8
18
19
5
6
7
No
.97 .03
64%
No
.88 .12
13%
Yes
.29 .71
3%
Yes
.30 .70
4%
No
.79 .21
7%
Yes
.17 .83
9%
Copyright
©
2014, [email protected]
40/46
http: // togaware. com
Copyright
©
2014, [email protected]
41/46
Rattle 2014−Feb−27 23:21:10 gjw
Building Decision Trees
In R
Building Decision Trees
An R Scripting Hint — Unchanged Code
An R Scripting Hint — Unchanged Code
This code remains the same to build the decision tree.
form
train
test
model
model
<<<<-
Similarly for the predictions.
formula(paste(target, "~ ."))
c(sample(1:nrow(ds), 0.70*nrow(ds)))
setdiff(1:nrow(ds), train)
rpart(form, ds[train, vars])
head(predict(model, ds[test,], type="class"))
##
3
8
9
10
11
12
## setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
## n= 105
##
## node), split, n, loss, yval, (yprob)
##
* denotes terminal node
##
## 1) root 105 69 setosa (0.34286 0.32381 0.33333)
##
2) Petal.Length< 2.6 36 0 setosa (1.00000 0.00000 0.00000) *
##
3) Petal.Length>=2.6 69 34 virginica (0.00000 0.49275 0.50725)
##
6) Petal.Length< 4.95 35 2 versicolor (0.00000 0.94286 0.05714) *
##
7) Petal.Length>=4.95 34 1 virginica (0.00000 0.02941 0.97059) *
http: // togaware. com
Copyright
©
2014, [email protected]
In R
table(predict(model, ds[test,], type="class"), ds[test, target])
##
##
##
##
##
42/46
setosa
versicolor
virginica
http: // togaware. com
setosa versicolor virginica
14
0
0
0
15
4
0
1
11
Copyright
©
2014, [email protected]
43/46
Summary
Overview
Predictive Data Mining: Ensembles
Summary
Workshop Overview
1
R: A Language for Data Mining
Decision Tree Induction.
2
Data Mining, Rattle, and R
Most widely deployed machine learning algorithm.
3
Loading, Cleaning, Exploring Data in Rattle
4
Descriptive Data Mining
5
Predictive Data Mining: Decision Trees
6
Predictive Data Mining: Ensembles
7
Moving into R and Scripting our Analyses
8
Literate Data Mining in R
Simple idea, powerful learner.
Available in R through the rpart package.
Related packages include party, Cubist, C50, RWeka (J48).
Copyright
http: // togaware. com
©
2014, [email protected]
45/46
Copyright
http: // togaware. com
©
2014, [email protected]
Multiple Models
Boosting
Building Multiple Models
14/17
Algorithm
Boosting Algorithms
Basic idea: boost observations that are “hard to model.”
General idea developed in Multiple Inductive Learning algorithm
(Williams 1987).
Algorithm: iteratively build weak models using a poor learner:
Ideas were developed (ACJ 1987, PhD 1990) in the context of:
observe that variable selection methods don’t discriminate;
so build multiple decision trees;
then combine into a single model.
Build an initial model;
Identify mis-classified cases in the training dataset;
Boost (over-represent) training observations modelled incorrectly;
Basic idea is that multiple models, like multiple experts, may
produce better results when working together, rather than in
isolation
Build a new model on the boosted training dataset;
Repeat.
The result is an ensemble of weighted models.
Two approaches covered: Boosting and Random Forests.
Best off the shelf model builder. (Leo Brieman)
Meta learners.
Copyright
http: // togaware. com
©
2014, [email protected]
Boosting
4/36
Copyright
http: // togaware. com
©
2014, [email protected]
Example
Boosting
Example: Error Rate
Example
Example: Variable Importance
Notice error rate decreases quickly then flattens.
Helps understand the knowledge captured.
plot(m)
varplot(m)
Variable Importance Plot
Temp3pm
Pressure9am
MinTemp
Humidity3pm
Temp9am
MaxTemp
Evaporation
Cloud9am
Humidity9am
WindSpeed3pm
WindSpeed9am
Cloud3pm
WindGustSpeed
Sunshine
Pressure3pm
WindGustDir
WindDir3pm
Rainfall
WindDir9am
RainToday
1
0.10
Error
0.12
0.14
Training Error
1 Train
1
0.08
1
0.06
1
1
0
http: // togaware. com
10
Copyright
20
©
6/36
30
40
50
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.04
Iteration 1 to 50
2014, [email protected]
12/36
http: // togaware. com
Copyright
0.05
©
0.06
0.07
0.08
0.09
0.10
Score
2014, [email protected]
13/36
Boosting
Example
Boosting
Example: Sample Trees
Example
Example: Performance
There are 50 trees in all. Here’s the first 3.
predicted <- predict(m, weather[-train,], type="prob")[,2]
actual <- weather[-train,]$RainTomorrow
risks <- weather[-train,]$RISK_MM
riskchart(predicted, actual, risks)
fancyRpartPlot(m$model$trees[[1]])
fancyRpartPlot(m$model$trees[[2]])
fancyRpartPlot(m$model$trees[[3]])
Risk Scores
0.9
0.8
0.7
0.6
0.5
0.4 0.3
0.2
0.1
Lift
100
1
−1
.96 .04
100%
Cloud3pm < 7.5
yes
4
no
2
yes
−1
.97 .03
94%
1
1
−1
.96 .04
100%
−1
.97 .03
100%
Pressure3pm >= 1012
no
yes
80
Pressure3pm >= 1012
no
5
3
3
1
.82 .18
20%
1
.83 .17
18%
MaxTemp >= 27
Sunshine >= 11
−1
.87 .13
18%
Performance (%)
3
Pressure3pm >= 1012
60
2
40
Humidity3pm < 42
4
10
11
3
−1
.99 .01
76%
−1
.96 .04
9%
1
.72 .28
9%
1
.42 .58
6%
2
6
7
2
6
7
−1
.98 .02
80%
−1
1.00 .00
6%
1
.63 .37
14%
−1
.99 .01
82%
−1
.97 .03
6%
1
.67 .33
12%
23%
1
20
Recall (88%)
Risk (94%)
Precision
0
Rattle 2014−Feb−27 23:21:19 gjw
Rattle 2014−Feb−27 23:21:19 gjw
Rattle 2014−Feb−27 23:21:20 gjw
0
20
40
60
80
100
Caseload (%)
http: // togaware. com
Copyright
©
2014, [email protected]
Boosting
14/36
2
First application of the technology — 1995
Decision Stumps: Age > NN; Change in Marital Status
3
Boosted Neural Networks
4
OCR using neural networks as base learners
Drucker, Schapire, Simard, 1993
2014, [email protected]
5
16/36
AdaBoost uses e −m ; LogitBoost uses log (1 + e −m ); Doom II
uses 1 − tanh(m)
AdaBoost tends to be sensitive to noise (addressed by
BrownBoost)
AdaBoost tends not to overfit, and as new models are added,
generalisation error tends to improve.
Can be proved to converge to a perfect model if the learners are
always better than chance.
http: // togaware. com
Copyright
©
2014, [email protected]
17/36
Random Forests
Random Forests
Random Forests
Build many decision trees (e.g., 500).
For each tree:
Original idea from Leo Brieman and Adele Cutler.
Select a random subset of the training set (N);
Choose different subsets of variables for each node of the
decision tree (m << M);
Build the tree without pruning (i.e., overfit)
The name is Licensed to Salford Systems!
Hence, R package is randomForest.
Typically presented in context of decision trees.
Classify a new entity using every decision tree:
Random Multinomial Logit uses multiple multinomial logit
models.
Copyright
Example
Boosting is implemented in R in the ada library
Random Forests
http: // togaware. com
15/36
Summary
1
©
2014, [email protected]
Boosting
ATO Application: What life events affect compliance?
Copyright
©
Example
Example Applications
http: // togaware. com
Copyright
http: // togaware. com
©
2014, [email protected]
Each tree “votes” for the entity.
The decision with the largest number of votes wins!
The proportion of votes is the resulting score.
19/36
http: // togaware. com
Copyright
©
2014, [email protected]
20/36
Random Forests
Random Forests
Example: RF on Weather Data
Example: Error Rate
Error rate decreases quickly then flattens over the 500 trees.
set.seed(42)
(m <- randomForest(RainTomorrow ~ ., weather[train, -c(1:2, 23)],
na.action=na.roughfix,
importance=TRUE))
plot(m)
0.8
m
Copyright
http: // togaware. com
0.4
Error
error rate: 13.67%
0.2
OOB estimate of
Confusion matrix:
No Yes class.error
No 211
4
0.0186
Yes 31 10
0.7561
0.6
Call:
randomForest(formula=RainTomorrow ~ ., data=weath...
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
0.0
##
##
##
##
##
##
##
##
##
##
##
##
0
©
2014, [email protected]
21/36
http: // togaware. com
Random Forests
100
200
Copyright
©
300
400
500
trees
2014, [email protected]
Random Forests
Example: Variable Importance
Example: Sample Trees
Helps understand the knowledge captured.
There are 500 trees in all. Here’s some rules from the first tree.
varImpPlot(m, main="Variable Importance")
## Random Forest Model 1
##
## ------------------------------------------------------...
## Tree 1 Rule 1 Node 30 Decision No
##
## 1: Evaporation <= 9
## 2: Humidity3pm <= 71
## 3: Cloud3pm <= 2.5
## 4: WindDir9am IN ("NNE")
## 5: Sunshine <= 10.25
## 6: Temp3pm <= 17.55
## ------------------------------------------------------...
## Tree 1 Rule 2 Node 31 Decision Yes
##
## 1: Evaporation <= 9
## 2: Humidity3pm <= 71
....
Variable Importance
Sunshine
Cloud3pm
Pressure3pm
Temp3pm
WindGustSpeed
MaxTemp
Pressure9am
Temp9am
Humidity3pm
MinTemp
Cloud9am
WindSpeed3pm
WindSpeed9am
Humidity9am
WindGustDir
Evaporation
WindDir9am
WindDir3pm
Rainfall
RainToday
Pressure3pm
Sunshine
Cloud3pm
Pressure9am
WindGustSpeed
Humidity3pm
MinTemp
Temp3pm
Temp9am
MaxTemp
Humidity9am
WindSpeed3pm
WindSpeed9am
Cloud9am
Evaporation
WindDir9am
WindGustDir
WindDir3pm
Rainfall
RainToday
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
5
10
MeanDecreaseAccuracy
Copyright
http: // togaware. com
©
15
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
2
4
6
MeanDecreaseGini
8
2014, [email protected]
23/36
http: // togaware. com
Random Forests
0.4
0.3
0.2
2014, [email protected]
24/36
Most accurate of current algorithms.
Runs efficiently on large data sets.
Risk Scores
0.5
©
Features of Random Forests: By Brieman
predicted <- predict(m, weather[-train,], type="prob")[,2]
actual <- weather[-train,]$RainTomorrow
risks <- weather[-train,]$RISK_MM
riskchart(predicted, actual, risks)
0.6
Copyright
Random Forests
Example: Performance
0.9 0.7
0.8
22/36
0.1
Lift
100
4
Can handle thousands of input variables.
80
Performance (%)
3
60
Gives estimates of variable importance.
2
40
22%
1
20
Recall (92%)
Risk (97%)
Precision
0
0
20
40
60
80
100
Caseload (%)
http: // togaware. com
Copyright
©
2014, [email protected]
25/36
http: // togaware. com
Copyright
©
2014, [email protected]
26/36
Summary
Summary
Moving into R and Scripting our Analyses
Summary
Workshop Overview
1
R: A Language for Data Mining
2
Data Mining, Rattle, and R
3
Loading, Cleaning, Exploring Data in Rattle
4
Descriptive Data Mining
5
Predictive Data Mining: Decision Trees
6
Predictive Data Mining: Ensembles
7
Moving into R and Scripting our Analyses
8
Literate Data Mining in R
Ensemble: Multiple models working together
Often better than a single model
Variance and bias of the model are reduced
The best available models today - accurate and robust
In daily use in very many areas of application
http: // togaware. com
Copyright
©
e
2014, [email protected]
Moving Into R
36/36
Copyright
http: // togaware. com
Programming with Data
©
2014, [email protected]
Moving Into R
Data Scientists are Programmers of Data
15/17
Programming with Data
From GUI to CLI — Rattle’s Log Tab
But. . .
Data scientists are programmers of data
A GUI can only do so much
R is a powerful statistical language
Data Scientists Desire. . .
Scripting
Transparency
Repeatability
Sharing
http: // togaware. com
Copyright
©
2014, [email protected]
Moving Into R
24/40
Copyright
http: // togaware. com
Programming with Data
©
2014, [email protected]
Moving Into R
From GUI to CLI — Rattle’s Log Tab
25/40
Programming with Data
Step 1: Load the Dataset
dsname <- "weather"
ds
<- get(dsname)
dim(ds)
## [1] 366
24
names(ds)
## [1]
## [5]
## [9]
## [13]
....
http: // togaware. com
Copyright
©
2014, [email protected]
26/40
http: // togaware. com
"Date"
"Rainfall"
"WindGustSpeed"
"WindSpeed3pm"
"Location"
"Evaporation"
"WindDir9am"
"Humidity9am"
Copyright
©
"MinTemp"
"Sunshine"
"WindDir3pm"
"Humidity3pm"
2014, [email protected]
"...
"...
"...
"...
27/40
Moving Into R
Programming with Data
Moving Into R
Step 2: Observe the Data — Observations
Step 2: Observe the Data — Structure
head(ds)
str(ds)
##
Date Location MinTemp MaxTemp Rainfall Evapora...
## 1 2007-11-01 Canberra
8.0
24.3
0.0
...
## 2 2007-11-02 Canberra
14.0
26.9
3.6
...
## 3 2007-11-03 Canberra
13.7
23.4
3.6
...
....
## 'data.frame': 366
## $ Date
:
## $ Location
:
## $ MinTemp
:
## $ MaxTemp
:
## $ Rainfall
:
## $ Evaporation :
## $ Sunshine
:
## $ WindGustDir :
## $ WindGustSpeed:
## $ WindDir9am
:
## $ WindDir3pm
:
....
tail(ds)
##
Date Location MinTemp MaxTemp Rainfall Evapo...
## 361 2008-10-26 Canberra
7.9
26.1
0
...
## 362 2008-10-27 Canberra
9.0
30.7
0
...
## 363 2008-10-28 Canberra
7.1
28.4
0
...
....
Copyright
http: // togaware. com
©
2014, [email protected]
Moving Into R
28/40
2014, [email protected]
Moving Into R
©
2014, [email protected]
29/40
Programming with Data
Step 2: Observe the Data — Variables
id
<- c("Date", "Location")
target <- "RainTomorrow"
risk
<- "RISK_MM"
(ignore <- union(id, risk))
##
Date
Location
MinTemp ...
## Min.
:2007-11-01
Canberra
:366
Min.
:-5.3...
## 1st Qu.:2008-01-31
Adelaide
: 0
1st Qu.: 2.3...
## Median :2008-05-01
Albany
: 0
Median : 7.4...
## Mean
:2008-05-01
Albury
: 0
Mean
: 7.2...
## 3rd Qu.:2008-07-31
AliceSprings : 0
3rd Qu.:12.5...
## Max.
:2008-10-31
BadgerysCreek: 0
Max.
:20.9...
##
(Other)
: 0
...
##
Rainfall
Evaporation
Sunshine
Wind...
## Min.
: 0.00
Min.
: 0.20
Min.
: 0.00
NW
...
## 1st Qu.: 0.00
1st Qu.: 2.20
1st Qu.: 5.95
NNW ...
## Median : 0.00
Median : 4.20
Median : 8.60
E
...
....
©
Copyright
http: // togaware. com
Moving Into R
summary(ds)
Copyright
obs. of 24 variables:
Date, format: "2007-11-01" "2007-11-...
Factor w/ 46 levels "Adelaide","Alba...
num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 ...
num 24.3 26.9 23.4 15.5 16.1 16.9 1...
num 0 3.6 3.6 39.8 2.8 0 0.2 0 0 16...
num 3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6...
num 6.3 9.7 3.3 9.1 10.6 8.2 8.4 4....
Ord.factor w/ 16 levels "N"<"NNE"<"N...
num 30 39 85 54 50 44 43 41 48 31 ...
Ord.factor w/ 16 levels "N"<"NNE"<"N...
Ord.factor w/ 16 levels "N"<"NNE"<"N...
Programming with Data
Step 2: Observe the Data — Summary
http: // togaware. com
Programming with Data
## [1] "Date"
30/40
"Location" "RISK_MM"
(vars
<- setdiff(names(ds), ignore))
## [1]
## [5]
## [9]
## [13]
....
"MinTemp"
"Sunshine"
"WindDir3pm"
"Humidity3pm"
"MaxTemp"
"WindGustDir"
"WindSpeed9am"
"Pressure9am"
Copyright
http: // togaware. com
Programming with Data
©
"...
"...
"...
"...
2014, [email protected]
Moving Into R
Step 3: Clean the Data — Remove Missing
"Rainfall"
"WindGustSpeed"
"WindSpeed3pm"
"Pressure3pm"
31/40
Programming with Data
Step 3: Clean the Data — Remove Missing
dim(ds)
dim(ds)
## [1] 366
24
## [1] 328
24
sum(is.na(ds[vars]))
sum(is.na(ds[vars]))
## [1] 47
## [1] 0
ds <- ds[-attr(na.omit(ds[vars]), "na.action"),]
http: // togaware. com
Copyright
©
2014, [email protected]
32/40
http: // togaware. com
Copyright
©
2014, [email protected]
33/40
Moving Into R
Programming with Data
Moving Into R
Step 3: Clean the Data—Target as Categoric
Programming with Data
Step 3: Clean the Data—Target as Categoric
summary(ds[target])
##
##
##
summary(ds[target])
##
RainTomorrow
## Min.
:0.000
## 1st Qu.:0.000
## Median :0.000
## Mean
:0.183
## 3rd Qu.:0.000
## Max.
:1.000
....
RainTomorrow
0:268
1: 60
count
200
100
ds[target] <- as.factor(ds[[target]])
levels(ds[target]) <- c("No", "Yes")
0
0
Copyright
http: // togaware. com
©
2014, [email protected]
Moving Into R
34/40
http: // togaware. com
Programming with Data
Copyright
1
©
RainTomorrow
2014, [email protected]
Moving Into R
Step 4: Prepare for Modelling
35/40
Programming with Data
Step 5: Build the Model—Random Forest
(form <- formula(paste(target, "~ .")))
## RainTomorrow ~ .
library(randomForest)
model <- randomForest(form, ds[train, vars], na.action=na.omit)
model
(nobs <- nrow(ds))
## [1] 328
##
## Call:
## randomForest(formula=form, data=ds[train, vars], ...
##
Type of random forest: classification
##
Number of trees: 500
## No. of variables tried at each split: 4
....
train <- sample(nobs, 0.70*nobs)
length(train)
## [1] 229
test <- setdiff(1:nobs, train)
length(test)
## [1] 99
Copyright
http: // togaware. com
©
2014, [email protected]
Moving Into R
36/40
http: // togaware. com
Programming with Data
Copyright
©
2014, [email protected]
37/40
Literate Data Mining in R
Step 6: Evaluate the Model—Risk Chart
Workshop Overview
pr <- predict(model, ds[test,], type="prob")[,2]
riskchart(pr, ds[test, target], ds[test, risk],
title="Random Forest - Risk Chart",
risk=risk, recall=target, thresholds=c(0.35, 0.15))
1
R: A Language for Data Mining
2
Data Mining, Rattle, and R
3
Loading, Cleaning, Exploring Data in Rattle
Lift
5
4
Descriptive Data Mining
4
5
Predictive Data Mining: Decision Trees
6
Predictive Data Mining: Ensembles
7
Moving into R and Scripting our Analyses
8
Literate Data Mining in R
Random Forest − Risk Chart
Risk Scores
0.9 0.8
100
0.7
0.60.5 0.4 0.3 0.2
0.1
Performance (%)
80
60
3
40
2
RainTomorrow (98%)
20
19%
RISK_MM (97%)
1
Precision
0
0
20
40
60
80
100
Caseload (%)
http: // togaware. com
Copyright
©
2014, [email protected]
38/40
http: // togaware. com
Copyright
©
2014, [email protected]
16/17
Motivation
Motivation
Why is Reproducibility Important?
Literate Data Mining Overview
Your Research Leader or Executive drops by and asks:
“Remember that research you did last year? I’ve heard there is
an update on the data that you used. Can you add the new data
in and repeat the same analysis?”
“Jo Bloggs did a great analysis of the company returns data just
before she left. Can you get someone else to analyse the new
data set using the same methods, and so produce an updated
report that we can present to the Exec next week?”
Copyright
©
2014, [email protected]
Authors productive with narrative and code in one document
Sweave (Leisch 2002) and now KnitR (Yihui 2011)
Embed R code into LATEX documents for typesetting
“The fraud case you provided an analysis of last year has finally
reached the courts. We need to ensure we have a clear trail of the
data sources, the analyses performed, and the results obtained,
to stand up in court. Could you document these please.”
http: // togaware. com
One document to intermix the analysis, code, and results
4/37
KnitR also supports publishing to the web
http: // togaware. com
Copyright
Motivation
Eliminate errors that occur when transcribing results into
documents.
Record the context for the analysis and decisions made about the
type of analysis to perform in the one place.
Document the processes to provide integrity for the conclusions
of the analysis.
2014, [email protected]
Those who receive the results of modern data analysis have
limited opportunity to verify the results by direct observation.
Users of the analysis have no option but to trust the analysis,
and by extension the software that produced it. This places an
obligation on all creators of software to program in such a way
that the computations can be understood and trusted. This
obligation I label the Prime Directive.
John Chambers (2008)
Software for Data Analysis: Programming with R
Share approach with others for peer review and for learning from
each other—engender a continuous learning environment.
©
5/37
Prime Objective: Trustworthy Software
Automatically regenerate documents when code, data, or
assumptions change.
Copyright
2014, [email protected]
Motivation
Why Reproducible Data Mining?
http: // togaware. com
©
6/37
http: // togaware. com
Copyright
Motivation
©
2014, [email protected]
7/37
Motivation
Beautiful Output by KnitR
Beautiful Output by Default
The reader wants to read the document and easily do so!
KnitR combined with LATEX will
Code highlighting is done automatically
Intermix analysis and results of analysis
Default theme is carefully designed
Automatically generate graphics and tables
Many other themes are available
Support reproducible and transparent analysis
R Code is “properly” reformatted
Produce the best looking reports.
Analyses (Graphs and Tables) automatically included.
http: // togaware. com
Copyright
©
2014, [email protected]
8/37
http: // togaware. com
Copyright
©
2014, [email protected]
9/37
Using RStudio
Basic LATEX Markup
Using RStudio
Introducing LATEX
Simplified interaction with R, LATEX, and KnitR
A text markup language rather than a WYSIWYG.
Executes R code one line at a time
Based on TEX from 1977 — very stable and powerful.
LATEX is easier to use macro package built on TEX.
Formats LATEX documents and provides and spell checking
A single click compile to PDF and synchronised views
Ensures consistent style (layout, fonts, tables, maths, etc.)
Automatic indexes, footnotes and references.
Documents are well structured and are clear text.
Has a learning curve.
Demonstrate: Startup and explore RStudio.
http: // togaware. com
Copyright
©
2014, [email protected]
12/37
Copyright
http: // togaware. com
Basic LATEX Markup
©
2014, [email protected]
14/37
Basic LATEX Markup
Basic LATEX Usage
Structures
\documentclass{article}
\documentclass{article}
\begin{document}
\begin{document}
\section{Introduction}
...
\end{document}
\subsection{Concepts}
...
Demonstrate Create a new Sweave document in RStudio
\end{document}
http: // togaware. com
Copyright
©
2014, [email protected]
15/37
Copyright
http: // togaware. com
Basic LATEX Markup
©
2014, [email protected]
16/37
Basic LATEX Markup
Formats
RStudio Support for LATEX
RStudio provides excellent support for working with LATEX documents
\documentclass{article}
Helps to avoid having to know too much abuot LATEX
Best illustrated through a demonstration
\begin{document}
Format menu
\begin{itemize}
\item ABC
\item DEF
\end{itemize}
Section commands
Font commands
List commands
Verbatim/Block commands
Spell Checker
This if \textbf{bold} text or \textbf{italic} text, ...
Demonstrate: Start a new document, add contents, format to PDF.
\end{document}
http: // togaware. com
Compile PDF
Copyright
©
2014, [email protected]
17/37
http: // togaware. com
Copyright
©
2014, [email protected]
18/37
Incorporating R Code
Incorporating R Code
Incorporating R Code
Making You Look Good
We insert R code in a Chunk starting with << >>=
<<format_example>>=
for(i in 1:5){j<-cos(sin(i)*i^2)+3;print(j-5)}
@
We terminate the Chunk with @
Save LATEX with extension Rnw
This Chunk
<<simple_example>>=
x <- sum(1:10)
x
@
for(i in 1:5)
{
j <- cos(sin(i)*i^2)+3
print(j-5)
}
Produces
## [1]
## [1]
## [1]
## [1]
....
x <- sum(1:10)
x
## [1] 55
-1.334
-2.88
-1.704
-1.103
Demonstrate: Do this in RStudio
http: // togaware. com
Copyright
©
2014, [email protected]
20/37
Copyright
http: // togaware. com
Incorporating R Code
library(xtable)
obs <- sample(1:nrow(weatherAUS), 8)
vars <- 2:6
xtable(weatherAUS[obs, vars])
We can do that with \Sexpr{...}.
Our dataset has \Sexpr{nrow(ds)} observations of
\Sexpr{ncol(ds)} variables.
50959
26581
3947
73014
27770
67989
80467
33587
Becomes
Our dataset has 82169 observations of 24 variables.
Better Still: \Sexpr{format(nrow(ds), big.mark=",")}
Our dataset has 82,169 observations of 24 variables.
©
2014, [email protected]
22/37
Location
Cairns
Canberra
Cobar
SalmonGums
Canberra
PerthAirport
Darwin
Ballarat
http: // togaware. com
Formatting Tables and Plots
http: // togaware. com
Copyright
©
MaxTemp
27.30
31.30
34.00
33.50
14.00
24.00
31.70
19.30
Copyright
©
MaxTemp
27.30
31.30
34.00
33.50
14.00
24.00
31.70
19.30
Rainfall
0.00
0.00
0.00
0.00
0.00
0.00
0.40
0.00
Evaporation
4.80
6.60
7.20
1.60
5.80
9.00
2014, [email protected]
24/37
Table: Limit Number of Digits
print(xtable(weatherAUS[obs, vars],
digits=1),
include.rownames=FALSE)
print(xtable(weatherAUS[obs, vars]),
include.rownames=FALSE)
MinTemp
17.50
13.20
18.80
10.60
3.10
13.30
25.80
6.00
MinTemp
17.50
13.20
18.80
10.60
3.10
13.30
25.80
6.00
Formatting Tables and Plots
Table: Exclude Row Names
Location
Cairns
Canberra
Cobar
SalmonGums
Canberra
PerthAirport
Darwin
Ballarat
21/37
A Simple Table
Include information about data within the narrative.
Copyright
2014, [email protected]
Formatting Tables and Plots
R Within the Text
http: // togaware. com
©
Rainfall
0.00
0.00
0.00
0.00
0.00
0.00
0.40
0.00
2014, [email protected]
Evaporation
4.80
6.60
7.20
Location
Cairns
Canberra
Cobar
SalmonGums
Canberra
PerthAirport
Darwin
Ballarat
1.60
5.80
9.00
25/37
http: // togaware. com
MinTemp
17.5
13.2
18.8
10.6
3.1
13.3
25.8
6.0
Copyright
©
MaxTemp
27.3
31.3
34.0
33.5
14.0
24.0
31.7
19.3
Rainfall
0.0
0.0
0.0
0.0
0.0
0.0
0.4
0.0
2014, [email protected]
Evaporation
4.8
6.6
7.2
1.6
5.8
9.0
26/37
Formatting Tables and Plots
Formatting Tables and Plots
Table: Tiny Font
Table: Column Alignment
vars <- 2:8
print(xtable(weatherAUS[obs, vars],
digits=0,
align="rlrrrrrr"),
size="tiny")
vars <- 2:8
print(xtable(weatherAUS[obs, vars],
digits=0),
size="tiny",
include.rownames=FALSE)
Location
Cairns
Canberra
Cobar
SalmonGums
Canberra
PerthAirport
Darwin
Ballarat
MinTemp
18
13
19
11
3
13
26
6
MaxTemp
27
31
34
34
14
24
32
19
Copyright
http: // togaware. com
©
Rainfall
0
0
0
0
0
0
0
0
Evaporation
5
7
7
Sunshine
10
12
11
2
6
9
2
8
7
WindGustDir
SSE
WSW
ENE
SE
S
W
WNW
SE
2014, [email protected]
50959
26581
3947
73014
27770
67989
80467
33587
27/37
Location
Cairns
Canberra
Cobar
SalmonGums
Canberra
PerthAirport
Darwin
Ballarat
MinTemp
18
13
19
11
3
13
26
6
Copyright
http: // togaware. com
Formatting Tables and Plots
MaxTemp
27
31
34
34
14
24
32
19
©
Rainfall
0
0
0
0
0
0
0
0
Evaporation
5
7
7
Sunshine
10
12
11
2
6
9
2
8
7
WindGustDir
SSE
WSW
ENE
SE
S
W
WNW
SE
2014, [email protected]
28/37
Formatting Tables and Plots
Table: Caption
Plots
library(ggplot2)
cities <- c("Canberra", "Darwin", "Melbourne", "Sydney")
ds <- subset(weatherAUS, Location %in% cities & ! is.na(Temp3pm))
g <- ggplot(ds, aes(Temp3pm, colour=Location, fill=Location))
g <- g + geom_density(alpha = 0.55)
print(g)
print(xtable(weatherAUS[obs, vars],
digits=1,
caption="This is the table caption."),
size="tiny")
0.20
Location
Cairns
Canberra
Cobar
SalmonGums
Canberra
PerthAirport
Darwin
Ballarat
MinTemp
17.5
13.2
18.8
10.6
3.1
13.3
25.8
6.0
MaxTemp
27.3
31.3
34.0
33.5
14.0
24.0
31.7
19.3
Rainfall
0.0
0.0
0.0
0.0
0.0
0.0
0.4
0.0
Evaporation
4.8
6.6
7.2
Sunshine
10.0
11.6
11.3
1.6
5.8
9.0
1.9
8.1
7.3
WindGustDir
SSE
WSW
ENE
SE
S
W
WNW
SE
0.15
Location
density
50959
26581
3947
73014
27770
67989
80467
33587
Canberra
Darwin
Melbourne
0.10
Sydney
Table : This is the table caption.
0.05
0.00
http: // togaware. com
Copyright
©
10
2014, [email protected]
Knitting
29/37
http: // togaware. com
©
20
Copyright
Our First KnitR Document
30
Knitting
Create a KnitR Document: New→R Sweave
40
Temp3pm
2014, [email protected]
31/37
Our First KnitR Document
Setup KnitR
We wish to use KnitR rather than the older Sweave processor
In RStudio we can configure the options to use knitr:
Select Tools→Options
Choose the Sweave group
Choose knitr for Weave Rnw files using:
The remaining defaults should be okay
Click Apply and then OK
http: // togaware. com
Copyright
©
2014, [email protected]
24/34
http: // togaware. com
Copyright
©
2014, [email protected]
25/34
Knitting
Our First KnitR Document
Knitting
Simple KnitR Document
Our First KnitR Document
Simple KnitR Document
Insert the following into your new KnitR document:
Insert the following into your new KnitR document:
\title{Sample KnitR Document}
\author{Graham Williams}
\maketitle
\title{Sample KnitR Document}
\author{Graham Williams}
\maketitle
\section*{My First Section}
\section*{My First Section}
This is some text that is automatically typeset
by the LaTeX processor to produce well formatted
quality output as PDF.
This is some text that is automatically typeset
by the LaTeX processor to produce well formatted
quality output as PDF.
Your turn—Click Compile PDF to view the result.
Your turn—Click Compile PDF to view the result.
http: // togaware. com
Copyright
©
2014, [email protected]
Knitting
26/34
http: // togaware. com
Copyright
©
Our First KnitR Document
2014, [email protected]
Knitting
Simple KnitR Document
26/34
Our First KnitR Document
Simple KnitR Document—Resulting PDF
Result of Compile PDF
http: // togaware. com
Copyright
©
2014, [email protected]
Knitting
27/34
http: // togaware. com
Copyright
©
Including R Commands in KnitR
2014, [email protected]
Knitting
KnitR: Add R Commands
Including R Commands in KnitR
KnitR: Add R Commands
R code can be used to generate results into the document:
R code can be used to generate results into the document:
<<echo=FALSE, message=FALSE>>=
library(rattle) # Provides the weather dataset
library(ggplot2) # Provides the qplot() function
<<echo=FALSE, message=FALSE>>=
library(rattle) # Provides the weather dataset
library(ggplot2) # Provides the qplot() function
ds <- weather
qplot(MinTemp, MaxTemp, data=ds)
@
ds <- weather
qplot(MinTemp, MaxTemp, data=ds)
@
Your turn—Click Compile PDF to view the result.
Your turn—Click Compile PDF to view the result.
http: // togaware. com
Copyright
©
2014, [email protected]
28/34
29/34
http: // togaware. com
Copyright
©
2014, [email protected]
29/34
Knitting
Including R Commands in KnitR
Knitting
KnitR Document With R Code
Including R Commands in KnitR
Simple KnitR Document—PDF with Plot
Result of Compile PDF
http: // togaware. com
Copyright
©
2014, [email protected]
Knitting
30/34
Copyright
©
Basics Cheat Sheet
2014, [email protected]
Knitting
LaTeX Basics
% Introduce a Sub Section
\subsubsection*{...}
% Introduce a Sub Sub Section
\textbf{...}
\textit{...}
% Bold font
% Italic font
\begin{itemize}
\item ...
\item ...
\end{itemize}
% A bullet list
Basics Cheat Sheet
echo=FALSE
eval=TRUE
# Do not display the R code
# Evaluate the R code
results="hide"
# Hide the results of the R commands
fig.width=10
fig.height=8
# Extend figure width from 7 to 10 inches
# Extend figure height from 7 to 8 inches
out.width="0.8\\textwidth"
out.height="0.5\\textheight"
Copyright
©
2014, [email protected]
32/34
http: // togaware. com
Copyright
©
2014, [email protected]
Moving Into R
Workshop Overview
1
R: A Language for Data Mining
2
Data Mining, Rattle, and R
3
Loading, Cleaning, Exploring Data in Rattle
4
Descriptive Data Mining
5
Predictive Data Mining: Decision Trees
6
Predictive Data Mining: Ensembles
7
Moving into R and Scripting our Analyses
8
Literate Data Mining in R
Copyright
©
2014, [email protected]
# Fit figure 80% page width
# Fit figure 50% page height
Plus an extensive collection of other options.
Plus an extensive collection of other markup and capabilities.
http: // togaware. com
31/34
KnitR Basics
\subsection*{...}
http: // togaware. com
http: // togaware. com
33/34
Resources
Resources and References
OnePageR: http://onepager.togaware.com – Tutorial Notes
Rattle: http://rattle.togaware.com
Guides: http://datamining.togaware.com
Practise: http://analystfirst.com
Book: Data Mining using Rattle/R
Chapter: Rattle and Other Tales
Paper: A Data Mining GUI for R — R Journal, Volume 1(2)
2/17
http: // togaware. com
Copyright
©
2014, [email protected]
39/40