Download Introduction Predictive Analytics Tools: Weka, R

Document related concepts
no text concepts found
Transcript
Introduction Predictive Analytics
Tools: Weka, R!
Predictive Analytics Center of
Excellence
San Diego Supercomputer Center
University of California, San Diego
!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Available Data Mining Tools!
COTs:!
n IBM Intelligent Miner!
n SAS Enterprise Miner!
n Oracle ODM!
n Microstrategy!
n Microsoft DBMiner!
n Pentaho!
n Matlab!
n Teradata!
Open Source:!
n WEKA!
n KNIME!
n Orange!
n RapidMiner!
n NLTK!
n R!
n Rattle!
SAN DIEGO SUPERCOMPUTER CENTER
2
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Agenda!!
•  WEKA!
• 
• 
• 
• 
Intro and background"
Data Preparation"
Creating Models/ Applying Algorithms"
Evaluating Results"
•  R!
•  R Background"
•  R Basics"
•  Outline"
•  R-Studio Overview"
•  Hands On (homework)"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
WEKA!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Download and Install WEKA!
•  Website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html!
!
5
SAN DIEGO SUPERCOMPUTER CENTER
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
What is WEKA?!
•  Waikato Environment for Knowledge Analysis!
•  WEKA is a data mining/machine learning application developed
by Department of Computer Science, University of Waikato,
New Zealand"
•  WEKA is open source software in JAVA "
•  WEKA is a collection machine learning algorithms and tools for data
mining tasks"
•  data pre-processing, classification, regression, clustering, association,
and visualization. "
•  WEKA is well-suited for developing new machine learning
schemes "
•  WEKA is a bird found only in New Zealand. !
6
SAN DIEGO SUPERCOMPUTER CENTER
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Advantages of Weka !
•  Free availability !
•  under the GNU General Public License"
•  Portability!
•  fully implemented in the Java programming language and thus runs on
almost any modern computing platforms"
•  Windows, Mac OS X and Linux"
•  Comprehensive collection of data preprocessing and modeling
techniques!
•  Supports standard data mining tasks: data preprocessing, clustering,
classification, regression, visualization, and feature selection."
•  Easy to use GUI!
•  Provides access to SQL databases !
•  using Java Database Connectivity and can process the result returned
by a database query."
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Disadvantages !!
•  Sequence modeling is not covered by the
algorithms included in the Weka distribution!
•  Not capable of multi-relational data mining!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
WEKA Walk Through: Main GUI!
•  Three graphical user interfaces!
•  “The Explorer” (exploratory data analysis)"
• 
• 
• 
• 
• 
• 
pre-process data"
build “classifiers” "
cluster data"
find associations"
attribute selection"
data visualization"
•  “The Experimenter” (experimental environment)"
•  used to compare performance of different learning
schemes "
•  “The KnowledgeFlow” (new process model
inspired interface) "
•  Java-Beans-based interface for setting up and running
machine learning experiments."
•  Command line Interface (“Simple CLI”)!
9
More at: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html SAN DIEGO SUPERCOMPUTER CENTER
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
1
0
SAN DIEGO SUPERCOMPUTER CENTER
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
WEKA:: Explorer: Preprocess!
•  Importing data !
•  Data format"
•  Uses flat text files to describe the data"
•  Data can be imported from a file in various formats: "
•  ARFF, CSV, C4.5, binary"
•  Data can also be read from a URL or from an SQL
database (using JDBC)"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
WEKA:: ARFF file format!
@relation heart-disease-simplified
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
age numeric
sex { female, male}
chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
cholesterol numeric
exercise_induced_angina { no, yes}
class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...!
A more thorough description is available here
http://www.cs.waikato.ac.nz/~ml/weka/arff.html
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
University
of Waikato
1
3
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
University
of Waikato
1
4
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Weka: Explorer:Preprocess!
•  Preprocessing data !
•  Visualization"
•  Filtering algorithms "
•  filters can be used to transform the data (e.g., turning numeric
attributes into discrete ones) and make it possible to delete
instances and attributes according to specific criteria."
•  Removing Noisy Data"
•  Adding Additional Attributes"
•  Remove Attributes"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
WEKA:: Explorer: Preprocess!
•  Used to define filters to transform
Data. !
•  WEKA contains filters for:!
•  Discretization, normalization, resampling,
attribute selection, transforming, combining
attributes, etc"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
University
of Waikato
1
9
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Explorer: Visualize!
•  Visualization very useful in practice!
•  help determine difficulty of the learning problem"
•  WEKA can visualize single attributes (1-d) and
pairs of attributes (2-d)!
•  Color-coded class values!
•  “Jitter” option to deal with nominal attributes
(and to detect “hidden” data points)!
•  “Zoom-in” function!
SAN DIEGO SUPERCOMPUTER CENTER
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
22
SAN DIEGO SUPERCOMPUTER CENTER
University
of Waikato
2
3
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
University
of Waikato
2
4
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Explorer: Attribute Selection!
•  Panel that can be used to investigate which
(subsets of) attributes are the most predictive
ones!
•  Attribute selection methods contain two parts:!
•  A search method: best-first, forward selection, random,
exhaustive, genetic algorithm, ranking!
•  An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …"
•  Very flexible: WEKA allows (almost) arbitrary
combinations of these two!
2
5
SAN DIEGO SUPERCOMPUTER CENTER
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
WEKA:: Explorer: building “classifiers”!
•  Classifiers in WEKA are models for
predicting nominal or numeric quantities!
•  Implemented learning schemes include:!
•  Decision trees and lists, instance-based
classifiers, support vector machines, multi-layer
perceptrons, logistic regression, Bayes’ nets, …"
•  “Meta”-classifiers include:!
•  Bagging, boosting, stacking, error-correcting
output codes, locally weighted learning, … "
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
University
of Waikato
2
7
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
University
of Waikato
2
8
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
WEKA:: Explorer: building “Cluster”!
•  WEKA contains “clusters” for finding
groups of similar instances in a dataset!
•  Implemented schemes are:!
•  k-Means, EM, Cobweb, X-means, FarthestFirst"
•  Clusters can be visualized and compared
to “true” clusters (if given)!
•  Evaluation based on loglikelihood if
clustering scheme produces a probability
distribution!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Explorer: Finding associations!
•  WEKA contains an implementation of the Apriori
algorithm for learning association rules!
•  Works only with discrete data"
•  Can identify statistical dependencies between
groups of attributes:!
•  milk, butter ! bread, eggs (with confidence 0.9 and
support 2000)"
•  Apriori can compute all rules that have a given
minimum support and exceed a given
confidence!
SAN DIEGO SUPERCOMPUTER CENTER
7/1/14
30
UNIVERSITY OF CALIFORNIA, SAN DIEGO
References and Resources!
•  References:!
•  WEKA website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html"
•  WEKA Tutorial:"
•  Machine Learning with WEKA: A presentation demonstrating all graphical
user interfaces (GUI) in Weka. "
•  A presentation which explains how to use Weka for exploratory data
mining. "
•  WEKA Data Mining Book:"
•  Ian H. Witten and Eibe Frank, Data Mining: Practical Machine
Learning Tools and Techniques (Second Edition)"
•  WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/
Main_Page"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
R Environment: R Studio!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Downloading R/ R Studio!
•  http://www.r-project.org/!
•  http://www.rstudio.com/ide/download/!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
What is R? !
•  An Environment!
•  R is an integrated suite of software facilities for data manipulation,
calculation and graphical facilities for data analysis and display. "
• 
• 
• 
• 
Effective data handling and storage"
Suite of operators for calculations on arrays"
Large, coherent, integrated collection of intermediate tools for data analysis "
Programming language, run time environment"
•  Developed at Bell Labs!
•  GNU open source software!
•  Under the terms of the Free Software Foundation's GNU General
Public License"
•  Open Source implementation of S-Plus language!
•  Well-developed, simple and effective programming language"
•  Highly extensible!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
R Features!
• 
• 
• 
Software package designed for data analysis and graphical representation!
Interactive, but may also be used programmatically!
Platform independence!
• 
• 
• 
Free, open source code!
Engaged community!
• 
• 
Compiles and runs on a wide variety platforms, Unix base, Windows and MacOS. "
over 4,200 user-contributed packages"
Extendable!
• 
User defined functions"
•  > 4000 packages available in the CRAN package repository"
• 
• 
• 
Supports extensions / add-ons (i.e. – rApache)"
Compatible with other languages (i.e. – SQL, perl, C)"
Data Import"
•  Pre-processing data from different sources"
• 
Scalability!
• 
Parallel R packages "
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
R packages for DM!
• 
• 
• 
• 
• 
• 
• 
• 
Clustering !
Classification!
Association Rules !
Sequential patterns!
Time Series!
Statistics!
Graphics!
Data manipulation!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Data Mining!
•  linear models (lm)!
•  generalized linear
models(glm)!
•  generalized additive
models (gam)!
•  linear mixed effects
models(lme)!
•  quantile regression (qr)!
•  vector general additive
models(vgam)!
•  lasso, ridge, and elastic
net models (glmnet)!
•  non-linear models (nlm)!
•  linear mixed effects
models (nlmer)!
•  linear discriminant
analysis (lda)!
•  quadratic discriminate
analysis (qda)!
•  trees (tree)!
•  random forests
(randomForrest)!
•  support vector machines
(svm)!
•  neural networks (nnet)!
•  k-nearest neighbors (knn)!
•  kmeans!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Big Data Options!
•  lapply-based parallelism!
•  multicore library"
•  snow library"
• 
foreach-based parallelism!
•  doMC backend"
•  doSNOW backend"
•  doMPI backend"
• 
Map/Reduce- (Hadoop-) based parallelism!
•  Hadoop streaming with R mappers/reducers"
•  Rhadoop (rmr, rhdfs, rhbase)"
•  RHIPE"
• 
Poor-man's Parallelism!
•  lots of Rs running"
•  lots of input files"
•  Hands-off Parallelism!
•  OpenMP support compiled into R build"
•  Dangerous!"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
R Considerations/Limitations!
•  Command Line Interface!
•  Performance!
•  Memory Limits!
•  memory limits dependent on the build, (32-bit vs. 64-bit)"
•  32-bit build of R on Windows is dependent on the
underlying OS version"
•  Syntax “curiosities”!
•  Learning curve!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
R-Studio Overview!
•  http://www.rstudio.com/ide/download/
!
•  R-Studio is an integrated development environment to
support R code.
•  R-Studio runs in two ways:
• 
• 
Desktop version for Linux, Mac, Windows: Single user,
perfect for laptop or desktop machine
Server Version for Linux: Allows an number of remote users
to run R-Studio within a web-browser, facilitates sharing of
code and data among team members
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
•  General View of R-Studio
Editor Window!
Project Window:!
Currently loaded !
Workspace, and !
history!
“pop-up”:!
Multi-tab display: !
Shows graphics, !
Current directory and !
loaded packages!
Console: Run R!
Commands!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
The Fundamentals !
•  Launch R!
•  Quit R!
•  q() "
•  Getting Help!
• 
• 
• 
• 
help(package_name) or ?(package_name) or help start()"
example(package_name)"
??(keyword)"
library(help=“package_name”)"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
The Basics!
•  R environmental commands!
•  list objects"
•  ls() "
•  objects()"
•  list files in current directory"
•  list.files()"
•  list current directory"
•  getwd()"
•  set working directory"
•  setwd()"
•  remove objects"
•  rm()"
•  Workspace versus console!
•  Clear workspace"
•  rm(list=ls())"
•  Clear console"
•  (control, L)"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
The Basics
(Naming Variables)!
•  Requirements!
•  Case sensitive, names must start with letter or '.’"
•  Only letters, numbers, underscores and‘.’s"
•  Special keywords!
•  break, else, FALSE, for, function, if, Inf, NA, NaN, next,
repeat, return, TRUE, while"
•  Names not limited in length!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
The Basics
!
•  All entities in are called “objects”!
•  arrays, vectors, matrices, functions, lists, data frames, factors"
•  Expressions vs. assignments!
• 
• 
• 
• 
• 
10+10"
my.age <- 23"
my.age < - 23 (note the added space)"
age<- c(my.age, 14, 59, 32)"
my.age == 40"
•  Data Types!
•  Numeric, Integer, Complex, Logical, Character"
• Function call!
!> mean(weight)"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Summary of Data Structures!
Linear!
Rectangular!
Homogeneous"
Vectors"
Matrices"
Heterogeneous"
Lists"
Data Frames"
"
•  Vectors and Matrices must contain same data type!
•  Character Type will trump numeric: Values will be
forced into characters!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
The Basics
(Functions)!
•  Basic functions!
•  mean(age)"
•  sd(age)"
•  sqrt(var(age))"
•  TIP: to list all function in search path"
–  sapply(search(), ls, all.names = TRUE)
•  User Defined functions!
•  Score <- age * 10;"
•  Using the correct functions for the given data
type!
•  apply() family "
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Function Components!
writeLines(text=“text”, con = stdout(), sep = "\n", useBytes
= FALSE)!
•  function name: writeLines(“146.6”, “popRate.txt”, sep =
"\n”)"
•  parentheses: writeLines(“146.6”, “popRate.txt”, sep = "\n”)!
•  commas: writeLines(“146.6”, “popRate.txt”, sep = "\n”)"
•  first argument: writeLines(“146.6”, “popRate.txt”, sep =
"\n”)"
•  second argument: writeLines(“146.6”, “popRate.txt”, sep =
"\n”)""
•  optional argument: writeLines(“146.6”, “popRate.txt”, "\n”)"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Importing Data/Exporting Data!
• 
Flat Files!
• 
Import: > AHW <- read.csv(“AHW_1.csv”, header=TRUE)"
>weatherdata <- read.table(file="C:/work/DM1/weather.csv",
header=TRUE, sep=",") "
•  Export: > USTemps=read.table(file=file.choose(),header=TRUE)"
• 
Databases!
• 
Import"
•  connection <- dbConnect(driver, user, password, host, dbname)"
> AHW <- dbSendQuery(connection, “SELECT * FROM AHW”)
• 
Export"
•  > connnection <- dbConnect(driver, user, password, host,dbname)"
> dbWriteTable (con, “AHW”, AHW)
• 
R objects!
• 
• 
• 
• 
Import: > load(‘AHW.Rdata’)"
Export: > save(AHW, file=“New_AHW.Rdata”)"
Web!
• 
connection <-url(‘http://pace.sdsc.edu/sites/default/bootcamp/images/AHW_1.csv’)"
• 
AHW <- read.csv(con, header=TRUE)"
Plots!
• 
• 
png(filename="C:/R/figure.png", height=295, width=300, bg="white")"
pdf(file="C:/R/figure.pdf", height=3.5, width=5)"
•  Dev.off() #turn off device driver (to flush output to png/pdf)"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
•  Loading dataset to R-Studio (Simple text file)
Name of data frame!
to be created with !
imported data!
Options for parsing !
the text data into !
fields and values!
How data frame will !
look once the data !
are imported!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Extending R!
• 
http://cran.r-project.org/web/packages/!
• 
Install a package !
• 
• 
from command line"
"> install.package(‘name_of_package’)"
from GUI"
•  Packages & Data > Package Installer"
• 
Load Library (to use installed package)"
• 
• 
> library(name_of_package)"
Example "
> library(markdown)"
• 
Use Library Function!
• 
• 
> function_name(parameters)"
Example "
> markdownToHTML("example.md")"
"
http://www.r-bloggers.com/dont-r-alone-a-guide-to-tools-for-collaboration-with-r/!!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
More Information……!
•  The R Manuals!
•  http://www.stat.berkely.edu/~spector/R.pdf"
•  And Introduction to R !
•  http://cran.r-project.org/doc/manuals/R-intro.html"
•  http://tryr.codeschool.com/"
•  Books!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Other Resources!
/server irc.freenode.net/join #R!"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
the end!!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO