Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Introduction Predictive Analytics
Tools: Weka, R!
Predictive Analytics Center of
Excellence
San Diego Supercomputer Center
University of California, San Diego
!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Available Data Mining Tools!
COTs:!
n IBM Intelligent Miner!
n SAS Enterprise Miner!
n Oracle ODM!
n Microstrategy!
n Microsoft DBMiner!
n Pentaho!
n Matlab!
n Teradata!
Open Source:!
n WEKA!
n KNIME!
n Orange!
n RapidMiner!
n NLTK!
n R!
n Rattle!
SAN DIEGO SUPERCOMPUTER CENTER
2
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Agenda!!
• WEKA!
•
•
•
•
Intro and background"
Data Preparation"
Creating Models/ Applying Algorithms"
Evaluating Results"
• R!
• R Background"
• R Basics"
• Outline"
• R-Studio Overview"
• Hands On (homework)"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
WEKA!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Download and Install WEKA!
• Website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html!
!
5
SAN DIEGO SUPERCOMPUTER CENTER
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
What is WEKA?!
• Waikato Environment for Knowledge Analysis!
• WEKA is a data mining/machine learning application developed
by Department of Computer Science, University of Waikato,
New Zealand"
• WEKA is open source software in JAVA "
• WEKA is a collection machine learning algorithms and tools for data
mining tasks"
• data pre-processing, classification, regression, clustering, association,
and visualization. "
• WEKA is well-suited for developing new machine learning
schemes "
• WEKA is a bird found only in New Zealand. !
6
SAN DIEGO SUPERCOMPUTER CENTER
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Advantages of Weka !
• Free availability !
• under the GNU General Public License"
• Portability!
• fully implemented in the Java programming language and thus runs on
almost any modern computing platforms"
• Windows, Mac OS X and Linux"
• Comprehensive collection of data preprocessing and modeling
techniques!
• Supports standard data mining tasks: data preprocessing, clustering,
classification, regression, visualization, and feature selection."
• Easy to use GUI!
• Provides access to SQL databases !
• using Java Database Connectivity and can process the result returned
by a database query."
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Disadvantages !!
• Sequence modeling is not covered by the
algorithms included in the Weka distribution!
• Not capable of multi-relational data mining!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
WEKA Walk Through: Main GUI!
• Three graphical user interfaces!
• “The Explorer” (exploratory data analysis)"
•
•
•
•
•
•
pre-process data"
build “classifiers” "
cluster data"
find associations"
attribute selection"
data visualization"
• “The Experimenter” (experimental environment)"
• used to compare performance of different learning
schemes "
• “The KnowledgeFlow” (new process model
inspired interface) "
• Java-Beans-based interface for setting up and running
machine learning experiments."
• Command line Interface (“Simple CLI”)!
9
More at: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html SAN DIEGO SUPERCOMPUTER CENTER
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
1
0
SAN DIEGO SUPERCOMPUTER CENTER
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
WEKA:: Explorer: Preprocess!
• Importing data !
• Data format"
• Uses flat text files to describe the data"
• Data can be imported from a file in various formats: "
• ARFF, CSV, C4.5, binary"
• Data can also be read from a URL or from an SQL
database (using JDBC)"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
WEKA:: ARFF file format!
@relation heart-disease-simplified
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
age numeric
sex { female, male}
chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
cholesterol numeric
exercise_induced_angina { no, yes}
class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...!
A more thorough description is available here
http://www.cs.waikato.ac.nz/~ml/weka/arff.html
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
University
of Waikato
1
3
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
University
of Waikato
1
4
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Weka: Explorer:Preprocess!
• Preprocessing data !
• Visualization"
• Filtering algorithms "
• filters can be used to transform the data (e.g., turning numeric
attributes into discrete ones) and make it possible to delete
instances and attributes according to specific criteria."
• Removing Noisy Data"
• Adding Additional Attributes"
• Remove Attributes"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
WEKA:: Explorer: Preprocess!
• Used to define filters to transform
Data. !
• WEKA contains filters for:!
• Discretization, normalization, resampling,
attribute selection, transforming, combining
attributes, etc"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
University
of Waikato
1
9
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Explorer: Visualize!
• Visualization very useful in practice!
• help determine difficulty of the learning problem"
• WEKA can visualize single attributes (1-d) and
pairs of attributes (2-d)!
• Color-coded class values!
• “Jitter” option to deal with nominal attributes
(and to detect “hidden” data points)!
• “Zoom-in” function!
SAN DIEGO SUPERCOMPUTER CENTER
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
22
SAN DIEGO SUPERCOMPUTER CENTER
University
of Waikato
2
3
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
University
of Waikato
2
4
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Explorer: Attribute Selection!
• Panel that can be used to investigate which
(subsets of) attributes are the most predictive
ones!
• Attribute selection methods contain two parts:!
• A search method: best-first, forward selection, random,
exhaustive, genetic algorithm, ranking!
• An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …"
• Very flexible: WEKA allows (almost) arbitrary
combinations of these two!
2
5
SAN DIEGO SUPERCOMPUTER CENTER
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
WEKA:: Explorer: building “classifiers”!
• Classifiers in WEKA are models for
predicting nominal or numeric quantities!
• Implemented learning schemes include:!
• Decision trees and lists, instance-based
classifiers, support vector machines, multi-layer
perceptrons, logistic regression, Bayes’ nets, …"
• “Meta”-classifiers include:!
• Bagging, boosting, stacking, error-correcting
output codes, locally weighted learning, … "
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
University
of Waikato
2
7
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
University
of Waikato
2
8
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
WEKA:: Explorer: building “Cluster”!
• WEKA contains “clusters” for finding
groups of similar instances in a dataset!
• Implemented schemes are:!
• k-Means, EM, Cobweb, X-means, FarthestFirst"
• Clusters can be visualized and compared
to “true” clusters (if given)!
• Evaluation based on loglikelihood if
clustering scheme produces a probability
distribution!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Explorer: Finding associations!
• WEKA contains an implementation of the Apriori
algorithm for learning association rules!
• Works only with discrete data"
• Can identify statistical dependencies between
groups of attributes:!
• milk, butter ! bread, eggs (with confidence 0.9 and
support 2000)"
• Apriori can compute all rules that have a given
minimum support and exceed a given
confidence!
SAN DIEGO SUPERCOMPUTER CENTER
7/1/14
30
UNIVERSITY OF CALIFORNIA, SAN DIEGO
References and Resources!
• References:!
• WEKA website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html"
• WEKA Tutorial:"
• Machine Learning with WEKA: A presentation demonstrating all graphical
user interfaces (GUI) in Weka. "
• A presentation which explains how to use Weka for exploratory data
mining. "
• WEKA Data Mining Book:"
• Ian H. Witten and Eibe Frank, Data Mining: Practical Machine
Learning Tools and Techniques (Second Edition)"
• WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/
Main_Page"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
R Environment: R Studio!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Downloading R/ R Studio!
• http://www.r-project.org/!
• http://www.rstudio.com/ide/download/!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
What is R? !
• An Environment!
• R is an integrated suite of software facilities for data manipulation,
calculation and graphical facilities for data analysis and display. "
•
•
•
•
Effective data handling and storage"
Suite of operators for calculations on arrays"
Large, coherent, integrated collection of intermediate tools for data analysis "
Programming language, run time environment"
• Developed at Bell Labs!
• GNU open source software!
• Under the terms of the Free Software Foundation's GNU General
Public License"
• Open Source implementation of S-Plus language!
• Well-developed, simple and effective programming language"
• Highly extensible!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
R Features!
•
•
•
Software package designed for data analysis and graphical representation!
Interactive, but may also be used programmatically!
Platform independence!
•
•
•
Free, open source code!
Engaged community!
•
•
Compiles and runs on a wide variety platforms, Unix base, Windows and MacOS. "
over 4,200 user-contributed packages"
Extendable!
•
User defined functions"
• > 4000 packages available in the CRAN package repository"
•
•
•
Supports extensions / add-ons (i.e. – rApache)"
Compatible with other languages (i.e. – SQL, perl, C)"
Data Import"
• Pre-processing data from different sources"
•
Scalability!
•
Parallel R packages "
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
R packages for DM!
•
•
•
•
•
•
•
•
Clustering !
Classification!
Association Rules !
Sequential patterns!
Time Series!
Statistics!
Graphics!
Data manipulation!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Data Mining!
• linear models (lm)!
• generalized linear
models(glm)!
• generalized additive
models (gam)!
• linear mixed effects
models(lme)!
• quantile regression (qr)!
• vector general additive
models(vgam)!
• lasso, ridge, and elastic
net models (glmnet)!
• non-linear models (nlm)!
• linear mixed effects
models (nlmer)!
• linear discriminant
analysis (lda)!
• quadratic discriminate
analysis (qda)!
• trees (tree)!
• random forests
(randomForrest)!
• support vector machines
(svm)!
• neural networks (nnet)!
• k-nearest neighbors (knn)!
• kmeans!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Big Data Options!
• lapply-based parallelism!
• multicore library"
• snow library"
•
foreach-based parallelism!
• doMC backend"
• doSNOW backend"
• doMPI backend"
•
Map/Reduce- (Hadoop-) based parallelism!
• Hadoop streaming with R mappers/reducers"
• Rhadoop (rmr, rhdfs, rhbase)"
• RHIPE"
•
Poor-man's Parallelism!
• lots of Rs running"
• lots of input files"
• Hands-off Parallelism!
• OpenMP support compiled into R build"
• Dangerous!"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
R Considerations/Limitations!
• Command Line Interface!
• Performance!
• Memory Limits!
• memory limits dependent on the build, (32-bit vs. 64-bit)"
• 32-bit build of R on Windows is dependent on the
underlying OS version"
• Syntax “curiosities”!
• Learning curve!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
R-Studio Overview!
• http://www.rstudio.com/ide/download/
!
• R-Studio is an integrated development environment to
support R code.
• R-Studio runs in two ways:
•
•
Desktop version for Linux, Mac, Windows: Single user,
perfect for laptop or desktop machine
Server Version for Linux: Allows an number of remote users
to run R-Studio within a web-browser, facilitates sharing of
code and data among team members
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
• General View of R-Studio
Editor Window!
Project Window:!
Currently loaded !
Workspace, and !
history!
“pop-up”:!
Multi-tab display: !
Shows graphics, !
Current directory and !
loaded packages!
Console: Run R!
Commands!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
The Fundamentals !
• Launch R!
• Quit R!
• q() "
• Getting Help!
•
•
•
•
help(package_name) or ?(package_name) or help start()"
example(package_name)"
??(keyword)"
library(help=“package_name”)"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
The Basics!
• R environmental commands!
• list objects"
• ls() "
• objects()"
• list files in current directory"
• list.files()"
• list current directory"
• getwd()"
• set working directory"
• setwd()"
• remove objects"
• rm()"
• Workspace versus console!
• Clear workspace"
• rm(list=ls())"
• Clear console"
• (control, L)"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
The Basics
(Naming Variables)!
• Requirements!
• Case sensitive, names must start with letter or '.’"
• Only letters, numbers, underscores and‘.’s"
• Special keywords!
• break, else, FALSE, for, function, if, Inf, NA, NaN, next,
repeat, return, TRUE, while"
• Names not limited in length!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
The Basics
!
• All entities in are called “objects”!
• arrays, vectors, matrices, functions, lists, data frames, factors"
• Expressions vs. assignments!
•
•
•
•
•
10+10"
my.age <- 23"
my.age < - 23 (note the added space)"
age<- c(my.age, 14, 59, 32)"
my.age == 40"
• Data Types!
• Numeric, Integer, Complex, Logical, Character"
• Function call!
!> mean(weight)"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Summary of Data Structures!
Linear!
Rectangular!
Homogeneous"
Vectors"
Matrices"
Heterogeneous"
Lists"
Data Frames"
"
• Vectors and Matrices must contain same data type!
• Character Type will trump numeric: Values will be
forced into characters!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
The Basics
(Functions)!
• Basic functions!
• mean(age)"
• sd(age)"
• sqrt(var(age))"
• TIP: to list all function in search path"
– sapply(search(), ls, all.names = TRUE)
• User Defined functions!
• Score <- age * 10;"
• Using the correct functions for the given data
type!
• apply() family "
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Function Components!
writeLines(text=“text”, con = stdout(), sep = "\n", useBytes
= FALSE)!
• function name: writeLines(“146.6”, “popRate.txt”, sep =
"\n”)"
• parentheses: writeLines(“146.6”, “popRate.txt”, sep = "\n”)!
• commas: writeLines(“146.6”, “popRate.txt”, sep = "\n”)"
• first argument: writeLines(“146.6”, “popRate.txt”, sep =
"\n”)"
• second argument: writeLines(“146.6”, “popRate.txt”, sep =
"\n”)""
• optional argument: writeLines(“146.6”, “popRate.txt”, "\n”)"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Importing Data/Exporting Data!
•
Flat Files!
•
Import: > AHW <- read.csv(“AHW_1.csv”, header=TRUE)"
>weatherdata <- read.table(file="C:/work/DM1/weather.csv",
header=TRUE, sep=",") "
• Export: > USTemps=read.table(file=file.choose(),header=TRUE)"
•
Databases!
•
Import"
• connection <- dbConnect(driver, user, password, host, dbname)"
> AHW <- dbSendQuery(connection, “SELECT * FROM AHW”)
•
Export"
• > connnection <- dbConnect(driver, user, password, host,dbname)"
> dbWriteTable (con, “AHW”, AHW)
•
R objects!
•
•
•
•
Import: > load(‘AHW.Rdata’)"
Export: > save(AHW, file=“New_AHW.Rdata”)"
Web!
•
connection <-url(‘http://pace.sdsc.edu/sites/default/bootcamp/images/AHW_1.csv’)"
•
AHW <- read.csv(con, header=TRUE)"
Plots!
•
•
png(filename="C:/R/figure.png", height=295, width=300, bg="white")"
pdf(file="C:/R/figure.pdf", height=3.5, width=5)"
• Dev.off() #turn off device driver (to flush output to png/pdf)"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
• Loading dataset to R-Studio (Simple text file)
Name of data frame!
to be created with !
imported data!
Options for parsing !
the text data into !
fields and values!
How data frame will !
look once the data !
are imported!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Extending R!
•
http://cran.r-project.org/web/packages/!
•
Install a package !
•
•
from command line"
"> install.package(‘name_of_package’)"
from GUI"
• Packages & Data > Package Installer"
•
Load Library (to use installed package)"
•
•
> library(name_of_package)"
Example "
> library(markdown)"
•
Use Library Function!
•
•
> function_name(parameters)"
Example "
> markdownToHTML("example.md")"
"
http://www.r-bloggers.com/dont-r-alone-a-guide-to-tools-for-collaboration-with-r/!!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
More Information……!
• The R Manuals!
• http://www.stat.berkely.edu/~spector/R.pdf"
• And Introduction to R !
• http://cran.r-project.org/doc/manuals/R-intro.html"
• http://tryr.codeschool.com/"
• Books!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Other Resources!
/server irc.freenode.net/join #R!"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
the end!!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO