Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction Predictive Analytics Tools: Weka, R! Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego ! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Available Data Mining Tools! COTs:! n IBM Intelligent Miner! n SAS Enterprise Miner! n Oracle ODM! n Microstrategy! n Microsoft DBMiner! n Pentaho! n Matlab! n Teradata! Open Source:! n WEKA! n KNIME! n Orange! n RapidMiner! n NLTK! n R! n Rattle! SAN DIEGO SUPERCOMPUTER CENTER 2 UNIVERSITY OF CALIFORNIA, SAN DIEGO Agenda!! • WEKA! • • • • Intro and background" Data Preparation" Creating Models/ Applying Algorithms" Evaluating Results" • R! • R Background" • R Basics" • Outline" • R-Studio Overview" • Hands On (homework)" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO WEKA! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Download and Install WEKA! • Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html! ! 5 SAN DIEGO SUPERCOMPUTER CENTER 7/1/14 UNIVERSITY OF CALIFORNIA, SAN DIEGO What is WEKA?! • Waikato Environment for Knowledge Analysis! • WEKA is a data mining/machine learning application developed by Department of Computer Science, University of Waikato, New Zealand" • WEKA is open source software in JAVA " • WEKA is a collection machine learning algorithms and tools for data mining tasks" • data pre-processing, classification, regression, clustering, association, and visualization. " • WEKA is well-suited for developing new machine learning schemes " • WEKA is a bird found only in New Zealand. ! 6 SAN DIEGO SUPERCOMPUTER CENTER 7/1/14 UNIVERSITY OF CALIFORNIA, SAN DIEGO Advantages of Weka ! • Free availability ! • under the GNU General Public License" • Portability! • fully implemented in the Java programming language and thus runs on almost any modern computing platforms" • Windows, Mac OS X and Linux" • Comprehensive collection of data preprocessing and modeling techniques! • Supports standard data mining tasks: data preprocessing, clustering, classification, regression, visualization, and feature selection." • Easy to use GUI! • Provides access to SQL databases ! • using Java Database Connectivity and can process the result returned by a database query." SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Disadvantages !! • Sequence modeling is not covered by the algorithms included in the Weka distribution! • Not capable of multi-relational data mining! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO WEKA Walk Through: Main GUI! • Three graphical user interfaces! • “The Explorer” (exploratory data analysis)" • • • • • • pre-process data" build “classifiers” " cluster data" find associations" attribute selection" data visualization" • “The Experimenter” (experimental environment)" • used to compare performance of different learning schemes " • “The KnowledgeFlow” (new process model inspired interface) " • Java-Beans-based interface for setting up and running machine learning experiments." • Command line Interface (“Simple CLI”)! 9 More at: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html SAN DIEGO SUPERCOMPUTER CENTER 7/1/14 UNIVERSITY OF CALIFORNIA, SAN DIEGO 1 0 SAN DIEGO SUPERCOMPUTER CENTER 7/1/14 UNIVERSITY OF CALIFORNIA, SAN DIEGO WEKA:: Explorer: Preprocess! • Importing data ! • Data format" • Uses flat text files to describe the data" • Data can be imported from a file in various formats: " • ARFF, CSV, C4.5, binary" • Data can also be read from a URL or from an SQL database (using JDBC)" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO WEKA:: ARFF file format! @relation heart-disease-simplified @attribute @attribute @attribute @attribute @attribute @attribute age numeric sex { female, male} chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} cholesterol numeric exercise_induced_angina { no, yes} class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ...! A more thorough description is available here http://www.cs.waikato.ac.nz/~ml/weka/arff.html SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER University of Waikato 1 3 7/1/14 UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER University of Waikato 1 4 7/1/14 UNIVERSITY OF CALIFORNIA, SAN DIEGO Weka: Explorer:Preprocess! • Preprocessing data ! • Visualization" • Filtering algorithms " • filters can be used to transform the data (e.g., turning numeric attributes into discrete ones) and make it possible to delete instances and attributes according to specific criteria." • Removing Noisy Data" • Adding Additional Attributes" • Remove Attributes" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO WEKA:: Explorer: Preprocess! • Used to define filters to transform Data. ! • WEKA contains filters for:! • Discretization, normalization, resampling, attribute selection, transforming, combining attributes, etc" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER University of Waikato 1 9 7/1/14 UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Explorer: Visualize! • Visualization very useful in practice! • help determine difficulty of the learning problem" • WEKA can visualize single attributes (1-d) and pairs of attributes (2-d)! • Color-coded class values! • “Jitter” option to deal with nominal attributes (and to detect “hidden” data points)! • “Zoom-in” function! SAN DIEGO SUPERCOMPUTER CENTER 7/1/14 UNIVERSITY OF CALIFORNIA, SAN DIEGO 22 SAN DIEGO SUPERCOMPUTER CENTER University of Waikato 2 3 7/1/14 UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER University of Waikato 2 4 7/1/14 UNIVERSITY OF CALIFORNIA, SAN DIEGO Explorer: Attribute Selection! • Panel that can be used to investigate which (subsets of) attributes are the most predictive ones! • Attribute selection methods contain two parts:! • A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking! • An evaluation method: correlation-based, wrapper, information gain, chi-squared, …" • Very flexible: WEKA allows (almost) arbitrary combinations of these two! 2 5 SAN DIEGO SUPERCOMPUTER CENTER 7/1/14 UNIVERSITY OF CALIFORNIA, SAN DIEGO WEKA:: Explorer: building “classifiers”! • Classifiers in WEKA are models for predicting nominal or numeric quantities! • Implemented learning schemes include:! • Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, …" • “Meta”-classifiers include:! • Bagging, boosting, stacking, error-correcting output codes, locally weighted learning, … " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER University of Waikato 2 7 7/1/14 UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER University of Waikato 2 8 7/1/14 UNIVERSITY OF CALIFORNIA, SAN DIEGO WEKA:: Explorer: building “Cluster”! • WEKA contains “clusters” for finding groups of similar instances in a dataset! • Implemented schemes are:! • k-Means, EM, Cobweb, X-means, FarthestFirst" • Clusters can be visualized and compared to “true” clusters (if given)! • Evaluation based on loglikelihood if clustering scheme produces a probability distribution! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Explorer: Finding associations! • WEKA contains an implementation of the Apriori algorithm for learning association rules! • Works only with discrete data" • Can identify statistical dependencies between groups of attributes:! • milk, butter ! bread, eggs (with confidence 0.9 and support 2000)" • Apriori can compute all rules that have a given minimum support and exceed a given confidence! SAN DIEGO SUPERCOMPUTER CENTER 7/1/14 30 UNIVERSITY OF CALIFORNIA, SAN DIEGO References and Resources! • References:! • WEKA website: http://www.cs.waikato.ac.nz/~ml/weka/index.html" • WEKA Tutorial:" • Machine Learning with WEKA: A presentation demonstrating all graphical user interfaces (GUI) in Weka. " • A presentation which explains how to use Weka for exploratory data mining. " • WEKA Data Mining Book:" • Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)" • WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/ Main_Page" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO R Environment: R Studio! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Downloading R/ R Studio! • http://www.r-project.org/! • http://www.rstudio.com/ide/download/! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO What is R? ! • An Environment! • R is an integrated suite of software facilities for data manipulation, calculation and graphical facilities for data analysis and display. " • • • • Effective data handling and storage" Suite of operators for calculations on arrays" Large, coherent, integrated collection of intermediate tools for data analysis " Programming language, run time environment" • Developed at Bell Labs! • GNU open source software! • Under the terms of the Free Software Foundation's GNU General Public License" • Open Source implementation of S-Plus language! • Well-developed, simple and effective programming language" • Highly extensible! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO R Features! • • • Software package designed for data analysis and graphical representation! Interactive, but may also be used programmatically! Platform independence! • • • Free, open source code! Engaged community! • • Compiles and runs on a wide variety platforms, Unix base, Windows and MacOS. " over 4,200 user-contributed packages" Extendable! • User defined functions" • > 4000 packages available in the CRAN package repository" • • • Supports extensions / add-ons (i.e. – rApache)" Compatible with other languages (i.e. – SQL, perl, C)" Data Import" • Pre-processing data from different sources" • Scalability! • Parallel R packages " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO R packages for DM! • • • • • • • • Clustering ! Classification! Association Rules ! Sequential patterns! Time Series! Statistics! Graphics! Data manipulation! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Data Mining! • linear models (lm)! • generalized linear models(glm)! • generalized additive models (gam)! • linear mixed effects models(lme)! • quantile regression (qr)! • vector general additive models(vgam)! • lasso, ridge, and elastic net models (glmnet)! • non-linear models (nlm)! • linear mixed effects models (nlmer)! • linear discriminant analysis (lda)! • quadratic discriminate analysis (qda)! • trees (tree)! • random forests (randomForrest)! • support vector machines (svm)! • neural networks (nnet)! • k-nearest neighbors (knn)! • kmeans! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Big Data Options! • lapply-based parallelism! • multicore library" • snow library" • foreach-based parallelism! • doMC backend" • doSNOW backend" • doMPI backend" • Map/Reduce- (Hadoop-) based parallelism! • Hadoop streaming with R mappers/reducers" • Rhadoop (rmr, rhdfs, rhbase)" • RHIPE" • Poor-man's Parallelism! • lots of Rs running" • lots of input files" • Hands-off Parallelism! • OpenMP support compiled into R build" • Dangerous!" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO R Considerations/Limitations! • Command Line Interface! • Performance! • Memory Limits! • memory limits dependent on the build, (32-bit vs. 64-bit)" • 32-bit build of R on Windows is dependent on the underlying OS version" • Syntax “curiosities”! • Learning curve! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO R-Studio Overview! • http://www.rstudio.com/ide/download/ ! • R-Studio is an integrated development environment to support R code. • R-Studio runs in two ways: • • Desktop version for Linux, Mac, Windows: Single user, perfect for laptop or desktop machine Server Version for Linux: Allows an number of remote users to run R-Studio within a web-browser, facilitates sharing of code and data among team members SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO • General View of R-Studio Editor Window! Project Window:! Currently loaded ! Workspace, and ! history! “pop-up”:! Multi-tab display: ! Shows graphics, ! Current directory and ! loaded packages! Console: Run R! Commands! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO The Fundamentals ! • Launch R! • Quit R! • q() " • Getting Help! • • • • help(package_name) or ?(package_name) or help start()" example(package_name)" ??(keyword)" library(help=“package_name”)" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO The Basics! • R environmental commands! • list objects" • ls() " • objects()" • list files in current directory" • list.files()" • list current directory" • getwd()" • set working directory" • setwd()" • remove objects" • rm()" • Workspace versus console! • Clear workspace" • rm(list=ls())" • Clear console" • (control, L)" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO The Basics (Naming Variables)! • Requirements! • Case sensitive, names must start with letter or '.’" • Only letters, numbers, underscores and‘.’s" • Special keywords! • break, else, FALSE, for, function, if, Inf, NA, NaN, next, repeat, return, TRUE, while" • Names not limited in length! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO The Basics ! • All entities in are called “objects”! • arrays, vectors, matrices, functions, lists, data frames, factors" • Expressions vs. assignments! • • • • • 10+10" my.age <- 23" my.age < - 23 (note the added space)" age<- c(my.age, 14, 59, 32)" my.age == 40" • Data Types! • Numeric, Integer, Complex, Logical, Character" • Function call! !> mean(weight)" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Summary of Data Structures! Linear! Rectangular! Homogeneous" Vectors" Matrices" Heterogeneous" Lists" Data Frames" " • Vectors and Matrices must contain same data type! • Character Type will trump numeric: Values will be forced into characters! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO The Basics (Functions)! • Basic functions! • mean(age)" • sd(age)" • sqrt(var(age))" • TIP: to list all function in search path" – sapply(search(), ls, all.names = TRUE) • User Defined functions! • Score <- age * 10;" • Using the correct functions for the given data type! • apply() family " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Function Components! writeLines(text=“text”, con = stdout(), sep = "\n", useBytes = FALSE)! • function name: writeLines(“146.6”, “popRate.txt”, sep = "\n”)" • parentheses: writeLines(“146.6”, “popRate.txt”, sep = "\n”)! • commas: writeLines(“146.6”, “popRate.txt”, sep = "\n”)" • first argument: writeLines(“146.6”, “popRate.txt”, sep = "\n”)" • second argument: writeLines(“146.6”, “popRate.txt”, sep = "\n”)"" • optional argument: writeLines(“146.6”, “popRate.txt”, "\n”)" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Importing Data/Exporting Data! • Flat Files! • Import: > AHW <- read.csv(“AHW_1.csv”, header=TRUE)" >weatherdata <- read.table(file="C:/work/DM1/weather.csv", header=TRUE, sep=",") " • Export: > USTemps=read.table(file=file.choose(),header=TRUE)" • Databases! • Import" • connection <- dbConnect(driver, user, password, host, dbname)" > AHW <- dbSendQuery(connection, “SELECT * FROM AHW”) • Export" • > connnection <- dbConnect(driver, user, password, host,dbname)" > dbWriteTable (con, “AHW”, AHW) • R objects! • • • • Import: > load(‘AHW.Rdata’)" Export: > save(AHW, file=“New_AHW.Rdata”)" Web! • connection <-url(‘http://pace.sdsc.edu/sites/default/bootcamp/images/AHW_1.csv’)" • AHW <- read.csv(con, header=TRUE)" Plots! • • png(filename="C:/R/figure.png", height=295, width=300, bg="white")" pdf(file="C:/R/figure.pdf", height=3.5, width=5)" • Dev.off() #turn off device driver (to flush output to png/pdf)" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO • Loading dataset to R-Studio (Simple text file) Name of data frame! to be created with ! imported data! Options for parsing ! the text data into ! fields and values! How data frame will ! look once the data ! are imported! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Extending R! • http://cran.r-project.org/web/packages/! • Install a package ! • • from command line" "> install.package(‘name_of_package’)" from GUI" • Packages & Data > Package Installer" • Load Library (to use installed package)" • • > library(name_of_package)" Example " > library(markdown)" • Use Library Function! • • > function_name(parameters)" Example " > markdownToHTML("example.md")" " http://www.r-bloggers.com/dont-r-alone-a-guide-to-tools-for-collaboration-with-r/!! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO More Information……! • The R Manuals! • http://www.stat.berkely.edu/~spector/R.pdf" • And Introduction to R ! • http://cran.r-project.org/doc/manuals/R-intro.html" • http://tryr.codeschool.com/" • Books! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO Other Resources! /server irc.freenode.net/join #R!" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO the end!! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO