Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine Ltd [email protected] Dr Andy Pryke - The Data Mine Ltd Outline 1. Overview What is R? When to use R? Wot no GUI? Help and Support 2. Examples Simple Commands Statistics Graphics Modeling and Mining SQL Database Interface 3. Going Forward Relevant Libraries Online Courses etc. Dr Andy Pryke - The Data Mine Ltd What is R? • Open source, well supported, command line driven, statistics package • 100s of extra “packages” available free • Large number of users - particularly in bio-informatics and social science • Good Design - John Chambers received the ACM 1998 Software System Award for “S” Dr. Chambers' work "will forever alter the way people analyze, visualize, and manipulate data…” Dr Andy Pryke - The Data Mine Ltd When Should I Use R? • To do a full cycle of: – – – – – – data import data pre-processing exploratory statistics and graphics, modeling and data mining report production integration into other systems. • Or any one of these steps - i.e. just to standardise pre-processing of data Dr Andy Pryke - The Data Mine Ltd Wot no GUI? or “The Advantages of Scripting” • • • • • Repeatable Debug-able Documentable Build on previous work Automation – Report generation – Website or system integration – Links from Perl, Python, Java, C, TCP/IP…. Dr Andy Pryke - The Data Mine Ltd Help and Support Built in help/example system (e.g. type “?plot”) Many tutorials available free R-Help mailing list - Archived online - Key R developers respond - Contributors understand statistical concepts Large User Community Dr Andy Pryke - The Data Mine Ltd Simple Commands 1+1 2 10*3 30 c(1,2,3) 1 2 3 c(1,2,3)*10 10 20 30 x <- 5 x*x 25 exp(1) 2.718282 q() Save workspace image? [y/n/c]: n Dr Andy Pryke - The Data Mine Ltd Simple Statistics colnames(iris) "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width” "Species" plot(iris$Sepal.Length, iris$Petal.Length) # Pearson Correlation cor(iris$Sepal.Length, iris$Petal.Length) 0.8717538 # Spearman Correlation cor(rank(iris$Sepal.Length), rank(iris$Petal.Length)) 0.8818981 Dr Andy Pryke - The Data Mine Ltd Graphics Edgar Anderson's Iris Data 0.5 1.0 1.5 2.0 2.5 7.5 Eye Blue Hazel Green Male 2.0 2.5 3.0 3.5 4.0 4.5 5.5 Sepal.Length Black 6.5 Brown FemaleMale 2.0 2.5 3.0 3.5 4.0 4.00 Female Sex Hair 5 6 7 Brown Sepal.Width Pearson residuals: 7.61 0.00 4.5 5.5 6.5 7.5 1 2 3 4 5 6 Red Female Male Female Male Petal.Width Blond 0.5 1.0 1.5 2.0 2.5 1 2 3 4 Petal.Length 2.00 -2.00 -4.33 p-value = < 2.22e-16 7 January Pie Sales Cherry Blueberry Apple Vanilla Cream Other Boston Cream Dr Andy Pryke - The Data Mine Ltd Linear Models ## Scatterplot of Sepal and Petal Length plot(iris$Sepal.Length, iris$Petal.Length) 4 3 2 1 iris$Petal.Length 5 ## plot the model as a line abline(irisModel) 6 7 ## Make a Model of Petals in terms of Sepals irisModel <- lm(iris$Petal.Length ~ iris$Sepal.Length) 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Classification Trees # Model Species irisct <- ctree(Species ~ . , data = iris) 1 Petal.Length p < 0.001 1.9 1.9 3 Petal.Width p < 0.001 # Show the model tree plot(irisct) # Compare predictions table(predict(irisct), iris$Species) 1.7 1.7 4 Petal.Length p < 0.001 4.8 Node 2 (n = 50) 4.8 Node 5 (n = 46) Node 6 (n = 8) Node 7 (n = 46) 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 setosa 0 setosa 0 setosa setosa Dr Andy Pryke - The Data Mine Ltd SQL Interface Connect to databases with ODBC library("RODBC") channel <- odbcConnect("PostgreSQL30w", case="postgresql") sqlSave(channel,iris, tablename="iris") myIris <- sqlQuery(channel, "select * from iris") Dr Andy Pryke - The Data Mine Ltd Data Mining Libraries (i) RandomForest – Random forests - Robust prediction Party – Conditional inference trees - Statistically principled – Model-based partitioning - Advanced regression – cForests - Random Forests with ctrees e1071 – Naïve Bayes, Support Vector Machines, Fuzzy Clustering and more... Dr Andy Pryke - The Data Mine Ltd Data Mining Libraries (ii) nnets – Feed-forward Neural Networks – Multinomial Log-Linear Models BayesTree – Bayesian Additive Regression Trees gafit & rgenoud – Genetic Algorithm based optimisation varSelRF – Variable selection using random forests Dr Andy Pryke - The Data Mine Ltd Data Mining Libraries (iii) arules – Association Rules (links to ‘C’ code) Rweka library – Access to the many data mining algorithms found in open source package “Weka” dprep – Data pre-processing – You can easily write your own functions too. Bioconductor – Multiple packages for analysis of genomic (and Dr Andy Pryke - The Data Mine Ltd biological) data Sources of Further Information Download these slides + the examples & find links to online courses in R here: http://www.andypryke.com/pub/R Dr Andy Pryke - The Data Mine Ltd Dr Andy Pryke - The Data Mine Ltd Editors which Link to R • • • • • • • Rgui (not really a GUI) Emacs (with “ESS” mode) RCmdr Tinn-R jgr - Ja SciViews and more... Dr Andy Pryke - The Data Mine Ltd