Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Parallel and Cluster Computing with R Elizabeth Byerly 2016-09-27 Press "s" for presenter's notes 1 Introduction 2 Making coffee Work Delicious Coffee Ground Boiling Beans Water 3 Single thread 4 Parallel 5 Cluster 6 R is a single-threaded program. We will learn how to write parallel R programs. 7 Motivation 8 Faster process means more iterations, testing, and experimentation 9 Intractable problems become tractable 10 Outcomes 11 Learn jargon and fundamental concepts of parallel programming 12 Identify parallelizable tasks 13 Learn base R's parallel syntax 14 Introduce cluster computing 15 Agenda 1. Parallel programming 2. Parallel programming in R 3. Cluster computing 16 Parallel Programming 17 Parallelizable problems Work can be split into independent processes: Bootstrapping Random forests Tuning parameters Graphing Data cleaning ... 18 Process independence A process does not rely on another process's outputs Processes do not need to communicate state during execution 19 Many data, one task. One data, many tasks. 20 Parallel overhead Load balancing Communication speed 21 The Computer Model 22 CPU CPU CPU CPU Memory, RAM Hard Drive 23 CPU CPU CPU CPU Memory, RAM Hard Drive 24 CPU CPU CPU CPU Memory, RAM Hard Drive 25 CPU CPU CPU CPU Memory, RAM Hard Drive 26 CPU CPU CPU CPU Memory, RAM Hard Drive 27 CPU CPU CPU CPU Memory, RAM Hard Drive 28 Parallel Programming in R 29 The parallel package Base R package Shared syntax with R's functional programming utilities (the apply functions) Systems for all basic parallel operations 30 Crash Course: lapply 31 lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X. - CRAN 32 example_list < list(1:10, 11:100, rep(20, 5)) example_vector < c(1, 3, 16) 33 lapply(example_list, identity) ## [[1]] ## [1] 1 2 3 4 5 6 7 8 9 10 ## ## [[2]] ## [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 ## [14] 24 25 26 27 28 29 30 31 32 33 34 35 36 ## [27] 37 38 39 40 41 42 43 44 45 46 47 48 49 ## [40] 50 51 52 53 54 55 56 57 58 59 60 61 62 ## [53] 63 64 65 66 67 68 69 70 71 72 73 74 75 ## [66] 76 77 78 79 80 81 82 83 84 85 86 87 88 ## [79] 89 90 91 92 93 94 95 96 97 98 99 100 ## ## [[3]] ## [1] 20 20 20 20 20 34 lapply(example_list, mean) ## [[1]] ## [1] 5.5 ## ## [[2]] ## [1] 55.5 ## ## [[3]] ## [1] 20 35 lapply(example_vector, identity) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 3 ## ## [[3]] ## [1] 16 36 lapply(example_vector, function(x) x * x) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 9 ## ## [[3]] ## [1] 256 37 lapply(c(identity, mean, sum), function(current_func) { current_func(example_vector) }) ## [[1]] ## [1] 1 3 16 ## ## [[2]] ## [1] 6.666667 ## ## [[3]] ## [1] 20 38 Our First Parallel Program 39 Bootstrapping an estimate 1. Define a bootstrap for the iris dataset 2. Run the bootstrap function in a single thread 3. Configure a minimal R parallel computing environment 4. Run the bootstrap function in parallel 40 Our bootstrap function run_iris_boot < function(...) { iris_boot_sample < iris[sample(1:nrow(iris), replace = TRUE),] lm(Sepal.Length ~ Sepal.Width + Petal.Length, data = iris_boot_sample) } 41 Run once run_iris_boot() ## Call: ## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length, ## data = iris_boot_sample) ## ## Coefficients: ## (Intercept) Sepal.Width Petal.Length ## 2.0756 0.6162 0.4983 42 Single threaded set.seed(20160927) system.time(lapply(1:1000, run_iris_boot)) ## user system elapsed ## 1.01 0.00 1.02 43 Parallel library(parallel) cores < detectCores() cluster < makeCluster(cores) clusterSetRNGStream(cluster, 20160927) system.time(parLapply(cluster, 1:1000, run_iris_boot)) ## user system elapsed ## 0.19 0.03 0.77 stopCluster(cluster) 44 Improvement 0.77 / 1.02 ## [1] 0.754902 45 Additional overhead system.time({ library(parallel) cores < detectCores() cluster < makeCluster(cores) clusterSetRNGStream(cluster, 20160927) parLapply(cluster, 1:1000, run_iris_boot) stopCluster(cluster) }) ## user system elapsed ## 0.17 0.05 1.35 1.35 / 1.02 ## [1] 1.323529 46 Breaking Down the Example 47 library(parallel) cores < detectCores() cluster < makeCluster(cores) clusterSetRNGStream(cluster, 20160927) parLapply(cluster, 1:1000, run_iris_boot) stopCluster(cluster) 48 CPU CPU Memory, RAM Hard Drive 49 library(parallel) cores < detectCores() cluster < makeCluster(cores) clusterSetRNGStream(cluster, 20160927) parLapply(cluster, 1:1000, run_iris_boot) stopCluster(cluster) 50 CPU CPU Memory, RAM Hard Drive 51 library(parallel) cores < detectCores() cluster < makeCluster(cores) clusterSetRNGStream(cluster, 20160927) parLapply(cluster, 1:1000, run_iris_boot) stopCluster(cluster) 52 library(parallel) cores < detectCores() cluster < makeCluster(cores) clusterSetRNGStream(cluster, 20160927) parLapply(cluster, 1:1000, run_iris_boot) stopCluster(cluster) 53 CPU CPU Memory, RAM Hard Drive 54 library(parallel) cores < detectCores() cluster < makeCluster(cores) clusterSetRNGStream(cluster, 20160927) parLapply(cluster, 1:1000, run_iris_boot) stopCluster(cluster) 55 CPU CPU Memory, RAM Hard Drive 56 Doing Something Useful Tuning Parameters 57 Testing k-means groups 1. Generate and configure our cluster 2. Instruct our worker nodes to load a needed package 3. Run k-means against different potential group counts on our worker nodes 4. Return the results to our manager session 5. Summarize the fit for different groups in our manager session 58 clusterEvalQ(cluster, library(MASS)) test_centers < 2:6 node_results < parSapply(cluster, test_centers, function(n_centers) { kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss }) final_results < by(node_results, test_centers, median) 59 clusterEvalQ(cluster, library(MASS)) test_centers < 2:6 node_results < parSapply(cluster, test_centers, function(n_centers) { kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss }) final_results < by(node_results, test_centers, median) 60 clusterEvalQ(cluster, library(MASS)) test_centers < 2:6 node_results < parSapply(cluster, test_centers, function(n_centers) { kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss }) final_results < by(node_results, test_centers, median) 61 clusterEvalQ(cluster, library(MASS)) test_centers < 2:6 node_results < parSapply(cluster, test_centers, function(n_centers) { kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss }) final_results < by(node_results, test_centers, median) 62 clusterEvalQ(cluster, library(MASS)) test_centers < 2:6 node_results < parSapply(cluster, test_centers, function(n_centers) { kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss }) final_results < by(node_results, test_centers, median) 63 Doing Something Useful Cleaning Data 64 Cleaning text data 1. Generate and configure our cluster 2. Export database information to our worker nodes 3. Instruct our worker nodes to create a connection to the database 4. Split the raw text file paths across our worker nodes and instruct them to read each text file, clean the data, and write it to the database 65 clusterExport(cluster, c("db_user", "db_password", "db_host")) clusterEvalQ(cluster, { db_conn < dbConnect(user=db_user, password=db_password, host=db_host) }) parLapply(cluster, raw_data_files, function(file_path) { df < read.csv(file_path) df$y < toupper(df$y) dbWriteTable(db_conn, df, ) }) 66 db_user < "elizabeth" db_user ## [1] "elizabeth" clusterEvalQ(cluster, db_user) ## Error in checkForRemoteErrors(lapply(cl, recvResult)) : ## 2 nodes produced errors; first error: object 'db_user' not found clusterExport(cluster, c("db_user")) clusterEvalQ(cluster, db_user) ## [[1]] ## [1] "elizabeth" ## ## [[2]] ## [1] "elizabeth" 67 clusterExport(cluster, c("db_user", "db_password", "db_host")) clusterEvalQ(cluster, { db_conn < dbConnect(user=db_user, password=db_password, host=db_host) }) parLapply(cluster, raw_data_files, function(file_path) { df < read.csv(file_path) df$y < toupper(df$y) dbWriteTable(db_conn, df, ) }) 68 clusterExport(cluster, c("db_user", "db_password", "db_host")) clusterEvalQ(cluster, { db_conn < dbConnect(user=db_user, password=db_password, host=db_host) }) parLapply(cluster, raw_data_files, function(file_path) { df < read.csv(file_path) df$y < toupper(df$y) dbWriteTable(db_conn, df, ) }) 69 clusterExport(cluster, c("db_user", "db_password", "db_host")) clusterEvalQ(cluster, { db_conn < dbConnect(user=db_user, password=db_password, host=db_host) }) parLapply(cluster, raw_data_files, function(file_path) { df < read.csv(file_path) df$y < toupper(df$y) dbWriteTable(db_conn, df, ) }) 70 clusterExport(cluster, c("db_user", "db_password", "db_host")) clusterEvalQ(cluster, { db_conn < dbConnect(user=db_user, password=db_password, host=db_host) }) parLapply(cluster, raw_data_files, function(file_path) { df < read.csv(file_path) df$y < toupper(df$y) dbWriteTable(db_conn, df, ) }) 71 Troubleshooting 72 Evaluating performance When does it make sense to absorb the parallel overhead? 1. Many computations against the same data 2. The same computation against many data 3. No need to communicate mid-process Check your assumptions using system.time() 73 Monitoring nodes Error messages are typically obtuse. ## Error in checkForRemoteErrors(val) : ## 4 nodes produced errors; first error: 1 Test your code in a single-threaded session Create log files for worker nodes 74 Random number generators clusterSetRNGStream(cluster, 20160927) Two things to consider: 1. For L'Ecuyer to work, the same number of worker processes must be fed streams 2. We cannot reproduce the results in a singlethreaded process 75 Cluster Computing 76 A cluster is a number of computers configured to work together across a network as a single system for some task. 77 For our purposes, a cluster is computers running local R sessions that take commands and return outputs to a manager session. 78 makePSOCKCluster() Given a number, make that many local Rscript sessions. Given a character vector, use each value as a network address and instantiate a remote Rscript session. The manager R session tracks the network location and port of each Rscript worker node. Rscript worker nodes listen on a port for instructions and serial data from the manager. 79 Computers and networks default to closed traffic for security. 80 Minimum networking knowledge SSH, secure communication across networks Firewalls, traffic control at network boundaries Ports, computers listening for traffic from the network 81 Making an R Cluster 82 Steps 1. Launch two computers on a secure network 2. Install the necessary software on both computers 3. Share the private network IPs and SSH credentials across the two computers 4. Use makePSOCKcluster() and the private network IP addresses to create worker node R sessions 83 Launch two computers 84 Install the necessary software sudo aptget install rbasedev opensshserver 85 Share IPs and SSH credentials 86 Create a worker node R session 87 Create a worker node R session library(parallel) cluster < makePSOCKcluster(c("172.31.51.171", "localhost")) clusterEvalQ(cluster, { system("ifconfig eth0 | grep 'inet addr' | awk '{print $2}'") }) ## addr:172.31.50.241 ## addr:172.31.51.171 88 Conclusion 89 Further Reading Matloff - Parallel Programming for Data Science Bengtsson - The future package Wickham - Advanced R Eddelbuettel - High-Performance and Parallel Computing with R Task View 90 Questions? Find me on Twitter (@ByerlyElizabeth) Find me on LinkedIn (Elizabeth Byerly) 91