Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Parallel and Cluster
Computing with R
Elizabeth Byerly
2016-09-27
Press "s" for presenter's notes
1
Introduction
2
Making coffee
Work
Delicious
Coffee
Ground Boiling
Beans Water
3
Single thread
4
Parallel
5
Cluster
6
R is a single-threaded program.
We will learn how to write parallel R programs.
7
Motivation
8
Faster process means more iterations,
testing, and experimentation
9
Intractable problems become tractable
10
Outcomes
11
Learn jargon and fundamental concepts
of parallel programming
12
Identify parallelizable tasks
13
Learn base R's parallel syntax
14
Introduce cluster computing
15
Agenda
1. Parallel programming
2. Parallel programming in R
3. Cluster computing
16
Parallel Programming
17
Parallelizable problems
Work can be split into independent processes:
Bootstrapping
Random forests
Tuning parameters
Graphing
Data cleaning
...
18
Process independence
A process does not rely on another process's
outputs
Processes do not need to communicate state
during execution
19
Many data, one task.
One data, many tasks.
20
Parallel overhead
Load balancing
Communication speed
21
The Computer Model
22
CPU
CPU
CPU
CPU
Memory, RAM
Hard Drive
23
CPU
CPU
CPU
CPU
Memory, RAM
Hard Drive
24
CPU
CPU
CPU
CPU
Memory, RAM
Hard Drive
25
CPU
CPU
CPU
CPU
Memory, RAM
Hard Drive
26
CPU
CPU
CPU
CPU
Memory, RAM
Hard Drive
27
CPU
CPU
CPU
CPU
Memory, RAM
Hard Drive
28
Parallel Programming in R
29
The parallel package
Base R package
Shared syntax with R's functional
programming utilities (the apply functions)
Systems for all basic parallel operations
30
Crash Course: lapply
31
lapply
returns a list of the same length as X, each
element of which is the result of applying FUN to
the corresponding element of X.
- CRAN
32
example_list < list(1:10, 11:100, rep(20, 5))
example_vector < c(1, 3, 16)
33
lapply(example_list, identity)
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
## ## [[2]]
## [1] 11 12 13 14 15 16 17 18 19 20 21 22 23
## [14] 24 25 26 27 28 29 30 31 32 33 34 35 36
## [27] 37 38 39 40 41 42 43 44 45 46 47 48 49
## [40] 50 51 52 53 54 55 56 57 58 59 60 61 62
## [53] 63 64 65 66 67 68 69 70 71 72 73 74 75
## [66] 76 77 78 79 80 81 82 83 84 85 86 87 88
## [79] 89 90 91 92 93 94 95 96 97 98 99 100
## ## [[3]]
## [1] 20 20 20 20 20
34
lapply(example_list, mean)
## [[1]]
## [1] 5.5
## ## [[2]]
## [1] 55.5
## ## [[3]]
## [1] 20
35
lapply(example_vector, identity)
## [[1]]
## [1] 1
## ## [[2]]
## [1] 3
## ## [[3]]
## [1] 16
36
lapply(example_vector, function(x) x * x)
## [[1]]
## [1] 1
## ## [[2]]
## [1] 9
## ## [[3]]
## [1] 256
37
lapply(c(identity, mean, sum), function(current_func) {
current_func(example_vector)
})
## [[1]]
## [1] 1 3 16
## ## [[2]]
## [1] 6.666667
## ## [[3]]
## [1] 20
38
Our First Parallel Program
39
Bootstrapping an estimate
1. Define a bootstrap for the iris dataset
2. Run the bootstrap function in a single thread
3. Configure a minimal R parallel computing
environment
4. Run the bootstrap function in parallel
40
Our bootstrap function
run_iris_boot < function(...) {
iris_boot_sample < iris[sample(1:nrow(iris), replace = TRUE),]
lm(Sepal.Length ~ Sepal.Width + Petal.Length,
data = iris_boot_sample)
}
41
Run once
run_iris_boot()
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length,
## data = iris_boot_sample)
## ## Coefficients:
## (Intercept) Sepal.Width Petal.Length ## 2.0756 0.6162 0.4983 42
Single threaded
set.seed(20160927)
system.time(lapply(1:1000, run_iris_boot))
## user system elapsed ## 1.01 0.00 1.02 43
Parallel
library(parallel)
cores < detectCores()
cluster < makeCluster(cores)
clusterSetRNGStream(cluster, 20160927)
system.time(parLapply(cluster, 1:1000, run_iris_boot))
## user system elapsed ## 0.19 0.03 0.77 stopCluster(cluster)
44
Improvement
0.77 / 1.02
## [1] 0.754902
45
Additional overhead
system.time({
library(parallel)
cores < detectCores()
cluster < makeCluster(cores)
clusterSetRNGStream(cluster, 20160927)
parLapply(cluster, 1:1000, run_iris_boot)
stopCluster(cluster)
})
## user system elapsed ## 0.17 0.05 1.35
1.35 / 1.02
## [1] 1.323529
46
Breaking Down the Example
47
library(parallel)
cores < detectCores()
cluster < makeCluster(cores)
clusterSetRNGStream(cluster, 20160927)
parLapply(cluster, 1:1000, run_iris_boot)
stopCluster(cluster)
48
CPU
CPU
Memory, RAM
Hard Drive
49
library(parallel)
cores < detectCores()
cluster < makeCluster(cores)
clusterSetRNGStream(cluster, 20160927)
parLapply(cluster, 1:1000, run_iris_boot)
stopCluster(cluster)
50
CPU
CPU
Memory, RAM
Hard Drive
51
library(parallel)
cores < detectCores()
cluster < makeCluster(cores)
clusterSetRNGStream(cluster, 20160927)
parLapply(cluster, 1:1000, run_iris_boot)
stopCluster(cluster)
52
library(parallel)
cores < detectCores()
cluster < makeCluster(cores)
clusterSetRNGStream(cluster, 20160927)
parLapply(cluster, 1:1000, run_iris_boot)
stopCluster(cluster)
53
CPU
CPU
Memory, RAM
Hard Drive
54
library(parallel)
cores < detectCores()
cluster < makeCluster(cores)
clusterSetRNGStream(cluster, 20160927)
parLapply(cluster, 1:1000, run_iris_boot)
stopCluster(cluster)
55
CPU
CPU
Memory, RAM
Hard Drive
56
Doing Something Useful Tuning Parameters
57
Testing k-means groups
1. Generate and configure our cluster
2. Instruct our worker nodes to load a needed
package
3. Run k-means against different potential group
counts on our worker nodes
4. Return the results to our manager session
5. Summarize the fit for different groups in our
manager session
58
clusterEvalQ(cluster, library(MASS))
test_centers < 2:6
node_results < parSapply(cluster, test_centers, function(n_centers) {
kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss
})
final_results < by(node_results, test_centers, median)
59
clusterEvalQ(cluster, library(MASS))
test_centers < 2:6
node_results < parSapply(cluster, test_centers, function(n_centers) {
kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss
})
final_results < by(node_results, test_centers, median)
60
clusterEvalQ(cluster, library(MASS))
test_centers < 2:6
node_results < parSapply(cluster, test_centers, function(n_centers) {
kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss
})
final_results < by(node_results, test_centers, median)
61
clusterEvalQ(cluster, library(MASS))
test_centers < 2:6
node_results < parSapply(cluster, test_centers, function(n_centers) {
kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss
})
final_results < by(node_results, test_centers, median)
62
clusterEvalQ(cluster, library(MASS))
test_centers < 2:6
node_results < parSapply(cluster, test_centers, function(n_centers) {
kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss
})
final_results < by(node_results, test_centers, median)
63
Doing Something Useful Cleaning Data
64
Cleaning text data
1. Generate and configure our cluster
2. Export database information to our worker
nodes
3. Instruct our worker nodes to create a connection
to the database
4. Split the raw text file paths across our worker
nodes and instruct them to read each text file,
clean the data, and write it to the database
65
clusterExport(cluster, c("db_user", "db_password", "db_host"))
clusterEvalQ(cluster, {
db_conn < dbConnect(user=db_user, password=db_password,
host=db_host)
})
parLapply(cluster, raw_data_files, function(file_path) {
df < read.csv(file_path)
df$y < toupper(df$y)
dbWriteTable(db_conn, df, )
})
66
db_user < "elizabeth"
db_user
## [1] "elizabeth"
clusterEvalQ(cluster, db_user)
## Error in checkForRemoteErrors(lapply(cl, recvResult)) : ## 2 nodes produced errors; first error: object 'db_user' not found
clusterExport(cluster, c("db_user"))
clusterEvalQ(cluster, db_user)
## [[1]]
## [1] "elizabeth"
## ## [[2]]
## [1] "elizabeth"
67
clusterExport(cluster, c("db_user", "db_password", "db_host"))
clusterEvalQ(cluster, {
db_conn < dbConnect(user=db_user, password=db_password,
host=db_host)
})
parLapply(cluster, raw_data_files, function(file_path) {
df < read.csv(file_path)
df$y < toupper(df$y)
dbWriteTable(db_conn, df, )
})
68
clusterExport(cluster, c("db_user", "db_password", "db_host"))
clusterEvalQ(cluster, {
db_conn < dbConnect(user=db_user, password=db_password,
host=db_host)
})
parLapply(cluster, raw_data_files, function(file_path) {
df < read.csv(file_path)
df$y < toupper(df$y)
dbWriteTable(db_conn, df, )
})
69
clusterExport(cluster, c("db_user", "db_password", "db_host"))
clusterEvalQ(cluster, {
db_conn < dbConnect(user=db_user, password=db_password,
host=db_host)
})
parLapply(cluster, raw_data_files, function(file_path) {
df < read.csv(file_path)
df$y < toupper(df$y)
dbWriteTable(db_conn, df, )
})
70
clusterExport(cluster, c("db_user", "db_password", "db_host"))
clusterEvalQ(cluster, {
db_conn < dbConnect(user=db_user, password=db_password,
host=db_host)
})
parLapply(cluster, raw_data_files, function(file_path) {
df < read.csv(file_path)
df$y < toupper(df$y)
dbWriteTable(db_conn, df, )
})
71
Troubleshooting
72
Evaluating performance
When does it make sense to absorb the parallel
overhead?
1. Many computations against the same data
2. The same computation against many data
3. No need to communicate mid-process
Check your assumptions using system.time()
73
Monitoring nodes
Error messages are typically obtuse.
## Error in checkForRemoteErrors(val) : ## 4 nodes produced errors; first error: 1
Test your code in a single-threaded session
Create log files for worker nodes
74
Random number generators
clusterSetRNGStream(cluster, 20160927)
Two things to consider:
1. For L'Ecuyer to work, the same number of
worker processes must be fed streams
2. We cannot reproduce the results in a singlethreaded process
75
Cluster Computing
76
A cluster is a number of computers configured to
work together across a network as a single system
for some task.
77
For our purposes, a cluster is computers running
local R sessions that take commands and return
outputs to a manager session.
78
makePSOCKCluster()
Given a number, make that many local Rscript
sessions.
Given a character vector, use each value as a
network address and instantiate a remote
Rscript session.
The manager R session tracks the network
location and port of each Rscript worker node.
Rscript worker nodes listen on a port for
instructions and serial data from the manager.
79
Computers and networks default to closed traffic
for security.
80
Minimum networking knowledge
SSH, secure communication across networks
Firewalls, traffic control at network boundaries
Ports, computers listening for traffic from the
network
81
Making an R Cluster
82
Steps
1. Launch two computers on a secure network
2. Install the necessary software on both
computers
3. Share the private network IPs and SSH
credentials across the two computers
4. Use makePSOCKcluster() and the private network
IP addresses to create worker node R sessions
83
Launch two computers
84
Install the necessary software
sudo aptget install rbasedev opensshserver
85
Share IPs and SSH credentials
86
Create a worker node R session
87
Create a worker node R session
library(parallel)
cluster < makePSOCKcluster(c("172.31.51.171", "localhost"))
clusterEvalQ(cluster, {
system("ifconfig eth0 | grep 'inet addr' | awk '{print $2}'")
})
## addr:172.31.50.241
## addr:172.31.51.171
88
Conclusion
89
Further Reading
Matloff - Parallel Programming for Data
Science
Bengtsson - The future package
Wickham - Advanced R
Eddelbuettel - High-Performance and Parallel
Computing with R Task View
90
Questions?
Find me on Twitter (@ByerlyElizabeth)
Find me on LinkedIn (Elizabeth Byerly)
91