Download Parallel and Cluster Computing with R

Document related concepts

Distributed operating system wikipedia , lookup

Parallel port wikipedia , lookup

Bus (computing) wikipedia , lookup

Computer cluster wikipedia , lookup

Transcript
Parallel and Cluster
Computing with R
Elizabeth Byerly
2016-09-27

Press "s" for presenter's notes
1

Introduction
2

Making coffee
Work
Delicious
Coffee
Ground Boiling
Beans Water
3

Single thread
4

Parallel
5

Cluster
6
R is a single-threaded program.
We will learn how to write parallel R programs.


7
Motivation
8
Faster process means more iterations,
testing, and experimentation

9
Intractable problems become tractable

10

Outcomes
11
Learn jargon and fundamental concepts
of parallel programming

12

Identify parallelizable tasks
13

Learn base R's parallel syntax
14

Introduce cluster computing
15

Agenda
1. Parallel programming
2. Parallel programming in R
3. Cluster computing
16

Parallel Programming
17

Parallelizable problems
Work can be split into independent processes:
Bootstrapping
Random forests
Tuning parameters
Graphing
Data cleaning
...
18

Process independence
A process does not rely on another process's
outputs
Processes do not need to communicate state
during execution
19

Many data, one task.
One data, many tasks.
20

Parallel overhead
Load balancing
Communication speed
21

The Computer Model
22

CPU
CPU
CPU
CPU
Memory, RAM
Hard Drive
23

CPU
CPU
CPU
CPU
Memory, RAM
Hard Drive
24

CPU
CPU
CPU
CPU
Memory, RAM
Hard Drive
25

CPU
CPU
CPU
CPU
Memory, RAM
Hard Drive
26

CPU
CPU
CPU
CPU
Memory, RAM
Hard Drive
27

CPU
CPU
CPU
CPU
Memory, RAM
Hard Drive
28

Parallel Programming in R
29

The parallel package
Base R package
Shared syntax with R's functional
programming utilities (the apply functions)
Systems for all basic parallel operations
30

Crash Course: lapply
31
lapply
returns a list of the same length as X, each
element of which is the result of applying FUN to
the corresponding element of X.

- CRAN
32

example_list <­ list(1:10, 11:100, rep(20, 5))
example_vector <­ c(1, 3, 16)

33
lapply(example_list, identity)
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
## ## [[2]]
## [1] 11 12 13 14 15 16 17 18 19 20 21 22 23
## [14] 24 25 26 27 28 29 30 31 32 33 34 35 36
## [27] 37 38 39 40 41 42 43 44 45 46 47 48 49
## [40] 50 51 52 53 54 55 56 57 58 59 60 61 62
## [53] 63 64 65 66 67 68 69 70 71 72 73 74 75
## [66] 76 77 78 79 80 81 82 83 84 85 86 87 88
## [79] 89 90 91 92 93 94 95 96 97 98 99 100
## ## [[3]]
## [1] 20 20 20 20 20

34
lapply(example_list, mean)
## [[1]]
## [1] 5.5
## ## [[2]]
## [1] 55.5
## ## [[3]]
## [1] 20

35
lapply(example_vector, identity)
## [[1]]
## [1] 1
## ## [[2]]
## [1] 3
## ## [[3]]
## [1] 16

36
lapply(example_vector, function(x) x * x)
## [[1]]
## [1] 1
## ## [[2]]
## [1] 9
## ## [[3]]
## [1] 256

37
lapply(c(identity, mean, sum), function(current_func) {
current_func(example_vector)
})
## [[1]]
## [1] 1 3 16
## ## [[2]]
## [1] 6.666667
## ## [[3]]
## [1] 20

38
Our First Parallel Program
39
Bootstrapping an estimate
1. Define a bootstrap for the iris dataset
2. Run the bootstrap function in a single thread
3. Configure a minimal R parallel computing
environment
4. Run the bootstrap function in parallel

40

Our bootstrap function
run_iris_boot <­ function(...) {
iris_boot_sample <­ iris[sample(1:nrow(iris), replace = TRUE),]
lm(Sepal.Length ~ Sepal.Width + Petal.Length,
data = iris_boot_sample)
}
41

Run once
run_iris_boot()
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length,
## data = iris_boot_sample)
## ## Coefficients:
## (Intercept) Sepal.Width Petal.Length ## 2.0756 0.6162 0.4983 42

Single threaded
set.seed(20160927)
system.time(lapply(1:1000, run_iris_boot))
## user system elapsed ## 1.01 0.00 1.02 43

Parallel
library(parallel)
cores <­ detectCores()
cluster <­ makeCluster(cores)
clusterSetRNGStream(cluster, 20160927)
system.time(parLapply(cluster, 1:1000, run_iris_boot))
## user system elapsed ## 0.19 0.03 0.77 stopCluster(cluster)
44

Improvement
0.77 / 1.02
## [1] 0.754902
45

Additional overhead
system.time({
library(parallel)
cores <­ detectCores()
cluster <­ makeCluster(cores)
clusterSetRNGStream(cluster, 20160927)
parLapply(cluster, 1:1000, run_iris_boot)
stopCluster(cluster)
})
## user system elapsed ## 0.17 0.05 1.35
1.35 / 1.02
## [1] 1.323529
46
Breaking Down the Example

47

library(parallel)
cores <­ detectCores()
cluster <­ makeCluster(cores)
clusterSetRNGStream(cluster, 20160927)
parLapply(cluster, 1:1000, run_iris_boot)
stopCluster(cluster)

48
CPU
CPU
Memory, RAM
Hard Drive
49

library(parallel)
cores <­ detectCores()
cluster <­ makeCluster(cores)
clusterSetRNGStream(cluster, 20160927)
parLapply(cluster, 1:1000, run_iris_boot)
stopCluster(cluster)

50
CPU
CPU
Memory, RAM
Hard Drive
51

library(parallel)
cores <­ detectCores()
cluster <­ makeCluster(cores)
clusterSetRNGStream(cluster, 20160927)
parLapply(cluster, 1:1000, run_iris_boot)
stopCluster(cluster)

52
library(parallel)
cores <­ detectCores()
cluster <­ makeCluster(cores)
clusterSetRNGStream(cluster, 20160927)
parLapply(cluster, 1:1000, run_iris_boot)
stopCluster(cluster)

53
CPU
CPU
Memory, RAM
Hard Drive
54

library(parallel)
cores <­ detectCores()
cluster <­ makeCluster(cores)
clusterSetRNGStream(cluster, 20160927)
parLapply(cluster, 1:1000, run_iris_boot)
stopCluster(cluster)

55
CPU
CPU
Memory, RAM
Hard Drive
56

Doing Something Useful Tuning Parameters
57
Testing k-means groups
1. Generate and configure our cluster
2. Instruct our worker nodes to load a needed
package
3. Run k-means against different potential group
counts on our worker nodes
4. Return the results to our manager session
5. Summarize the fit for different groups in our
manager session

58

clusterEvalQ(cluster, library(MASS))
test_centers <­ 2:6
node_results <­ parSapply(cluster, test_centers, function(n_centers) {
kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss
})
final_results <­ by(node_results, test_centers, median)

59
clusterEvalQ(cluster, library(MASS))
test_centers <­ 2:6
node_results <­ parSapply(cluster, test_centers, function(n_centers) {
kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss
})
final_results <­ by(node_results, test_centers, median)

60
clusterEvalQ(cluster, library(MASS))
test_centers <­ 2:6
node_results <­ parSapply(cluster, test_centers, function(n_centers) {
kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss
})
final_results <­ by(node_results, test_centers, median)

61
clusterEvalQ(cluster, library(MASS))
test_centers <­ 2:6
node_results <­ parSapply(cluster, test_centers, function(n_centers) {
kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss
})
final_results <­ by(node_results, test_centers, median)

62
clusterEvalQ(cluster, library(MASS))
test_centers <­ 2:6
node_results <­ parSapply(cluster, test_centers, function(n_centers) {
kmeans(anorexia[2:3], centers = n_centers, nstart = 200)$betweenss
})
final_results <­ by(node_results, test_centers, median)

63
Doing Something Useful Cleaning Data
64
Cleaning text data
1. Generate and configure our cluster
2. Export database information to our worker
nodes
3. Instruct our worker nodes to create a connection
to the database
4. Split the raw text file paths across our worker
nodes and instruct them to read each text file,
clean the data, and write it to the database

65

clusterExport(cluster, c("db_user", "db_password", "db_host"))
clusterEvalQ(cluster, {
db_conn <­ dbConnect(user=db_user, password=db_password,
host=db_host)
})
parLapply(cluster, raw_data_files, function(file_path) {
df <­ read.csv(file_path)
df$y <­ toupper(df$y)
dbWriteTable(db_conn, df, )
})

66
db_user <­ "elizabeth"
db_user
## [1] "elizabeth"
clusterEvalQ(cluster, db_user)
## Error in checkForRemoteErrors(lapply(cl, recvResult)) : ## 2 nodes produced errors; first error: object 'db_user' not found
clusterExport(cluster, c("db_user"))
clusterEvalQ(cluster, db_user)
## [[1]]
## [1] "elizabeth"
## ## [[2]]
## [1] "elizabeth"

67
clusterExport(cluster, c("db_user", "db_password", "db_host"))
clusterEvalQ(cluster, {
db_conn <­ dbConnect(user=db_user, password=db_password,
host=db_host)
})
parLapply(cluster, raw_data_files, function(file_path) {
df <­ read.csv(file_path)
df$y <­ toupper(df$y)
dbWriteTable(db_conn, df, )
})

68
clusterExport(cluster, c("db_user", "db_password", "db_host"))
clusterEvalQ(cluster, {
db_conn <­ dbConnect(user=db_user, password=db_password,
host=db_host)
})
parLapply(cluster, raw_data_files, function(file_path) {
df <­ read.csv(file_path)
df$y <­ toupper(df$y)
dbWriteTable(db_conn, df, )
})

69
clusterExport(cluster, c("db_user", "db_password", "db_host"))
clusterEvalQ(cluster, {
db_conn <­ dbConnect(user=db_user, password=db_password,
host=db_host)
})
parLapply(cluster, raw_data_files, function(file_path) {
df <­ read.csv(file_path)
df$y <­ toupper(df$y)
dbWriteTable(db_conn, df, )
})

70
clusterExport(cluster, c("db_user", "db_password", "db_host"))
clusterEvalQ(cluster, {
db_conn <­ dbConnect(user=db_user, password=db_password,
host=db_host)
})
parLapply(cluster, raw_data_files, function(file_path) {
df <­ read.csv(file_path)
df$y <­ toupper(df$y)
dbWriteTable(db_conn, df, )
})

71
Troubleshooting
72
Evaluating performance
When does it make sense to absorb the parallel
overhead?
1. Many computations against the same data
2. The same computation against many data
3. No need to communicate mid-process
Check your assumptions using system.time()

73
Monitoring nodes
Error messages are typically obtuse.

## Error in checkForRemoteErrors(val) : ## 4 nodes produced errors; first error: 1
Test your code in a single-threaded session
Create log files for worker nodes
74
Random number generators
clusterSetRNGStream(cluster, 20160927)
Two things to consider:
1. For L'Ecuyer to work, the same number of
worker processes must be fed streams
2. We cannot reproduce the results in a singlethreaded process

75

Cluster Computing
76
A cluster is a number of computers configured to
work together across a network as a single system
for some task.

77
For our purposes, a cluster is computers running
local R sessions that take commands and return
outputs to a manager session.


78
makePSOCKCluster()
Given a number, make that many local Rscript
sessions.
Given a character vector, use each value as a
network address and instantiate a remote
Rscript session.
The manager R session tracks the network
location and port of each Rscript worker node.
Rscript worker nodes listen on a port for
instructions and serial data from the manager.
79
Computers and networks default to closed traffic
for security.


80
Minimum networking knowledge
SSH, secure communication across networks
Firewalls, traffic control at network boundaries
Ports, computers listening for traffic from the
network
81

Making an R Cluster
82
Steps
1. Launch two computers on a secure network
2. Install the necessary software on both
computers
3. Share the private network IPs and SSH
credentials across the two computers
4. Use makePSOCKcluster() and the private network

IP addresses to create worker node R sessions
83

Launch two computers
84

Install the necessary software
sudo apt­get install r­base­dev openssh­server
85

Share IPs and SSH credentials
86

Create a worker node R session
87

Create a worker node R session
library(parallel)
cluster <­ makePSOCKcluster(c("172.31.51.171", "localhost"))
clusterEvalQ(cluster, {
system("ifconfig eth0 | grep 'inet addr' | awk '{print $2}'")
})
## addr:172.31.50.241
## addr:172.31.51.171
88

Conclusion
89

Further Reading
Matloff - Parallel Programming for Data
Science
Bengtsson - The future package
Wickham - Advanced R
Eddelbuettel - High-Performance and Parallel
Computing with R Task View
90

Questions?
Find me on Twitter (@ByerlyElizabeth)
Find me on LinkedIn (Elizabeth Byerly)
91