Download lec1 - b

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Distributed operating system wikipedia , lookup

IEEE 1355 wikipedia , lookup

CAN bus wikipedia , lookup

Airborne Networking wikipedia , lookup

UniPro protocol stack wikipedia , lookup

Transcript
CS294-123: Scalable Machine Learning
John Canny
Spring 2016
CCN: 25819
Motivation
Convolutional NNs: AlexNet (2012): trained on 200 GB of
ImageNet Data
Human performance
5.1% error
Recurrent Nets: LSTMs (1997):
The unreasonable effectiveness of Reinforcement3
Learning + Deep Learning
In 2013, Deep Mind published a paper demonstrating
superior performance (better than a human expert) on six
games on a virtual Atari 2600 game console (Pong,
Breakout, Space Invaders…)
Acquired by Google in 2014, conditioned on Google
creating an AI ethics panel.
4
Learning without a Teacher
Word2vec: semantic similarity of nearby words
Skip-thought: coherence of nearby sentences.
Distant supervision: Learn from context (e.g. captions) plus
other data sources.
…these approaches work on massive corpora
Motivation
Yahoo: 0.5 exabytes (1018 bytes) of user data.
Youtube: 0.5 exabytes of mostly public data.
Amazon: 1 exabyte(?) +
Tesla, Google, BMW, Uber,…
Many terabytes per vehicle/day
=1 terabyte/day/drone
Three Approaches to Performance
There are three approaches to improving performance for
machine learning:
1. Algorithmic and Statistical Efficiency: operations need to
reach a given level of accuracy.
2. System Efficiency: How much time (energy etc.) does it
take to perform those operations.
3. Cluster parallelism.
Three Approaches to Performance
There are three approaches to improving performance for
machine learning:
1. Algorithmic and Statistical Efficiency: operations need to
reach a given level of accuracy. Free!

2. System Efficiency: How much time (energy etc.) does it
take to perform those operations. Free!
3. Cluster parallelism. Expensive, but often exploits
data locality.
This class
We will briefly cover these topics:
1. Algorithmic and Statistical Efficiency: operations need to
reach a given level of accuracy.
2. System Efficiency: How much time (energy etc.) does it
take to perform those operations.
3. Cluster parallelism.
but there are hard limits to how far these can take us…
This Class
We will spend most of our time on:
1. Algorithmic and Statistical Efficiency: operations
need to reach a given level of accuracy.
We use “statistical efficiency” in a very broad sense. i.e. we
are free to decide what training data to use and how.
So we will cover recent work on active learning, attention
etc. And explore “learning to learn” e.g. using reinforcement
learning policies to tune hyperparameters.
Class Format
One meeting per week, Weds 9-10:30am in 320 Soda.
1-2 unit credit.
Typically 3 papers reviewed by 3 student discussants. You
must be a discussant for 1 unit credit.
For 2-unit credit, you should do a class project:
• Yes, it can be a team project.
• Yes, it can be part of your research or another class project.
• Yes, it can be a survey article.
Class Resources
Syllabus, readings etc. will be on bCourses. First link on my
home page: https://bcourses.berkeley.edu/courses/1413454
You should be able to see CS294-123 if you are registered. If
not, email me with your campus email address to get
registered manually.
Most of the site will be publically readable.
Piazza?
Class Prerequisites and Content
This is primarily a machine learning course. You should have
had CS189 or CS281, or a graduate stat. class.
The content is still tentative and depends on student interest.
Readings will be posted at least a week ahead. Please
suggest other relevant or interesting material.
Class Outline
Questions?
A quick tour of hardware
A Tale of Two Architectures
Intel CPU
NVIDIA GPU
Memory Controller
ALU
ALU
ALU
ALU
Core
Core
Core
Core
ALU
ALU
ALU
ALU
L3 Cache
L2 Cache
ALU
ALU
Roofline Design (Williams, Waterman, Patterson, 2009)
• Roofline design establishes fundamental performance
limits for a computational kernel.
Throughput (gflops)
GPU ALU throughput
1000
CPU ALU throughput
100
10
1
0.01
0.1
1
10
100
Operational Intensity (flops/byte)
1000
CPU vs GPU Memory Hierarchy
Intel 8 core Sandy Bridge CPU
4kB registers:
5 TB/s
512K L1 Cache
1 TB/s
NVIDIA GK110 GPU
4 MB register file (!)
40 TB/s
1 MB Constant Mem
13 TB/s
1 MB Shared Mem
1 TB/s
2 MB L2 Cache
8 MB L3 Cache
500 GB/s
1.5 MB L2 Cache
40 GB/s
10s GB Main Memory
500 GB/s
350 GB/s
12 GB Main Memory
CPU vs GPU Memory Hierarchy
Intel 8 core Sandy Bridge CPU
4kB registers:
5 TB/s
512K L1 Cache
1 TB/s
NVIDIA GK110 GPU
4 MB register file (!)
40 TB/s
1 MB Constant Mem
13 TB/s
1 MB Shared Mem
1 TB/s
2 MB L2 Cache
8 MB L3 Cache
500 GB/s
1.5 MB L2 Cache
40 GB/s
10s GB Main Memory
500 GB/s
350 GB/s
12 GB Main Memory
Roofline Design – Matrix kernels
• Dense matrix multiply
• Sparse matrix multiply
Throughput (gflops)
GPU ALU throughput
1000
CPU ALU throughput
100
10
1
0.01
0.1
1
10
100
Operational Intensity (flops/byte)
1000
Convergence
• Current “compute” processors (NVIDIA GPUs, Intel Xeon
Phi) live on a PCI bus, separated from the CPU/main
memory.
• Intel’s next-generation processor “Knight’s Landing” puts the
compute coprocessor on the motherboard. This has
significant advantages for memory-bound algorithms.
How Much does Memory Matter?
Big Data about people (text, web, social media) follow
power law statistics:
Feature frequency
Freq α 1/rank
Feature rank
About half of users/features seen only once,
Average user seen log N times, N = dataset size
How Big Should Models Be?
Big Data about people (text, web, social media) follow
power law statistics.
Total data volume is approx. N log N for N features.
So twice as much data  observations of about twice as
many users.
Even single observations improve predictions, so more
data (with proportionately bigger models) means more
accuracy and more revenue.
How Big Should Models Be?
But to benefit from more data, we need bigger models
(enough to model each user/feature).
Rough rule-of-thumb:
Dataset size should be 10x-100x model size.
Smaller models are probably underfitting.
Larger models are typically overfitting.
Cluster Computing
Increasingly, the main argument for cluster-based ML is to
support larger models through model parallelism.
Data parallelism: Data is split across nodes, each node
has a local copy of a shared model.
Model parallelism: Model is split across nodes. No-one
has a complete copy. Nodes must communicate to update
the model.
Calibration: GPU Memory vs Cluster B/W
Bisection B/W is the critical parameter for inference on
large models.
• Memory bisection B/W for a current generation GPU:
• 350 GB/s
• Network bisection B/W for a first-generation* data center
with 20,000 nodes:
• 220 GB/s
Calibration: GPU Memory vs Cluster B/W
Traditional Data center topology (Google 2004)
• 512 racks of 40x 1 Gb/s hosts, 20,000 machines.
• Bisection bandwidth 20k / 10 / 8 = 220 GB/s
4x cluster routers, 512x 1 Gb/s downlinks, 4x 10 Gb/s sidelinks
10x oversubscription
Top-of-rack
Switches with
4 x 1Gb/s uplinks
... 504 more racks
Roofline Design for Communication
• Networks have both throughput and latency limits
Throughput (Mbytes/sec)
40 Gbit infiniband
10 Gbit Ethernet
1000
100
10
0.1
0.001
0.01
0.1
Packet size (MB)
1
10
100
Synchronizing models: Allreduce
Every node j has (model) data Mj , we want to compute an
aggregate (e.g. sum) of all data vectors M and distribute it to
all nodes. Each node class
allreduce(Din, Dout, N)
Allreduce is the synchronous version of this process.
Parameter servers approximate this with an asynchronous
protocol.
Why Allreduce?
• Speed: Minibatch SGD and related algorithms are the
fastest way to solve many of these. But multi-terabyte
dataset sizes make single machine training slow (> 1 day).
• The ratio of computation to communication varies, but
often the network is the bottleneck.
• Therefore we want an Allreduce that consumes data at
roughly the full NIC bandwidth for the nodes.
Why Allreduce?
• Scale: Some models wont fit on one machine (e.g.
Pagerank).
• Other models will fit but minibatch updates only touch a
small fraction of features (Click Prediction). Updating only
those is much faster than a full Allreduce.
• Sparse Allreduce: (next time) Each node j pushes
updates to a subset of features Uj and pulls a subset Vj
from a distributed model.
Lower Bounds for total message B/W.
• Intuitively, every node needs to send D values
somewhere, and receive D values for the final aggregate.
• The data from each node needs to be caught somewhere
if allreduce happens in the same network, and the final
aggregate message need to originate from somewhere.
• This suggests a naïve lower bound of 2D values in or out
of each node, or 2DN for an N-node network.
• In fact more careful analysis shows the lower bound is
2D(N-1) for total bandwidth, and ~ 2D/B for latency:
Patarasuk et al. Bandwidth optimal all-reduce algorithms for clusters of
workstations
A B/W optimal network (dense data)
• This network uses 2D(N-1) total B/W:
• Every node sends all its data to a “server,” which sums
and returns the results:
Clients
server
But this network has 2D(N-1)/B latency because all the data
has to be processed by the server. Suboptimal by N-1.
Assumption: “Large messages,” time  message size
Parameter servers
• Potentially B/W optimal, but latency limited by net B/W of
servers:
Clients
servers
This network has at least 2D(N-1)/PB latency with P
servers, suboptimal by (N-1)/P.
B/W and Latency-Optimal Networks
• Reduce-Scatter implementation (Hadoop, Spark,
GraphLab, OpenMPI,…).
• Divide data on each node into N chunks (contiguous here)
• Each chunk sent to a “home node” (gray) where they are
summed, and eventually returned.
B/W and Latency-Optimal Networks
• Reduce-Scatter implementation (Hadoop, Spark,
GraphLab, OpenMPI,…).
• Divide data on each node into N chunks (contiguous here)
• Each chunk sent to a “home node” (gray) where they are
summed, and eventually returned.
B/W and Latency-Optimal Networks
• Reduce-Scatter implementation (Hadoop, Spark,
GraphLab, OpenMPI,…).
• Divide data on each node into N chunks (contiguous here)
• Each chunk sent to a “home node” (gray) where they are
summed, and eventually returned.
B/W and Latency-Optimal Networks
• Reduce-Scatter implementation (Hadoop, Spark,
GraphLab, OpenMPI,…).
• Divide data on each node into N chunks (contiguous here)
• Each chunk sent to a “home node” (gray) where they are
summed, and eventually returned.
B/W and Latency-Optimal Networks
• Reduce-Scatter implementation (Hadoop, Spark, GraphLab,
OpenMPI,…).
• Divide data on each node into N chunks (contiguous here)
• Each chunk sent to a “home node” (gray) where they are
summed, and eventually returned.
• Latency = 2(N-1)
nrounds
*
D/N
/
message size
B (optimal)
B/W
B/W and Latency-Optimal Networks
• Reduce-Scatter implementation (Hadoop, Spark, GraphLab,
OpenMPI,…).
• Divide data on each node into N chunks (contiguous here)
• Each chunk sent to a “home node” (gray) where they are
summed, and eventually returned.
Intermediate result is a single distributed copy of the reduction
B/W and Latency-Optimal Networks
• Ring topology.
• Since every message round is a circular permutation, can
move the buckets instead of the data.
B/W and Latency-Optimal Networks
• Ring topology.
• Since every message round is a circular permutation, can
move the buckets instead of the data.
B/W and Latency-Optimal Networks
• Ring topology.
• Since every message round is a circular permutation, can
move the buckets instead of the data.
B/W and Latency-Optimal Networks
• Ring topology.
• Since every message round is a circular permutation, can
move the buckets instead of the data.
• Every bucket is full now. We can unload the contents of
each bucket on the current node.
• Latency = 2(N-1)
nrounds
*
D/N
/
B (optimal)
message size node B/W
B/W and Latency-Optimal Networks
• Butterfly topology.
• Progressive reduce/scatter on groups.
msg
size
D/2
size
D/4
B/W and Latency-Optimal Networks
• Butterfly topology.
• Progressive reduce/scatter on groups.
msg
size
D/2
size
D/4
Latency = 2(D/2 + D/4 + …)/B = 2D(N-1)/N/B (optimal)
Allreduce Network Comparison
Topology
nrounds
msg size
Shuffle
2(N-1)
D/N
2(N-1)
ND/2
good
Ring
2(N-1)
D/N
2(N-1)
2D
poor
2 log2 N
D/2,…,D/N
2
Dlog2 N
fair/poor
Butterfly
nrounds with bisection
small msgs
B/W
fault
toler.
• Some shuffle implementations (e.g. Dogwild, Baidu) are
fully asynchronous (N-node parameter servers).
• Its possible to use unreliable communication (Dogwild)
with shuffle.
• All methods implement a reduce-scatter in the middle.
• Broadcast doesn’t help (all receivers fully occupied).
Roofline Design for Communication
• Networks have both throughput and latency limits
Throughput (Mbytes/sec)
40 Gbit infiniband
10 Gbit Ethernet
1000
100
10
0.1
0.001
0.01
0.1
Packet size (MB)
1
10
100
Roofline Design for Communication
Consequences:
• Designing wrt packet size is critical for large networks.
• Simple MPI communication patterns aren’t optimal for
commodity networks in practice.
• Using multithreading (send 8 packets “at once”) hides
latency and improves throughput.
For Next Time
Distributed ML systems:
• Single-node parameter server + all-to-all reduce: Spark
• Vertex Model: Graphlab
• Sparse Allreduce: Kylix
Volunteers?