Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS294-123: Scalable Machine Learning John Canny Spring 2016 CCN: 25819 Motivation Convolutional NNs: AlexNet (2012): trained on 200 GB of ImageNet Data Human performance 5.1% error Recurrent Nets: LSTMs (1997): The unreasonable effectiveness of Reinforcement3 Learning + Deep Learning In 2013, Deep Mind published a paper demonstrating superior performance (better than a human expert) on six games on a virtual Atari 2600 game console (Pong, Breakout, Space Invaders…) Acquired by Google in 2014, conditioned on Google creating an AI ethics panel. 4 Learning without a Teacher Word2vec: semantic similarity of nearby words Skip-thought: coherence of nearby sentences. Distant supervision: Learn from context (e.g. captions) plus other data sources. …these approaches work on massive corpora Motivation Yahoo: 0.5 exabytes (1018 bytes) of user data. Youtube: 0.5 exabytes of mostly public data. Amazon: 1 exabyte(?) + Tesla, Google, BMW, Uber,… Many terabytes per vehicle/day =1 terabyte/day/drone Three Approaches to Performance There are three approaches to improving performance for machine learning: 1. Algorithmic and Statistical Efficiency: operations need to reach a given level of accuracy. 2. System Efficiency: How much time (energy etc.) does it take to perform those operations. 3. Cluster parallelism. Three Approaches to Performance There are three approaches to improving performance for machine learning: 1. Algorithmic and Statistical Efficiency: operations need to reach a given level of accuracy. Free! 2. System Efficiency: How much time (energy etc.) does it take to perform those operations. Free! 3. Cluster parallelism. Expensive, but often exploits data locality. This class We will briefly cover these topics: 1. Algorithmic and Statistical Efficiency: operations need to reach a given level of accuracy. 2. System Efficiency: How much time (energy etc.) does it take to perform those operations. 3. Cluster parallelism. but there are hard limits to how far these can take us… This Class We will spend most of our time on: 1. Algorithmic and Statistical Efficiency: operations need to reach a given level of accuracy. We use “statistical efficiency” in a very broad sense. i.e. we are free to decide what training data to use and how. So we will cover recent work on active learning, attention etc. And explore “learning to learn” e.g. using reinforcement learning policies to tune hyperparameters. Class Format One meeting per week, Weds 9-10:30am in 320 Soda. 1-2 unit credit. Typically 3 papers reviewed by 3 student discussants. You must be a discussant for 1 unit credit. For 2-unit credit, you should do a class project: • Yes, it can be a team project. • Yes, it can be part of your research or another class project. • Yes, it can be a survey article. Class Resources Syllabus, readings etc. will be on bCourses. First link on my home page: https://bcourses.berkeley.edu/courses/1413454 You should be able to see CS294-123 if you are registered. If not, email me with your campus email address to get registered manually. Most of the site will be publically readable. Piazza? Class Prerequisites and Content This is primarily a machine learning course. You should have had CS189 or CS281, or a graduate stat. class. The content is still tentative and depends on student interest. Readings will be posted at least a week ahead. Please suggest other relevant or interesting material. Class Outline Questions? A quick tour of hardware A Tale of Two Architectures Intel CPU NVIDIA GPU Memory Controller ALU ALU ALU ALU Core Core Core Core ALU ALU ALU ALU L3 Cache L2 Cache ALU ALU Roofline Design (Williams, Waterman, Patterson, 2009) • Roofline design establishes fundamental performance limits for a computational kernel. Throughput (gflops) GPU ALU throughput 1000 CPU ALU throughput 100 10 1 0.01 0.1 1 10 100 Operational Intensity (flops/byte) 1000 CPU vs GPU Memory Hierarchy Intel 8 core Sandy Bridge CPU 4kB registers: 5 TB/s 512K L1 Cache 1 TB/s NVIDIA GK110 GPU 4 MB register file (!) 40 TB/s 1 MB Constant Mem 13 TB/s 1 MB Shared Mem 1 TB/s 2 MB L2 Cache 8 MB L3 Cache 500 GB/s 1.5 MB L2 Cache 40 GB/s 10s GB Main Memory 500 GB/s 350 GB/s 12 GB Main Memory CPU vs GPU Memory Hierarchy Intel 8 core Sandy Bridge CPU 4kB registers: 5 TB/s 512K L1 Cache 1 TB/s NVIDIA GK110 GPU 4 MB register file (!) 40 TB/s 1 MB Constant Mem 13 TB/s 1 MB Shared Mem 1 TB/s 2 MB L2 Cache 8 MB L3 Cache 500 GB/s 1.5 MB L2 Cache 40 GB/s 10s GB Main Memory 500 GB/s 350 GB/s 12 GB Main Memory Roofline Design – Matrix kernels • Dense matrix multiply • Sparse matrix multiply Throughput (gflops) GPU ALU throughput 1000 CPU ALU throughput 100 10 1 0.01 0.1 1 10 100 Operational Intensity (flops/byte) 1000 Convergence • Current “compute” processors (NVIDIA GPUs, Intel Xeon Phi) live on a PCI bus, separated from the CPU/main memory. • Intel’s next-generation processor “Knight’s Landing” puts the compute coprocessor on the motherboard. This has significant advantages for memory-bound algorithms. How Much does Memory Matter? Big Data about people (text, web, social media) follow power law statistics: Feature frequency Freq α 1/rank Feature rank About half of users/features seen only once, Average user seen log N times, N = dataset size How Big Should Models Be? Big Data about people (text, web, social media) follow power law statistics. Total data volume is approx. N log N for N features. So twice as much data observations of about twice as many users. Even single observations improve predictions, so more data (with proportionately bigger models) means more accuracy and more revenue. How Big Should Models Be? But to benefit from more data, we need bigger models (enough to model each user/feature). Rough rule-of-thumb: Dataset size should be 10x-100x model size. Smaller models are probably underfitting. Larger models are typically overfitting. Cluster Computing Increasingly, the main argument for cluster-based ML is to support larger models through model parallelism. Data parallelism: Data is split across nodes, each node has a local copy of a shared model. Model parallelism: Model is split across nodes. No-one has a complete copy. Nodes must communicate to update the model. Calibration: GPU Memory vs Cluster B/W Bisection B/W is the critical parameter for inference on large models. • Memory bisection B/W for a current generation GPU: • 350 GB/s • Network bisection B/W for a first-generation* data center with 20,000 nodes: • 220 GB/s Calibration: GPU Memory vs Cluster B/W Traditional Data center topology (Google 2004) • 512 racks of 40x 1 Gb/s hosts, 20,000 machines. • Bisection bandwidth 20k / 10 / 8 = 220 GB/s 4x cluster routers, 512x 1 Gb/s downlinks, 4x 10 Gb/s sidelinks 10x oversubscription Top-of-rack Switches with 4 x 1Gb/s uplinks ... 504 more racks Roofline Design for Communication • Networks have both throughput and latency limits Throughput (Mbytes/sec) 40 Gbit infiniband 10 Gbit Ethernet 1000 100 10 0.1 0.001 0.01 0.1 Packet size (MB) 1 10 100 Synchronizing models: Allreduce Every node j has (model) data Mj , we want to compute an aggregate (e.g. sum) of all data vectors M and distribute it to all nodes. Each node class allreduce(Din, Dout, N) Allreduce is the synchronous version of this process. Parameter servers approximate this with an asynchronous protocol. Why Allreduce? • Speed: Minibatch SGD and related algorithms are the fastest way to solve many of these. But multi-terabyte dataset sizes make single machine training slow (> 1 day). • The ratio of computation to communication varies, but often the network is the bottleneck. • Therefore we want an Allreduce that consumes data at roughly the full NIC bandwidth for the nodes. Why Allreduce? • Scale: Some models wont fit on one machine (e.g. Pagerank). • Other models will fit but minibatch updates only touch a small fraction of features (Click Prediction). Updating only those is much faster than a full Allreduce. • Sparse Allreduce: (next time) Each node j pushes updates to a subset of features Uj and pulls a subset Vj from a distributed model. Lower Bounds for total message B/W. • Intuitively, every node needs to send D values somewhere, and receive D values for the final aggregate. • The data from each node needs to be caught somewhere if allreduce happens in the same network, and the final aggregate message need to originate from somewhere. • This suggests a naïve lower bound of 2D values in or out of each node, or 2DN for an N-node network. • In fact more careful analysis shows the lower bound is 2D(N-1) for total bandwidth, and ~ 2D/B for latency: Patarasuk et al. Bandwidth optimal all-reduce algorithms for clusters of workstations A B/W optimal network (dense data) • This network uses 2D(N-1) total B/W: • Every node sends all its data to a “server,” which sums and returns the results: Clients server But this network has 2D(N-1)/B latency because all the data has to be processed by the server. Suboptimal by N-1. Assumption: “Large messages,” time message size Parameter servers • Potentially B/W optimal, but latency limited by net B/W of servers: Clients servers This network has at least 2D(N-1)/PB latency with P servers, suboptimal by (N-1)/P. B/W and Latency-Optimal Networks • Reduce-Scatter implementation (Hadoop, Spark, GraphLab, OpenMPI,…). • Divide data on each node into N chunks (contiguous here) • Each chunk sent to a “home node” (gray) where they are summed, and eventually returned. B/W and Latency-Optimal Networks • Reduce-Scatter implementation (Hadoop, Spark, GraphLab, OpenMPI,…). • Divide data on each node into N chunks (contiguous here) • Each chunk sent to a “home node” (gray) where they are summed, and eventually returned. B/W and Latency-Optimal Networks • Reduce-Scatter implementation (Hadoop, Spark, GraphLab, OpenMPI,…). • Divide data on each node into N chunks (contiguous here) • Each chunk sent to a “home node” (gray) where they are summed, and eventually returned. B/W and Latency-Optimal Networks • Reduce-Scatter implementation (Hadoop, Spark, GraphLab, OpenMPI,…). • Divide data on each node into N chunks (contiguous here) • Each chunk sent to a “home node” (gray) where they are summed, and eventually returned. B/W and Latency-Optimal Networks • Reduce-Scatter implementation (Hadoop, Spark, GraphLab, OpenMPI,…). • Divide data on each node into N chunks (contiguous here) • Each chunk sent to a “home node” (gray) where they are summed, and eventually returned. • Latency = 2(N-1) nrounds * D/N / message size B (optimal) B/W B/W and Latency-Optimal Networks • Reduce-Scatter implementation (Hadoop, Spark, GraphLab, OpenMPI,…). • Divide data on each node into N chunks (contiguous here) • Each chunk sent to a “home node” (gray) where they are summed, and eventually returned. Intermediate result is a single distributed copy of the reduction B/W and Latency-Optimal Networks • Ring topology. • Since every message round is a circular permutation, can move the buckets instead of the data. B/W and Latency-Optimal Networks • Ring topology. • Since every message round is a circular permutation, can move the buckets instead of the data. B/W and Latency-Optimal Networks • Ring topology. • Since every message round is a circular permutation, can move the buckets instead of the data. B/W and Latency-Optimal Networks • Ring topology. • Since every message round is a circular permutation, can move the buckets instead of the data. • Every bucket is full now. We can unload the contents of each bucket on the current node. • Latency = 2(N-1) nrounds * D/N / B (optimal) message size node B/W B/W and Latency-Optimal Networks • Butterfly topology. • Progressive reduce/scatter on groups. msg size D/2 size D/4 B/W and Latency-Optimal Networks • Butterfly topology. • Progressive reduce/scatter on groups. msg size D/2 size D/4 Latency = 2(D/2 + D/4 + …)/B = 2D(N-1)/N/B (optimal) Allreduce Network Comparison Topology nrounds msg size Shuffle 2(N-1) D/N 2(N-1) ND/2 good Ring 2(N-1) D/N 2(N-1) 2D poor 2 log2 N D/2,…,D/N 2 Dlog2 N fair/poor Butterfly nrounds with bisection small msgs B/W fault toler. • Some shuffle implementations (e.g. Dogwild, Baidu) are fully asynchronous (N-node parameter servers). • Its possible to use unreliable communication (Dogwild) with shuffle. • All methods implement a reduce-scatter in the middle. • Broadcast doesn’t help (all receivers fully occupied). Roofline Design for Communication • Networks have both throughput and latency limits Throughput (Mbytes/sec) 40 Gbit infiniband 10 Gbit Ethernet 1000 100 10 0.1 0.001 0.01 0.1 Packet size (MB) 1 10 100 Roofline Design for Communication Consequences: • Designing wrt packet size is critical for large networks. • Simple MPI communication patterns aren’t optimal for commodity networks in practice. • Using multithreading (send 8 packets “at once”) hides latency and improves throughput. For Next Time Distributed ML systems: • Single-node parameter server + all-to-all reduce: Spark • Vertex Model: Graphlab • Sparse Allreduce: Kylix Volunteers?