Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Cracking of wireless networks wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
TCP congestion control wikipedia , lookup
PCI Express wikipedia , lookup
Internet protocol suite wikipedia , lookup
Conventional PCI wikipedia , lookup
Performance Evaluation of InfiniBand NFS/RDMA for Linux Benjamin Allan, Helen Chen, Scott Cranford, Ron Minnich, Don Rudish, and Lee Ward Sandia National Laboratories This work was supported by the United States Department of Energy, Office of Defense Programs. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed-Martin Company, for the United States Department of Energy under contract DE-AC04-94-AL85000. Talk Outline • • • • • • Read/Write performance Application Profile System Profile Network Profile Infinite fast file and disk I/O Infinite Fast Network Sandia Motivation for looking at NFS/RDMA • Why NFS-RDMA and why is Sandia looking at it? – Use for HPC Platforms – Transparent solution for applications – In the mainstream kernel – Increased performance over normal NFS Reads VS Writes TCP VS RDMA • NFS/TCP read/write ratio 2:1 • NFS/RDMA read/write ratio 5:1 • Previous work http://www.chelsio.com/nf s_over_rdma.html FTQIO Application • FTQ (fixed time quantum) – Simply put, rather than doing a fixed unit of work, and measuring how long that work took; FTQ measures the amount of work done in a fixed time quantum. • FTQ I/O – Modified FTQ to measure file system performance by writing data and recording statistics. – High Resolution Benchmark • More info visit: – http://rt.wiki.kernel.org/index.php/FTQ • How it works? – One thread of the program writes blocks of allocated memory to disk. – A second thread records the number of bytes written and optionally Supermon data. (talk supermon about later) • Basic Operation – The loop will count work done until it reaches a fixed end-point in time. – It then records the starting point of the loop and the amount of work that was done. FTQIO Application Profile •Red DOTS represent data written in a 440 microsecond interval. •Every 440 microseconds, FTQIO will count how many bytes to wrote and then plot it Bytes Written 27 sec FTQIO Run Application Profile with VMM Data •Red DOTS represent bytes recorded in a 440 microsecond interval. •Blue DOTS represent number of dirty pages •Purple DOTS represent number of pages in writeback queue •Black DOTS represent when application goes to sleep Bytes Written 27 sec FTQIO Run Added bytes transmitted from IB card Bytes Written •Red DOTS represent bytes recorded in a 440 microsecond interval. •Blue DOTS represent number of dirty pages •Purple DOTS represent number of pages in writeback queue •Black DOTS represent when application goes to sleep •Green DOTS represent number of bytes transmitted on the InfiniBand card 27 sec FTQIO Run Baseline Approach User Client Server Application Application VFS VFS Page Cache Kernel NFS Client Page Cache NFS Server RPC FS TCP/IPRDMA TCP/IPRDMA Block PCI HCA PCI HCA PCI RPC HW Short Circuit Patch Short Circuit Patch IB Fabric Controller Disk Look at the Code Where to look? The Linux Cross Reference http://lxr.linux.no Ftrace Comes with the kernel No userspace programs Debugfs http://rt.wiki.kernel.org/index.php/Ftrace # tracer: function_graph # # CPU TASK/PID DURATION # | | | | | 0) dd-2280 | 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0) dd-2280 | 0) dd-2280 | 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us 0) dd-2280 | 0.000 us | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FUNCTION CALLS | | | | schedule_tail() { finish_task_switch(); __might_sleep(); _cond_resched(); __task_pid_nr_ns(); down_read_trylock(); __might_sleep(); _cond_resched(); find_vma(); handle_mm_fault() { kmap_atomic() { kmap_atomic_prot() { page_address(); } } _spin_lock(); do_wp_page() { vm_normal_page(); reuse_swap_page(); unlock_page() { __wake_up_bit(); } kunmap_atomic() { arch_flush_lazy_mmu_mode(); } anon_vma_prepare() { __might_sleep(); _cond_resched(); } __alloc_pages_internal() { __might_sleep(); _cond_resched(); get_page_from_freelist() { next_zones_zonelist(); next_zones_zonelist(); zone_watermark_ok(); } } Infinite fast file and disk I/O • When NFS Server wants to write to a file, claim success. User Application Application VFS VFS Page Cache Kernel In fs/nfsd/nfs3proc.c /* * Write data to a file */ static __be32 nfsd3_proc_write(struct svc_rqst *rqstp, struct nfsd3_writeargs *argp, struct nfsd3_writeres *resp) { __be32 nfserr; Page Cache NFS Client NFS Server RPC RPC FS TCP/IP HW RDMA TCP/IP PCI HCA RPC Payload (Bytes) 32768 65536 131072 262144 524288 IB Fabric RDMA if (foobar_flag != '0') { resp->count = argp->count; RETURN_STATUS(0); } Block PCI PCI HCA Controller Throughput (MB/s) 32-KB 512-KB 1-MB Record Record Record 285.60 283.40 281.60 377.00 350.50 293.00 387.50 363.50 306.00 401.40 335.80 305.00 425.00 376.50 312.50 Disk fh_copy(&resp->fh, &argp->fh); resp->committed = argp->stable; nfserr = nfsd_write(rqstp, &resp->fh, NULL, argp->offset, rqstp->rq_vec, argp->vlen, argp->len, &resp->committed); resp->count = argp->count; RETURN_STATUS(nfserr); } Infinite Fast Network • Remove the RDMA transport from the NFS write-path. • Max throughput of 1.25GB/sec • Nothing goes out on the network. • RPC transmit – Factor of 3 improvement as oppose to when we send the data over the wire. User Application Application VFS VFS Page Cache Kernel – Returns claiming that the transmit completed and now has the reply. • Tells the NFS client service that the page was committed. Page Cache NFS Client NFS Server RPC RPC FS TCP/IP HW RDMA TCP/IP PCI HCA IB Fabric RDMA Block PCI PCI HCA Controller Disk Recap And Conclusion User Client Server Application Application VFS VFS Page Cache Page Cache Kernel NFS Client RPC 377MB/sec 1.25GB/sec TCP/IPRDMA HW NFS Server RPC FS TCP/IPRDMA Block PCI HCA PCI 1.8GB/sec PCI HCA IB Fabric Controller Disk