* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Storage Systems CSE 598D, Spring 2007
Survey
Document related concepts
Internet protocol suite wikipedia , lookup
Deep packet inspection wikipedia , lookup
Wake-on-LAN wikipedia , lookup
Piggybacking (Internet access) wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Zero-configuration networking wikipedia , lookup
Bus (computing) wikipedia , lookup
Airborne Networking wikipedia , lookup
Network tap wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
Peer-to-peer wikipedia , lookup
Transcript
Storage Systems CSE 598d, Spring 2007 Lecture 15: Consistency Semantics, Introduction to Network-attached Storage March 27, 2007 • Last class Agenda – Consistency models: Brief Overview • Next – More details on consistency models – Network storage introduction • • • • NAS vs SAN DAFS Some relevant technology and systems innovations FC, Smart NICs, RDMA, … – A variety of topics on file systems (and other storage-related software) • • • • • • • • • Log-structured file systems Databases and file systems compared Mobile/poorly connected systems, highly distributed & P2P storage NFS, Google file system Asynchronous I/O Flash-based storage Active disks, object-based storage devices (OSD) Archival and secure storage Storage virtualization and QoS – Reliability, (emerging) miniature storage devices Problem Background and Definition • Consistency issues were first studied in the context of shared-memory multiprocessors and we will start our discussion in the same context – Ideas generalize to any distributed system with shared storage • Memory consistency model (MCM) of an SMP provides a formal specification of how the memory system will appear to the programmer – Places restrictions on the values that can be returned by a read in a sharedmemory program execution – An MCM is a contract between the memory and the programmer • Why different models? – Trade-offs involved between “strictness” of consistency guarantees, implementation efforts (hardware, compiler, programmer), system performance Atomic/Strict Consistency • Most intuitive, naturally appealing • Any read to a memory location x returns the value stored by the most recent write operation to x • Defined w.r.t. a “global” clock – • That is the only way “most recent” can be defined unambiguously Uni-processors typically observe such consistency – – A programmer on a uni-processor naturally assumes this behavior E.g., As a programmer, one would not expect the following code segment to print 1 or any other value than 2 • – Still possible for compiler and hardware to improve throughput by re-ordering instructions • • A = 1; A = 2; print (A); Atomic consistency can be achieved as long as data and control dependencies are adhered to Often considered a base model (for evaluating MCMs that we will see next) Atomic/Strict Consistency • What happens on a multi-processor? – Even on the smallest and fastest multi-processor, global time can not be achieved! – Achieving atomic consistency not possible – But not a hindrance, since programmers manage quite well with something weaker than atomic consistency – What behavior do we expect when we program on a multi-processor? • What we DO NOT expect: a global clock • What we expect: – Operations from a process will execute sequentially » Again: A = 1; A =2; print (A) should not print 1 • And then we can use Critical section/Mutual exclusion mechanisms to enforce desired order among instructions coming from different processors – So we expect a MCM less strict than atomic consistency. What is this consistency model, what are its properties, and what does the hardware/software (compiler) have to do to provide it? Sequential Consistency • What we typically expect from a shared-memory multi-processor system is captured by sequential consistency – Lamport [1979]: A multi-processor is sequentially consistent if the result of any execution is the same as if • The operations of all the processors were executed in some sequential order – That is, memory accesses occur atomically w.r.t. other memory accesses • The operations of each individual processor appear in this sequence in the order specified by its program – Equivalently, any valid interleaving is acceptable as long as all processes see the same ordering of memory references – Programmer’s view P1 P3 P3 Memory Pn Example: Sequential Consistency P1: W(x)1 P2: P3: W(y)2 R(y)2 R(x)0 R(x)1 • Not atomically consistent because: – R(y)2 by P3 reads a value that has not been written yet – W(x)1 and W(y)2 appear commuted at P3 • But sequentially consistent – SC doesn’t have the notion of global clock Example: Sequential Consistency P1: W(x)1 P2: P3: W(y)2 R(y)2 R(x)0 R(x)1 • Not atomically consistent because: – R(y)2 by P3 reads a value that has not been written yet – W(x)1 and W(y)2 appear commuted at P3 • But sequentially consistent • What about? P1: W(x)1 P2: P3: W(y)2 R(y)2 R(x)0 R(x)1 R(y)2 R(x)0 R(x)1 Example: Sequential Consistency P1: W(x)1 P2: P3: W(y)2 R(y)2 R(x)0 R(x)1 • Not atomically consistent because: – R(y)2 by P3 reads a value that has not been written yet – W(x)1 and W(y)2 appear commuted at P3 • But sequentially consistent • What about? P1: W(x)1 P2: P3: W(y)2 R(y)2 R(x)0 R(x)1 R(x)1 R(y)0 R(y)2 Causal Consistency • Hutto and Ahamad, 1990 • Each operation is either “causally related” or “concurrent” with another – When a processor performs a read followed later by a write, the two operations are said to be causally related because the value stored by the write may have been dependent upon the result of the read – A read operation is causally related to an earlier write that stored the data retrieved by the read – Transitivity applies – Operations that are not causally related are said to be concurrent. • A memory is causally consistent if all processors agree on the order of causally related writes – Weaker than SC that requires all writes to be seen in the same order P1: P2: P3: P4: W(x)1 R(x)1 W(x)2 R(x)1 R(x)1 W(x)3 R(x)3 R(x)2 R(x)2 R(x)3 W(x)1 and W(x)2 causally related W(x)2 and W(x)3 not causally related! Summary: Uniform MCMs Atomic consistency Sequential consistency Causal consistency Processor consistency Cache consistency PRAM consistency Slow memory UNIX and session semantics • UNIX file sharing semantics on a uni-processor system – When a read follows a write, the read returns the value just written – When two writes happen in quick succession, followed by a read, the value read is that stored by the last write • Problematic for a distributed system – Theoretically achievable if single file server and no client caching • Session semantics – Writes made visible to others only upon the closing of a file Delta Consistency • Any write will become visible within at most Delta time units – – – – Barring network latency Meanwhile … all bets are off! Push versus pull Compare with sequential, causal, etc. in terms of valid orderings of operations • Related: Mutual consistency with parameter Delta – A given set of “objects” are within Delta time units of each other at all times as seen by a client – Note that it is OK to be stale with respect to the server by more than Delta! – Generally, specify two parameters • Delta1: Freshness w.r.t. server • Delta2: Mutual consistency of related objects File Systems Consistency Semantics • • • • • What is involved in providing these semantics? UNIX semantics easy to implement on a uni-processor Session semantics: session state at the server Delta consistency: timeouts, leases Meta-data consistency – Some techniques we have seen • Journaling, LFS, Meta-data journaling: ext3 • Synchronous writes • NVRAM: expensive, unavailable – Disk scheduler enforced ordering! • File system passes sequencing restrictions to the disk scheduler • Problem: Disk scheduler can not enforce an ordering among requests not yet visible to it – Soft updates • Dependency information is maintained for meta-data blocks in write-back cache on a perfield and/or per-pointer granularity Network-attached Storage • Introduction to important ideas and technologies • Lots of slides, will cover some in class, post all on Angel • Subsequent classes will cover some topics in depth Direct Attached Storage • Problems/shortcomings in enterprise/commercial settings – – – – Sharing of data difficult Programming and client access inconvenient Wastage of data More? “Remote” Storage • Idea: Separate storage from the clients and application servers and locate it on the other side of a scalable networking infrastructure – • Variants on this idea that we will see soon Advantages – – Reduction in wasted capacity by pooling devices and consolidating unused capacity formerly spread over many directly-attached storage devices Reduced time to deploy new storage • – Backup made more convenient • – – Application server involvement removed Management simplified by centralizing storage under a consolidate manager interface Availability improved (potentially) • • Client software is designed to tolerate dynamic changes in network resources but not the changing of local storage configurations while the client is operating All software and hardware is specifically developed and tested to run together Disadvantages – Complexity, more expertise needed • Implies more set-up and management cost Network Attached Storage File interface exported to rest of the network Storage Area Network (SAN) Block interface exported to rest of the network SAN versus NAS Source: November 2000/Vol. 43, No. 11 COMMUNICATIONS OF THE ACM Differences between NAS and SAN • NAS – – – – TCP/IP or UDP/IP protocols and Ethernet networks High-level requests and responses for files NAS devices translate file requests into operations on disk blocks Cheaper • SAN – – – – – – Fibre Channel and SCSI More scalable Clients translate files access to operate on specific disk Data block level Expensive Separation of storage traffic from general network traffic • Beneficial from security, performance NAS File Servers • • • • • • Pre-configured file servers Consists of one or more internal servers with pre-configured capacity Have a stripped down OS; any component not associated with file services is discarded Connected via Ethernet to LAN OS stripping makes it more efficient than a general purpose OS Have plug and play functionality Source: Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS iSCSI and InfiniBand by Ulf Troppens,Rainer Erkens,Wolfgang Mueller NAS Network Performance • • • NAS and traditional network file systems use IP-based protocols over NIC devices. A consequence of this deployment is poor network performance. The main culprits often cited include: - Protocol processing in network stacks - Memory copying - Kernel overhead including system calls and context switches. NAS Network Performance Figure depicting sources of TCP/IP overhead NAS Network Performance Protocol Processing • Data transmission involves the OS services for memory and process management, the TCP/IP protocol stack and the network device and its device driver. • The network per-packet costs include the overhead to execute the TCP/IP protocol code, allocate and release memory buffers, and device interrupts for packet arrival and transmit completion. • The per-byte costs include overheads to move data within the end to end system and to compute checksums to detect data corruption in the network. NAS Network Performance Memory Copy Current implementation for data transmission requires the same data to be copied at several stages. NAS Network Performance • • An NFS client requesting data stored on a NAS server with internal SCSI disk would involve: - Hard Disk to RAM transfer using SCSI, PCI and system buses - RAM to NIC transfer using the System and PCI buses For a traditional NFS this would further involve a transfer from the application memory to the kernel buffer cache of the transmitting computer before forwarding to the network card. Accelerating Performance • Two starting points to accelerate network file system performance are : - The underlying communication protocol TCP/IP was designed to provide a reliable framework for data exchange over an unreliable network. The TCP/IP stack is complex and CPU-intensive. Example alternate: VIA/RDMA - The Network file system Development of new network file systems which have a reliable network connection requirement. Network file systems could be modified to use thinner communication protocols Example alternate: DAFS Proposed Solutions TCP/IP offloading Engines (TOEs) • • An increasing number of network adapters are able to compute internet checksum Some adapters can now perform TCP or UDP protocol processing Copy Avoidance • Several buffer management schemes had been proposed to either reduce or eliminate data copying Proposed Solutions Fibre Channel • • Fibre Channel reduces the communication overhead by offloading transport processing to the NIC instead of using the host processor Zero copying is facilitated by direct communication between the host memory and the NIC device Direct-Access Transport • • • Requires NIC support for remote DMA User-level networking made possible through user-mode process interacting directly with the NIC to send or receive messages with minimal kernel intervention Reliable message transport network Proposed Solutions NIC Support Mechanism • • • • NIC device exposes an array of connection descriptors to the system’s physical address space During connection setup time network device driver maps a free descriptor into the user virtual address space This grants user process a direct and safe access to the NIC’s buffers and registers This facilitates user-level networking and copy avoidance Proposed Solutions User-Level File System • • • • • Kernel policies for file system caching and prefetching do not favor some applications The migration of OS functions into user level libraries allow user applications more control and specialization. Clients would run in user mode as libraries linked directly with applications.This reduces the overhead due to system calls Clients may evolve independent of the operating system Clients could also run on any OS, with no special kernel support except the NIC device driver. Virtual Interface And RDMA • The virtual interface architecture facilitates fast and efficient data exchange between applications running on different machines • VIA reduces complexity by allowing applications (VI consumers) to communicate directly with the network card (VI NIC) via common memory areas, bypassing the operating system • The VI provider is the NIC and its device driver • RDMA is a communication model supported on the VIA which allow applications to read and write memory areas of processes running on different computers VI Architecture and RDMA Source: Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS iSCSI and InfiniBand by Ulf Troppens,Rainer Erkens,Wolfgang Mueller Remote DMA (RDMA) VIA Model CPU CPU 1 send doorbell user address space LANai send descriptor 3 6 send buffer host 2 4 Myrinet NIC receive doorbell data packets in NIC memory 5 user address space receive descriptor LANai 7 8 10 Myrinet NIC receive buffer 9 host InfiniBand • • • • “Infinite Bandwidth” A Switch-based I/O interconnect architecture Low pin count serial architecture Infiniband Architecture(IBA) defines a System Area Network (SAN) – • IBA defines a switched communications fabric – • IBA SAN is a communications and management infrastructure for I/O and IPC high bandwidth and low latency Backed by top companies in the industries; Compaq, Dell, Hewlett Packard, IBM, Intel, Microsoft and sun Limits of the PCI Bus • Parallel Component Interconnect (PCI) – Introduced in 1992 – Has become the standard bus architecture for servers – PCI bus • 32-bit/33MHz -> 64-bit/66 MHz – PCI-X • The latest version 64 bits at: PCI-X 66, PCI-X 133, PCI-X 266 and PCI-X 533 [4.3GBps] – Other PCI concerns include • • • • Bus sharing Bus speed Scalability Fault Tolerance PCI Express • High-speed point-to-point architecture that is essentially a serialized,packetizedversion of PCI • General purpose serial I/O bus for chip-to-chip communication, USB 2.0 / IEEE 1349b interconnects,and high-end graphics – viable AGP replacement • Bandwidth 4 Gigabit/second full duplex per lane – Up to 32 separate lanes – 128 Gigabit/second • Software-compatible with PCI device driver model • Expected to coexist with and not displace technologies like PCI-X in the foreseeable future Benefits of IBA • • • • • • • • • • • • Bandwidths An open and industry-inclusive standard Improved connection flexibility and scalability Improved reliability Offload communications processing from the OS and CPU Wide access to a variety of storage systems Simultaneous device communication Built-in security, quality of Service Support for Internet Protocol version (IPv6) Fewer and better managed system interrupts Support for up to 64000 addressable devices Support for copper cable and optic fiber InfiniBand Components • Host Channel Adapter (HCA) – An interface to a host and supports all software Verbs • Target Channel Adapter (TCA) – Provides the connection to an I/O device from InfiniBand • Switch – Fundamental component of an IB fabric – Allows many HCAs and TCAs to connect to it and handles network traffic. • Router – Forwards data packets from a local network to other external subnets • Subnet Manager – An application responsible for configuring the local subnet and ensuring its continued operation An IBA SAN InfiniBand Layers • Physical Layer Link Pin Count Signaling Rate Data Rate Full-Duplex Data Rate 1x 4 2,5 Gb/s 2 Gb/s 4 Gb/s (500 MB/s) 4x 16 10 Gb/s 8 Gb/s 16 Gb/s (2 GB/s) 12x 48 30 Gb/s 24 Gb/s 48 Gb/s (6 GB/s) InfiniBand Layers • Link Layer – – Is central to the IBA and includes packet layout, point to point link instructions, switching within a local subnet and data integrity Packets • – Switching • – Supported by Virtual lanes is a unique logical communication link that shares a single physical link Up to 15 virtual lane per physical link (VL0 – VL15) Packet is assigned a priority Credit Based Flow Control • – Data forwarding within a local subnet QoS • • • • – Data and management packets Used to manage data flow between two point-to-point links Integrity check using CRC InfiniBand Layers • Networking Layer –Responsible for routing packets from one subnet to another –The global route header (GRH) located within a packet includes an IPv6 address for the source and destination of each packet • Transport Layer –Handles the order of packet delivery as well as partitioning, multiplexing and transport services that determine reliable connections Infiniband Architecture • The Queue Pair Abstraction –2 queues of communication meta data (send & recv) –Registered buffers which to send from/recv to “Architectural Interactions of I/O Networks and Inter-networks”, Philip Buonadonna, Intel Research & University of California, Berkeley Direct Access File System • • • • • • A new network file system derived from NFS version 4 Tailored to use remote DMA (RDMA) which requires the virtual interface (VI) framework Introduced to combine the low overhead of SAN products with the generality of NAS file servers Communication between a DAFS server and client is done through RDMA Client side caching of locks for easier subsequent access to same file Clients can be implemented as a shared library in user space or in the kernel DAFS Architecture Source: Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS iSCSI and InfiniBand by Ulf Troppens,Rainer Erkens,Wolfgang Mueller Direct Access File System DAFS Protocol • • • • • Defined as a set of send and request formats and their semantics Defines recommended procedural APIs to access DAFS services from a client program Assumes a reliable network transport and offers server-directed command flow Each operation is a separate request but also supports request chaining Defines features for session recovery and locking primitives Direct Access File System Direct Access Data Transfer • • • • • Supports direct variants of data transfer operations such as read, write, setattr etc. Direct transfer operations to and from client-provided memory using RDMA read and write operations Client registers each memory region with local kernel before requesting direct I/O on region API defined primitives register and unregister for memory region management; register returns a region descriptor Registration issues a system call to pin buffer regions in physical memory, then loads page translations for the region into a lookup table on the NIC Direct Access File System RDMA Operations • • • RDMA operations for direct I/O are initiated by the server. Client write request to server includes a region token for the buffer containing the data Server then issues a RDMA read to fetch data from client and responds with a write request response after RDMA completion Direct Access File System Asynchronous I/O and Prefetching • • • Supports fully asynchronous API interface which enables clients to pipeline I/O operations and overlap them with application processing Event notification mechanisms delivers asynchronous completions and client may create several completion groups DAFS can be implemented as a user library to be linked with applications or within the kernel. Direct Access File System Figure depicting DAFS and NFS Client Architectures Source: http://www.eecs.harvard.edu/~vino/fs-perf/dafs.html Direct Access File System Server Design and Implementation • The kernel server design is fashioned on an event driven state transition diagram • The main event triggering state transitions are: recv_done, send_done and bio_done Figure 1. An event-driven DAFS server Source: http://www.eecs.harvard.edu/~vino/fs-perf/dafs.html Direct Access File System Event Handlers • • • • Each network or disk event is associated with a handler routine recv_done - Client initiated transfer is complete. This signal is asserted by the NIC and initiates the processing of an incoming RPC request send_done - Server initiated transfer is complete. The handler for this signal releases all the locks involved in the RDMA operation and returns an RPC response bio_done - Block I/O request from disk is complete. This signal is raised by the disk controller and wakes up any thread that is blocking on a previous disk I/O Direct Access File System Server Design and Implementation • • • • • Server performs disk I/O using the zero-copy buffer cache interface This interface facilitates the locking pages and their mappings Buffers involved in RDMA need to be locked during the entire transfer duration Transfers are initiated using RPC handlers and processing is asynchronous Kernel buffer cache manager registers and de-registers buffer mappings to the NIC on the fly, as physical pages are returned or removed from the buffers Direct Access File System Server Design and Implementation • • • • • • Server creates multiple kernel threads to facilitate I/O concurrency A single listener thread monitors for new transport connections. Other worker threads handle data transfer Arriving messages generate a recv_done interrupt which is processed by a single handler for the completion group Handler queues up incoming RPC requests and invokes a worker thread to start data processing A thread locks all the necessary file pages in the buffer cache, creates RDMA descriptors and issues RDMA operations After RDMA completion, a send_done signal is sent which initiates the clean up and release of all resources associated with the completed operation Communication Alternatives Source: Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS iSCSI and InfiniBand by Ulf Troppens,Rainer Erkens,Wolfgang Mueller Experimental Setup Source: http://www.eecs.harvard.edu/~vino/fs-perf/dafs.html Experimental Setup System Configuration • Pentium III 800 MHz clients and servers • Server cache 1GB, 133MHz memory bus • 9GB Disks, 10K RPM Seagate Cheetah, 64-bit/33MHz PCI bus • VI over Giganet cLAN 1000 adapter (DAFS) • UDP/IP over Gigabit Ethernet, Alteon Tigon-II adapters (NFS) Experimental Setup • • • • NFS block I/O transfer size is set at mount time Packets sent in fragmented UDP packets Interrupt coalescing is set to high on Tigon-II Checksum offloading enabled on Tigon-II • NFS-nocopy required modifying Tigon-II firmware, IP fragmentation code, file cache code,VM system and Tigon-II driver, to facilitate header splitting and page remapping Experimental Results The table below shows the results for one-byte round trip latency and bandwidth. The higher latency in Tigon-II was due to datapath crossing the kernel UDP/IP stack Experimental Results Bandwidth and Overhead • • • Server pre-warmed with 768MB dataset Designed to stress on network data transfer Hence client caching not considered Sequential Configuration • DAFS client utilized the asynchronous I/O API • NFS had read-ahead enabled Random Configuration • NFS tuned for best-case performance at each request size by selecting a matching NFS transfer size Experimental Results Experimental Results Experimental Results TPIE Merge • • • The sequential record merge program combines n sorted input files of x y-bytes each into a single sorted output file Depicts raw sequential I/O performance with varying amounts of processing Performance is limited by the client CPU Experimental Results Experimental Results PostMark • • A synthetic benchmark used in measuring file system performance over workloads composed of many short-lived, relatively small files Creates a pool of files with random sizes followed by sequence of file operations Experimental Results Berkeley DB • Synthetic workload composed of read-only transactions, processing one small record at random from a B-tree Disk Storage Interfaces • • • • • Parallel ATA (IDE, E-IDE) Serial ATA (SATA) Small Computer System Interface (SCSI) Serial Attached SCSI (SAS) Fiber Channel (FC) "It's More Then the Interface" By Gordy Lutz of Seagate, August, 2002. Parallel ATA • • • • 16-bit bus Two bytes per bus transaction 40-pin connector Master/slave shared bus • Bandwidth 25MHz strobe x 2 for double data rate clocking x 16bits per edge / 8 bits per byte ------------------------------------= 100MBytes/sec Serial ATA (SATA) • • • 7-pin connector Point to Point connections for dedicated bandwidth Bit-by-bit – – • One single signal path for data transmission The other signal path for acknowledgement Bandwidth 1500MHz embedded clock x 1 bit per clock x 80% for 8b10b encoding / 8 bits per byte ------------------------------------= 150MBytes/sec • • • 2002 -> 150MB/sec 2004 -> 300MB/sec 2007 -> 600MB/sec 8b10b encoding • • • • IBM Patent Used in SATA, SAS, FC and InfiniBand Convert 8 bits data to 10 bits codes Provides better synchronization than Manchester encoding Small Computer Systems Interface (SCSI) • • • • • • • • SCSI for high-performance storage market SCSI-1 proposed in 1986 Parallel Interface Maximum cabling distance is 12 meters Terminators required Bus width is 8-bit (narrow) 16 devices per bus A device with a high priority has a bus SCSI (cont’d) • Peer-to-peer connection (channel) • 50/68 pins • • • • Hot repair not provided Multiple buses needed beyond 16 devices Low bandwidth Distance limitation SCSI Roadmap • Wide SCSI (16-bit bus) • Fast SCSI (double data rate) Serial Attached SCSI (SAS) • • • • • • ANSI standard in 2003 Interoperability with SATA Full-duplex Dual-port 128 devices 10 meters Dual port • ATA, SCSI and SATA support a single port • Controller is a single point of failure • SAS and FC support dual port SAS Roadmap http://www.scsita.org/aboutscsi/sas/SAS_roadmap2004.html Fibre Channel (FC) • Developed to backbone technology of LANs • The name is a misnomer – Runs on copper also – 4 wire cable or fiber optic • • • • 10 km or less per link 126 devices per loop No terminators Installed base of Fibre Channel devices* – $2.45 billion FC HBAs in 2005 – $5.4 billion FC switches in 2005 *Source: Gartner, Dec 13, 2001 FC (cont’d) • Advantage – – – – – High bandwidth Secure Zero-copy send and receive Low host CPU utilization FCP (Fibre Channel Protocol) • Disadvantage – – – – – Not a wide-area network Separate physical network infrastructure Expensive Different management mechanisms Interoperability from difference vendors Fiber Channel Topologies Ulf Troppens, Rainer Erkens and Wolfgang Muller, Storage Networks Explained Fiber Channel Ports • N-Port: Node port • F-Port: Fabric port • L-Port: Loop port – Only connect to AL • E-Port: Expansion port – Connect two switches • G-Port: Generic port • B-Port: Bridge port – Bridge to other networks (IP, ATM, etc) • NL-Port: Node_Loop_port – Can connect both in fabric and in AL • FL-Port: Fabric_Loop_port – Makes a fabric to connect to a loop Ulf Troppens, Rainer Erkens and Wolfgang Muller, Storage Networks Explained Arbitrated Loop in FC Ulf Troppens, Rainer Erkens and Wolfgang Muller, Storage Networks Explained Arbitrated Loop in FC Ulf Troppens, Rainer Erkens and Wolfgang Muller, Storage Networks Explained Routing mechanisms in switch • Store-forward routing • Cut-through routing William James Dally and Brian Towles, Principles and practices of Interconnection networks, chapter 13 Fibre Channel Hub and Switch • Switch – Thousands of connections – Bandwidth per device is nearly constant – Aggregate bandwidth increases with increased connectivity – Deterministic latency • Hub – 126 Devices – Bandwidth per device diminished with increased connectivity – Aggregate bandwidth is constant with increased connectivity – Latency increases as the number of devices increases Fibre Channel Structure Fibre Channel Bandwidth • Clock rate is 1.0625GHz • 1.0625[Gbps] x 2048[payload]/2168[payload+overhead] x 0.8[8b10b]/8[bits] = 100.369 MB/s Cable types in FC FC Roadmap Product Naming Throughput (MB/s) T11 Spec Completed (Year) Market Availability (Year) 1GFC 200 1996 1997 2GFC 400 2000 2001 4GFC 800 2003 2005 8GFC 1,600 2006 2008 16GFC 3200 2009 2011 32GFC 6400 2012 Market Demand 64GFC 12800 2016 Market Demand 128GFC 25600 2020 Market Demand http://www.fibrechannel.org/OVERVIEW/Roadmap.html Interface Comparison Market Segments It’s more than interface, Seagate, 2003 Interface Trends - Previous It’s more than interface, Seagate, 2003 Interface Trends – Today and Tomorrow It’s more than interface, Seagate, 2003 IP Storage IP Storage (cont’d) • TCP/IP is used as a storage interconnect to transfer block level data. • IETF working group, the IP Storage (IPS) • iSCSI, iFCP, and FCIP protocols • Cheaper • Provides one technology for a client to connect to servers and storage devices • Increases operating distances • Improves availability of storage systems • Can utilize network management tools It’s more than interface, Seagate, 2003 iSCSI (Internet SCSI) • iSCSI is a Transport for SCSI Commands – – – – iSCSI is an End to End protocol iSCSI can be implemented on Desktops, Laptops and Servers iSCSI can be implemented with current TCP/IP Stacks iSCSI can be implemented completely in a HBA • Overcomes the distance limitation • Cost-effective Protocol Stack - iSCSI Packet and Bandwidth - iSCSI • iSCSI overhead: 78 Bytes – 14 (Ethernet) + 20 (IP) + 20 (TCP) + 4 (CRC) + 20 (Interframe Gap) – iSCSI header occurs 48 bytes per SCSI command • 1.25[Gbps] x 1460[payload]/1538[payload+overhead] x 0.8[8b10b]/8[bits] = 113.16 MB/s • Bi-Directional Payload Bandwidth: 220.31 MB/s Problems with iSCSI • Limited Performance because – Protocol overhead in TCP/IP – Interrupts are generated for each network packet – Extra copies when sending and receiving data iSCSI Adapter Implementations • Software approach – Show the best performance – This approach is very competitive due to fast modern CPUs • Hardware Approaches – Relatively slow CPU compared to host CPU – Development speed is also slower than that in host CPU – Performance improvement is limited without superior advances in embedded CPU – Can show performance improvement in highly-loaded systems Prasenjit Sarkar, Sandeep Utamchandani, Kaladhar Voruganti, Storage over IP: When Does Hardware Support help?, FAST 2003 iFCP (Internet Fiber Channel Protocol) • • • • • • • iFCP is a gateway-to-gateway protocol for the implementation of a fibre channel fabric over a TCP/IP transport Allow users to interconnect FC devices over a TCP/IP network at any distance Traffic between fibre channel devices is routed and switched by TCP/IP network iFCP maps each FC address to an IP address and each FC session to an TCP session FC messaging and routing services are terminated at the gateways so that are not merged Data backup and replication mFCP uses UDP/IP How does iFCP work? Types of iFCP communication FCIP (Fiber Channel over IP) • TCP/IP-based tunneling protocol to encapsulate fibre channel packets • Allow users to interconnect FC devices over a TCP/IP network at any distance (same as iFCP) • Merges connected SANs into a single FC fabric • Data backup and replication • Gateways –used to interconnect fibre channel SANs to the IP network –set up connections between SANs or between fibre channel devices and SANs FCIP (Fiber Channel over IP) Comparison between FCIP and iFCP IP Storage Protocols: iSCSI, iFCP and FCIP RAS • Reliability – The basic InfiniBand link connection is comprised of only four signal wires – IBA accommodates multiple ports for each I/O unit – IBA provides multiple CRCs • Availability – An IBA fabric in inherently redundant, with multiple paths to sources assuring data delivery – IBA allows the network to heal itself if a link fails or is reporting errors – IBA has a many-to-many server-to-I/O relationship • Serviceability – Hot-pluggable Feature Infini Band Fibre Channel 1Gb & 10 Gb Ethernet PCI-X Bandwidth 2.5 , 10, 30 Gb/s 1, 2.1 Gb/s 1, 10 Gb/s 8.51 Gb/s Bandwidth FullDuplex 5, 20, 60 Gb/s 2.1 , 4.2 GB/s 2, 20 Gb/s N/A Pin Count 4, 16, 48 4 4/8 90 Media Copper/Fiber Copper/Fiber Copper/Fiber PCB Max Length Copper 250 / 125 m 13m 100m inches Max Length Fiber 10 km km km N/A Partitioning X X X N/A Scalable Link Width X N/A N/A N/A Max Payload 4 KB 2KB 1.5 KB No Packets A classification of storage systems (warning - not comprehensive) • Isolated – – – – • E.g., A laptop/PC with a local file system We know how these work File systems were first developed for centralized computer systems as an OS facility providing a convenient programming interfact to (disk) storage Subsequently acquired features like AC, file-locking that made them useful for sharing of data and programs Distributed – Why? • – “Basic” Distributed file system • • – – Sharing, scalability, mobility, fault tolerance, … Give the illusion of local storage when the data is spread across a network (usually a LAN) to clients running on multiple computers Support the sharing of information of in the form of files and hardware resources in the form of persistent storage throughout an intranet Enhancements in various domains for “real-time” performance (multimedia), high failure resistance, high scalability (P2P), security, longevity (archival systems), mobility/disconnections, … Remote objects to support distributed object-oriented programming Storage systems and their properties Sharing Persistence Caching/ replication Consistency maintenance Main memory No No No Strict one-copy RAM File system No Yes No Strict one-copy UNIX FS Distributed file system Yes Yes Yes Yes (approx.) NFS Web Yes Yes Yes Very approx/No Web server Distributed shared memory Yes No Yes Yes (approx) Ivy Remote objects (RMI/ORB) Yes No No Strict one-copy CORBA Persistent object store Yes Yes No Strict one-copy CORBA persistent state service P2P storage system Yes Yes Yes Very approx OceanStore Example