Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
AWOCA2003 Data Reservoir: Utilization of Multi-Gigabit Backbone Network for Data-Intensive Research Mary Inaba, Makoto Nakamura, Kei Hiraki University of Tokyo AWOCA 2003 Today’s Topic • New infrastructure for data intensive scientific research • Problems of using the Internet AWOCA2003 One day, I was surprised One professor (Dept. of Astronomy) said Network is for E-mail and paper exchange. FEDEX is for REAL Data exchange. (They use DLT tapes, and airplanes) AWOCA2003 Huge Data Producers AKEBONO Sattelite High Energy Accelerator SUBARU telescope KAMIOKANDE (Novel Prize) Radio Telescope in NOBEYAMA A lot of Data suggest a lot of scientific truth, by computation. Now, we can compute. Data Intensive Research AWOCA2003 Huge Data Transfer (inquiry to Profs.) Current State Data Transfer by DLT, EVERY WEEK. Expected Data Size in a few years 10GB/day for Satellite Data 50GB/day High Energy Accelerator 50PB tape archive for Earth Simulation Observatories are shared by many researchers, hence, NEED to bring data to Lab., somehow. Does Network help? AWOCA2003 Super-SINET backbone Start 2002 Jan Hokkaido Univ Network for Universities and Institute Combination of 10Gbps ordinary Line several 1Gbps Project Lines (physics, genome, Grid, etc.) Kyoto Univ, Doshisha Univ Kyushu Univ Osaka Univ B N C Tohoku Univ KEK, Tsukuba Univ Univ. Tokyo , NAO, NII, Titech, ISAS Waseda Nagoya Univ, Okazaki Optical Labs Cross-connect AWOCA2003 Currently It is not so easy to transfer HUGE data by fully utilizing bandwidth for long distance, Because, TCP/IP is popularly used, for TCP/IP latency is the problem. Disk I/O speed (50MB/sec) … AWOCA2003 Recall HISOTRY Infrastructure for Scientific Research Projects • Utilization of computing systems at the time – From the birth of a electronic computer • Numerical computation ⇒ Tables、Equations ① • Supercomputing(vector) ⇒ Simulation ② ③ • Servers ⇒ Database、Data-mining、Genome ④ • Internet ⇒ Information Exchange、 Documentation⑤ Scientific researchers always utilize top-end systems EDSAC ① CDC-6600 ② CRAY-1 ③ SUN Fire15000 ④ 10G Switch ⑤ AWOCA2003 Frontier of Information Processing New transition period -- Balance of computing systems – Very high-speed network – Large scale disk storage – Cluster computers New infrastructure for Data Intensive Research CPU GFLOPS Network Interface Memory GB Local Disks Gbps Remote Disks AWOCA2003 Research Projects with Data Reservoir Name Project Domestic Connection Oversea Connection Current amount of traffic CERN Brook Haven CERN LEP 70 DAT/month CERN LHC 100 MB/sec Brook Haven 100 Mbps RCNP Accelerator 50 GB/day Nobeyama Radio Observatory Max Plank Observatory VLBI data 200 GB Slone Digital Sky Survey National Astronomical Observatory Fermi Lab. Survey Data: 10 TB Data Exchange between Fermi Lab Kazuo Makishima Satellite observation of early universe ISAS Hiroshima Univ. Saitama Univ. NASA European Space Agency Current Satellite 1GB/day Toshio Yamagata Simulation of Global Change Frontier Research System for Global Change N/A 1Simulation 10 TB Currently, data archive system with 50 PBytes Tomio Kobayashi :JC ATLAS Experiment KEK Kyoto Univ. Univ. of Tsukuba CERN CERN LHC Takashi Onaka Infra-red observation Satellite IRIS Nagoya Univ ESA receiving site (Sweeden) Downlink …... 200MB Data exchange within a minutes Jun'ichiro Makino Astronomical Simulation by GRAPE-6 National Astronomical Observatory Advanced Study, Princeton Univ. Musium of Natural History Maxmum Throughput: 100MB/s 1Simulation: 10TB Hiroaki Aihara KEK b-factory KEK Naboya Univ. Princeton Univ. Sadanori Okamura SUBARU telescope National Astronomical Observatory Hawaii Observatory Hideyuki Sakai High Energy Polarimeter SMART Yoshiaki Sobue Radio telescop (VLBI) Sadanori Okamura RIKEN RCNP 100 MB/sec Raw Data:600 GB/day Data exchange: 10GB/day 100 GB/day。 Peak bandwidth 0.5 GB /sec (4Gbps) AWOCA2003 Basic Architecture Data Reservoir High latency Very high bandwidth Network Cache Disks Physically addressed Parallel and Multi-stream transfer Local file accesses Data Reservoir Distribute Shared File Local file accesses (DSM like architecture) Cache Disks AWOCA2003 Data intensive scientific computation through SUPER-SINET Nobeyama Radio Observatory (VLBI) Nuclear experiments Belle Experiments Data Reservoir Very High-speed Network Digital Sky Survey Data Reservoir Local Accesses X-ray astronomy Satellite ASUKA Data Reservoir SUBARU Telescope CERN Data analysis at University of Tokyo AWOCA2003 Design Policy Application • Modification of disk handler under VFS layer • Direct access to raw device for efficient data transfer • Multi-level striping for scalability File System md (RAID) driver sd sg st Data Server SCSI driver iSCSI driver iSCSI daemon - • • • • • Use of iSCSI protocol Local file accesses through LAN Global disk transfer through WAN Single file image File system transparency sg - SCSI driver(mid) SCSI Driver(low) Disks AWOCA2003 File accesses on Data Reservoir Scientific Detectors File Server User Programs File Server 1st level striping File Server File Server Disk access by iSCSI IP Switch IP Switch 2nd level striping Disk Server Disk Server Disk Server Disk Server AWOCA2003 File accesses on Data Reservoir User’s View Scientific Detectors File Server User Programs File Server 1st level striping File Server File Server Disk access by iSCSI IP Switch IP Switch 2nd level striping Disk Server Disk Server Disk Server Disk Server AWOCA2003 Global Data Transfer Scientific Detectors File Server User Programs File Server File Server File Server iSCSI Bulk Transfer IP Switch IP Switch Global Network Disk Server Disk Server Disk Server Disk Server AWOCA2003 Implementation(File Server) Application System Call NFS EXT2 Linux RAID TCP/UDP IP Network sd Driver sg Driver iSCSI driver AWOCA2003 Implementation(Disk Server) Application Layer System Call iSCSI daemon Data Stripe TCP IP dr Driver sg Driver iSCSI Driver SCSI Driver Network Disk Disk Disk AWOCA2003 Performance evaluation of Data Reservoir 1. Local experiment 1 Gbps model (basic performance) 2. 40 km experiments 1 Gbps model、U. of ⇔ ISAS 3. 1600 km experiments 1 Gbps model • 26ms latency (Tokyo ⇔ Kyoto⇔Osaka⇔Sendai⇔Tokyo) • High-quality network (SUPER-Sinet Grid project lines) 4. US-Japan experiments 1. 1Gbps model 2. U. of Tokyo ⇔ Fujitsu Lab. America (Maryland, USA) 3. U. of Tokyo ⇔ Scinet (Maryland, USA) 5. 10 Gbps experiments compare four different switch configuration 1. Extreme Summit 7i, Trunked 8 Gigabit Ethernets 2. RiverStone RS16000 Trunked 8 and 121000BASE-SX 3. Foundry BigIron 10GBASE-LR modules 4. Extreme BlackDiamond Trunked 8 1000BASE-SX 5. Foundry BigIron Trunked 2 10BASE-LR AWOCA2003 • the bottleneck (8Gbps) , Trunking 8 Gigabit Ethernets Performance Comparison to ftp(40km) • ftp ---- Optimal performance (minimum disk head movements) • iSCSI – Queued operation • iSCSI transfer is 55% faster than ftp on single TCP stream AVERAGE iSCSI transfer (DISK to DISK) 30.00 35.00 25.00 30.00 rate(MB/s) 40.00 20.00 15.00 10.00 MAX MIN 45.00 35.00 25.00 20.00 15.00 10.00 0.00 0.00 file file 8 1 file 3 1 file 8 2 file 3 2 file 8 3 file 3 3 file 8 4 file 3 4 file 8 5 file 3 5 file 8 63 5.00 3 5.00 file rate(MB/s) FTP 1GB file transfer (DISK to DISK) 1GB file default 1GB file tune Queue 1 Queue 2 Queue 4 Queue 8 Queue 16 AWOCA2003 1600 km experiment System • 870 Mbps file transfer BW Univ. of Tokyo (CISCO 6509) ↓ 1G Ether (Super-SINET) Kyoto Univ (Extreme Black Diamond ) ↓ 1G Ether (Super-SINET) Osaka Univ. (CISCO 3508) ↓ 1G Ether (Super-SINET) Tohoku Univ. (Jumper fiber) ↓ 1G Ether (Super-SINET) Univ. of Tokyo (Extreme Summit 7i) AWOCA2003 Network for 1600km experiments IBM ・ Grid project networks of SUPER-Sinet ・ One-way latency 26ms I BM Tohoku Univ. (sendai) 550mile 250mile Kyoto Univ. IBM IBM IBM IBM Osaka Univ. 300mile Univ. of Tokyo 1000mile line GbE AWOCA2003 Transfer speed on 1600kmexperiment Maximum bandwidth by SmartBits = 970 Mbps Overheads of headers ~ 5% 1000 Transfer Rate (Mbps) 900 870 828 812 800 737 700 700 707 600 499 500 478 493 400 300 200 100 0 1*4*8 1*4*(2+2) 1*4*4 1*2*8 1*2*(2+2) 1*2*4 1*1*8 1*1*(2+2) 1*1*4 System configuration (file-servers * disk servers * disks/disk server) AWOCA2003 10Gbps experiment 11.7 Gbps transfer BW • Local connection of two 10Gbps models • 10GBASE-LR or 8 to 12 1000BASE-SX • 24 disk servers + 6 file servers – Dell 1650, 1.26GHz PentiumIII×2 1GB memory、ServerSet III HE-SL – NetGear GE NIC – Extreme Summit 7i (Trunking) – Extreme Black Diamond 6808 – Foundry Big Iron (10GBASE-LR) – RiverStone RS-16000 AWOCA2003 Performance on10Gbps model • • • • 300GBytes file transfer (iSCSI streams) 5% header loss due to TCP/IP, iSCSI 7% performance loss due to trunking Uneven use of disk servers 100GB file transfer in 2 minutes 8.0 Throughput (Gbps) 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 4 8 16 Number of disk servers 24 AWOCA2003 US-Japan Experiments at SC2002 Bandwidth Challenge 92% Usage of Bandwidth using TCP/IP AWOCA2003 Brief Explanation of TCP/IP AWOCA2003 User’s View Internet TCP is PIPE TCP Input Data TCP abcde Byte stream Output Same Data abcde In the same order AWOCA2003 TCP’s View Internet TCP TCP abcde Byte stream Check all data has come? Re-order when arrival order is wrong abcde Ask “re-send” when data misses. AWOCA2003 Speed Control TCP’s feature • Keep data until “Acknowledgement” arrives. Use Buffer (Window), and when get ACK from receiver new data is moved to buffer • Speed Control (Congestion Control) without knowing the state of routers. Make Buffer (Window) small, when congestion is guessed to be occurred. AWOCA2003 Window Size and Throughput Roughly speaking Throughput = Window Size / RTT RTT: Round Trip Time Hence, Longer RTT needs Larger Window Size for same throughput. AWOCA2003 Congestion Control AIMD Additive Increase Multiplicative Decrease Gradually accelerate once after congestion occurs, Rapidly slow-down, when congestion is expected. Window Size Doubled for every ACK (start phase) AIMD phase time AWOCA2003 Another Problem Denote “network with long latency and wide bandwidth” as LFN(Long Fat Pipe Network) LFN needs large window size, But, since increment is triggered by ACK. speed of increment is also SLOW. (LFN suffers, AIMD) AWOCA2003 Network Environment The Bottle Neck (about 600Mbps) Note that 600Mbps < 1Gbps AWOCA2003 92% using TCP/IP is good, but, still we have a PROBLEM Several Streams work after other streams finish AWOCA2003 Fastest and slowest stream in the worst case Sequence Number The slowest 3 times slower Than the fastest. Even other streams finish Throughput did not recover Time AWOCA2003 Hand-made Tools • DR Gigabit Network Analyzer – Need accurate Time Stamp with 100ns accuracy – Dump full packets • Comet Delay and Drop Pseudo Long Fat Pipe Network(LFN) Gigabit Ether a packet is sent every 12 μsec AWOCA2003 Programmable NIC(Network Interface Card) AWOCA2003 DR Giga Analyzer AWOCA2003 Comet Delay and Drop AWOCA2003 Unstable Throughput • We examined Long Distance Data Transfer, throughput is 8Mbps to 120Mbps. (When we use Gigabit Ethernet Interface) AWOCA2003 Fast Ethernet is very stable AWOCA2003 Analysis of single stream. Number of packets with 200msec RTT AWOCA2003 Packet Distribution Number of Packets Per msec Time(sec) AWOCA2003 Packet Distribution of Fast Ethernet Number of Packets Per msec Time(sec) AWOCA2003 Gigabit Ethernet interface v.s. Fast Ethernet interface Even, same “20Mbps”, Behavior of 20Mbps of Gigabit Ethernet Interface and 20Mbps of Fast Ethernet Interface Is completely different. Gigabit Ethernet is very bursty. Router might not like this. AWOCA2003 2 problems • Once packets are sent burstly, router sometimes cannot bear. (Unlucky stream slow, lucky stream fast) Especially when bottleneck is under Gigabit. • More than 80% of time, the sender do not send anything. AWOCA2003 Problem of implementation 1Gbps speed, suppose ether packet 1500B, 1 packet should be sent every 12 μsec. On the other hand, UNIX Kernel Timer is 10msec. AWOCA2003 IPG(Inter Packet GAP) • Transmitter is always on, • When no packet sent, idle state. • Each Frame at least 12bytes IPG (IEEE 802.3) sender • Tunable by e1000 driver, (8bytes – 1023 bytes) AWOCA2003 IPG tuning for short distance Fast Ethernet Gigabit Ethernet IPG 8bytes 94.1Mbps 941Mbps IPG 1023 bytes 56.7Mbps 567Mbps Suppose Ether Frame is 1500bytes, 1508: 2523 is approximately 567: 941 These work theoretically. (Gigabit ether has been perfectly tuned already for short distance data transfer) AWOCA2003 IPG tuning for Long Distance AWOCA2003 MAX,MIN,Average, Standard Deviation of Throughput FastEther AWOCA2003 Some patterns of throughput change AWOCA2003 Detail (Slow Start Phase) AWOCA2003 Packet Distribution AWOCA2003 But • They are like an ad-hoc patch. What is the essential Problem? AWOCA2003 One big problem • Good MODEL does not exist. Old type MODEL does not work well. such as queueing theory M/M/1 packt distribution Poisson Distribution Experiment says it is not good. Currently, simulation and using real network is the only way to check. (No Theoretical background) AWOCA2003 What is the difference of telephone network? AUTONOMY AWOCA2003 For Telephone network, • Telephone Company knows, manages and controls whole network. • End-node doesn’t have to do heavy job, such as congestion control. AWOCA2003 Current Trend(?) • Analyze NETWORK using Game Theory. • Nash Equilibrium AWOCA2003