Download Document

Document related concepts
no text concepts found
Transcript
AWOCA2003
Data
Reservoir:
Utilization of Multi-Gigabit Backbone
Network for Data-Intensive Research
Mary Inaba, Makoto Nakamura, Kei Hiraki
University of Tokyo
AWOCA 2003
Today’s Topic
• New infrastructure for data
intensive scientific research
• Problems of using the Internet
AWOCA2003
One day, I was surprised
One professor (Dept. of Astronomy) said
Network is for E-mail and paper exchange.
FEDEX is for REAL Data exchange.
(They use DLT tapes, and airplanes)
AWOCA2003
Huge Data Producers
AKEBONO Sattelite
High Energy Accelerator
SUBARU telescope
KAMIOKANDE (Novel Prize)
Radio Telescope in NOBEYAMA
A lot of Data suggest a lot of scientific truth, by computation.
Now, we can compute.  Data Intensive Research
AWOCA2003
Huge Data Transfer (inquiry to Profs.)
Current State
Data Transfer by DLT, EVERY WEEK.
Expected Data Size in a few years
10GB/day for Satellite Data
50GB/day High Energy Accelerator
50PB tape archive for Earth Simulation
Observatories are shared by many researchers,
hence, NEED to bring data to Lab., somehow.
Does Network help?
AWOCA2003
Super-SINET backbone
Start 2002 Jan
Hokkaido
Univ
Network for Universities and Institute
Combination of
10Gbps ordinary Line
several 1Gbps Project Lines
(physics, genome, Grid, etc.)
Kyoto Univ,
Doshisha
Univ
Kyushu
Univ
Osaka
Univ
B
N
C
Tohoku
Univ
KEK,
Tsukuba Univ Univ. Tokyo ,
NAO, NII,
Titech,
ISAS
Waseda
Nagoya
Univ,
Okazaki
Optical
Labs
Cross-connect
AWOCA2003
Currently
It is not so easy to transfer HUGE data by
fully utilizing bandwidth for long distance,
Because,
TCP/IP is popularly used,
for TCP/IP latency is the problem.
Disk I/O speed (50MB/sec)
…
AWOCA2003
Recall HISOTRY
Infrastructure for Scientific Research Projects
•
Utilization of computing systems at the time
– From the birth of a electronic computer
• Numerical computation ⇒ Tables、Equations ①
• Supercomputing(vector) ⇒ Simulation ② ③
• Servers
⇒ Database、Data-mining、Genome ④
• Internet
⇒ Information Exchange、 Documentation⑤
Scientific researchers always utilize top-end systems
EDSAC
①
CDC-6600
②
CRAY-1
③
SUN Fire15000
④
10G Switch
⑤
AWOCA2003
Frontier of Information Processing
New transition period -- Balance of computing systems
– Very high-speed network
– Large scale disk storage
– Cluster computers
New infrastructure for
Data Intensive Research
CPU
GFLOPS
Network
Interface
Memory
GB
Local Disks
Gbps
Remote Disks
AWOCA2003
Research Projects
with Data Reservoir
Name
Project
Domestic Connection
Oversea Connection
Current amount of traffic
CERN
Brook Haven
CERN LEP 70 DAT/month
CERN LHC 100 MB/sec
Brook Haven 100 Mbps
RCNP Accelerator 50 GB/day
Nobeyama Radio
Observatory
Max Plank
Observatory
VLBI data 200 GB
Slone Digital Sky Survey
National Astronomical
Observatory
Fermi Lab.
Survey Data: 10 TB
Data Exchange between
Fermi Lab
Kazuo
Makishima
Satellite observation of
early universe
ISAS
Hiroshima Univ.
Saitama Univ.
NASA
European Space
Agency
Current Satellite 1GB/day
Toshio
Yamagata
Simulation of Global Change
Frontier Research System
for Global Change
N/A
1Simulation 10 TB
Currently, data archive
system with 50 PBytes
Tomio
Kobayashi
:JC ATLAS Experiment
KEK
Kyoto Univ.
Univ. of Tsukuba
CERN
CERN LHC
Takashi
Onaka
Infra-red observation
Satellite
IRIS
Nagoya Univ
ESA receiving site
(Sweeden)
Downlink …... 200MB
Data exchange within a
minutes
Jun'ichiro
Makino
Astronomical Simulation
by GRAPE-6
National Astronomical
Observatory
Advanced Study,
Princeton Univ.
Musium of Natural
History
Maxmum Throughput:
100MB/s
1Simulation:
10TB
Hiroaki
Aihara
KEK b-factory
KEK
Naboya Univ.
Princeton Univ.
Sadanori
Okamura
SUBARU telescope
National Astronomical
Observatory
Hawaii Observatory
Hideyuki
Sakai
High Energy Polarimeter
SMART
Yoshiaki
Sobue
Radio telescop (VLBI)
Sadanori
Okamura
RIKEN
RCNP
100 MB/sec
Raw Data:600 GB/day
Data exchange: 10GB/day
100 GB/day。
Peak bandwidth 0.5 GB /sec
(4Gbps)
AWOCA2003
Basic Architecture
Data
Reservoir
High latency
Very high bandwidth
Network
Cache Disks
Physically addressed
Parallel and
Multi-stream transfer
Local file accesses
Data
Reservoir
Distribute
Shared
File
Local file accesses
(DSM like architecture)
Cache Disks
AWOCA2003
Data intensive scientific computation through SUPER-SINET
Nobeyama
Radio
Observatory
(VLBI)
Nuclear experiments
Belle Experiments
Data
Reservoir
Very High-speed
Network
Digital Sky Survey
Data
Reservoir
Local
Accesses
X-ray astronomy
Satellite ASUKA
Data
Reservoir
SUBARU
Telescope
CERN
Data analysis at University of Tokyo
AWOCA2003
Design Policy
Application
• Modification of disk handler under VFS
layer
• Direct access to raw device for efficient
data transfer
• Multi-level striping for scalability
File System
md (RAID) driver
sd
sg
st
Data Server
SCSI driver
iSCSI driver
iSCSI daemon
-
•
•
•
•
•
Use of iSCSI protocol
Local file accesses through LAN
Global disk transfer through WAN
Single file image
File system transparency
sg
-
SCSI driver(mid)
SCSI Driver(low)
Disks
AWOCA2003
File accesses on Data Reservoir
Scientific
Detectors
File Server
User Programs
File Server
1st level striping
File Server
File Server
Disk access by iSCSI
IP Switch
IP Switch
2nd level striping
Disk Server
Disk Server
Disk Server
Disk Server
AWOCA2003
File accesses on Data Reservoir
User’s View
Scientific
Detectors
File Server
User Programs
File Server
1st level striping
File Server
File Server
Disk access by iSCSI
IP Switch
IP Switch
2nd level striping
Disk Server
Disk Server
Disk Server
Disk Server
AWOCA2003
Global Data Transfer
Scientific
Detectors
File Server
User Programs
File Server
File Server
File Server
iSCSI Bulk Transfer
IP Switch
IP Switch
Global Network
Disk Server
Disk Server
Disk Server
Disk Server
AWOCA2003
Implementation(File Server)
Application
System Call
NFS
EXT2
Linux RAID
TCP/UDP
IP
Network
sd Driver
sg Driver
iSCSI driver
AWOCA2003
Implementation(Disk Server)
Application Layer
System Call
iSCSI daemon
Data Stripe
TCP
IP
dr Driver
sg Driver
iSCSI Driver
SCSI Driver
Network
Disk
Disk
Disk
AWOCA2003
Performance evaluation of Data Reservoir
1. Local experiment
1 Gbps model (basic performance)
2. 40 km experiments 1 Gbps model、U. of ⇔ ISAS
3. 1600 km experiments 1 Gbps model
• 26ms latency (Tokyo ⇔ Kyoto⇔Osaka⇔Sendai⇔Tokyo)
• High-quality network (SUPER-Sinet Grid project lines)
4. US-Japan experiments
1. 1Gbps model
2. U. of Tokyo ⇔ Fujitsu Lab. America (Maryland, USA)
3. U. of Tokyo ⇔ Scinet (Maryland, USA)
5. 10 Gbps experiments compare four different switch configuration
1. Extreme Summit 7i, Trunked 8 Gigabit Ethernets
2. RiverStone RS16000 Trunked 8 and 121000BASE-SX
3. Foundry BigIron 10GBASE-LR modules
4. Extreme BlackDiamond Trunked 8 1000BASE-SX
5. Foundry BigIron Trunked 2 10BASE-LR
AWOCA2003
• the bottleneck (8Gbps) , Trunking 8 Gigabit Ethernets
Performance Comparison to ftp(40km)
• ftp ---- Optimal performance (minimum disk head
movements)
• iSCSI – Queued operation
• iSCSI transfer is 55% faster than ftp on single TCP
stream
AVERAGE
iSCSI transfer (DISK to DISK)
30.00
35.00
25.00
30.00
rate(MB/s)
40.00
20.00
15.00
10.00
MAX
MIN
45.00
35.00
25.00
20.00
15.00
10.00
0.00
0.00
file
file 8
1
file 3
1
file 8
2
file 3
2
file 8
3
file 3
3
file 8
4
file 3
4
file 8
5
file 3
5
file 8
63
5.00
3
5.00
file
rate(MB/s)
FTP 1GB file transfer (DISK to DISK)
1GB file default
1GB file tune
Queue 1
Queue 2
Queue 4
Queue 8
Queue 16
AWOCA2003
1600 km experiment System
•
870 Mbps file transfer BW
Univ. of Tokyo (CISCO 6509)
↓ 1G Ether (Super-SINET)
Kyoto Univ (Extreme Black
Diamond )
↓ 1G Ether (Super-SINET)
Osaka Univ. (CISCO 3508)
↓ 1G Ether (Super-SINET)
Tohoku Univ. (Jumper fiber)
↓ 1G Ether (Super-SINET)
Univ. of Tokyo (Extreme Summit 7i)
AWOCA2003
Network for 1600km experiments
IBM
・ Grid project networks of SUPER-Sinet
・ One-way latency 26ms
I BM
Tohoku
Univ.
(sendai)
550mile
250mile
Kyoto
Univ.
IBM
IBM
IBM
IBM
Osaka
Univ.
300mile
Univ. of
Tokyo
1000mile line GbE
AWOCA2003
Transfer speed on 1600kmexperiment
Maximum bandwidth by SmartBits = 970 Mbps
Overheads of headers ~ 5%
1000
Transfer Rate (Mbps)
900
870
828
812
800
737
700
700
707
600
499
500
478
493
400
300
200
100
0
1*4*8 1*4*(2+2) 1*4*4
1*2*8
1*2*(2+2) 1*2*4
1*1*8 1*1*(2+2) 1*1*4
System configuration (file-servers * disk servers * disks/disk server)
AWOCA2003
10Gbps experiment
11.7 Gbps transfer BW
• Local connection of two 10Gbps
models
• 10GBASE-LR or
8 to 12 1000BASE-SX
• 24 disk servers + 6 file servers
– Dell 1650, 1.26GHz PentiumIII×2
1GB memory、ServerSet III
HE-SL
– NetGear GE NIC
– Extreme Summit 7i (Trunking)
– Extreme Black Diamond 6808
– Foundry Big Iron (10GBASE-LR)
– RiverStone RS-16000
AWOCA2003
Performance on10Gbps model
•
•
•
•
300GBytes file transfer (iSCSI streams)
5% header loss due to TCP/IP, iSCSI
7% performance loss due to trunking
Uneven use of disk servers
100GB file transfer in 2 minutes
8.0
Throughput (Gbps)
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
4
8
16
Number of disk servers
24
AWOCA2003
US-Japan Experiments at SC2002
Bandwidth Challenge
92% Usage of Bandwidth using TCP/IP
AWOCA2003
Brief Explanation of TCP/IP
AWOCA2003
User’s View
Internet
TCP is PIPE
TCP
Input Data
TCP
abcde
Byte stream
Output Same Data
abcde
In the same order
AWOCA2003
TCP’s View
Internet
TCP
TCP
abcde
Byte stream
Check all data has come?
Re-order when arrival order is wrong abcde
Ask “re-send” when data misses.
AWOCA2003
Speed Control
TCP’s feature
• Keep data until “Acknowledgement”
arrives.
Use Buffer (Window), and when get ACK from receiver
new data is moved to buffer
• Speed Control (Congestion Control)
without knowing the state of routers.
Make Buffer (Window) small, when congestion
is guessed to be occurred.
AWOCA2003
Window Size and Throughput
Roughly speaking
Throughput = Window Size / RTT
RTT: Round Trip Time
Hence, Longer RTT needs
Larger Window Size for same throughput.
AWOCA2003
Congestion Control
AIMD
Additive Increase
Multiplicative Decrease
Gradually accelerate once after congestion occurs,
Rapidly slow-down, when congestion is expected.
Window
Size
Doubled
for every ACK
(start phase)
AIMD phase
time
AWOCA2003
Another Problem
Denote “network with long latency and wide
bandwidth” as LFN(Long Fat Pipe
Network)
LFN needs large window size,
But, since increment is triggered by ACK.
speed of increment is also SLOW.
(LFN suffers, AIMD)
AWOCA2003
Network Environment
The Bottle Neck (about 600Mbps)
Note that 600Mbps < 1Gbps
AWOCA2003
92% using TCP/IP is good,
but, still we have a PROBLEM
Several Streams work
after other streams finish
AWOCA2003
Fastest and slowest stream
in the worst case
Sequence
Number
The slowest 3 times slower
Than the fastest.
Even other streams finish
Throughput did not recover
Time
AWOCA2003
Hand-made Tools
• DR Gigabit Network Analyzer
– Need accurate Time Stamp with 100ns
accuracy
– Dump full packets
• Comet Delay and Drop
Pseudo Long Fat Pipe Network(LFN)
Gigabit Ether
a packet is sent every 12 μsec
AWOCA2003
Programmable NIC(Network
Interface Card)
AWOCA2003
DR Giga Analyzer
AWOCA2003
Comet Delay and Drop
AWOCA2003
Unstable Throughput
• We examined Long Distance Data
Transfer, throughput is
8Mbps to 120Mbps.
(When we use Gigabit Ethernet Interface)
AWOCA2003
Fast Ethernet is very stable
AWOCA2003
Analysis of single stream.
Number of packets
with 200msec RTT
AWOCA2003
Packet Distribution
Number of
Packets
Per msec
Time(sec)
AWOCA2003
Packet Distribution of
Fast Ethernet
Number of
Packets
Per msec
Time(sec)
AWOCA2003
Gigabit Ethernet interface
v.s. Fast Ethernet interface
Even, same “20Mbps”,
Behavior of
20Mbps of Gigabit Ethernet Interface and
20Mbps of Fast Ethernet Interface
Is completely different.
Gigabit Ethernet is very bursty.
Router might not like this.
AWOCA2003
2 problems
• Once packets are sent burstly, router
sometimes cannot bear.
(Unlucky stream slow, lucky stream fast)
Especially when bottleneck is under
Gigabit.
• More than 80% of time, the sender do not
send anything.
AWOCA2003
Problem of implementation
1Gbps speed, suppose ether packet 1500B,
1 packet should be sent every 12 μsec.
On the other hand, UNIX Kernel Timer is
10msec.
AWOCA2003
IPG(Inter Packet GAP)
• Transmitter is always on,
• When no packet sent, idle state.
• Each Frame at least 12bytes IPG (IEEE
802.3) sender
• Tunable by e1000 driver, (8bytes – 1023
bytes)
AWOCA2003
IPG tuning for short distance
Fast Ethernet
Gigabit Ethernet
IPG 8bytes
94.1Mbps
941Mbps
IPG 1023 bytes
56.7Mbps
567Mbps
Suppose Ether Frame is 1500bytes,
1508: 2523 is approximately 567: 941
These work theoretically.
(Gigabit ether has been perfectly tuned already
for short distance data transfer)
AWOCA2003
IPG tuning for Long Distance
AWOCA2003
MAX,MIN,Average, Standard
Deviation of Throughput
FastEther
AWOCA2003
Some patterns of
throughput change
AWOCA2003
Detail (Slow Start Phase)
AWOCA2003
Packet Distribution
AWOCA2003
But
• They are like an ad-hoc patch.
What is the essential Problem?
AWOCA2003
One big problem
• Good MODEL does not exist.
Old type MODEL does not work well.
such as
queueing theory M/M/1
packt distribution Poisson Distribution
Experiment says it is not good.
Currently, simulation and using real network
is the only way to check.
(No Theoretical background)
AWOCA2003
What is the difference of
telephone network?
AUTONOMY
AWOCA2003
For Telephone network,
• Telephone Company knows, manages and
controls whole network.
• End-node doesn’t have to do heavy job,
such as congestion control.
AWOCA2003
Current Trend(?)
• Analyze NETWORK using Game Theory.
• Nash Equilibrium
AWOCA2003