Download Services in CINEMA

SIP Server Scalability IRT Internal Seminar Kundan Singh, Henning Schulzrinne and Jonathan Lennox May 10, 2005 Agenda Why do we need scalability? Scaling the server      Scaling using load sharing     SIP express router (Iptel.org) SIPd (Columbia University) Threads/Processes/Events DNS-based, Identifier-based Two stage architecture Conclusions 27 slides 2 Internet telephony (SIP: Session Initiation Protocol) [email protected] yahoo.com example.com INVITE REGISTER INVITE 129.1.2.3 [email protected] 192.1.2.4 DB DNS 3 Scalability Requirements Depends on role in the network architecture Cybercafe Edge ISP server 10,000 customers ISP IP network IP phones ISP SIP/MGC SIP/PSTN Carrier network GW Enterprise server GW 1000 customers MG IP PSTN PBX PSTN phones SIP/MGC GW MG Carrier (3G) MG 10 million customers T1 PRI/BRI PSTN 4 Scalability Requirements Depends on traffic type  Registration (uniform)   Call routing (Poisson)   Instant message, presence (including sensors), device control Stateful calls (Poisson arrival, exponential call duration)   stateful vs stateless proxy, redirect, programmable scripts Beyond telephony (Don’t know)   Authentication, mobile users Firewall, conference, voicemail Transport type  UDP/TCP/TLS (cost of security) 5 SIPstone SIP server performance metrics SQL database  Steady state rate for  Server  Measure: #requests/s with given delay constraint.  Loader Handler  REGISTER 200 OK R1 successful registration, forwarding and unsuccessful call attempts measured using 15 min test runs.  Performance=f(#user,#DNS,UDP/TCP,g(request),L) where g=type and arrival pdf (#request/s), L=logging? For register, outbound proxy, redirect, proxy480, proxy200. Parameters  INVITE 100 Trying R2 180 Ringing 200 OK ACK BYE 200 OK INVITE 180 Ringing   200 OK ACK 200 OK   Delay budget: R1 < 500 ms, R2 < 2000 ms Shortcomings:  BYE Measurement interval, transaction response time, RPS (registers/s), CPS (calls/s), transaction failure probability<5%, does not consider forking, scripting, Via header, packet size, different call rates, SSL. Is there linear combination of results? Whitebox measurements: turnaround time Extend to SIMPLEstone 6 SIP server What happens inside a proxy? stateful Response recvfrom or accept/recv parse Request Match transaction Modify response Stateless proxy Found Match transaction Update DB REGISTER other Stateless proxy sendto, send or sendmsg Redirect/reject Lookup DB Build response Proxy Modify Request DNS (Blocking) I/O Critical section (lock) Critical section (r/w lock) 7 Lessons Learnt (sipd) In-memory database  Call routing involves ( 1) contact lookups   Cache (FastSQL)    10 ms per query (approx) Loading entire database is easy Periodic refresh Potentially useful for DNS lookups Web config SQL database Periodic Refresh Cache < 1 ms [2002:Narayanan] Single CPU Sun Ultra10 Turnaround time vs RPS 8 Lessons Learnt (sipd) Thread-per-request does not scale One thread per message  Doesn’t scale  Thread pool + queue  Too many threads over a short timescale   Stateless: 2-4 threads per transaction Stateful: 30s holding time   Overload management   Thread overhead less; more useful processing Pre-fork processes for SIP-CGI Graceful failure, drop requests over responses Not enough if holding time is high  Each request holds (blocks) a thread Incoming Requests R1-4 R1 R2 R3 R4 Throughput Thread pool with overload control Incoming Requests R1-4 Thread per request Load Fixed number of threads 9 What is the best architecture?  Event-based Reactive system    1. Process pool 2. Each pool process receives and processes to the end (SER)  stateful Response recvfrom or accept/recv parse Request Thread pool 3. Receive and hand-over to pool thread (sipd) Each pool thread receives and processes to the end Staged event-driven: each stage has a thread pool Match transaction Modify response Stateless proxy Update DB Found Match transaction Stateless proxy REGISTER other Lookup DB sendto, send or sendmsg Redirect/reject Build response Proxy Modify Request DNS 10 Stateless proxy UDP, no DNS, six messages per call stateful Response recvfrom or accept/recv parse Request Match transaction Modify response Stateless proxy Found Match transaction Stateless proxy Update DB REGISTER other sendto, send or sendmsg Redirect/reject Lookup DB Build response Proxy Modify Request DNS 11 Stateless proxy UDP, no DNS, six messages per call 4 3.5 3 2.5 Event Th/msg Th-pool1 Th-pool2 Proc-pool 2 1.5 1 0.5 0 1xP/Linux 4xP/Linux 1xS/Solaris 2xS/Solaris Architecture /Hardware 1 PentiumIV 3GHz, 1GB, Linux2.4.20 (CPS) 4 pentium, 450MHz, 512 MB, Linux2.4.20 (CPS) 1 ultraSparc-IIi, 300 MHz, 64MB, Solaris (CPS) 2 ultraSparc-II, 300 MHz, 256MB, Solaris (CPS) Event-based 1650 370 150 190 Thread/msg 1400 TBD 100 TBD Thread-pool1 1450 600 (?) 110 220 (?) Thread-pool2 1600 1150 (?) 152 TBD Process-pool 1700 1400 160 350 12 Stateful proxy UDP, no DNS, eight messages per call  Event-based   Thread-per-message    single thread: socket listener + scheduler/timer pool_schedule => pthread_create Thread-pool1 (sipd) Thread-pool2  N event-based threads     Each handles specific subset of requests (hash(call-id)) Receive & hand over to the correct thread poll in multiple threads => bad on multi-CPU Process pool  Not finished yet 13 Stateful proxy UDP, no DNS, eight messages per call 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Event Th/msg Th-pool1 Th-pool2 1xP/Linux 4xP/Linux 1xS/Solaris 2xS/Solaris Architecture /Hardware 1 PentiumIV 3GHz, 1GB, Linux2.4.20 (CPS) 4 pentium, 450MHz, 512 MB, Linux2.4.20 (CPS) 1 ultraSparc-IIi, 360MHz, 256 MB, Solaris5.9 (CPS) 2 ultraSparc-II, 300 MHz, 256 MB, Solaris5.8 (CPS) Event-based 1200 300 160 160 Thread/msg 650 175 90 120 Thread-pool1 950 340 (p=4) 120 120 (p=4) Thread-pool2 1100 500 (p=4) 155 200 (p=4) Process-pool - - - - 14 Lessons Learnt What is the best architecture? Stateless        CPU is bottleneck Memory is constant Process pool is the best Event-based not good for multi-CPU Thread/msg and thread-pool similar Thread-pool2 close to process-poll  Stateful    Memory can become bottle-neck Thread-pool2 is good  But not N x CPU  Not good if P  CPU Process pool may be better (?) 15 Lessons Learnt (sipd) Avoid blocking function calls  DNS   10-25 ms (29 queries) Cache     Lazy logger as a separate thread Date formatter    non-blocking Logger   110 to 900 CPS Internal vs external Logger: while (1) { lock; writeall; unlock; sleep; } Strftime() 10% REG processing Update date variable every second random32()  Cache gethostid()- 37s 16 Lessons Learnt (sipd) Resource management  Socket management   Problems: OS limit (1024), “liveness” detection, retransmission One socket per transaction does not scale    Socket buffer size   Global socket if downstream server is alive, soft state – works for UDP Hard for TCP/TLS – apply connection reuse 64KB to 128KB; Tradeoff: memory per socket vs number of sockets Memory management  Problems: too many malloc/free, leaks Stateless processing INV pool 180 200 ACK BYE 200 REG  Memory time (s)  Transaction specific memory, free once; also, less memcpy W/o mempool  About155 67 67gain 95 139 62 237 30% performance W/ mempool Improvement (%)  Stateful: 650 to 800 CPS; Stateless: 900 to 1200 CPS 200 70 111 49 48 64 106 41 202 48 28 27 28 33 24 34 15 31 17 Lessons Learnt (SER) Optimizations  Reduce copying and string operations   Reduce URI comparison to local   Data lumps, counted strings (+5-10%) User part as a keyword, use r2 parameters Parser   Lazy parsing (2-6x), incremental parsing 32-bit header parser (2-3.5x)    Case compare   Use padding to align Fast for general case (canonicalized) Hash-table, sixth bit Database  Cache is divided into domains for locking [2003:Jan Janak] SIP proxy server effectiveness, Master’s thesis, Czech Technical University 18 Lessons Learnt (SER) Protocol bottlenecks and other scalability concerns  Protocol bottlenecks  Parsing      Authentication   Reuse credentials in subsequent requests TCP   Order of headers Host names vs IP address Line folding Scattered headers (Via, Route) Message length unknown until Content-Length Other scalability concerns  Configuration:   broken digest client, wrong password, wrong expires Overuse of features    Use stateless instead of stateful if possible Record route only when needed Avoid outbound proxy if possible 19 Load Sharing Distribute load among multiple servers  Single server scalability   There is a maximum capacity limit Multiple servers     DNS-based Identifier-based Network address translation Same IP address 20 Load Sharing (DNS-based) Redundant proxies and databases P1  REGISTER  D1   D2 P3 Write to D1 & D2 INVITE  P2 INVITE REGISTER Read from D1 or D2 Database write/ synchronization traffic becomes bottleneck 21 Load Sharing (Identifier-based) Divide the user space P1 a-h  D1  P2 i-q  D2  Use many Hashing  P3 r-z Proxy and database on the same host First-stage proxy may get overloaded Static vs dynamic D3 22 Load Sharing Comparison of the two designs P1 P1 a-h D1 D1 P2 P3 P2 i-q D2 D2 High scale Low reliability P3 r-z D2 Total time per DB ((tr/D)+1)TN ((tr+1)/D)TN = (A/D) + B = (A/D) + (B/D) D N r T t = = = = = number of database servers number of writes (REGISTER) #reads/#writes = (INV+REG)/REG write latency read latency/write latency 23 Scalability (and Reliability) Two stage architecture for CINEMA a*@example.com a1 s1 Master a2 a.example.com _sip._udp SRV 0 0 a1.example.com SRV 1 0 a2.example.com Slave sip:[email protected] s2 sip:[email protected] b*@example.com s3 ex example.com _sip._udp SRV 0 40 s1.example.com SRV 0 40 s2.example.com SRV 0 20 s3.example.com SRV 1 0 ex.backup.com b1 Master b2 Slave b.example.com _sip._udp SRV 0 0 b1.example.com SRV 1 0 b2.example.com Request-rate = f(#stateless, #groups) Bottleneck: CPU, memory, bandwidth? 24 Load Sharing Result (UDP, stateless, no DNS, no mempool) S P CPS 3 3 2800 2 3 2100 2 2 1800 1 2 1050 0 1 900 25 Lessons Learnt Load sharing  Non-uniform distribution    Stateless proxy    S=800, P=650 CPS Registration (no auth)    S=1050, P=900 CPS S3P3 => 10 million BHCA (busy hour call attempts) Stateful proxy   Identifier distribution (bad hash function) Call distribution => dynamically adjust S=2500, P=2400 RPS S3P3 => 10 million subscribers (1 hour refresh) Memory pool and thread-pool2/event-based further increase the capacity (approx 1.8x) 26 Conclusions and future work  Server scalability   Load sharing   Non-blocking, process/events/thread, resource management, optimizations DNS, Identifier, two-stage Current and future work:   Measure process pool performance for stateful Optimize sipd     Use thread-pool2/event-based (?) Memory - use counted strings; clean after 200 (?) CPU - use hash tables Presence, call stateful and TLS performance (Vishal and Eilon) 27 Backup slides Telephone scalability (PSTN: Public Switched Telephone Network) database (SCP) for freephone, calling card, … signaling network (SS7) local telephone switch (class 5 switch) signaling router 10,000 customers (STP) 20,000 calls/hour regional telephone switch (class 4 switch) 100,000 customers 150,000 calls/hour “bearer” network database (SCP) 10 million customers 2 million lookups/hour signaling router (STP) 1 million customers 1.5 million calls/hour telephone switch (SSP) 29 SIP server Comparison with HTTP server  Signaling (vs data) bound    Transactions   DNS, SQL database Transport   Stateful wait for response Depends on external entities   No File I/O (exception: scripts, logging) No caching; DB read and write frequency are comparable UDP in addition to TCP/TLS Goals   Carrier class scaling using commodity hardware Try not to customize/recompile OS or implement (parts of) server in kernel (khttpd, AFPA) 30 Related work Scalability for (web) servers  Existing work     HTTP vs SIP   Connection dispatcher Content/session-based redirection DNS-based load sharing UDP+TCP, signaling not bandwidth intensive, no caching of response, read/write ratio is comparable for DB SIP scalability bottleneck   Signaling (chapter 4), real-time media data, gateway 302 redirect to less loaded server, REFER session to another location, signal upstream to reduce 31 Related work 3GPP (release 5)’s IP Multimedia core network Subsystem uses SIP  Proxy-CSCF (call session control function)   Interrogating-CSCF    First contact in operator’s network. Locate S-CSCF for register Serving-CSCF    First contact in visited network. 911 lookup. Dialplan. User policy and privileges, session control service Registrar Connection to PSTN  MGCF and MGW 32 Server-based vs peer-to-peer Reliability, failover latency DNS-based. Depends on client retry timeout, DB replication latency, registration refresh interval DHT self organization and periodic registration refresh. Depends on client timeout, registration refresh interval. Scalability, number of users Depends on number of servers in the two stages. Depends on refresh rate, join/leave rate, uptime Call setup latency One or two steps. O(log(N)) steps. Security TLS, digest authentication, S/MIME Additionally needs a reputation system, working around spy nodes Maintenance, configuration Administrator: DNS, database, middle-box Automatic: one time bootstrap node addresses PSTN interoperability Gateways, TRIP, ENUM Interact with server-based infrastructure or co-locate peer node with the gateway 33 Comparison of sipd and SER  sipd     Thread pool Events (reactive system) Memory pool PentiumIV 3GHz, 1GB, 1200 CPS, 2400 RPS (no auth)  SER    Process pool Custom memory management PentiumIII 850 MHz, 512 MB => 2000 CPS, 1800 RPS 34

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Services in CINEMA