Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Scaling SIP Servers Sankaran Narayanan Joint work with CINEMA team IRT Group Meeting – April 17, 2002 Agenda Introduction Issues in scaling Facets of sipd architecture Some results Conclusion and Future Work Introduction – SIP servers SIP Signaling – Proxy, redirect Proxies Call routing by contact location UDP/TCP/TLS Stateful or stateless Programmable scripts User location – Registrars SQL database What is scale ? Large call volumes, commodity hardware [Schu0012:Industrial] Response times (mean, deviation), Turn around time Goals 200 OK INVITE Delay budget [SIPstone] REGISTER R2 < 2 s R1 < 500 ms Class-5 switches handle > 750K BHCA INVITE R2 180 180 200 200 ACK ACK R1 Limits to scaling Not CPU bound OS resource limits Network I/O – blocking Wait for responses Latency: Contact, DNS lookups Open files (<= 1024 on Unix) LWP’s (Solaris) vs. user-kernel threads (Linux, Windows) Try not to… Customize and recompile OS (parts) server into kernel (khttpd, AFPA, …) The problem Scaling CPU-bound jobs (throughput=1/delay) Hardware: CPU speed, RAM, … Software: better OS, scheduler, … Algorithm: optimize protocol processing Blocking (Network, Disk I/O) is expensive Hypothesis I/O-bound CPU-bound; reduce blocking Optimized resource usage – stability at high loads Facets of sipd architecture Blocking Process models Socket management Protocol processing Blocking Mutex, event (socket, timeout), fread Queue builds up Potentially high variability Tandem queue system Easy to fix Non-blocking calls (event driven, later!) Move queue to different thread (lazy logger) Logger { lock; write; unlock; } Blocking (2) Call routing involves ( 1) contact lookups 10 ms per query (approx) Cache Works well for sipd style servers Fetch-on-demand with replacement (harder) Loading entire database is easy need for refresh – long lived servers. Potentially useful for DNS SRV lookups (?) SQL database Periodic Refresh Cache < 1 ms REGISTER performance Single CPU Sun Ultra10 Response time is constant for Cache (FastSQL) Process models (1) One thread per request Doesn’t scale Too many threads over a short timescale R1 Stateless proxy: 2-4 threads per transaction High load affects throughput R2 R3 R4 Throughput Incoming Requests R1-4 Load Incoming Requests R1-4 Process models (2) Thread pool + Queue Thread overhead less; more useful processing Overload management drop requests over responses, drop tail Not enough if holding time is high Each request holds (blocks) a thread Fixed number of threads Throughput Load Stateless proxy (Solaris) Turnaround time is almost constant for stateless proxy • • The sudden increase in response time - client problem UDP losses on Ultra10 @ (120 * 6 * 500 * 8) bps Stateless proxy (Linux) Request turnaround time breaks down Response turnaround time is constant Effect of high holding times and thread scheduling How to set queue size – investigate? Queue evolution for sipd Number of requests (y-axis) waiting in the queue for a free thread on Solaris (left) and Linux (right) over a period of up-time (x-axis). Process models (3) Blocking thread model needs “too many” threads Stateful transaction stays for 30 s Return thread to free pool instead of blocking Event-driven architectures State transition triggered by a global event scheduler OnIncoming1xx(), OnInviteTimeout(), … SIP-CGI: pre-forked multiple processes Socket management Problem: open sockets limit (1024), “liveness” detection, retransmission One socket per transaction does not scale Global socket if downstream server is alive, soft state – works for UDP Hard for TCP/TLS – connections Worse for Java servers – no select, poll Optimizing protocol processing Not too useful if CPU is not the bottleneck Text protocol - parsing, formatting overheads Order of headers matter (Via) Other optimizations (parse-on-demand, date formatting) ... Conclusion Unlike web servers: can be stateful, less disk I/O, lesser impact of TCP stack/behavior, … Pros: UDP, Stateless routing, Load-balancing using DNS, … Challenges: scaling state machine, Towards 2.5M BHCA (3600 messages/s) Event driven architecture (SEDA?) Resource management (file limits, threads) Tuning operating system (scheduler, …) Future work Stateful proxy performance Evaluate event driven architecture Effect of request forking (> 1 contacts) on server behavior Programmable scripts Queue management and overload control Other types of servers (conference servers, media servers, etc.), References CINEMA web page. http://www.cs.columbia.edu/IRT/cinema H. Schulzrinne. “Industrial strength internet telephony,” Presentation at 6th SIP bakeoff, Dec. 2000. H. Schulzrinne et. al. “SIPstone – Benchmarking SIP server performance,” CS Technical report, Columbia University.