Download Server

HTTP for DB Dummies Steve Gribble [email protected] The Web • HTTP 1.0 model (slowly fading out, replaced by HTTP 1.1): GET /document.html Client cache TCP Server The Web Client cache Server Basics of HTTP Structure of a Request <METHOD> <URL> <HTTPVERSION>\r\n <HEADERNAME>: <HEADERVAL>\r\n <HEADERNAME>: <HEADERVAL>\r\n … \r\n <DATA, IF POST> GET /test/index.html?foo=bar+baz&name=steve HTTP/1.0\r\n Connection: Keep-Alive\r\n User-Agent: Mozilla/4.07 [en] (X11; I; Linux 2.0.36 i686)\r\n Host: ninja.cs.berkeley.edu:5556\r\n Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*\r\n Accept-Encoding: gzip\r\n Accept-Language: en\r\n Accept-Charset: iso-8859-1,*,utf-8\r\n \r\n Structure of a Response <HTTPVERSION> <STATUS CODE> <MSG>\r\n <HEADERNAME>: <HEADERVAL>\r\n <HEADERNAME>: <HEADERVAL>\r\n … \r\n <DATA, IF NECESSARY> HTTP/1.0 200 OK Server: Netscape-Enterprise/2.01 Date: Thu, 04 Feb 1999 00:28:19 GMT Accept-ranges: bytes Last-modified: Wed, 01 Jul 1998 17:07:38 GMT Content-length: 1848 Content-type: text/html TCP level analysis HTTP 1.0 FTP ( >=2nd file) Interesting TCP gotchas • Mandatory roundtrips – – – – TCP three-way handshake get request, data return new connections for each inlined image (parallelize) lots of extra syn or syn/ack packets • Slow-start penalties – can show only affects fast networks, not modems • Lots of TCP connections to server – spatial/processing overhead in server (TCP stack) – many protocol control block (PCB) TIME_WAIT entries – unfairness because of loss of congestion control info Fix? • Persistent HTTP – in HTTP/1.0, add “Connection: Keep-Alive\r\n” header – in HTTP/1.1, P-HTTP built in • Does it help? – mostly for server-side reasons, not network efficiency – allows pipelining of multiple requests on one connection • Does it hurt? – how does a client know when document is returned? – when does the connection get dropped? • idle timeouts on server side • client drops connections • server needs to reclaim resources HTTP/1.0 Client Methods • GET – fetch and return a document – URL can be overloaded to submit form data • GET /foo/bar.html?x=bar&bam=baz • POST – submit a form, and receive response • HEAD – like GET, but only return HTTP headers and not the data itself. Useful for caching • PUT, DELETE, LINK, UNLINK – not really used - big security issues if not careful HTTP/1.0 Status Codes • Family of codes, with 5 “types” – 1xx: informational – 2xx: successful, e.g. 200 OK – 3xx: redirection (gotcha: redirection loops?) • 301 Moved Permanently • 304 Not Modified – 4xx: Client Error • • • • 400 Bad Request 401 Unauthorized 403 Forbidden 404 Not Found – 5xx: Server Error • 501 Not Implemented • 503 Service Unavailable HTTP/1.0 Headers (case insensitive?) • Allow - returned by server – Allow: GET, HEAD – never used in practice - clients know what they can do • Authorization - sent by client – – – – Authorization: <credentials> “Basic Auth” is commonly used <credentials> = Base64( username:password ) ok if inside an SSL connection (encrypted) • Content-Encoding - sent by either – Content-Encoding: x-gzip – selects an encoding for the transport, not the content – sadly, no common support for encodings (Windows) HTTP/1.0 Headers continued • Content-Length - sent by either – Content-Length: 56 – how much payload is being sent? – necessary for persistent HTTP, or for POSTs • Content-Type - sent by server – Content-Type: text/html – what MIME type the payload is – nasty one: multipart/mixed • Date – Date: Tue, 15 Nov 1994 08:12:31 GMT – 3 accepted date formats (RFC 822, RFC 850, asctime()) HTTP/1.0 headers, continued • Expires - sent by server – Expires: Thu, 01 Dec 1994 16:00:00 GMT – primitive caching expiration date – cannot force clients to update view, only on refresh • From - sent by client – From: [email protected] – not really used • If-Modified-Since - sent by client – If-Modified-Since: Sat, 29 Oct 1994 19:43:31 GMT – server returns data if modified, else “304 Not Modified” HTTP/1.0 headers, con’t • Last-Modified - returned by server – Last-Modified: Sat, 29 Oct 1994 19:43:31 GMT – semantically imprecise - file modification? Record timestamp? Date in case file dynamically generated? – used with If-Modified-Since and HEAD method • Location - returned by server – Location: http://www.cs.ubc.ca – used in case of 3xx redirections • Pragma - sent by client or server – Pragma: no-cache – extensibility mechanism. No-cache is the only popularly used pragma, AFAIK HTTP/1.0 headers, con’t • Referer - sent by client – Referer: http://www.xxx-smut.com – specifies address from which request was generated – all sorts of privacy issues - must be careful with this • Server - returned by server – Server: Netscape-Enterprise/2.01 – identifies server software. why? (measurement…) • User-Agent - sent by client – – – – User-Agent: Mozilla/4.07 [en] (X11; I; Linux 2.0.36 i686) identifies client software why? Optimize layout, send based on capability of client. Hint: just pretend to be Netscape. MSIE does.. HTTP/1.0 Server headers • WWW-Authenticate - sent by server – WWW-Authenticate: <challenge> – tells client to resend request with Authorization: header • Incrementally added hacks: – – – – – Accept: image/gif, image/jpeg, text/*, */* Accept-Encoding: gzip Accept-Language: en Retry-After: (date) or (seconds) [Set-]Cookie: Part_Number="Rocket_Launcher_0001"; Version="1"; Path="/acme" – Title: (title) HTTP/1.1 Additions • Lots of problems associated with HTTP/1.0 – the network problems we talked about before – very poor cache consistency models – difficulty implementing multi-homed servers • want 1 IP address with multiple DNS names - how? – hard to precalculate content-lengths – connection dropped = lost data • no chunking • HTTP/1.1 is bloated spec to fix these problems – introduces many complexities – no longer an easy protocol to implement HTTP/1.1 - a Taste of the New • Host: www.ninja.com – clients MUST send this - fixes multi-homed problem – already in most 1.0 and 1.1 clients • Range: bytes=300-304,601-993 – useful broken connection recovery (like FTP recovery) • Age: <seconds, date> – expiration from caches • Etag: fa898a3e3 – unique tag to identify document (strong or weak forms) • Cache-control: <command> – marking documents as private (don’t keep in caches) • “chunked” transfer encoding – segmenting of documents - don’t have to calculate entire document length. Useful for dynamic query responses.. Architectural Complexities Caches Client TCP Server Original web: cache • Problem: no locality – non-local access pattern (trans-atlantic access) – servers serving the same bytes millions of times to localized communities of users Solution: Cache Hierarchy Client cache Cache Server Cache Cache • NLANR cache hierarchy most widely developed – informally uses Squid cache – root servers squirt out 30GB per day – anybody can join... Gotchas • Staleness – HTTP/1.1 cache consistency mechanisms mostly solve • Security – what happens if I infiltrate a cache? – servers/clients don’t even know this is happening – e.g.: AOL used to have a very stale cache, but has since moved to Inktomi • Ad clickthrough counts – how does Yahoo know how many times you accessed their pages, or more importantly, their ads? CGI-BIN gateways URL URL Client httpd data data cache CGI code File System • CGI = “Common Gateway Interface” – interface that allows independent authors to develop code that interacts with web servers – dynamic content generation, especially from scripts – CGI programs execute in separate process, typically CGI-BIN to DB gateways URL URL Client httpd ODBC / JDBC / etc. data data cache CGI code File System DB • JDBC/ODBC gateways – single-node DB, often running on remote host – long, blocking operations, usually – nasty transactional issues - how does client know that action succeeded or failed? • Datek/E*Trade troubles cgi-bin security • Lots of gotchas with CGI-BIN programs – buffer overflows (maximum length checks?) – shell metacharacter expansion • what happens if you put `cat /etc/passwd` in a form field? – sending mail, reading files – redirection - allows bypassing IP address-based security Multiple server support • We’ve seen how single IP address can server multiple web sites with “Host:” HTTP/1.1 field – what about having multiple physical hosts serving a single web site? – useful for scalability reasons Server Server Client TCP Server Server cache www.hotbot.com Solutions • DNS round-robin – assign multiple IP addresses to single domain name – client selects amongst them in order – shortcomings: • exposes individual nodes to clients • can’t take into account machine capabilities (multiprocessors) and currently experienced load • Front-end redirection – single front-end node serves HTTP redirect to selected backend node – introduces extra round-trip, FE is single point of failure More solutions • IP-level multiplexing through smart router – munge IP packets and send them to selected host – Cisco, SUN, etc. make hardware to do this • Cisco LocalDirector – tricky state management issues, failure semantics • “Smart Clients” – Netscape “Proxy Autoconfig” (PAC) mechanism • only useful if connecting via proxy • Javascript selects from amongst proxies – No HTTP protocol support for smart client access to web servers The “Real” Picture of the Web URL cache / firewall Client Redirector data cache I $ $ $ $ HTTP Server HTTP Server HTTP Server HTTP Server I I I CGI code DB www.nytimes.com Web Characteristics UCB HIP trace • Web traffic circa 1997 is primarily: – GIF data • 27% of bytes transferred, 51% of files transferred • average size 4.1 KB – JPEG data • 31% of bytes transferred, 16% of files transferred • average size: 12.8 KB – HTML data • 18% of bytes transferred, 22% of files transferred • average size: 5.6 KB • File sizes, server latency, access patterns – all heavy-tailed: most small, but some very large – self-similarity everywhere - lots and lots of bursts Server-Side Architecture Goals of server • High capacity web servers must do the following: – rapidly update corpus of content served – be efficient • latency: serve content as quickly as possible • throughput: parallel requests from large numbers of clients – be extensible • data-types • cgi-bin programs • server plug-ins – not crash – remain secure High-level Architecture Plugin Interface Filesystem cache Network handler Concurrency subsystem Protocol parser CGI interface Reverse DNS cache Logging subsystem Concurrency • How many simultaneously open connections must a server handle? – 1,000,000 hits per day • 12 hits per second average • upwards of 50 hits per second peak (bursts, diurnal cycle) – latency: • • • • 10 milliseconds (out of memory) ==> 1 connection 50 milliseconds (off of disk) ==> 3 connections 200 milliseconds (CGI + disk) ==> 10 connections 5 seconds (CGI to DB gateway) ==> 250 connections • Depending on expected usage, need very different concurrency models Strategies • Single process, single thread, serialized – simplest implementation, worst performance – perfectly fine for low traffic sites • Multiple processes, single serialized thread / process – Apache web server model – expensive (context switching, process state, …) • Multithreaded [and multiprocess] – complex synchronization primitives needed – thread creation/destruction vs. thread pool management • Event driven, asynchronous I/O – eliminates context switch overhead, better memory mgmt – very complex and delicate program flow Disk I/O • File system overhead – file system buffer management not optimal – don’t need many of the file system facilities • modifying files, moving files, locking files, seeks… • Alternatives: – directly interact with disk • very fast, very complex – in-memory caching on top of file system • works well given high locality of server access • be careful to not suffer from double-buffering • Interaction: thread subsystem and disk – balanced system - enough threads to saturate disk I/O Network I/O • Typical server behaviour rough on network stack – multiple outstanding connections – very rapid TCP creation and teardown – often, very slow last-hop network segment • Redundant operations performed – checksum calculations, byte swapping, … • Inefficiencies at packet level – header, body, FIN usually three separate round-trips • Poor network stack implementations – TIME_WAIT and IDLE PCB entries on single linked list – Nagle’s algorithm invoked when it shouldn’t be Inline scripting • Technology: server-side includes (SSIs) – script embedded inside content, interpreted before sent back to client – dynamically computed content inside templates • authorization (cert lookup or authentication) • DB lookup (inventory lists, product prices, …) • Challenges – similar to CGI: • security • efficiency (latency and throughput) Cheetah (Exokernel) • Direct access to hardware primitives – disk, network - eliminate costly OS generalizations – scatter/gather IO primitives – allow for common disk/network buffers (eliminate copy) • Compiler-assisted ILP – eliminate redundancies, staging inefficiencies • HTTP-specialized network stack and file system – precomputed HTTP headers, minimal copies – minimize network packets (e.g.piggyback FINs with data) – precomputed TCP/IP checksums Some Parting Thoughts Other things to keep in mind • There are non-humans on the web – spiders, crawlers, worms, etc, may behave badly • infinite FTP directory traps, request bursts, ... • Netscape, MSIE, and Apache set defacto standards – their semantics may subtly differ from standards – error-tolerance of popular clients/servers means that everybody must achieve same levels of tolerance • otherwise, you appear to be broken to users • e.g.: Netscape not parsing comments properly • SSL/X.509 – transport-level security: fixes up basic auth problems – eliminates caching or proxy mechanisms

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Server