Download PDF file - The Open University of Hong Kong

MT834 Unit 1 The Web and the Internet 080 Course team Developer: Jenny Lim, Consultant Designer: Chris Baker, OUHK Coordinator: Dr Li Tak Sing, OUHK Member: Dr Andrew Lui Kwok Fai, OUHK External Course Assessor Prof. Mingshu Li, Institute of Software, Chinese Academy of Sciences Production ETPU Publishing Team Copyright © The Open University of Hong Kong, 2004. Reprinted 2008. All rights reserved. No part of this material may be reproduced in any form by any means without permission in writing from the President, The Open University of Hong Kong. The Open University of Hong Kong 30 Good Shepherd Street Ho Man Tin, Kowloon Hong Kong Contents Overview 1 Objectives 2 Introduction 3 What is the Web? 4 Design and structure HTML URLs HTTP The Internet Design and structure Communication protocols Internet Protocol (IP) Ports Transmission Control Protocol (TCP) Web servers Role of browsers and servers Installing a local Web server 5 6 9 15 18 19 21 22 28 29 32 32 33 Summary 38 Suggested answers to self-tests 39 References 42 Unit 1 Overview Welcome to Unit 1 of MT834 Web Server Technology. Please be reminded that you should have read the MT834 Course Guide by now. It’s also a good idea to browse the course website through the Open Learning Environment (OLE). The course website offers interesting information and activities associated with each unit. You will find that this course includes theoretical concepts as well as hands-on experimentation. As a graduate student, you are also encouraged to visit the University’s award winning Electronic Library. If you have done all these things, you are ready to get started with this first unit. I hope you are as excited as I am by the new era of the information age. Without a doubt, the Web — together with the Internet — plays a most important role in allowing information to be accessed easily and instantly. To understand how the Web works, you first need to have a good understanding of the underlying network which serves as its delivery medium — the Internet. Here are the aspects of the Internet we shall examine in this unit: • the physical network; • design and structure of the Internet; • Domain Name Service (DNS); and • Transmission (or Transport) Control Protocol/Internet Protocol (TCP/IP). Next, we’ll discuss the technologies used to provide the World Wide Web service over the Internet: • HyperText Transfer Protocol (HTTP); • Uniform Resource Locator (URL); • Web client and server software; and • HyperText Markup Language (HTML) documents. To learn these concepts, you will read selected online readings and conduct a series of hands-on exercises on your local computer. You will install a Web browser, create your own HTML document, and then install and configure an Apache Web server to serve the HTML document. This first unit of MT834 Web Server Technology is expected to take you four weeks (or about 30 hours of study time) to complete. Please plan your time carefully. As you work through the unit, you will need to refer to online readings and activities on the MT834 Web Server Technology course website. You may begin Unit 1 now by reading the unit’s learning objectives. 1 2 MT834 Web Server Technology Objectives By the end of this unit, you should be able to: 1 Explain how the World Wide Web, Domain Name Service, FTP and other applications are made available over the medium of the Internet. 2 Discuss the design and structure of the Internet, and the use of TCP/IP as its transmission protocol. 3 Create basic HTML documents and serve these documents from a local Web server. 4 Describe the role of Web browsers and Web servers. Unit 1 Introduction As a distance education student, the World Wide Web is an integral part of your life. In courses such as this one, you rely on it to communicate with tutors and fellow students, download study materials, and view online resources. This experience is bound to make you very familiar with the Web from an end-user’s point of view, that is, from the perspective of someone who requests information via a Web browser. But what really goes on behind the scenes after you click on a hyperlink? How does your Web browser locate the document that you want and how does this document get transmitted to your machine? Where are these documents stored and how are they organized? For someone who is about to embark on a course in Web server technologies, you need a deeper understanding of the Web and the Internet beyond that of an experienced Web user. In this unit, we will take our first steps towards gaining this knowledge. We will start with an in-depth discussion of the design principles, technologies and protocols behind the Web and the Internet. These fundamental concepts will underlie the Web server technologies to be discussed throughout the course. 3 4 MT834 Web Server Technology What is the Web? Computers have been connected to the Internet since the 1970s, and data exchange between networked computers has been around for just as long. However, the launch of the World Wide Web in the early 1990s offered the prospect of something totally new. It allowed the entire Internet to be viewed as a single information space, where users accessing data could move seamlessly and transparently from machine to machine by following links. Before the Web, individual Internet computers had windowing systems and graphical capabilities, but networking applications such as email and FTP were still text-based. The ease, convenience and graphical nature of the Web has made it the ‘killer’ application of the Internet. The Web, as it is commonly called, is a collection of interlinked information that is accessible through a worldwide network. It is a digital ‘information space’ with a means for users to access and retrieve documents from it. Here are the components of the Web (also shown figure 1.1 below) which make all this information storage, organization and retrieval possible: 1 Information in the form of multimedia documents. Multimedia means that these documents can be composed of text, images, animation, audio, video, and other types of content. 2 Computers where these documents are stored (known as servers or providers) and computers from which these documents are accessed (known as customers or clients). 3 A networking medium which connects these clients and servers so that data can travel between them, namely, the Internet. The Web was built to run ‘on top’ of the Internet, which exists independently of the Web. 4 Web client and server software which allows webpages on any Web server to be accessed from around the world. From the user’s point of view, all that’s needed is a Web browser and an Internet connection in order to get on the Web. From the information provider’s point of view, he/she needs a Web server which is connected to the Internet. Image of Web browser Figure 1.1 Internet Overview of the Web and its components Image of Web server with documents Unit 1 For the rest of Unit 1, we will discuss the technologies, protocols and standards that are used by these different Web components. Design and structure Tim Berners-Lee proposed the World Wide Web in 1989 because he wanted a better way of sharing and retrieving information among the people who worked at the CERN (European Laboratory for Nuclear Research) office in Geneva, Switzerland. When he was designing the Web, these were some of his goals: • To allow access to different kinds of information stored on disparate computing platforms. Common protocols had to be used to provide a bridge between different computer operating systems and networks. • To use hypertext, or nonlinear text, that allows related documents to be tied together via ‘active links’, and that users can ‘follow’ by clicking on the links. The Web browser then fetches and displays the document pointed to by the link. • To decentralize control and access. In order to get on the Web, all that was needed was access to the networking medium (e.g., the Internet) and software to retrieve and view the documents on it (e.g., the browser). There was no central node or computer to which everything had to be connected. The Web’s architecture follows a standard client-server model. In this model, a user relies on a program (the client) to connect to a remote machine (the server), where the requested resource is stored. Possible resources could be text files, multimedia documents or dynamically generated pages. Web clients, such as Internet Explorer and Firefox, know how to present data but do not need to know the details of how this data is stored or generated. Web servers, such as Apache and Internet Information Server (IIS), know how to extract data, but are ignorant of the details of how it will be presented to the user. Tim Berners-Lee came up with three important new technologies for creating the Web: 1 HyperText Markup Language (HTML); 2 Uniform Resource Locators (URLs); and 3 HyperText Transfer Protocol (HTTP). These were based on ideas which emerged in the last few decades. However, the technology needed to make hypertext systems a reality was only brought together in the early 1990s, with the birth of the Web. In the remaining part of this section, we will discuss these three technologies in more detail. Now answer the following questions to test your understanding of the Web’s design and structure. 5 6 MT834 Web Server Technology Self-test 1.1 1 The Web uses the client-server architecture model. What does this mean and how can clients and servers become part of the Web? 2 What is the relationship between the Web and the Internet? HTML HTML, or HyperText Markup Language, is the markup language used to create documents on the Web. A markup language allows authors to highlight portions of the content and assign meaning to these portions by tagging them. HTML documents are plaintext or ASCII files that can be created using any text editor on any machine. Tim Berners-Lee chose the plaintext format because it could be understood by all computers, regardless of their operating system or hardware platform. This is an important factor behind the universality and accessibility of the Web. HTML documents contain a combination of text and markup tags. The markup tags specify the logical structure and organization of a document, for example, which parts belong to the head and body of the page. Here are the basic tags which must be found in every HTML document. <HTML> <HEAD> <TITLE> page title </TITLE> </HEAD> <BODY> body of the document </BODY> </HTML> Figure 1.2 Basic HTML document tags Aside from these basic tags, there are other tags which can be used to mark up parts of the Web document body, such as the paragraphs, headings, lists, quotes, definitions, citations, etc. The interpretation of these marked elements are left to the browser. This choice was made because the same HTML document may be viewed by different browsers of varying abilities. Here is a more detailed example of the elements that may be found in an HTML document. Unit 1 HTML source <html> <head> <TITLE>Learning HTML</TITLE> </head> <body> <H1>HTML is Easy To Learn</H1> Welcome to the world of HTML. This is the first paragraph. This text is bold. And this is the second paragraph. This text is emphasized. Here are 3 reasons to learn HTML: <ul> <li>You can build your own home page. <li>You can start an online business. <li>You can share your photo albums. </ul> </body> </html> How it is displayed by my browser (Internet Explorer 5.5) Figure 1.3 Elements that may be found in an HTML document Aside from defining the structure of the document, markup tags can also be used to identify the hyperlinks within the page. Just as there are HTML tags for representing formatting directives, there are HTML tags called anchor tags for representing HTML links, i.e., anchors, to other Web resources. When you use your mouse to point-and-click an HTML document and a new document or other multimedia resource pops up, you are using HTML anchors and HTML’s hypertext capabilities. The text The Open University of Hong Kong with a hypertext link to the OUHK’s homepage would be written like this: <A HREF="http://www.ouhk.edu.hk"> The Open University of Hong Kong</A> 7 8 MT834 Web Server Technology When displayed by your Web browser, the anchor is highlighted and underlined. It would look like this: The Open University of Hong Kong We will discuss how a Web address, or Uniform Resource Locator (URL) is formed a little later in the section titled ‘URLs’. Although you’ve now had a basic introduction to HTML, we’ve only covered a very small portion of its available tags. From its origins as a structural markup language, HTML has also grown to include formatting tags which describe how elements should appear (e.g., in what colours and sizes). If this is your first exposure to HTML, the following reading will give you a good foundation in the important tags that are needed to build webpages. The first item in the reading is a highly readable introduction to HTML. I’ve also provided a shorter item, item 2, that you can skim through if you want to refer to some basic guidelines for constructing a simple HTML document. Reading 1.1 1 ‘A Beginner’s Guide to HTML’, section Markup Tags http://www.put.com/HTMLPrimer.html#MT 2 ‘HTML for the conceptually challenged’, http://www.arachnoid.com/lutusp/html_tutor.html. When you have completed Reading 1.1 you should be able to construct an HTML document containing HTML tags for these elements: 1 bold text; 2 italicized text; 3 underlined text; 4 paragraph; 5 heading; 6 hypertext link or hyperlink; 7 inline image; 8 background graphic; 9 font colour; and 10 background colour. Unit 1 Activity 1.1 The files needed for the following activity can be found on the OLE. ABC Books has just decided to establish an online presence and they have given you a document containing the information they wish to appear on their homepage. 1 Please download the document called abc_home.zip which contains abc_home.doc and three graphic files. 2 Convert this information in the word file into an HTML document called abc_home.html. Code this page by hand, using the tags you’ve learned in this section. The three graphic files are to be used in the HTML document. Note: If you have difficulty with this or any of the other activities, please seek help on the MT834 discussion board, or contact your tutor. URLs In the previous section, you learned that HTML uses the anchor tag <A> in order to tell the Web browser where an information resource is located on the Internet. Uniform Resource Locators (URLs) are used within anchor tags to specify a unique online address for each resource, just as street addresses express the unique location of a place in our physical world. From your travels on the Web you are probably familiar with the basic form of a URL, as follows: http://hostname/path/filename.html A Web URL includes the protocol name (http), followed by a colon and a double slash (://), followed by the Web server name or address, and the location of the file on the server. If a filename is not specified, then the server will return the default homepage. Using the above example, you should recognize that the URL for the Open University of Hong Kong homepage would be http://www.ouhk.edu.hk. The next reading gives you more details on how different kinds of URLs can be coded in HTML pages. 9 10 MT834 Web Server Technology Reading 1.2 A Beginner’s Guide to HTML, section Linking http://www.put.com/HTMLPrimer.html#LI2. Now that you’ve got an idea of how webpages and hyperlinks are coded using HTML, let’s take a closer look at the software that is used to interpret and display HTML documents, namely, the Web browser. Activity 1.2 This activity requires you to have two Web browsers installed on your machine so that you can contrast how each Web browser handles common browser functions. The browsers used for the screen shots in this activity are Internet Explorer 6 and Firefox. You may use any reasonably current version of these two browsers. If you need to upgrade to a later version or you need to acquire new browsers, you can download • Firefox at http://www.mozilla.com/firefox/all • Microsoft Internet Explorer Web browser at http://www.microsoft.com/windows/ie/downloads/default.mspx. Follow the instructions for downloading and installing a new version of each Web browser for your operating system. Aside from displaying HTML documents, Web browsers also come with common menu-based functions. Functions are accessed via menu bars along the top of the browser window. An example of a common function would be configuring the browser to use certain fonts, text sizes or languages. Browsers from different vendors may label these menu options differently or place them in slightly different locations. Let’s compare and contrast how Firefox and Internet Explorer handle some common functions. 1 Viewing or altering the configuration settings of the Web browser • For Internet Explorer: Tools Æ Internet Options. • For Firefox: Tools Æ Options. Unit 1 Figure 1.4 Internet Options in Internet Explorer You will see common configuration settings for the browser such as font colours and language preferences. We will be altering some of these configuration settings in future units. Let’s change the default font colours. Figure 1.5 2 Changing default colours in Internet Explorer Viewing the properties of an image Point each Web browser at the URL for the Open University of Hong Kong’s homepage at http://www.ouhk.edu.hk. Position your mouse over an image in the webpage, then right-click on the mouse select Properties. 11 12 MT834 Web Server Technology Both browsers display the URL, the size and the dimension of the image. IE also shows the creation date of the image. Here’s what the Properties window looks like after I right-click on an image from OUHK’s homepage. Figure 1.6 3 The image Properties window Viewing the source of the HTML document Point each Web browser at the URL for the Open University of Hong Kong’s homepage at http://www.ouhk.edu.hk. • For Internet Explorer: View • For Firefox: View Source. Page Source. You should be able to recognize some HTML tags as well as HTML hypertext anchor tags. You will also see a variety of HTML tags indicating some interactive technologies such as JavaScript or Java. Ignore these complex HTML tags and constructs for now; we will cover these concepts in later units. Can you find the basic tags <HTML>, <HEAD>, <TITLE>, and <BODY> in the HTML source of the Open University of Hong Kong’s homepage? Unit 1 MIME types The HTML documents we’ve seen so far are all written and encoded using the ASCII or plaintext format. However, URLs not only point to HTML documents, but may also point to multimedia resources that are encoded in formats other than text. For example, images are usually in GIF, JPEG, or PNG format, while audio files might use MPEG, AU, or MP3. It is impossible for Web browsers to have the built-in capability to render every media format that is available. So how does the browser know how to handle those formats for which it does not have a native rendering capability? Multipurpose Internet Mail Extensions (MIME) types are the answer. MIME is an international standard that defines the rules for exchanging information that uses non-ASCII text encoding. It enables the client to know what type of file to expect and what software to use to interpret the file, in case the client is not capable of understanding this type of encoding. MIME also defines a standard set of names for the different data formats that could be transmitted over networks. The names come in two parts: the file type, followed by a slash (/), followed by a subtype. The following table lists some of the better-known MIME file types and subtypes, together with helper applications that can be invoked by the client to display them. Table 1.1 Better-known MIME file types and subtypes MIME file types and subtypes File extensions Description Application/msword doc Microsoft Word document Application/zip zip Compressed file that can be opened using PKZip, WinZip or other file compression software Image/jpeg jpeg, jpg JPEG image file, which can be opened natively by the browser and by graphics editors such as Adobe Photoshop, Macromedia Fireworks, etc. Image/gif gif GIF image file, which can be opened natively by the browser and by graphics editors such as Adobe Photoshop, Macromedia Fireworks, etc. Video/quicktime mov, qt Quicktime movie file Audio/midi midi LiveAudio 13 14 MT834 Web Server Technology The Web server can be configured to send the correct MIME type to the Web browser along with the requested resource. The Web browser examines the MIME type and displays the resource if it has the native capability to do so. If not, it can be configured to launch the appropriate helper application that can handle this resource. This is illustrated in the next activity. Activity 1.3 Let’s compare the approach used by Firefox and Internet Explorer in setting up and recognizing MIME types. This activity also illustrates how different vendors implement the same features in their software in different ways, further reinforcing what you’ve seen in the previous activity. 1 Using both browsers, visit the OUHK website and click on the Students link. Then, choosing the Information tab, go to Prospectus and then select View online. This link will take you to the Adobe Acrobat document (PDF) that contains the University prospectus. How did both browsers handle this file type? 2 Now let’s see how Firefox and IE were configured to use Adobe Acrobat to render the PDF file. For Firefox, the configuration screen can be accessed via these menu options: Tools Options Downloads Plug-Ins. Figure 1.7 Viewing MIME types in Firefox Unit 1 For Internet Explorer, the MIME types are embedded in the Windows operating system. On XP, the MIME types can be found by clicking on: My Computer Tools Folder Options File Types. The resulting display shows a list of all the registered MIME types on the system. Figure 1.8 Viewing MIME types in Internet Explorer HTTP One of the most important concepts when learning how clients and servers communicate over a network is protocol. Douglas Comer defines a protocol as ‘a formal description of message formats and the rules two or more machines must follow to exchange those messages’ (cited in Connected: An Internet Encyclopedia; Programmed Instruction Course, ‘Protocols’, http://freesoft.org/CIE/ Course/Section1/3.htm). Protocols usually exist in two forms. First, they exist in a textual form for humans to understand. Second, they exist as programming code for computers to execute. Whenever a computer needs to send data to or receive data from another host, a protocol is needed to specify how every bit of every message should be written and interpreted. The protocol also describes how to handle error conditions. HyperText Transfer Protocol, or HTTP, is the 15 16 MT834 Web Server Technology protocol used by the World Wide Web service. HTTP describes how Web clients make requests for information and how servers respond to these requests. Other services, such as email and FTP, also have their own protocol specifications. HTTP is an openly published standard protocol, which allows browsers and servers written by different vendors to communicate with each other, as long as their software speaks and understands HTTP. Like most network protocols, HTTP uses the client-server model: an HTTP client opens a connection and sends a request message to an HTTP server; the server then returns a response message, usually containing the resource that was requested. After delivering the response, the server closes the connection. This makes HTTP a stateless protocol which does not retain any connection information between transactions. HTTP messages are text-based and can be made up of several lines. Figures 1.9 and 1.10 are sneak previews of what HTTP messages can look like. Figure 1.9 shows an HTTP request sent by a Web client. The first line describes what the client wants to do (e.g., get a document) and includes the URL of the desired document. This could be followed by an optional number of headers. In figure 1.10, you see that successful requests return a status code of 200 in the first line. There could be other response headers attached, followed by the requested document itself. GET /index.html HTTP/1.0 Host: www.ouhk.edu.hk User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Figure 1.9 Sample HTTP request sent by a Web client HTTP/1.1·200·OK Date:·Fri,·10·Oct·2003·07:44:55·GMT Server:·Apache/1.3.22·(Unix)·mod_jk/1.1.0·mod_ssl/2.8.5·OpenSSL/0.9.6b Last-Modified:·Wed,·18·Jun·2003·09:43:11·GMT ETag:·"250b81-5a62-3ef0342f" Accept-Ranges:·bytes Content-Length:·23138 Connection:·close Content-Type:·text/html <html> <head><title>Sample Response</title></head> <body>Hello, world !</body> </html> Figure 1.10 Sample HTTP response sent by Web server The rules for constructing valid HTTP messages will be explained in more detail in Unit 2. However, the next activity allows you to Unit 1 experiment with different requests and examine the HTTP messages that are generated. Activity 1.4 In this activity, you view the HTTP messages exchanged by your Web browser and the Web server using the online HTTP viewer at http://www.rexswain.com/httpview.html. 1 Here are some pages to access: www.ouhk.edu.hk and www.yahoo.com. 2 For each HTTP request, identify where the URL is located. 3 For each HTTP response, identify where the status code and the HTML document is located. You should now understand the basics of HTML, URLs, hypertext, and the multimedia nature of Web documents. Test your knowledge by completing the following self-test. Suggested answers are provided at the back of the unit. Self-test 1.2 1 What three new technologies were created to build the Web? What does each of these technologies do? 2 What is hypertext? What are some of the characteristics of hypertext? 3 What is a URL? Give an example of a URL. 4 How do you represent a hypertext link in HTML? 5 How do you represent an inline image in HTML? 6 What HTML tags should always be included in an HTML document? 17 18 MT834 Web Server Technology The Internet The Internet is a network. More accurately, the Internet is a network made up of many connected and cooperating computer networks. All these networks communicate using the same methods or protocols. These interconnected networks spanning the entire globe are called an internetwork, hence the term Internet. You can visualize the Internet as a giant, global plumbing system, similar to the pipes that bring water to your homes, except it’s used to carry digitized data. The Internet in itself is useless if there is no traffic travelling on it, and as you’ve seen previously, email and the Web are the most popular services which use the Internet as their transport medium. The Internet was a product of the Cold War. It began as an effort to create a communication network that could withstand nuclear attacks. The idea behind the new network was that even if a section of it was destroyed, messages could still be delivered by redirecting them over sections that were still intact. None of the existing networks of the time could handle this requirement, including the telephone system which was vulnerable to attacks on its switching stations. Before the Internet, computers had to be directly connected to each other if they needed to communicate. Messages sent from host computer A to host computer B could only travel on a single, fixed route. There were no alternative paths that could be used in case this direct route was destroyed. Therefore, a new type of network was designed and built that could fulfill the requirements of a wartime network: one that we now call the Internet. The following online reading outlines the history of the Internet and its unique design which allows it to handle failure and re-route traffic around trouble spots. Reading 1.3 Ruthfield, S (1995) ‘The Internet’s history and development: from wartime tool to the fish-cam’, ACM Crossroads, http://www.acm.org/crossroads/xrds2-1/inet-history.html. Note: You do not have to remember the historical facts and organizations in the article such as RAND, MIT, UCLA, ARAPANET and Pentagon. There are two important characteristics of the Internet which stand out in Reading 1.3. First, the Internet is decentralized. There are no top-level computers or routes that can fail and stop the operation of the entire network. There Unit 1 can also be multiple, redundant routes between two computers, so if one route becomes unavailable, other routes can still be used. Second, the Internet is a packet-switching network. Messages are broken into discrete units called ‘packets’ which contain the addresses of the source and destination host. These packets are routed separately through intermediate nodes, hopping from one to another until they reach their intended destination. If a packet fails in traversing the network via a particular node, it is simply resent, again, and again, across another network path or ‘route’ until it reaches its destination. When all the packets have arrived at their destination, they are reassembled to form the original message. This design makes it possible for two computers to communicate with each other even if there is no direct, dedicated network cable running between them. B WWW client G C A D F H WWW server E Figure 1.11 Packets travel from the source to the destination via intermediate gateways, or routers, in a packet-switching network such as the Internet Nobody owns or controls the Internet, although companies and even governments might be responsible for building and maintaining portions of it. Rather, any network that voluntarily implements the Internet’s standard protocols may participate in it. Many Internet providers not only adhere to these standards, but also open up their networks to data traffic from the general public. The voluntary interconnections and cooperation between these network providers make the global Internet possible. Design and structure The Internet is a network of networks. At the very lowest level these networks consist of a set of physical network hardware and low-level communication software. This physical network layer is the lowest layer where a network connection takes place. Two physical nodes or computers on the network ‘connect’ to exchange messages in some form. A network connection can have a variety of physical forms as shown in figure 1.12. 19 20 MT834 Web Server Technology Satellite and antenna Ethernet local area network Ethernet PC Cable modem Ethernet 56K modem and telephone line Microwave and antenna modem Cable telephone line Set-top box Interactive Set-top box Cable The Internet The Internet Figure 1.12 Physical connections to the Internet Data is transmitted over various carriers, such as telephone lines, cable TV wires, and satellite channels. When you dial in to your Internet Service Provider (ISP), your computer actually becomes a node on the ISP’s network, and from here, you gain access to the Internet. Each network has its own low-level communication software (such as Ethernet, FDDI, X.25, IBM token ring, or ATM) so that the specialized network hardware components can ‘talk’ to one another. The Internet network software operates on top of these communication layers. The magic of the Internet is how these very different computer networks cooperate to form one internetwork — the Internet. The next reading describes the networks and the interconnections between them which form the Internet’s infrastructure. Reading 1.4 Howstuffworks, ‘How Internet infrastructure works’ (by Jeff Tyson), http://computer.howstuffworks.com/internet-infrastructure.htm. Note: Please only read the first three sections: (1) ‘Introduction to how Internet infrastructure works’, (2) ‘A hierarchy of networks’, and (3) ‘Bridging the divide’. Now complete the following self-test to assess your understanding of the Internet’s design and structure. Self-test 1.3 1 Describe two design characteristics of the Internet that make it wellsuited for carrying wartime communications. 2 What is the relationship between the networks that make up the Internet? Unit 1 Communication protocols The Internet has the ability to provide a bridge between different computer operating systems and networks. This is why Tim Berners-Lee was interested in providing the World Wide Web service over the Internet. It is relatively easy to add a new network to the Internet. The communication protocols used are built on standards that are open and publicly available. Owners of diverse physical network types or different computer operating systems can join the Internet simply by implementing or purchasing the appropriate protocols. The task of transmitting data across the Internet is divided up among several network protocols. Protocol layering is a common technique for simplifying networking designs. The work is divided into functional layers, and separate protocols are assigned to perform each layer’s job. This approach leads to a set of simple protocols, each taking care of a few well-defined tasks. A layer communicates with the layer above and below it, but it is not aware of layers which are not directly adjacent to it. The key protocols for the Internet are IP (Internet Protocol) and TCP (Transmission or Transport Control Protocol), with IP operating on the layer below TCP. The next reading describes how the Internet’s network model is organized as four layers of protocols. Reading 1.5 Connected: An Internet Encyclopedia; Programmed Instruction Course, ‘DoD networking model’, http://freesoft.org/CIE/Course/Section1/5.htm. Note: A link at the bottom of this reading takes you to a discussion on the topic of encapsulation. We refer to encapsulation a little later, so please take a quick look at this link also. In the previous reading, the topmost layer of the networking model is where Internet services such as telnet, FTP, WWW and email (SMTP) operate. The lowest layer is responsible for passing data packets on to the physical network cabling media. The popular protocols at the network access layer are PPP (point-to-point protocols) for Internet connection over regular phone lines and Ethernet protocol over Ethernet-based local area networks. IP is concerned with the routing of data between sender and recipient. It does this by attaching a source and destination address to each packet, like an envelope. 21 22 MT834 Web Server Technology TCP relies on IP to handle the details of getting data from one place to another. On top of it, TCP provides mechanisms for establishing connections between host computers, ensuring that data arrives in the correct sequence, and retransmitting packets that are not received correctly or promptly. Depending on the literature, different names may be given to the layers in the network model shown in Reading 1.5. To avoid confusion, I’d like to present the model here again, along with other names that are commonly used for the various layers. Process or Application layer Host-to-host or Transport layer Internet layer Network access or Physical layer Figure 1.13 Four-layer Internet network model The way that the application, transport, Internet and physical layers work together is called encapsulation. The application produces some data, adds a header to it and hands off the result to the transport layer. The transport layer adds another header, and hands the result off to the Internet layer. It’s like putting a letter in an envelope, then putting that envelope in a bigger envelope, and so on. On the receiving end, the network software unpacks the envelopes one layer at a time until the original data is handed to the receiving application. Now that you have a high-level idea of how TCP and IP are related to each other, let’s look at the details of how they work together. Internet Protocol (IP) IP manages the transfer of data across physically diverse networks. IP transfers data in pieces, called packets or datagrams. Each packet is encapsulated within an envelope of data describing where the packet came from and where it wants to go. Packets are transferred from one network to another according to the rules of the IP protocol, until it arrives at its destination. Networks are fallible, though, so some packets may be lost, delayed, or garbled along the way. Unit 1 Back in Reading 1.4 ‘How Internet infrastructure works’, you learned that routers are special gateway hosts which join different networks together. Routers are like traffic cops who stand at intersections on the Internet highway and decide if a packet is intended for a host within its own network or needs to be routed to a different network. If the packet needs to be forwarded to another network, the router uses its own routing protocols to determine where to send it next. IP specifies the formatting used to create packets and the addressing scheme which gives every computer on the Internet a unique address. A detailed discussion on the formatting specification of IP datagrams is beyond the scope of this course, but the next figure is meant to illustrate how an IP packet may look like once it is broken down into its component fields. VERSION HEADER LENGTH SERVICE TYPE IDENTIFICATION TIME TO LIVE TOTAL LENGTH FLAGS PROTOCOL FRAGMENT OFFSET HEADER CHECKSUM SOURCE IP ADDRESS DESTINATION IP ADDRESS IP OPTIONS (IF ANY) PADDING DATA / PAYLOAD….. Figure 1.14 Format of an IP datagram, containing the header and data area Source: Comer 1991, 92. Some of the key fields in the IP packet are: • Time-to-live (TTL) — limits the number of routers that a packet may go through before reaching its destination. This prevents IP packets from traveling on the Internet forever. • Protocol — lets the networking layer know what kind of transport layer protocol is in the data segment of the IP packet. Common transport layer protocols which use IP are TCP and User Datagram Protocol (UDP). • Source and destination IP addresses — are the IP addresses of the source and destination machines. • Data/Payload — contains the data which needs to be transmitted to another computer. This data is passed down to IP by a transport protocol such as TCP or UDP, as indicated by the protocol field. 23 24 MT834 Web Server Technology Activity 1.5 The Traceroute utility allows you to follow the path, or route, traveled by a packet through the network. We will use this command to view the number of hops that are traveled by a particular packet over the Internet. 1 If you are on a Windows machine, go to the the MS-DOS command prompt and type the following command: tracert www.google.com If you are on UNIX, type the following command: traceroute www.google.com The output will display the intermediate routers that your packet has to go through in order to arrive at its destination (www.google.com). It also records the round-trip travel time for each router. My packet had to travel more than 15 hops to reach Google. 2 Now let’s view the route to a Hong Kong-based server, namely, OUHK. If you are on a Windows system, type this command: tracert www.ouhk.edu.hk If you are on UNIX, type this command: traceroute www.ouhk.edu.hk Here’s the output of traceroute against OUHK’s Web server. C:\>tracert www.ouhk.edu.hk Tracing route to sun17a.ouhk.edu.hk [202.40.157.186] over a maximum of 30 hops: 1 2 3 4 5 6 7 8 30 10 20 20 20 20 20 20 ms ms ms ms ms ms ms ms 20 20 20 20 20 20 30 20 ms ms ms ms ms ms ms ms 20 20 20 20 40 30 20 30 ms ms ms ms ms ms ms ms 203.99.136.128 shkdtswh01r1.so-net.com.hk [203.99.143.65] 203.99.143.161 agc2-RGE.hkix.net [202.40.161.189] fe1-0-100M.ar2.HKG1.gblx.net [203.192.134.162] ip-203.192.137.234.gblx.net [203.192.137.234] sun25.ouhk.edu.hk [202.40.157.7] sun17a.ouhk.edu.hk [202.40.157.186] Trace complete. Figure 1.15 Output of traceroute against OUHK’s Web server My packets had to go through many fewer hops (only eight) in order to reach OUHK as compared with Google. This explains why a website should be hosted near its primary audience for better download times. Unit 1 Traceroute is very useful for debugging problems within a network. If you are unable to reach a destination server or if response time is slow, you can use this utility to pinpoint problem areas and slow links. IP addressing We have learned that IP defines a system for assigning unique addresses to all devices connected to the Internet. This is analogous to the system used by Hong Kong’s postal service to locate residences and businesses through street names and numbers. The next reading shows you how IP addresses are formed. Reading 1.6 Webopedia, ‘Understanding IP addressing’, http://www.webopedia.com/DidYouKnow/Internet/2002/ IPaddressing.asp. The reading describes how IP addresses are organized into two parts. The first part identifies the network and the last part identifies a specific host computer on that network. When data is routed on the Internet, the network portion of the IP address is used to locate the correct network. Once the data has arrived at the local network, the host portion of the IP address is used to identify the correct computer within this network for which the data is intended. Let’s apply this concept to an example to see how IP addresses and routing work together. Consider this IP address: 202.40.157.163 Starting from left to right in interpreting this address, we move from a larger, more general area of the network (network 202) to a more specific individual host on a smaller network (163). Imagine this Internet address belongs to a host on the Open University of Hong Kong’s local area network. A simplistic way of interpreting this address is: • 202 is a network that covers all of Asia; • 40 is a network for the city of Hong Kong; • 157 is the network containing all the computers for The Open University of Hong Kong; and • 163 is the individual host identifier of the computer on The Open University of Hong Kong’s network. 25 26 MT834 Web Server Technology In terms of routing packets, the IP layer on the Asia network (202) only needs to know how to send packets to the city of Hong Kong network (40). It does not need to know anything about network 157 or host 163. The city of Hong Kong network (40) only needs to know how to route packets to the 157 network, and the 157 network only needs to know how to route packets to host number 163. The way that the networks are organized as a hierarchy limits the amount of knowledge that any one routing node must have about the entire system of networks. There is a special IP address that we will be using in future exercises: 127.0.0.1. Network 127 is a specially designated network that is not owned by any official organization. Individual computer hosts assume ownership of the 127 network address to manage their network resources. 127.0.0.1 is the loopback address, a special address that computer hosts use to direct TCP/IP traffic back to themselves. The loopback address is useful for debugging and testing Internet services and we will use it in setting up our own Internet services. Activity 1.6 In this activity, you will view the current IP address assigned to your machine. You must connect to the Internet first so that your computer now becomes a host on the network. Depending on your Internet connection, your ISP may assign you a static IP address or a dynamically assigned IP address. A static address means your machine will always be assigned the same IP address, while a dynamically assigned address will be chosen from the pool of available addresses at the time. 1 If you are working on a Windows machine, you can view your current IP address by typing IPCONFIG on the command line. If this does not work on your system, try WINIPCFG. 2 If you are on a UNIX machine, type nslookup to display the IP address of the machine. Domain names IP addresses such as 202.40.157.163 are difficult to memorize and discuss, unless you’re a computer ☺. An alternate way of addressing hosts using alphabetic or word-based names, called domain names, was created. Examples of popular domains include amazon.com, yahoo.com, and google.com. OUHK’s Web server corresponds to the domain name ouhk.edu.hk. Domain names are not just easier to remember, they also allow the underlying IP addresses to be changed without affecting the name by Unit 1 which the outside world knows them. You can think of the domain name as the pseudonym or alias; the real name of the host computer is still its IP address. It is up to the Domain Name Service to translate a given domain name to its actual IP address. The next reading describes how this translation process is accomplished. Reading 1.7 Howstuffworks, ‘How Internet infrastructure works’ (by Jeff Tyson), http://computer.howstuffworks.com/internet-infrastructure4.htm. Note: Please only read the fifth section, ‘What’s in a name?’ Let’s conduct a few experiments to see how DNS maps domain names to Internet addresses. Activity 1.7 In this activity, we will use the nslookup utility to translate domain names to their corresponding IP addresses. The same command can be used on both Windows and UNIX systems. 1 In the command line, type nslookup www.yahoo.com. Here’s the output on my system. C:\>nslookup www.yahoo.com Server: ns1.so-net.com.hk Address: 203.99.142.8 Non-authoritative answer: Name: www.yahoo.akadns.net Addresses: 66.218.70.49, 66.218.71.86, 66.218.71.90, 66.218.71.95, 66.218.71.80, 66.218.71.92, 66.218.71.91, 66.218.70.48 Aliases: www.yahoo.com Figure 1.16 Output of nslookup www.yahoo.com It turns out that there are multiple IP addresses which correspond to the domain name www.yahoo.com. This means that Yahoo uses more than one Web server to handle the high volume of incoming traffic on its website. The nslookup command also returns the name and IP address of the domain name server (DNS server) that was used to do the translation. Can you tell what my DNS server is from the above display? 27 28 MT834 Web Server Technology 2 Instead of using the domain name http://www.yahoo.com in the URL, enter two of the numeric IP addresses above into your Web browser to access Yahoo’s website (e.g., http://66.218.71.80 and http://66.218.70.48). Both URLs should display Yahoo’s homepage, which proves that they both map to the same domain name. 3 Now find out the IP address for www.kfbg.org.hk (the website of Kadoorie Farm and Botanic Garden). In this case, the domain name maps to a single numeric address. Ports Ports are numbers assigned to software applications or services running on a computer. For example, a computer at an Internet Service Provider (ISP) might be running a Web server, an email (SMTP) server and an FTP server. How does a Web browser on a client PC say, ‘I want to speak to the Web server,’ and an email client say, ‘I want to speak to the email server.’? In order to identify the desired service, the client also needs to know which port number on the remote server has been assigned to the desired service. Table 1.2 shows the list of established port numbers for common services. For example, if a server machine is running a Web server and a File Transfer Protocol (FTP) server, the Web server would typically be available on port 80, and the FTP server would be available on port 21. Clients connect to a service at a specific IP address and on a specific port number. Once a client has connected to a service on a particular port, it accesses the service using the protocol for that service. Table 1.2 Protocol Well-known port number assignments Port Description echo 7 Allows one machine to ‘echo’ back the input received from another machine FTP 20, 21 Allows files to be exchanged between machines telnet 23 Used to log in to remote machines SMTP 25 Sends email between machines HTTP 80 Used by Web browsers and servers POP3 110 Transfers emails stored on a host machine to a client machine Some of these Internet services are described further in the next reading. Unit 1 Reading 1.8 Web Developer’s Virtual Library, ‘Internet protocols’ (by Alan Richmond), http://www.wdvl.com/Internet/Protocols/. As you can see, the Web is only one of many services that run over the Internet! Transmission Control Protocol (TCP) IP provides a best-effort service which strives to deliver packets to their destination but does not guarantee that the delivery will be successful. For an application such as the World Wide Web, IP’s best-effort level of service is not enough. Transmission (or Transport) Control Protocol (TCP) is needed in order to manage and provide a reliable connection between two computers. You can visualize a TCP connection between two hosts as a pipeline with two endpoints. TCP segments or packets are put in one end and come out of the other end. Data can be exchanged in both directions at the same time. Socket door Application writes data Application reads data TCP send buffer TCP receive buffer Socket door segment Figure 1.17 TCP is a connection-oriented protocol Source: Kurose and Ross 2000. TCP runs on top of IP and ensures that all IP packets which make up the same message are transmitted safely, completely and correctly to their destinations. TCP waits for the recipient to send back an acknowledgement message for each packet that has been sent before it sends out the next group of packets. If the recipient does not acknowledge receipt within a designated amount of time, the client TCP will resend the packet. The recipient may also send back an acknowledgement message which asks the sender to retransmit the packet if data has been corrupted or was not received in the correct sequence. The TCP protocol only runs in the source and destination host computers. Intermediate network elements such as routers and bridges merely forward IP packets without knowing which ones belong to the same message or are part of the same ‘connection’. This demonstrates protocol layering at work for you. 29 30 MT834 Web Server Technology Having taken a brief look at the TCP connection, let’s examine the TCP segment structure. Similar to IP, the TCP segment consists of header fields and a data field. SOURCE PORT DESTINATION PORT SEQUENCE NUMBER ACKNOWLEDGEMENT NUMBER HLEN RESERVED CHECKSUM OPTIONS (IF ANY) CODE BITS WINDOW URGENT POINTER PADDING DATA / PAYLOAD…. Figure 1.18 Format of a TCP segment with a TCP header followed by data Source: Comer 1991, 183. Some of the key fields are: • Source and destination port numbers — used by the two communicating hosts. • Sequence number — shows the position of this data segment within a group of segments. TCP breaks up a message into packets and then labels each packet with a sequence number. This allows the two communicating hosts to know what packets have been received and which ones have not. It is also used to determine the order in which packets should be reassembled. • Acknowledgement number — returned to the sender to acknowledge receipt of a particular packet. It also informs the sender of the sequence number of the next byte that the sender expects from the recipient. • Checksum — used to verify that the data was not corrupted during the transmission. At this point you should understand TCP/IP, IP addresses, ports and domain names. This should give you a basic understanding of how the Web depends on the Internet’s protocols, infrastructure and services. Do the following self-test to check your knowledge, and then check your answers against those at the end of the unit. Unit 1 Self-test 1.4 1 What is a protocol? Why are protocols important to the Internet? 2 Name three Internet services that use TCP/IP and describe what each service does. 3 What is the role of IP within the TCP/IP suite of protocols? 4 What is the role of TCP within the TCP/IP suite of protocols? 5 What are some of the things that can go wrong when IP packets are transmitted over a network? 31 32 MT834 Web Server Technology Web servers We have seen that the Internet is built on open, public, standards such as TCP/IP and that this openness permits a range of diverse networks to easily join the Internet. The Web follows the Internet tradition and is also built on open communication standards: the HyperText Transfer Protocol (HTTP). Any program that implements the HTTP protocol can participate in the World Wide Web. The standards and protocols are general enough so that Web browsers and Web servers are implemented on a wide variety of computers and written in a wide variety of computing languages: C, C++, Java, etc. This section presents a conceptual overview of how browsers and servers communicate with each other, and how they perform their functions by making use of lower-level services such as TCP/IP and DNS. Role of browsers and servers A browser is an HTTP client because it sends requests to an HTTP server (Web server), which then sends responses back to the client. The standard (and default) port for HTTP servers to listen for incoming requests is port 80. Server machine running a Web server Your machine running a Web browser Your brower connects to the server and requests a page The server sends back the requested page Figure 1.19 Web browser requesting documents from a Web server Source: http://computer.howstuffworks.com. Please note that there can be HTTP clients that are not Web browsers. Programs written in a language such as Java or C can also issue HTTP requests to a Web server, thereby acting as an HTTP client. For the purposes of MT834, however, we will deal mostly with Web browsers acting as Web clients. Here are the basic steps that take place behind the scenes in order to satisfy a Web request: 1 The browser gets the server name and the filename (including the path) of the requested resource from the URL. Unit 1 2 The browser asks a Domain Name Server to translate the server name www.ouhk.edu.hk into an IP Address, which uniquely identifies the Web server on the Internet. 3 The browser connects to the server on port 80. 4 The browser sends a request to the server which is written according to the HTTP specification. The request will ask for the file http://www.ouhk.edu.hk/index.html. 5 The server sends the HTML text for the webpage to the browser using the HTTP protocol as well. 6 The browser interprets the HTML tags and displays the page on your screen. Steps 2 to 6 require the use of TCP/IP’s services. Let’s tie everything together and see how these protocols are used in a typical Web surfing session. The protocols are shown using the Internet four-layer model. Host A Host B Application layer (HTTP server) Application layer (HTTP server) Transport layer (TCP) Transport layer (TCP) Internet layer (IP) Internet layer (IP) Physical layer (Ethernet, FDDI, LocalTalk, etc.) Physical layer (Ethernet, FDDI, LocalTalk, etc.) Cabling medium (twisted pair, fibre optic) Figure 1.20 Web clients and servers use lower-level services in order to communicate over the network Installing a local Web server Let’s now demonstrate how the HTTP communication protocol works on your local computer by setting up a Web server and serving an HTML 33 34 MT834 Web Server Technology page to your Web browser. In the next activity you will set up the Apache Web server, which is a cross-platform Web server (it compiles and runs on a variety of Unix and Windows operating systems). However, Apache was originally written to run on UNIX servers, and new versions of Apache stabilize much faster on UNIX than Windows. For the sake of reliability and security, it is recommended to run Apache on the UNIX operating system. The activities and practical work for the rest of this unit will assume that you are using the Apache Web server on the Linux operating system. Linux is an operating system that was developed under the GNU General Public License and its source code is freely available to everyone. Linux is often considered an excellent, low-cost alternative to other more expensive operating systems. Since there is no single company that controls Linux, several organizations and individuals have developed their own ‘versions’ of the Linux operating system, known as distributions. A Linux distribution is based on Linus Torvalds’s Linux kernel, which contains the core functions of the operating system. We will assume that most of you are going to install Fedora Core 4 with Intel-based CPUs. You can buy Fedora from a computer store or a book shop, or you can download the software from many sites. One of them is an ftp site: ftp://ftp.cuhk.edu.hk/.1/Linux/distributions/fedora/core/4/i386/iso/ Note that we assume that you are using a 32-bit Intel processor. If not, you need to go to other directories for your CPU. Then, you need to download FC4-i386-discX.iso where X is 1 to 4. If you have a DVD writer, you can just download the file FC4-i386-DVD.iso. Then use these .iso files to burn four CDs or one DVD. Note that these .iso files are CD images or DVD images, which means that you should use ‘burn image’ to burn them. If you are using Nero, the burn image option can be found at ‘Recorder’ → ‘burn image’. Before you install Fedora, you need to think about where to install it. Linux can be installed on a dedicated system or on a dual-boot configuration, where it co-exists with another operating system such as Windows on the same machine. If you are planning to install Linux on a dual-boot configuration, make sure you do a complete backup of your existing system. There is always a possibility that you may lose all the data contained on your drive when you work with the hard disk partition table. Because of this, it’s usually recommended for beginners to install Linux on a dedicated machine. Therefore, the best solution is to install it on a new computer. The second best is to install it on a new hard disk. If these options are not available to you, the next best solution is to find a partition in your hard disk to install it. Of course, the worst situation is that you cannot find anywhere in your Unit 1 computer to install it. If you can afford it and your computer has space, buy a hard disk to install Fedora. The following reading is an installation guide for Fedora. Reading 1.9 Fedora Core 4 Installation Guide, http://fedora.redhat.com/docs/fedora-install-guide-en/fc4/. The disks with FC4-i386-disc1.iso or FC4-i386-DVD.iso are bootable. You should configure your PC to boot from the CD or DVD drive. After booting the disk, you can start the installation process. If you have enough disk space, I suggest you install all components of Fedora. If you’re planning to connect your Linux box to the Internet, here are the different ways to do so: • LAN; • dialup; and • broadband. If you are using broadband, you need to know which technology is used by your ISP. Note that there are mainly two technologies used, namely PPPoE and DHCP. If you require a password to get connected, you are most likely to be dealing with PPPoE. Otherwise it will be DHCP. If you use a router to connect to your ISP, then you just need to connect the Linux box to the router. The following tells you how to configure the network connection. For Fedora: Select RedHat (small icon on the left bottom corner) Æ System settings Æ Network. You will be prompted for root’s password. If you are using LAN, then you should double click on the network connection to configure the IP address and the gateway. If you are using dialup connection or PPPoE, you should click on the New button and then select modem connection and xDSL connection respectively. Then follow the instructions to complete the configuration. If you are using DHCP or a router, you should double click on the Ethernet connection and then configure the connection to use DHCP. Apache should have been installed if you installed all components of Fedora. Browse the pages on it through the loopback address, 127.0.0.1. If you’ll recall from the section on IP Addressing, this is a specially reserved address which directs TCP/IP traffic back to the local machine. The loopback address corresponds to the name alias of localhost. Entering http://127.0.0.1 or http://localhost in your browser will both display the homepage of your local Web server. 35 36 MT834 Web Server Technology If your machine is currently on the Internet, your ISP may also have dynamically assigned an IP address to your home computer. You cannot depend on this address to be the same every time you connect. The loopback address is the only fixed IP address that your Web server can rely on to be available session after session, and this is what we’ll use to test it. HTML documents served by your localhost Web server can only be retrieved by Web browsers running on the same machine. Web browsers on the Internet will not be able to communicate with your localhost Web server since they do not know your IP address. If your computer is on a local area network and has a fixed IP address, you will use that IP address to configure your Web server and then the documents you placed on it will become available to the entire LAN (and possibly, depending on your network set-up, the Internet). If you are not familiar with Linux, the following Web site contains some tutorials. Reading 1.10 Lancom Technologies, ‘Hello Linux!’, http://www.lancom-tech.com/hello-linux-crts.html. Activity 1.8 1 Copy the homepage you created for ABC Books in Activity 1.1, called abc_home.html into $(Apache_rootdir)\html where $(Apache_routdir) is /var/www. You should also copy all necessary images into a separate folder, such as $(Apache_rootdir)\html\images. Ensure that the HTML document refers to the images with the correct pathname. 2 Start the Apache server from the command line ‘/etc/rc.d/init.d/httpd start’. You can stop or restart the server by replacing the word ‘start’ by ‘stop’ or ‘restart’ respectively. 3 Type in the URL http://127.0.0.1/abc_home.html into your Web browser and see abc_home.html being served to you over the loopback network to your Web browser. 4 Type in the URL http://localhost/abc_home.html into your Web browser and retrieve the document. This demonstrates that 127.0.0.1 is the Internet address that maps to the localhost computer name. Unit 1 From this demonstration you should understand how the Web client and Web server communicate over the network when exchanging an HTML document. Do the following self-test to check your understanding of the Web’s system architecture. Self-test 1.5 Describe the steps that take place so that your requested document can be fetched and displayed on your computer. 37 38 MT834 Web Server Technology Summary This unit has given you an overview of the components of the Web and answered the question ‘What is the Web?’ You have seen that the Web is the one of the most popular applications using the Internet as its transmission medium. The three technologies that make up the Web are: • the Web’s document language: the HyperText Markup Language (HTML); • the Web’s system for addressing and locating documents: Universal Resource Locators (URLs); and • the communication language between the Web client and Web server: the HyperText Transfer Protocol (HTTP). In this first unit we explored HTML and URLs and you created your own HTML document. HTTP will be discussed in more detail in Unit 2 Web servers and HTTP. We also discussed the Internet — its characteristics, design features, and how it is related to the Web. The Internet is the underlying network that carries Web traffic consisting of HTTP request and response messages. This network uses a unique design meant to fulfill the requirements of a wartime network. We also examined the TCP/IP network protocol used to transmit data over the Internet. Essentially, HTTP messages are broken down into packets, routed individually over the network, and then reassembled at their destination. To appreciate how the Internet, Web browsers and Web servers interact, you installed and configured an Apache Web server to serve your HTML document. The next unit in this course examines Web server software in detail and looks at the step-by-step mechanism by which a Web browser ‘talks’ to a Web server according to the HyperText Transfer Protocol (HTTP). Unit 1 Suggested answers to self-tests Self-test 1.1 1 In the client-server model, an end-user relies on a program (the client) to communicate with an application residing on a remote machine (the server), in order to retrieve the requested resource. The Web follows this model, splitting the system’s functions between two tiers — the client and the server. From the client’s point of view, all that’s needed is a Web browser and an Internet connection in order to get on the Web. From the server’s point of view, what’s needed is a machine that’s connected to the Internet, runs Web server software and hosts the required documents. 2 The Web uses the Internet as its underlying network medium. Data exchanged between Web clients and servers travel over the Internet. However, the Internet and the Web are not one and the same. There are many other services aside from the Web which run over the Internet. Self-test 1.2 1 Three new technologies were created to build the Web: • The HyperText Markup Language (HTML) defines how webpages and hypertext links are written. • Universal Resource Locators (URLs) define the Web’s system for addressing and locating documents. Hypertext links contain URLs. • The HyperText Transfer Protocol (HTTP) is the communication language or protocol between the Web client (the Web browser) and the Web server. It describes how clients make requests for information and how servers respond to them. 2 Hypertext is a system of a collection of documents that are associated through active links. When a user chooses a link, that link is followed, and the document that link pointed to is fetched and displayed. Hypertext is a non-linear text system that creates an ‘information space’. 3 A URL is the address of a unique location of an Internet resource. For example, http://www.lycos.com is the URL for the Lycos search engine and ftp://ftp.ncsa.uiuc.edu is the URL for NCSA’s FTP site. 4 Hypertext links are represented as anchor tags in HTML. An anchor tag takes this form in HTML: <A HREF="http://www.ouhk.edu.hk"> The Open University of Hong Kong</A> 39 40 MT834 Web Server Technology 5 An inline image takes this form in an HTML document: <IMG SRC = "picture.gif"> 6 These HTML tags should always be in an HTML document: <HTML></HTML> <HEAD><TITLE></TITLE></HEAD> <BODY></BODY> Self-test 1.3 1 There are two design characteristics that can help the Internet withstand catastrophic attacks in wartime. First, it is decentralized. Computers can communicate directly with each other without having to go through a central node. This ensures that communication can still take place even if some machines are destroyed. Second, the Internet is a packet-switching network which offers multiple, redundant routes between two endpoints. This ensures that packets can still be routed between two hosts even if certain sections of the network are rendered inoperable. 2 The Internet is a global network that is made up of many smaller networks. These smaller networks connect with each other to form bigger and higher-level networks. There is no overall controlling network. Instead, these higher-level networks connect to each other through Network Access Points (NAPs). The Internet is a collection of huge networks which implement the same protocols and agree to route data traffic to each other at these Network Access Points. Self-test 1.4 1 A protocol is a set of conventions or rules specifying how each party should communicate. The details of Internet protocols are public domain, open standards. A network joins the Internet if it follows the communication rules specified in the TCP/IP protocol. An individual host computer can join the Web if it follows the communication rules specified in the HTTP protocol. 2 These are a few Internet services that use TCP/IP: • Domain Name Service (DNS) is an Internet service which maps Internet names to Internet addresses. Given an Internet domain name DNS will return an Internet address. Given an Internet address DNS will return an Internet domain name. • File Transfer Protocol (FTP) is an Internet service which allows users to copy files from one computer to another across the Internet. Unit 1 • Telnet is an Internet service which enables a person to set up a connection, log in, and conduct an interactive session with a remote computer on the Internet. • The World Wide Web is a hypertext-based Internet service. The WWW uses URLs as its addressing system and HTTP as its communication protocol. 3 Internet Protocol (IP) handles the addressing and coordinates the routing of packets across multiple Internet nodes. 4 TCP establishes a connection from one point on the Internet to another point on the Internet. Once the connection is established TCP is responsible for breaking the data up into packets and ensuring the reliable transfer of the packets over the network. TCP is responsible for detecting and correcting errors in the data transfer process. IP routes the packets across the nodes of the Internet. IP is the packet mover of the Internet. 5 Packets can be lost or destroyed when network hardware fails, or when networks become too congested with traffic. Even when packets make it to their destination, they may be delivered out of order, after a long delay, or with duplicate copies. This is why a reliable delivery service such as TCP is needed on top of an unreliable, connectionless packet delivery service such as IP. Self-test 1.5 1 The browser extracts the server name and the file name (including the path) from the URL. 2 The browser asks a Domain Name Server to translate the server name into a corresponding IP address. If the URL includes the numeric IP address instead of the server name, this step is not needed. 3 The browser establishes a connection to the server at its IP address on port 80. 4 The browser sends an HTTP request to the server. The request includes the path name and file name of the requested resource. 5 The server sends the HTML text for the webpage to the browser within an HTTP response message. 6 The browser receives the HTTP response over the network. It interprets the HTML tags and displays the page on the screen. 41 42 MT834 Web Server Technology References Andrews, J, Cutura, T, Hudson, K and Spivey, L A (2000) I-Net+ Guide to Internet Technologies, Boston: Course Technology. Comer, D E (1991) Internetworking with TCP/IP: Volume I; Principles, Protocols, and Architecture, Upper Saddle River, NJ: Prentice Hall. Connected: An Internet Encyclopedia; Programmed Instruction Course, http://freesoft.org/CIE/Course/index.htm. Kurose, J F and Ross, K W (2000) ‘3.5 connection-oriented transport: TCP’, http://www-net.cs.umass.edu/kurose/transport/segment.html. Wainwright, P (2002) Professional Apache 2.0, Wrox Press Ltd. Web Developer’s Virtual Library, http://www.wdvl.com. Yeager, N and McGrath, R E (1996) Web Server Technology: The Advanced Guide for World Wide Web Information Providers, San Francisco: Morgan Kaufmann.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PDF file - The Open University of Hong Kong