* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Data Communications began in 1940 when Dr George Stibbitz sent
Net neutrality law wikipedia , lookup
Deep packet inspection wikipedia , lookup
Computer network wikipedia , lookup
Cracking of wireless networks wikipedia , lookup
Airborne Networking wikipedia , lookup
Zero-configuration networking wikipedia , lookup
Internet protocol suite wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Piggybacking (Internet access) wikipedia , lookup
List of wireless community networks by region wikipedia , lookup
COMPSCI 111 S1C - Lecture 8 Computer Science 1 of 13 COMPSCI 111 S1C - Lecture 8 New connections - New Media (An introduction to data communications, networks and the Internet) Introduction Data Communications began in 1940, when Dr George Stibbitz sent data through the telephone system from Dartmouth College to New York. The value of transmitting data from one machine to another was quickly recognised. Data communications became a popular area of research among the military, business and academic communities. The importance of data communications continues to attract funding and research to this day. Data Transmission mediums The path through which data is sent is known as a channel. There are many different mediums which form data channels. The most common is plain wire cables (twisted pair or coaxial) which use electrical signals to carry data. Fibre optic cables consist of very thin glass filaments which carry digital signals generated by switching lasers on and off to form pulses of light. These fibre optic cables have many advantages over wire, the most important of which is speed of transmission. A single fibre optic cable can carry up to 7000 times as much data as regular wire cables. They are also less prone to interference (up to a million times lower rate of errors), safer, last longer and are much more difficult to wiretap (giving them a higher degree of security). Microwaves are also used to transmit data via relay stations. These stations must be in line-of-sight of each other, and can transmit very large quantities of data very rapidly. Both cellular phones and communication satellites use this technology which offers a cable free medium of communication. Figure 1: Data Transmission usually occurs through a variety of different communication mediums. Transfer rates. The greatest concern facing specialists in data communications is the speed by which the data is transferred. As technology increases, researchers are discovering new techniques which can be used to transmit data faster than ever before. However, our expectations continue to rise, requiring an ever faster rate of data transfer (e.g. people were once happy to use Morse code to transfer messages via telegraph, now people want video conferencing). The transmission rate of data is often referred to as the baud rate (which is incorrect, but the baud rate and transmission rate are often approximately equal), and is measured in bits per second (bps). The transmission rate is dependant upon the range of frequencies which can be transmitted by a given medium (i.e. the bandwidth of the medium), and the speed at which the signal travels through the medium. COMPSCI 111 S1C - Lecture 8 2 of 13 Technical details The transmission of data can be realised in a variety of ways. The best method of transmission is highly dependant upon the nature of the data to be transferred. Simplex channels allow data to be transmitted in one direction only (e.g. TV, Radio). Half-duplex channels allow 2-way communication, but only one direction at a time (e.g. walkie-talkies - one end must stop transmitting before the other one can begin). Full duplex channels allow data to be transmitted in both directions at the same time (e.g. telephone line both people can speak and listen at the same time). 0 0 0 0 1 Figure 2: Data transmitted in serial Data can be transmitted in a parallel or serial fashion depending upon hardware constraints. Inside the computer, information is usually transmitted in parallel one word at a time (a word being the number of bytes which the computer treats as a single unit). However, most data channels require data to be transmitted in serial, resulting in a much slower rate of transfer. Computers need to use accurate methods of data transfer since even a single error could render the data useless. Two common methods are called asynchronous and synchronous. Asynchronous transmission involves sending one byte of data at a time. In order to separate the bytes of data, start and stops bits are used to signal that a byte of data is being sent. Usually, 2 bits are sent as a signal that a byte is following, then the byte is sent, then 2 bits are sent to indicate that the byte has been transmitted. Start Bits 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 1 1 Figure 3: Data transmitted in parallel Byte of data 0 1 1 1 0 0 Stop Bits 1 0 0 1 1 Figure 4: Transmitting the byte 11100100 asynchronously requires sending 2 start bits (00), followed by the byte (11100100), followed by the stop bits (11). Synchronous transmission involves sending many bytes gathered together in a group known as a packet. These bytes are sent without start or stop bits. Instead, the transmission is carefully timed, so that the receiver can distinguish each individual byte. Protocols Data communications is a complex issue, involving many possible ways of sending and receiving data. It is important that both ends of a transmission use the same methods. Computer professionals have realised the importance of standardising on particular communications techniques. These standard methods of communicating are known as protocols and they define a set of rules and procedures indicating how to initiate and maintain communication. Low level protocols define the electrical and physical standards to be observed, the bit-ordering and the transmission, error detection and correction of the bit stream. High level protocols deal with the data formatting, including the syntax of messages, character sets, sequencing of messages etc. Communications Hardware A typical data communications system will involve many terminals connected to a central server. A network which exists over a large area would typically use modems to create the connections between the terminal and the server. These common elements of a network warrant further discussion. Any input/output device which is at the end (or terminal point) of a data communications system is known as a terminal. However, terminal is usually used to refer to a computer which is connected to the communications system and includes a screen and keyboard for input and output. Today, many computers have “terminal” software which allows them to connect to other host computers. COMPSCI 111 S1C - Lecture 8 3 of 13 A modem is a device which enables a computer to receive and transmit data over a telephone line. Traditionally, telephone lines have been analogue lines (i.e. transmits signals which have a continuous range of values), whereas computers use digital signals (discrete values - either 0 or 1 for computers). A modem converts the computer’s digital signal into analogue form which can be sent through a regular telephone. A modem at the other end of the telephone converts the signal from analogue back into digital so that the remote computer can understand the signals. Analog Digital Figure 5: Analog signals have a continuous range of values, while digital signals have a limited range of discrete values. Server A server is a computer which understands a particular protocol. It receives and transmits data according to the protocol which it understands. A typical data communications system may have many different servers each of which is dedicated to a particular sort of task (and can understand the corresponding protocol). Traditional Networks Traditionally, early networks used a single computer to co-ordinate the transfer of information from one machine to another. This central computer became known as the host computer. The host computer was usually left switched on all the time, allowing other computers to send data at any time. Central Computer Figure 6: A central computer was used to co-ordinate communications between different machines Local Area Networks (LAN) Most small businesses today have at least one computer. Those who have more than a single computer are likely to have them connected together in order to share files and other resources such as printers and modems. A local area network (LAN) consists of two or more computers physically connected together (usually with twisted pair or coaxial cable) which are located in a limited geographic area (less than 1 km across). COMPSCI 111 S1C - Lecture 8 4 of 13 Tape Drive Printer Scanner Figure 7: A Local Area Network allows users to transfer files between one another and to share expensive resources (such as printer, scanner or tape drive). Wide Area Networks (WAN) A network which spans a geographic area of greater than 1 km is usually known as a wide area network (WAN). Networks of this type often include satellite dishes, or operate through fast digital lines leased from a telecommunications company such as Telecom. The US National Science Foundation (NSF) backbone is one such example of a WAN. Figure 8: A Wide Area Network spans a distance of greater than 1 kilometre. An intranet Large organisations, such as the university, can have many LAN and WAN networks all interconnected to form an intranet. An intranet is the internal network of an organisation. This internal network may or may not be connected to the internet or perhaps another company’s intranet also. The Beginning of the Internet During the 1950’s, America was in the grip of “Cold War” paranoia. Fear of communism and the perceived threat of a Nuclear Strike from the Soviet Union caused a intense period of industrial research. The successful launch of Sputnik by the Soviet Union in 1957 provoked the formation of the Advanced Research Projects Agency (ARPA) within the Department of Defence (DoD). A New Type of Network: Dr. J. Licklider, who was head of ARPA in 1962 began to involve universities in the ongoing research, thus tapping the resources of the academic community. After the Cuban missile crisis in 1962, a nuclear attack seemed unavoidable. Investigation into the aftermath of a Nuclear Attack suggested that neither the long-distance telephone plant nor the military command and control network would survive such an attack. Even though most of the links would be operable, the centralised switching facilities would be destroyed, rendering the system useless. During this time, Paul Baran of RAND (a Research and Development organisation) came up with a promising solution. Baran conceived of a network where each node would have equal status; be autonomous; and be capable of receiving, routing, and transmitting information. This system would take each message and break it down into pieces (called packets). Each of these packets would be individually addressed and allowed to travel through the network by whatever route was available. If parts of the network were destroyed, the address of the destination was sufficient for a node to send the packet via a different route. In autumn of 1969, ARPA connected four sites together using methods developed by RAND. These sites were: 1. Stanford Research Institute (SRI) 2. University of California at Los Angeles (UCLA) 3. University of California at Santa Barbara (UCSB) 4. University of Utah COMPSCI 111 S1C - Lecture 8 5 of 13 The Developing Network The network was initially to be used primary for long-distance computing. This would allow a user in a remote location to log into and use a distant machine as if they were actually there. Researchers could make use of the few supercomputers in the country without having to travel miles to try out their programs. Within 2 years however, the main use of the network was no longer long-distance computing, but rather sending personal messages. Each user had their own account, and could be sent an individual message, which proved to be extremely popular among the growing number of users. The decentralised design of ARPAnet allowed computers to be connected far more easily than in traditional networks. It was not necessary to register with a central authority, but merely to connect to the closest machine already connected to the network. This design allowed the network to expand quickly and easily. TCP/IP Once the network began to expand, it started to be connected to other existing networks. The original protocol for ARPAnet was designed for only a small number of machines. Robert Kahn saw a need for a protocol which would allow for unlimited expansion. In 1972 he began to work on TCP/IP (Transmission Control Protocol/ Internet Protocol). He assumed: No internal changes would have to be made to a network in order to connect it to the Internet. Communication would be on a best effort basis, with any lost packets being retransmitted There would be no global control of the network. At a similar time in PARC the Ethernet protocol was developed by Bob Metcalfe for use with local area networks. After the public release of Ethernet in 1973 it quickly became the standard protocol for LAN communications and small independent networks flourished. From ARPAnet to Internet The term “Internet” was first used in a planning document in 1974, but ARPAnet was still commonly referred to until the early 1980’s. Around this time TCP/ IP was integrated into the UNIX system, so that standard mainframes would come with networking software already installed. Networks of varying sizes continued to become more commonplace. On January 1st, 1983, ARPAnet changed protocol from its original limited version to the TCP/IP protocols we use today. The transition went surprisingly smoothly. This change to TCP/IP allowed the DoD to separate from ARPAnet, and form Milnet (Military Network). Internet today When ARPAnet was a small network it was possible to maintain a table of all the computers and what their addresses were. With the open-ended architecture of TCP/IP, this was no longer feasible, and a system of maintaining addresses called the Domain Name Server was created. This system introduced a hierarchical scheme for addresses which operated in a similar way to regular post office addresses. The National Science Foundation (NSF) wanted a high performance network to link supercomputers together with the aim of providing researchers with available resources. During the 80’s, they created a high speed network which ran across the US, and became known as the Backbone of the Internet. Universities, Government departments and Businesses were all encouraged by the high speed link, and rapid growth of the Internet began. ARPAnet ceased in 1990, by which time the Internet was a well established system. The creation of the WWW in 1991 allowed the general public to experience the Internet for the first time. Entertainment, Advertising, and uninformed opinion quickly became commonplace. Experienced users of the Internet longed for the “good old days” when the Internet was a small closed community. The general public was able to enjoy another form of mass media, but one in which they had control. In 1995, the NSF created a new Very High Speed backbone which was reserved solely for academic research. Internet Service Providers (ISP) An Internet Service Provider (ISP) provides a way for an individual to access the Internet. The ISP usually has a host computer which maintains a permanent connection to other host computers and forms part of the Internet. The ISP will usually provide a client with an account on one of the host machines COMPSCI 111 S1C - Lecture 8 6 of 13 owned by the ISP. This account allows ISP to keep a record of how much information each user transfers, and provides a location where email and other files may be stored. Home User Modem Pool ISP Host Host Satellite Link Figure 10: A home user normally accesses the Internet through an Internet Service Provider. Connecting to the Internet In order for a user to access the Internet, they require a computer, a modem, software that allows the computer to connect to a host, and an account with an ISP. The communication software (or dial-up software in Windows 95) will tell the modem to dial the appropriate number and connect to the modem owned by the ISP. Once the modem has connected, the user must log into the account provided by the ISP (this is often achieved by scripts that run in the background). If the user has email waiting on the host machine, they may transfer it to their local machine to read it. Anytime the user looks at any information on the Internet (WWW pages, USENET news, email, IRC, et;) the data must be transferred through the host computer, down the telephone line, through the modem and onto the local machine before it can be read. Communication The Internet provides a world-wide communication medium. It is possible for a user to communicate to a single individual, small group, or broadcast to a large audience. The most common methods of communication are through electronic mail, internet relay chat, or Usenet newsgroups. Electronic mail Email is a mechanism that allows a user to type a message, and then send it through the Internet to a specified address. This is usually available with the lowest level of Internet access, and is considered by many to be the most essential aspect of Internet communication. The protocol by which email is transferred restricts the message to plain ASCII text. The development of MIME (Multipurpose Internet Mail Extensions) has allowed documents of any sort to be sent through email, greatly increasing its usefulness. More recently, email facilities have been incorporated into WWW browsers such as Netscape. These browsers support the use of HTML code within the mail message, allowing different fonts, sizes, and styles to be used within the document. Some of these browsers even support the inclusion of pictures within an email message (although such messages cannot be easily read by those who are not using an email program which supports HTML). Internet Culture Email usually reaches its destination within seconds. This fast delivery time has directly influenced the culture of Internet users in a variety of ways. Replies are usually received quickly (compared with snail mail), giving email a conversational flavour. This has resulted in an informal style of writing in which even spelling mistakes are taken for granted. Chat groups have almost instant delivery of text to the group, requiring people to develop methods of carrying on written conversations in real time. To reduce the time taken to get the message across, shorthand notation developed for common phrases (eg; FYI = for your information, TTYL = talk to you later, TNSTAAFL = there's no such thing as a free lunch). Emotion is difficult to express through pure text, yet it is important for conversations, and so emoticons (smileys) were developed to fulfil this role. These are usually used to represent sarcasm, humour, and basic emotions such as happy or sad. :-) ............ happy ;-) ............ wink (sarcasm/ humour) :-( ............. sad COMPSCI 111 S1C - Lecture 8 7 of 13 Mailing Lists All email is handled purely by computer systems. It is therefore easy to automate the sending of email to many users. All that is needed is a list of addresses, and a computer program (called a mailer) can mail out a message to every address on the list. This feature is the essential idea behind mailing lists, which are a list of addresses of people interested in the same topic, together with a program that manages the list. Any mail which is sent to a specified address (that of the mailer) is automatically forwarded to each person whose address is on the list. Generally, mailing lists are for people who have a common interest in a particular topic (eg; Kite Making). It is estimated that there are over 30,000 different mailing lists today. The largest mailing list currently is A-Word-A-Day with over 100,000 subscribers. Privacy There is no such thing as private email. There are no national laws which protect email. The administrators have access to all email that is sent to or from their system. It has been estimated that 25% of US companies read their employees’ email messages. There is no reason to be alarmed however, since in practice, it would take far too long to read everyone’s email. Just don’t send anything really important (i.e. documents relating to national security) via email. The issue of email privacy has become a common concern in the US which is in the process of developing legally binding ethical standards for the computer industry. Usenet The UNIX User Network (Usenet) began in 1981 when the UNIX to UNIX Copy Protocol (UUCP) was created. This protocol simply copied files from one UNIX machine to another across a network. It was quickly seen as an ideal way to build a message system, where any messages would be copied to all the other machines running UUCP. A backbone of machines supporting UUCP was set up spanning the US. The messages were stored and forwarded on a purely cooperative basis, at some expense to those providing the service. At this time, the groups fell into only 2 categories; mod were for moderated groups, and net were for all the others. Great Renaming In 1986, the newsgroups were beginning to become too numerous to easily manage, and so the Great (or Grand) Renaming began. This was a move to restructure the Usenet groups into the hierarchical structure we have today and took about a year to fully realise. The main groups became comp, misc, news, rec, sci, soc and talk. The renaming made it possible for the administrators to copy only the first few groups (the important ones in their eyes), and not support the other groups. This possibility angered many, and one of the biggest flame wars of all times ensued. During this time, the proposed rec.sex, and rec.drugs groups caused the backbone to break as some of the administrators refused to support those groups. This in turn caused the creation of the anarchistic alt hierarchy (in which anyone can start up a group). Usenet today The amount of information posted to Usenet newsgroups is staggering. Approximately 100 MB of information (the size of Encyclopaedia Britannica) is posted each day, distributed between more than 25,000 different newsgroups. Many of these articles are kept for only a few days, others may be kept for weeks or months depending upon which newsgroup the message appears in. The estimated number of readers world-wide is over 100 million. Other Forums The popularity of Internet browsers such as Internet Explorer have drawn the general public into the use of the WWW, but has not attracted so many into areas such as Usenet. Web-based forums are becoming increasingly common as a way of discussing special interest topics. Using the web for forum discussions has the added advantage of keeping the topic of discussion in the same place (same web site) as other information relating to the discussion. For example: A web site about the dangers of GE food can host pages informing the public of the danger as well as a discussion forum where questions can be raised and answered. The more frequent use of forums on web pages is beginning to give the web a greater sense of community and a more interactive feel to users. COMPSCI 111 S1C - Lecture 8 8 of 13 Netiquette With such a large audience able to publish (i.e. post messages), there is a need for conventions governing acceptable behaviour, which is commonly known as netiquette. Failing to follow these rules of netiquette has no official or formal effect, but reflects the poor judgement of the person posting. You would be well advised to read an article on netiquette before communicating via Email or posting to Usenet newsgroups. Timeline of some Internet Related events 1957 1962 1969 1970 1972 1973 1974 1976 1982 1983 1984 1986 1987 1988 1990 1992 1994 1995 1996 1998 1998 2000 2000 2001 2002 USSR launches Sputnik. The US forms ARPA to establish US lead in military technology Paul Baran, RAND: "On Distributed Communications Networks" ARPANET commissioned by DoD for research into networking First node-to-node message sent between UCLA and SRI (October) ARPANET hosts start using Network Control Protocol (NCP). Ray Tomlinson (BBN) writes basic email message send and read software Telnet specification Bob Metcalfe's Harvard PhD Thesis outlines idea for Ethernet File Transfer specification Vint Cerf and Bob Kahn publish "A Protocol for Packet Network Intercommunication" which specified in detail the design of a Transmission Control Program (TCP). Queen Elizabeth II sends out an e-mail ARPA establishes the Transmission Control Protocol (TCP) and Internet Protocol (IP), as the protocol suite, commonly known as TCP/IP, for ARPANET. This leads to one of the first definitions of an "internet" as connected set of networks, specifically those using TCP/IP, and "Internet" as connected TCP/IP internets. Cutover from NCP to TCP/IP (1 January) ARPANET split into ARPANET and MILNET Domain Name System (DNS) introduced. Number of hosts breaks 1,000 NSFNET created (backbone speed of 56Kbps) Number of hosts breaks 10,000 Internet worm burrows through the Net, affecting ~6,000 of the 60,000 hosts NSFNET backbone upgraded to T1 (1.544Mbps) ARPANET ceases to exist Number of hosts breaks 1,000,000 The term "Surfing the Internet" is coined by Jean Armour Polly Arizona law firm of Canter & Siegel "spams" the Internet with email advertising green card lottery services; Net citizens flame back NSFNET reverts back to a research network. Main US backbone traffic now routed through interconnected network providers Hong Kong police disconnect all but 1 of the colony's Internet providers in search of a hacker. 10,000 people are left without Net access The WWW browser war, fought primarily between Netscape and Microsoft, revolutionised the software industry with new releases made quarterly. Dot Coms (Internet companies) become the hottest item on the stock exchange. Millions are made as many Internet stocks increase by 10 fold. The game Everquest released and becomes the first really popular Massive Multiplayer Online Roleplaying game. Players pay a subscription and can create a character that they play in a huge 3D persistent world environment. It takes more than 8 hours of playing time to simply walk from one end of the world to another Everquest commonly known in the Internet community as EverCrack., one of the most addictive games people have played. Most players (community of 300,000+) have logged over 1000 hours of play. People reported as losing jobs and marriages broken due to playing Everquest The downfall of the tech stocks. Having doubled every year for the past 3 years, the bubble bursts and most tech stocks plummet in value. Dot Coms are worst hit, many going bankrupt or reducing to a fraction of their original value. Huge layoffs in IT staff follow. Research shows that a significant portion of children that meet a stranger online (in chat rooms etc.), go on to meet the stranger face to face in an unsupervised setting. Everquest blamed for the suicide of a teenage boy. Sony Online (Publisher) and Verant Interactive (Developer) of Everquest sued in US courts. Warning labels on “addictive” games a likely outcome. COMPSCI 111 S1C - Lecture 8 9 of 13 Hypertext and the World Wide Web Introduction The WWW is a fairly recent phenomenon, yet the underlying structure has had a long history of development which is still underway today. The current structure of the WWW has serious flaws which reduce its effectiveness as a hypermedia system. The interface through which we access the WWW is fluid, always shifting under competitive pressure. Through all the changes one thing remains consistent the public interest. Hypertext Vannevar Bush (Science Advisor to Roosevelt during WWII) proposed the Memex system in his 1945 article “As We May Think”. This system was a conceptual machine which could create information trails. These trails were links between related texts and illustrations which could be used for reference. Bush felt that such a machine would greatly increase learning, memory and knowledge. Inspired by Bush, Douglas Englebart (working in 1963 at SRI), proposed a system which cross-referenced related documents across a network (later he invented the idea of a mouse and screen pointer for GUIs). Ted Nelson’s work in 1960, was far more ambitious. His project, Xanadu, was designed as a document universe, where everything ever written would be stored and referenced with crosslinks to related information. He coined the term “Hypertext” in 1965. After much funding, the Xanadu project is still underway today, still under the leadership of Ted Nelson. His continuing advocacy of a 30 year old project has resulted in some criticism of Nelson “Xanadu, the grandest encyclopedic project of our era seemed not only a failure but an actual symptom of madness.” - Wired magazine Tim Berners-Lee began work on the WWW project at CERN in 1989. The European Particle Physics Laboratory had already helped shape the Internet by supporting the TCP/IP protocol. Tim Berners-Lee visualised a hypertext system which would encourage collaborative research. Contributors would have to have world-wide access through the Internet, and would be able to easily add to the database of knowledge and cross-reference other documents. Development The WWW was fully operational at CERN in 1991. At this time, only text was used, and the only browser used a CLI. The WWW looked a lot like other aspects of the Internet (Usenet, E-mail etc;). By 1992, the National Centre for Supercomputer Applications (NCSA) released the Mosaic browser with a GUI interface. A year later (in 1993), GUI versions of the browser were released for home computers, both PC and Macintosh versions. Since that time, the ease of use has allowed the general public to become involved. The Underlying Structure The protocol used by the WWW is the Hypertext Transfer Protocol (HTTP). The hosts which support HTTP are able to talk to each other, and pass documents back and forth, forming the basis of the WWW. Note that the WWW is not the same as the Internet, but rather a collection of servers (a subset of Internet hosts), which support HTTP. Client-Server Model The WWW uses a client-server model as the basis of communication. In this model, the client (browser) runs on the local (or user’s) machine, and the server runs on one of the host machines. The client is responsible for displaying the information in the documents, displaying and maintaining hypertext links in a intuitive manner and negotiating formats of information with the server. The server must negotiate formats with the client, send information in the requested format and manage the nodes of information on the host machine upon which it is running. COMPSCI 111 S1C - Lecture 8 10 of 13 Structural Problems The hypertext structure developed for the WWW has some major flaws. The most obvious of these is the problem of “dangling links”. These occur when a user moves or deletes a document available on the WWW. All the other documents which contain hypertext references to the first document are now left with links which refer to a page that no longer exists. The lack of facilities for accounting inherent in the structure (due to its anonymous nature) make it undesirable for the publisher to publish professional quality work on the WWW. Publishers are generally uninterested if there is no way to make a profit. This has further encouraged people to look to advertising in order to maintain the web sites. Due to the decentralised structure of the WWW, finding relevant information can be difficult. There is no central database or index of information, and no way to categorise information based on quality. Search engines are fully automated, and so they tend to index information poorly. Search Engines Many attempts have been made to index the WWW. This is usually achieved through the use of an automated program which recursively accesses all the pages on the WWW. For each page, the program will attempt to extract out the most important information from the document and send it back to a central database. This database can be searched for key words, and will display a list of pages which contain that word. The searchable database is known as a search engine. WWW Demographics The WWW is still predominantly used by males (69%) rather than females (31%). The average age is 35 years old, and has been steadily increasing over the past 5 years. Approximately half are married (46%) and one third single (37%). One in five users (20%) use the WWW for more than 20 hours per week, and almost a third spend 10-20 hours a week browsing. The most common use of the WWW is simply browsing (77%), followed by entertainment (64%), education (53%), work (51%) and shopping (19%). In recent surveys of users, the most common concern expressed by users is the issue of censorship (36%), followed by privacy issues (26%) and difficulty in navigation (14%). A telling statistic is that 37% claim to use the WWW instead of watching TV on a daily basis. Equally interesting is the response showing that users tend to spend as much time using e-mail as they do using the phone. (Statistics from GUV’s 6th WWW User Survey) Navigating through Cyberspace Imagine a library where the books are placed on the shelves in no particular order. The only way to access them is through an index containing all the words which appear inside all the books. Imagine that this library also contains all the junk mail produced, and all the business documents produced by companies. What you have imagined is the World-Wide Web as it stands today. Navigating through the mass of information on the WWW can be a demoralising and unsatisfactory experience. The information which is available is usually difficult to find, even if you know exactly what you are looking for. The most frustrating aspect of the WWW and the thing which prevents most people from making effective use of the WWW is the lack of direction. Is far too easy to get lost in cyberspace, confused about where you have come from and where to go to next. In order to master this new electronic environment you must learn how to use the tools for finding information on the WWW. The most important aspect is to choose the right tool for the job. There are many specific indexes, or databases which help to index specialised subjects (such as the Internet Movie Database). If you find any databases related to your own interests, then bookmark them, and build up a set of resources tailored to your tastes. If you are looking for general information about a topic, but don’t have any specific queries (eg; you are generally interested in watersports), then using one of the subject catalogs will help to guide you to appropriate pages. Only use search engines when you are searching for specific information. COMPSCI 111 S1C - Lecture 8 11 of 13 Search Engines or Subject Catalogs? A subject catalog (sometimes called a directory or guide) is a manually-created catalog of sites on the WWW. This means that a person has created categories and then a team of people review a page, and if they consider it to be a worthwhile resource, they will include a link to it in the appropriate place in the categorical index. This means that all the sites within a subject catalog have been screened for information content by a human, so you are unlikely to find personal home pages and the like. If the editors of the catalog do not like the content of a page, then they will not add it to the directory. These directories usually only index a small portion of the WWW, but the pages which are listed are usually of high quality. A search engine uses an automated program to index all the pages in the WWW in a single enormous database. So how does a computer program know what the content of a page is? Each search engine uses a different approach to this problem. Most of them examine the page, and record keywords from the page. They are likely to take account of the number of times a word appears, and it’s position in the page. The title and headings of a page are often highly regarded as indicative of the content of the page by an indexing program. Using the URL for searching The "branding" of a web site is possibly the most important aspect. Each web site has a unique address. If this address is memorable, then you can find the site easily and reliably. Sites like "Yahoo!" (http://www.yahoo.com) or "Amazon" (http://www.amazon.com) have succeeded in this contest for a place in the consciousness of the casual Internet user. Understanding how web sites are named gives you a big advantage in searching for sites. Searching for a site by trying to guess the URL is a remarkably quick and easy way to get started. Try a few guesses to start with and see if any are useful. You can always resort to a search engine. Example: If you are looking for tourist information about NZ where would you look? Try: www.tourist.co.nz, www.tourism.govt.nz, www.touring.org.nz. If none of these are successful, then perhaps the NZ govt site would have links to other tourist sites. Try www.govt.nz. Maybe each local govt body would have information about their area. If you find a site, but its not quite what you want, then look for Links to other related sites. Summary: 1. Use specific databases if they exist (and if you can find them) 2. Use catalogs to find general subject information 3. Use search engines for specific queries Simple Searching using Alta Vista Always try a simple search using natural language. Type a word or phrase or a question (for example, “weather Boston” or “what is the weather in Boston?”), then click Search (or press the Enter key). If the information you want from this sort of query isn't on the first couple of pages, try adding a few more specific words (like “June”, or “today”, or “weather report”) Required and rejected terms Often you will know a word that will be guaranteed to appear in a document for which you are searching. If this is the case, require that the word appear in all of the results by attaching a "+" to the beginning of the word (for example, to find an article on pet care, you might try the query dog cat pet +care). You may also find that when you search on a vague topic, you get a very broad set of results. You can quickly reject results by adding a term that appears often in unwanted articles with a "-" before it (for example, to find a recipe for oatmeal raisin cookies without nuts try: oatmeal raisin cookie +recipe -nut* -walnut*). Phrases If you know that a certain phrase will appear on the page you are looking for, put the phrase in quotes. (for example, try entering song lyrics such as "you ain't nothing but a hound dog"). Using quote marks ensures that the words appear in the exact same order that you have specified, otherwise you will find all documents containing any of the words: dog, hound, nothing, ain’t, you, a, but COMPSCI 111 S1C - Lecture 8 12 of 13 Case Sensitivity Use only lower case unless you want your search to be case sensitive. If you search for Coffee, you'll get only documents that include that word with just that capitalisation. If you search for coffee, you'll get any page with that word. Wildcards Use an asterisk (*) to broaden your search. To find any words that start with gold, use gold* to find matches for gold, goldfinch, goldfinger, and golden. Use this if the word you are searching for could have different endings (for example, don't search for dog, search for dog* if it could be plural). Special Functions for web searches using Alta Vista AltaVista doesn't just search text. Here are all of the other ways you can search on the net: Hypertext Links You can find pages which contain a word or phrase within the test of a hyperlink by using anchor:text. For example, anchor:"Click here to visit AltaVista" would find pages with "Click here to visit AltaVista" as a link. Destination URLs You can find all pages which link to a destination URL. This may be useful if you wished to find all pages which have links to a page you are interested in (such as your own home page). Use link:URLtext to find all such pages. For example, link:altavista.digital.com finds all pages which link to the Alta Vista search engine. Images You can search for images by using image:text. For example, image:elvis looks for pages with images called elvis. Note that the search is based on the name of the image file, so use short names without spaces (since filenames are likely to be single words less than 8 characters long). Titles If you know the title of the page you are looking for (ie; the name which usually appears in the title bar of the browser), then you could use title:text. For example, a search for title:Elvis would find pages with Elvis in the title. Advanced Features of Alta Vista When a general search has not been successful, and you are looking for specific information, then an advanced query may provide better results. Advanced search is for very specific queries and not for general searching. Almost everything you need to do can be done more quickly and with better results through the simple form. Remember, when you use the advanced search form, you control the ranking and if the ranking field is left blank, no ranking will be applied and the results will be in no particular order. Boolean Operations: Note that the + and - operators do not work in an advanced search. You should use the Boolean keywords AND, OR, NOT, and the operator NEAR. Each of these operators has a shortened form, respectively &, |, !, and ~. AND, & Finds only documents containing all of the specified words or phrases. Mary AND lamb finds documents with both the word Mary and the word lamb. OR, | Finds documents containing at least one of the specified words or phrases. Mary OR lamb finds documents containing either Mary or lamb. The found documents could contain both, but do not have to. COMPSCI 111 S1C - Lecture 8 13 of 13 NOT, ! Excludes documents containing the specified word or phrase. Mary AND NOT lamb finds documents with Mary but not containing lamb. NOT cannot stand alone--use it with another operator, like AND. For example, AltaVista does not accept Mary NOT lamb; instead, specify Mary AND NOT lamb. NEAR, ~ Finds documents containing both specified words and phrases within 10 words of each other. Mary NEAR lamb would find the nursery rhyme, but likely not religious or Christmas-related documents. Ranking results: To rank matches, enter terms in the Ranking field; otherwise, the results will appear in no particular order. You could enter words that are part of your query or enter new words as an additional way to refine your search. For example, you could further narrow a search for COBOL AND programming by entering advanced and experienced in the ranking field Using Search Engines In order to find information quickly, you need to practice using a search engine until you are confident with it. Spend a little time using a few different search engines, then pick your favourite one and learn how to use the advanced features. Each search engine indexes pages in a slightly different way, and use can use different techniques to narrow your search with each of them. It is worthwhile finding out how the search engine you use actually indexes pages, since that will give you insight into the results of your searches (ie; why certain pages are near the top of the list). The search engines which are most highly recommended are Google, Hot Bot, and Alta Vista, but each person is advised to try a range and select the one which you are most comfortable using.