Download Crawling and Intro to Python

Document related concepts
no text concepts found
Transcript
Web Crawling
Summer Job Survey
• If you are a CSC major, please be sure to
complete the survey, available from the
department website.
• Our ABET accreditation visitors will be
interested.
2
Picking up where we left off last
week
• So far
– Reviewed the syllabus, general class organization and
expectations
– Talked a bit about the beginnings of the web
• You have now read Vannevar Bush’s As We May Think
– Response? Where was he right? Where was he wrong? What did
he not envision? Did you notice anything about the writing style?
– Web Characteristics
• Lack of structure, organization to the collection
• Basic client-server model; http
• Introduction to search
– Crawling – one essential step in applications that may
involve searching, information organization
• requirements (Robust, Polite)
• Expectations (Distributed, Scalable, Efficient, Useful, Fresh,
Extensible
3
Basic Crawl Architecture
DNS
WWW
Doc
FP’s
robots
filters
URL
set
Content
seen?
URL
filter
Dup
URL
elim
Parse
Fetch
URL Frontier
Ref: Manning Introduction to Information Retrieval
Crawler Architecture
• Modules:
– The URL frontier (the queue of URLs still to be
fetched, or fetched again)
– A DNS resolution module (The translation from
a URL to a web server to talk to)
– A fetch module (use http to retrieve the page)
– A parsing module to extract text and links from
the page
– A duplicate elimination module to recognize
links already seen
Ref: Manning Introduction to Information Retrieval
Crawling threads
• With so much space to explore, so many
pages to process, a crawler will often
consist of many threads, each of which
cycles through the same set of steps we
just saw. There may be multiple threads
on one processor or threads may be
distributed over many nodes in a
distributed system.
Politeness
• Not optional.
• Explicit
– Specified by the web site owner
– What portions of the site may be crawled and what portions
may not be crawled
• robots.txt file
• Implicit
– If no restrictions are specified, still restrict how often you hit
a single site.
– You may have many URLs from the same site. Too much
traffic can interfere with the site’s operation. Crawler hits
are much faster than ordinary traffic – could overtax the
server. (Constitutes a denial of service attack) Good web
crawlers do not fetch multiple pages from the same server
at one time.
Robots.txt
• Protocol nearly as old as the web
• See www.rototstxt.org/robotstxt.html
File: URL/robots.txt
• Contains the access restrictions
– Example:
All robots (spiders/crawlers)
User-agent: *
Disallow: /yoursite/temp/
Robot named
searchengine only
User-agent: searchengine
Disallow:
Nothing disallowed
Source: www.robotstxt.org/wc/norobots.html
Another example
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
9
Processing robots.txt
• First line:
– User-agent – identifies to whom the instruction
applies. * = everyone; otherwise, specific crawler
name
– Disallow: or Allow: provides path to exclude or
include in robot access.
• Once the robots.txt file is fetched from a site,
it does not have to be fetched every time you
return to the site.
– Just takes time, and uses up hits on the
server
– Cache the robots.txt file for repeated
reference
Robots <META> tag
• robots.txt provides information about
access to a directory.
• A given file may have an html meta tag
that directs robot behavior
• A responsible crawler will check for that
tag and obey its direction.
• Ex:
– <META NAME=“ROBOTS” CONTENT = “INDEX, NOFOLLOW”>
– OPTIONS: INDEX, NOINDEX, FOLLOW, NOFOLLOW
See http://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2 and http://www.robotstxt.org/meta.html
Crawling
• Pick a URL from the frontier
• Fetch the document at the URL
• Parse the URL
Which one?
– Extract links from it to other docs (URLs)
• Check if URL has content already seen
– If not, add to indices
• For each extracted URL
E.g., only crawl .edu, obey
robots.txt, etc.
– Ensure it passes certain URL filter tests
– Check if it is already in the frontier (duplicate URL
elimination)
Ref: Manning Introduction to Information Retrieval
Basic Crawl Architecture
DNS
WWW
Doc
FP’s
robots
filters
URL
set
Content
seen?
URL
filter
Dup
URL
elim
Parse
Fetch
URL Frontier
Ref: Manning Introduction to Information Retrieval
DNS – Domain Name Server
• Internet service to resolve URLs into IP
addresses
• Distributed servers, some significant
latency possible
• OS implementations – DNS lookup is
blocking – only one outstanding request at
a time.
• Solutions
– DNS caching
– Batch DNS resolver – collects requests and
sends them out together
Ref: Manning Introduction to Information Retrieval
Parsing
• Fetched page contains
– Embedded links to more pages
– Actual content for use in the application
• Extract the links
– Relative link? Expand (normalize)
– Seen before? Discard
– New?
• Meet criteria? Append to URL frontier
• Does not meet criteria? Discard
• Examine content
Content
• Seen before?
–How to tell?
• Finger Print, Shingles
–Documents identical, or similar
–If already in the index, do not
process it again
Ref: Manning Introduction to Information Retrieval
Distributed crawler
• For big crawls,
– Many processes, each doing part of the job
• Possibly on different nodes
• Geographically distributed
– How to distribute
• Give each node a set of hosts to crawl
• Use a hashing function to partition the set of
hosts
– How do these nodes communicate?
• Need to have a common index
Ref: Manning Introduction to Information Retrieval
Communication between nodes
The output of the URL filter at each node is sent to the
Duplicate URL Eliminator at all nodes
DNS
Doc
FP’s
robots
filters
To
othe
r
hosts
URL
set
WWW
Parse
Fetch
Content
seen?
URL Frontier
Ref: Manning Introduction to Information Retrieval
URL
filter
Host
splitter
From
othe
r
hosts
Dup
URL
elim
URL Frontier
• Two requirements
– Politeness: do not go too often to the same
site
– Freshness: keep pages up to date
• News sites, for example, change frequently
• Conflicts – The two requirements may be
directly in conflict with each other.
• Complication
– Fetching URLs embedded in a page will yield
many URLs located on the same server. Delay
Ref: Manning Introduction
to Information
Retrieval
fetching
those.
More …
• We will examine these things more
completely. What will you actually do?
• Goal
– Write a simple crawler
• Not distributed, not multi-threaded
• Use a seed URL, connect with the server, fetch the
document, extract links, extract content
– Explore existing crawlers
• Evaluate their characteristics
• Learn to use one to do serious crawling
– Process the documents fetched to serve some
purpose. Create a web site for that purpose.
Ref: Manning Introduction to Information Retrieval
Processing the documents
• Create an index and store the
documents and the index so that
appropriate content can be found when
needed.
• Learn the fundamentals of information
retrieval as they apply to web services
A language suggestion
• Using the right language is often the key
to making a task reasonable, easy, or
very difficult
• There are languages designed and
optimized for text manipulation. Perl
and Python are examples.
• We will spend a bit of time learning the
fundamentals of python. You may use
whatever language you wish for your
programming.
22
Introducing Python
• "Python is an open-source object-oriented
programming language that offers two to ten fold
programmer productivity increases over languages
like C, C++, Java, C#, Visual Basic (VB), and Perl.”
– (http://pythoncard.sourceforge.net/what_is_python.html)
• See also: “Why Python” by Eric Raymond at
– http://www.linuxjournal.com/article/3882
• Interpreted language
• Widely used (including by Google)
– See http://googlestyleguide.googlecode.com/svn/trunk/pyguide.html
• for the Google Python Style Guide if interested
23
Starting Python
• See
– http://docs.python.org/tutorial/introduction.html
• Python is probably on your computer. If not,
please download it and install. Everything you
need is at http://python.org
• Python includes libraries and tools that will be
very useful for writing a web crawler.
24
Python - 1
$ python
Python 2.6.1 (r261:67515, Jun 24 2010,
21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or
"license" for more information.
>>> print "Hello world!"
Hello world!
>>>
25
Useful Python elements
• Sequences
– Lists, Tuples, Strings
• Numbers and numeric operations
• Control structures
• Useful modules for Web Application
development
26
Class: list
• Ordered collection of elements, mutable
• list() creates an empty list
• movies = list() makes an empty list with the name (identifier)
movies
• What can we do to a list? Some examples:
–
–
–
–
–
–
–
–
–
append(x) – add an item, x, to the end of the list
extend(L) – Extend the list by appending all the items of list L
insert(i,x) – insert item x at position i of the list
remove(x) – remove the first item in the list with value = x (error if
none exist)
pop(i) – return item at location i (and remove it from the list) If
no index (i) is provided, remove and return the last item of the
list.
index(x) – return the index value for the first occurrence of x
count(x) – return the number of occurrences of x in the list
sort() – sort the list, in place
reverse() – reverse the order of the elements in the list
New lists
• from old lists
– places[0:3]
– places[1:4:2]
– places + otherplaces
• note places + “pub” vs places +[‘pub’]
– places * 2
• Creating a list
– range(5,100,25) -- how many entries
28
Immutable objects
• Lists are mutable.
– Operations that can change a list –
• Name some –
• Two important types of objects are not mutable: str
and tuple
– tuple is like a list, but is not mutable
• A fixed sequence of arbitrary objects
• Defined with () instead of []
– grades = (“A”, “A-”, “B+”,”B”,”B-”,”C+”,”C”)
– str (string) is a fixed sequence of characters
• Operations on lists that do not change the list can
be applied to tuple and to str also
• Operations that make changes must create a new
copy of the structure to hold the changed version
29
Strings
• Strings are specified using quotes –
single or double
– name1 = “Ella Lane”
– name2= ‘Tom Riley’
• If the string contains a quotation mark, it
must be distinct from the marks
denoting the string:
– part1= “Ella’s toy”
– Part2=‘Tom\’s plane’
30
Methods
• In general, methods that do not change
the list are available to use with str and
tuple
• String methods
>>> message=(“Meet me at the coffee shop.
OK?”)
>>> message.lower()
'meet me at the coffee shop. ok?'
>>> message.upper()
'MEET ME AT THE COFFEE SHOP. OK?'
31
Immutable, but…
• It is possible to create a new string with
the same name as a previous string.
This leaves the previous string without a
label.
>>> note="walk today"
>>> note
'walk today'
>>> note = "go shopping"
>>> note
'go shopping'
The original string is still
there, but cannot be
accessed because it no
longer has a label
32
Strings and Lists of Strings
• Extract individual words from a string
>>> words = message.split()
>>> words
['Meet', 'me', 'at', 'the', 'coffee', 'shop.', 'OK?']
Note that there are no
spaces in the words in
the list. The spaces
were used to separate
the words and are
dropped.
• OK to split on any token
>>> terms=("12098,scheduling,of,real,time,10,21,,real
time,")
>>> terms
'12098,scheduling,of,real,time,10,21,,real time,'
>>> termslist=terms.split()
>>> termslist
['12098,scheduling,of,real,time,10,21,,real', 'time,']
>>> termslist=terms.split(',')
>>> termslist
['12098', 'scheduling', 'of', 'real', 'time', '10', '21', '', 'real
33
time', '’]
• Join words of a string to words in a list to form
a new string
words=['Meet','me','at','the','coffee','shop.','OK?']
wordstring = ""
for word in words:
wordstring += word
print 'Words concatenated:' ,wordstring
print 'After using join: ',
wordstring = ' '.join(words)
print wordstring
Output:
Words concatenated: Meetmeatthecoffeeshop.OK?
After using join: Meet me at the coffee shop. OK?
34
String Methods
• Methods for strings, not lists:
–
–
–
–
–
–
–
–
–
–
–
–
–
terms.isalpha()
terms.isdigit()
terms.isspace()
terms.islower()
terms.isupper()
message.lower()
message.upper()
message.capitalize()
message.center(80) (center in 80 places)
message.ljustify(80) (left justify in 80 places)
message.rjustify(80)
message.strip()
(remove left and right white spaces)
message.strip(chars) (returns string with left and/or right chars
removed)
– startnote.replace("Please m","M")
35
Spot check
• With a partner, do
– Create a list of at least five items
– Sort the list
– Print out the list in reverse order
– How few lines do you need?
36
Numeric types
• int – whole numbers, no decimal places
• float – decimal numbers, with decimal
place
• long – arbitrarily long ints. Python does
conversion when needed
• operations between same types gives result
of that type
• operations between int and float yields
float
>>> 3/2
1
>>> 3.//2.
1.0
>>> 3./2.
1.5
>>> 18%4
2
>>> 3/2.
1.5
>>> 18//4
4
37
Numeric operators
book slide
38
Numeric Operators
book slide
39
Numeric Operators
book slide
40
Casting
Convert from one type to another
>>> str(3.14159)
'3.14159'
>>> int(3.14159)
3
>>> round(3.14159)
3.0
>>> round(3.5)
4.0
>>> round(3.499999999999)
3.0
>>> num=3.789
>>> num
3.7890000000000001
>>> str(num)
'3.789'
>>> str(num+4)
'7.789’
>>> str(num)
'3.789'
>>> str(num+4)
'7.789'
>>>
>>> list(num)
Traceback (most recent call last):
File "<stdin>", line 1, in
<module>
TypeError: 'float' object is not
iterable
>>> list(str(num))
['3', '.', '7', '8', '9']
>>> tuple(str(num))
('3', '.', '7', '8', '9')
41
Functions
• We have seen some of these before
book slide
42
Functions
book slide
43
Modules
• Collections of things that are very handy to have, but not as
universally needed as the built-in functions.
>>> from math import pi
>>> pi
3.1415926535897931
>>> import math
>>> math.sqrt(32)*10
56.568542494923804
>>>
• We will use modules specific to web application
development
• Once imported, use help(<module>) for full documentation
44
Common modules
book slide
45
Expressions
• Several part operations, including
operators and/or function calls
• Order of operations same as arithmetic
– Function evaluation
– Parentheses
– Exponentiation (right to left)
– Multiplication and Division (left to right)
– Addition and Subtraction (left to right)
book slide
46
Boolean
Values are False or True
X
book slide
Y
not X X and Y X or Y X == Y X != y
False False
True
False
False
True
False
False True
True
False
True
False
True
True False False
False
True
False
True
True
True
True
True
False
True
False
47
Source code in file
• Avoid retyping each command each time
you run the program. Essential for nontrivial programs.
• Allows exactly the same program to be run
repeatedly -- still interpreted, but no
accidental changes
• Use print statement to output to display
• File has .py extension
• Run by typing python <filename>.py
python termread.py
48
Basic I/O
• print
– list of items separated by commas
– automatic newline at end
– forced newline: the character ‘\n’
• raw_input(<string prompt>)
– input from the keyboard
– input comes as a string. Cast it to make it into
some other type
• input(<prompt>)
– input comes as a numeric value, int or float
49
Case Study – Date conversion
months = ('Jan', 'Feb', 'Mar', 'Apr', 'May',
'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
date = raw_input('Enter date (mm-dd-yyyy) ')
pieces = date.split('-')
monthVal = months[int(pieces[0])-1]
print monthVal + ' ' + pieces[1] +', ' +pieces[2]
Try it – run it on your machine with a few dates
50
Spot check
• Again, split the class. Work in pairs
– First do this:
• Prompt for a name, first name first
• Print out the name as
– lastname, firstname
– Then do this:
• Find the number of occurrences of the string
named pattern in a string named statement
• (prompt for the string and the pattern, then do
the count and output the result)
51
Control structures
• Conditionals, repetition
• IF:
if <condition>:
<instruction(s) to execute>
if <condition>:
<instruction(s) to execute if true>
else:
<instruction(s) to execute if false>
52
Nested if
Consider this example code:
values = range(27,150,4)
print values
x=raw_input("Enter x:")
if x in values:
print "x in value set", x
else:
if x> 50:
print "new x greater than 50"
else:
print "new x less than 50"
Note that the required
indentation makes
python code very
readable
53
Shortened nested if
values = range(27,150,4)
print values
strx=raw_input("Enter x:")
x=int(strx)
if x in values:
print "x in value set", x
elif x> 50:
print "new x greater than 50"
else:
print "new x less than 50"
54
Spot check
• Repeat the date example, but add a
check for valid entries
55
for
• iterative loop
• A loop variable takes on each value in a
specified sequence, executes the body
of the code with the current value,
repeats for each value.
for <variable> in <sequence>:
<block of code to execute>
• Sequence may be a list, a range, a tuple,
a string
56
Iterating through a string
teststring = "When in the course of human events, it
becomes necessary ..."
countc = 0
for c in teststring:
if c == "a":
countc +=1
print "Number of c's in the string: ", countc
57
Stepping through a list
cousins=["Mike", "Carol", "Frank", "Ann", "Jim", "Pat", "Franny",\
"Elizabeth", "Richard", "Sue"]
steps = range(len(cousins))
for step in steps:
print cousins[step]+", ",
Output:
Mike, Carol, Frank, Ann, Jim, Pat, Franny, Elizabeth, Richard, Sue,
Exercise:
Get rid of that last comma.
58
Stepping through the list again
cousins=["Mike", "Carol", "Frank", "Ann", "Jim", "Pat", "Franny",\
"Elizabeth", "Richard", "Sue"]
for step in range(len(cousins)):
print cousins[step]+", ”,
• Output
Mike, Carol, Frank, Ann, Jim, Pat, Franny, Elizabeth, Richard, Sue,
59
While loop
while <condition>:
body
Note that there is a required : after the
first line, but no punctuation after lines
in the body. The indentation shows
what belongs to the body.
60
Files
• Built-in file class
– Two ways to input a line from the file:
• line=source.readline()
• for line in source:
where line and source are
local names
Note – no explicit read
in the for loop
filename=raw_input('File to read: ')
source = file(filename) #Access is read-only
for line in source:
print line
61
Another iteration over file
a = open("numgone.txt", "r")
line = a.readline()
while line:
print line[0]
line = a.readline() # Note that the
content of line changes
# here, resetting the loop
a.close()
62
File I/O
• File object can be created with open() built-in
function
• File methods: (Selected)
– file.close()
– file.flush()
– file.read([size]) -- read at most size bytes from the file.
If size omitted, read to end of file
– file.readline([size]) – read one line of the file. Optional
size determines maximum number of bytes returned
and may produce an incomplete line.
– file.readlines() – read to EOF, returning a list of the file
lines
– file.write(str) – write str to file. May need flush or close
to complete the file write
– writelines(sequence) – writes a sequence, usually a list
of strings
63
Basics of HTML
• Web pages are coded with HTML
• Each browser has a way of displaying the page
coding, but it differs.
• Essentials – tags
<something> ….. </something>
• Essentials – links
<a href=“url of the resource”> text to highlight </a>
• Tags, including link tags, may have other
parameters. All will start with < and end with >
• Sometimes, the closing tag may be missing
– <p> not always followed by a </p>
64
Next week
• Python code for retrieving web pages
(crawling)
• Demonstration of crawling visualization
65