Download INTRODUCTION TO PYTHON

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CHAPTER
ONE
INTRODUCTION TO PYTHON
This section contains an introduction to Python.
Python has had a lot of good introductory material written for it, and the notes given here are just a bare bones
introduction targeting basic concepts needed in this course. It often pays when working on a programming language
to get several different takes on the material. In the interest of giving students a different perspective, and generally
broadening their knowledge, here are some of the best quick introductions to Python:
1. Older version of Python tutorial, better for beginners Actually written by Guido van Rossum.
2. Current Python docs tutorial
3. How to think like a programmer
4. Google’s Python class
Contents
1.1 Python example
Note: Python and Ipython notebook versions of code (.py .ipynb).
This example illustrates some Python code:
1
2
3
4
5
6
7
>>>
>>>
>>>
>>>
>>>
>>>
Hrs
total_secs = 7684
hours = total_secs // 3600
secs_still_remaining = total_secs % 3600
minutes = secs_still_remaining // 60
secs_finally_remaining = secs_still_remaining % 60
print "Hrs =", hours, "mins =", minutes, "secs =", secs_finally_remaining
= 2 mins = 8 secs = 4
1.2 Your first Python program
Here is some Python code here. You should download this code and put it in a file called hello.py. Throughout these
notes you will be given files to download. You should adopt the following practice. Create a directory (folder) in
which you place these files and always start up python from the commandline after connecting to that directory.
The file hello.py is a tiny Python program to check that Python is working. Try running this program from the
command line like this:
i
gawron$
python hello.py
(the gawron$ is my commandline prompt; the rest is what I typed on my keyboard, followed by <Enter>). The
program should print:
Hello World
Or if you type:
gawron$
python hello.py Alice
it should print:
Hello Alice
Or if you type:
gawron$
python hello.py stupid
it should print:
Hello stupid
If you have a text editor that you feel comfortable with (as discussed in Section text_editor_ide), try editing the
hello.py file, changing the ‘Hello’ to ‘Howdy’, and running it again.
Once you have that working, you’re ready for class – you can edit and run Python code; now you just need to learn
Python!
1.2.1 A little more discussion
The following function is what the file contains:
1
import sys
2
3
4
5
6
def main():
"""
Get the name from the command line, using ’World’ as a fallback.
"""
7
8
9
10
11
12
if len(sys.argv) >= 2:
name = sys.argv[1]
else:
name = ’World’
print ’Hello’, name
This block of Python code is a function definition. A function is a program, and the way you define a function in
Python is to write:
def <function_name> ([Arguments]):
followed by an number of lines of indented legal Python code. The arguments are values that are passed into the
function that may affect what it does. In the case of main, there are no arguments, and so the arguments list is simply
().
To see how this function works in python, let’s run the Python **interactively, so that we end up seeing the python
prompt after the program is run. This is done by including a -i on the commandline as follows:
gawron$
python -i hello.py
We then get:
Hello World
>>>
We can run the function main again from inside Python, because the first thing Python did on startup was load up all
the definitions in the file hello.py. Then it executed main. We can tell it to execute main again as follows:
>>> main()
Hello World
And it still works. In general in this course, we’ll be executing Python code interactively by typing directly to the
Python prompt.
This simple function contains a lot of features of Python which we will be discussing in more detail in the next few
sections. Just as a small preview, we discuss some of them now:
1. Line 1 is an import statement. Python comes with a small amount of built in functionality, but most of what
you do when you program Python requires importing other files defining new functionality. These helper files
are called modules, and any Python distribution, even the bare bones standard distribution, comes with a large
number of them. The core set of modules that comes with a standard Python distribution is called the Python
standard library. The sys module imported in line 1 is one of these. It provides access to some variables
containing information gathered when python started up, including information about the machine Python is
running on and how Python was started. One of these is the variable sys.argv used in line 8.
2. The variable sys.argv is a list containing any arguments you provided when you called Python from the
commandline, including the name of the program you asked Python to run. So if you typed:
python hello_world.py
then sys.argv looks like this:
>>> sys.argv
[’hello.py’]
if you typed:
python hello_world.py Alice
then sys.argv looks like this:
>>> sys.argv
[’hello.py’, ’Alice’]
3. Line 8 of the program is a test to see if the sys.argv list is long enough to contain a name; its length is 1
if no name was supplied on the commandline and its length is 2 only if a name was supplied. If the test is
passed (the length is 2), then the variable name is set to the name supplied on the commandline (line 9). This is
because sys.argv[0] gives us the fi<rst thing on the list and sys.argv[1] gives us the second. Otherwise
(else), the name is set to be ‘World‘ (line 11).
4. In line 12, the print command is called. This prints something to the screen so the user can see it. In the case,
the print command prints “Hello” followed by a space (signalled by the comma) followed by the value of
name.
Each of these points illustrates an important feature of Python. We’ll see them again in the upcoming sections.
1.3 Python types
In this section. we discuss two basic kinds of objects in Python, numbers and strings. There are lots of other kinds of
objects in Python, but these are the two most important for the kinds of problems discussed in this course.
In addition, they provide a good starting point for understanding some of the other Python types.
1.3.1 Numbers
First, as our first Python session showed, there are numbers:
>>> X = 3
Python actually has several different number types. In many simple scripts, Python programmers do not actually have
to think about the different kinds of numbers (this is not true in every programming language!). Nevertheless, it is
helpful to understand the basic concept, and since we are going to have to understand how different data types work,
it helps to understand how the simplest kinds of type distinctions work, and some of the motivations behind them.
Figure Python number types shows the Python type tree for numbers.
Figure 1.1: Python number types
Let’s start with the distinction between integers and floats. For most purposes, you can simply think of this as a
distinction between the kinds of values you want to represent. For values that are exactly equal to integers (..., -2, -1,
0, 1, 2, ...), you use integers (Python type name int); for values that come in between, you use floats:
>>> type(1)
int
>>> type(1.2)
float
>>> X = 1
>>> type(X)
<type ’int’>
>>> X = 1.2
>>> type(X)
<type ’float’>
Now the real question is why bother to have this distinction at all? Why not just have a number type and leave it at
that? The answer in part is space. It takes a lot of information to represent values between 1 and 2 exactly. In fact, for
many values that come up in mathematics (The value of π, for example), it would necessarily take an infinite amount
of space. In a decimal representation, fractions like 13 are infinitely repeating decimals, and would also take an infinite
of space to represent exactly. Since numbers are represented as binary fractions in computer memory, a different set
of fractions comes out as infinitely repeating in computer memory (.1, for example) 1 .
So what we do instead is set aside a standard amount of space for each floating point number we want to use, in fact
quite a lot of space — to allow for satisfactory precision in extended calculations. On the other hand, sometimes we
1
For an excellent discussion of floats in Python, see the Python tutorial page.
don’t want to use numbers for extended mathematical calculations of arbitrary precision. Sometimes we just want to
use them for counting. So when I use a particular variable to store the number of times I see the word ricochet, I know
that no matter how much data I’ve got, the number of times the word occurs can still be represented by an integer. So
for storing an integer we set aside another smaller amount of space, and just as there are floats I cant represent in the
given amount of space, so there are also integers (big ones) I can’t represent in the agreed-upon amount of space. Now
if I really need more space, there is another BIGGER data type I can use for REALLY big integers (say I am counting
subatomic events), called a long (or long integer), and that too has its limits. When the absolute value of numbers gets
too big to represent in the amount of memory available, that’s called overflow.
Finally, there is a distinct number type for complex numbers, which are really numbers with two number components:
>>> X = 3j+2
>>> type(X)
<type ’complex’>
And these come up less in Social Science settings, so we’ll pass over them quickly.
In sum, each of the number data types has its specific purposes, and its specific limits. In general each number type
has its maximum and minimum value; floats have maximum precision values, which means, for example, that certain
numbers are too close to 0 to represent. This problem is called underflow.
Most of these facts aren’t very important in social science computing, but it is important to understand that there ARE
different number types, and that they exist for very good reasons. As the domain of social science computing expands,
these kinds of distinctions become important to understand.
For example, since the advent of successful speech recognition systems in the 1980s, the branch of linguistics devoted
to computer processing of language has undergone a massive expansion and influx of new ideas. Statistical modeling
has become much more important. As a result computing the probabilities of very rare linguistics events has become
a practical necessity; in such computations, underflow problems often arise, and computational linguists have learned
how to write programs that deal with them.
1.3.2 Strings
The other basic data type is strings:
>>> X = ’frog’
>>> type(X)
<type ’str’>
When we type in a word with quotes to the Python prompt, or when we write a program that reads in a file of ordinary
English text, generally the data type you get is strings. Unless you tell Python otherwise, the data type you get by
reading in a file is strings.
Much of this course will be concerned with dealing with string data, since a lot of data of interest to social scientists is
in string form.
The important thing to remember about strings is that when you want to explicitly reference a string value, you need
quotes b, as in the example above with ‘frog’. Leaving out the quotes is an error:
>>> X = frog
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name ’frog’ is not defined
Python interprets this as a reference to a variable. The variable frog might refer to anything, an integer, a float, a file;
Python doesn’t know, and reports an error.
Python allows any character to occur in a string, including the punctiation marks and spaces. So the following is fine:
>>> X = ’The big dog laughed.’
But how about the quotation mark character? Can that occur in a string? The answer is that it can, but you have to
wrap the string in a distinct kind of quotation mark. So both of the following are fine:
>>> X = "The big dog laughed and said, ’Hello, Jeremy.’"
>>> Y = ’The big dog laughed and said, "Hello, Jeremy."’
>>> X == Y
False
The convention is that the string expression has to start and end with the same kind of quotation mark. Any quotation
marks inside have to be different and are considered part of the strings being referred to, so X and Y differ in that X
contains two instances “”’ and Y contains two instances of ‘”’. The quotation marks at the beginning and end of the
string are not considered part of the string; they are just delimiters, like parentheses in arithmetic, telling you where
the first and last character of the string are. So contrast the above examples with the following:
>>> X = "The big dog laughed."
>>> Y = ’The big dog laughed.’
>>> X == Y
True
Which quotation character you use as your delimiter doesn’t matter (as long as there are no quotation characters inside
the string).
Generally speaking, strings of more than one line require some special provisions. They should be begun and ended
with triple quotes:
>>> X = """
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
"""
>>> print X
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Note that the spaces included at the beginning of each line are part of the string. Such multiline strings serve an
important purpose in Python, since they are used for documentation.
Strings can also include special characters such as tabs. To place a tab in a string use the special \t symbol; To place
a line break in a string use the special \n symbol. Thus, to place a tab between ‘x’ and ‘y’, we write:
>>> Z = ’x\ty’
>>> print Z
x
y
And since \n produces a line break, the string X defined above, giving four lines of the Zen of Python, can also be
defined:
>>> X = "\n
>>> print X
Beautiful is better than ugly.\n
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Explicit is better than implicit.\n
Simple is bett
Generally speaking there is little need for multiline strings with explcit \n, except for strings assembled from pieces
by a program. The triple-quoted form is preferred because it is more readable.
1.4 More_Python types
We’ve learned about numbers and strings, but the world does not consist of numbers and strings alone, not even in
a programming language. First of all there is the world outside the program. There are different types for files (or,
more precisely, for the streams by which we communicate with them) and for the endpoints of our connections to the
outside world (for instance, in talking to another host on the internet or to a printer). But even before we get to the
outside world, there is the basic need to have data that is structured, in a way we now try to make clear.
Sometimes we need data that contains other data. In Python, this kind of datum is called a container.
Figure Python container type tree shows the basic Python container types. All serve different needs. The container
types are
Figure 1.2: Python container type tree
One feature that all containers share is they support the in operator:
>>> x = ’abcde’
>>> ’c’ in x
True
So containers contain things and x in y checks to see if x is contained in y, and in this case, the character ‘c’ is contained
in the string ‘abcde’, so the answer is True. The tree above contains one non-container which is an important special
case, file-like objects. These are IO streams for communicating with files or with the user interactively. Technically,
file-like objects are not containers, but in many ways you can think of them that way (what they “contain” is lines).
They even support the in operator, though it works in a different way (see Files and file IO streams )
Most important Python supertypes
• Mutable types
• Containers
• Iterables
For the authoritative definitions of each of these, see Python.org glossary. The term container is missing from this
glossary, though it is clearly a design concept in Python. For a clear discussion (for the geeky) of what a container
really is, see the documentation for the Python collections module.
Basic container types
1.4.1 Lists
Lists are sequences of things. What kinds of things? Well, anything that can be a thing in Python. That is one key fact
about lists. A list can be a higgledypiggledy assortment. One list can contain a number, a string, and another list. You
should certainly use lists when you want to remember the order in which you saw a sequence of things. But there are
other reasons to use lists. They are in some ways the default container type.
Python uses square brackets to enclose lists and commas to separate the elements, so X, Y, and Z are all valid lists:
>>> X = [24, 3.14, ’w’, [1,’a’]] # List with 4 items
>>> Y = [100]
# list with one item
>>> Z = [] #empty list
Note in particular that one of the elements of X is itself a list. More on lists containing lists below. The name list is
special in Python, because it refers to the type list:
>>> list
<type ’list’>
Creating lists
You can use the type name as function of to create lists. So consider the following:
>>> L = list(’hi!’)
Python interprets this as a request to make a list that contains the same things as the string hi!; that string contains 3
characters, so Python makes L be a list containing three characters:
>>> L
[’h’, ’i’, ’!’]
Use of the the type as a function that creates instances of the type is a standard practice in Python, so the list function
can be fed any sequence as an argument, and it returns a list containing the same elements. A special case is calling
list with no arguments at all:
>>> M = list()
This returns a special list called the empty list, which is of length 0, and contains no elements at all:
>>> M
[]
This may seem useless, but it is a great convenience when programming the result be something that is well defined
and legal when all the elements have been removed from a list.
Indexing lists, list slices
A list is an index where the indices are numbers; items are referred to by their integer indices:
>>> X[0]
24
>>> X[1]
3.14
>>> X[-1]
7
# 1st element
# 2nd element
# last element
Thus, indexing starts with 0. This means the highest index that retrieves anything is 1 less than the length of the list.
List X has length 4, so the following raises an IndexError:
>>> X[4]
# Raises exception!
...
IndexError: list index out of range
Python also provides easy access to subsequences of a list. The following examples illustrate how to make such
references:
>>> X[0:2] # list of 1st and 2nd elements
[24, 3.14]
>>> X[:-1] # list excluding last element
References to subsequences of a list are called slices.
List can be concatenated a longer lists:
>>> X + Y # Concatenation!
[24, 3.14, ’w’, [1,’a’], 100]
Lists allow value assignments, which change the value of a reference in place (in place assignment):
>>> X[2] = 5
>>> X
[24, 3.14, 5, [1,’a’]]
>>> X[0:2]
[24, 3.14]
>>> X[0:2] = [1,3]
>>> X
[1, 3, 5, [1,’a’]]
Only list-values can be assigned to slices:
>>> X[0:2] = 1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only assign an iterable
The Error message here “TypeError: can only assign an iterable” refers to a general class containing containers called
iterables, the root of the the tree in the Section on More_Python types. We’ll talk more about iterables later. The
important point for now is that iterables include all container including lists but not integers, so:
>>> X[0:2] = [1]
works, but
>>> X[0:2] = 1
does not. A list slice must be filled in by a list; any iterable can be easily turned into a list, but 1 cannot.
Learning how to read and interpret Python errors is an important part of being able to write simple programs in Python.
Properly understood, Python’s error-reporting will make it easier for you to fix errors.
Lists containing lists
We introduced the list structure by saying a list could contain anything. That includes a list. Lists of lists are useful
for many purposes. One of the most intuitive is to represent a table. Suppose we want to represent the following table
from some dataset in Python.
42
2
14
3.14
4
0
7
0
0
We can do that as follows:
>>> Table = [[42, 3.14,7],[2,4,0],[14,0,0]]
>>> Table[0]
[42, 3.14,7]
This means to retrieve the value 42, we can do:
>>> FirstRow = Table[0]
>>> FirstRow[0]
42
But Python allows any expression which has a list as its value to be followed by [index]. That includes the
expression Table[0]. It is much more convenient and Pythonic to do this in one step:
>>> Table[0][0]
42
So in general, thinking of a table as a list of rows, each of which is a list, we can access the element in row i, column
j, with Table[i][j].
List methods
In all the following examples, L is a list.
L.insert(i, x)
Insert an item at a given position. The first argument is the index of the element before which to insert, so
a.insert(0, x) inserts at the front of the list, and a.insert(len(a), x) is equivalent to a.append(x).
L.append(x)
Equivalent to a.insert(len(a), x).
L.index(x)
Return the index in the list of the first item whose value is x. It is an error if there is no such item.
L.remove(x)
Remove the first item from the list whose value is x. It is an error if there is no such item.
L.sort()
Sort the items of the list. Much more to be said about this. But it changes the list.
L.reverse()
Reverse the elements of the list, in place.
L.count(x)
Return the number of times x appears in the list.
See Also:
For a nice overview of list operations, see the Google’s tutorial. This is geared toward the more advanced student.
For a real programmer’s discussion of Python lists, addressing questions like what’s faster, assignment or insertion, or
what’s faster, insertion at the beginning, or insertion at the end, see Fredrik Lundh’s effbot.org discussion.
1.4.2 Strings
We have already been introduced to strings as a basic data type. Now we take a look them again from a different point
of view. Strings are containers. This means you can look at their insides and do things like check whether the first
character is capitalized and whether the third character is “e”.
Indexing strings, string slices
To get at the inner components of strings Python uses the same syntax and operators as lists. The Pythonic conception
is that both lists and strings belong to a ‘super’ data type, sequences. Sequence types are containers that contain
elements in a particular order, so indexing by number makes sense for all sequences:
>>>
>>>
’d’
>>>
’o’
>>>
’s’
X = ’dogs’
X[0]
X[1]
X[-1]
The following raises an IndexError, as it would with a 4-element list:
>>> X[4]
...
IndexError: string index out of range
Strings can also be one element long:
>>> Y = ’d’
Note: Unlike C, there is no special type for characters in Python. Characters are just one-element strings.
And they can be empty, just as lists can:
>>> Z = ’’
Python also provides easy access to subsequences of a string, just as it does for lists. The following examples illustrate
how to make such references:
>>> X[0:2] # string of 1st and 2nd characters
’do’
>>> X[:-1] # string excluding last character
’dog’
References to subsequences of a string are called slices.
Guido va Rossum says: “The best way to remember how slices work is to think of the indices as pointing between
characters, with the left edge of the first character numbered 0. Then the right edge of the last character of a string of
n characters has index n”:
+---+---+---+---+---+
| H | e | l | p | A |
+---+---+---+---+---+
0
1
2
3
4
5
-5 -4 -3 -2 -1
The first row of numbers gives the position of the indices 0...5 in the string; the second row gives the corresponding
negative indices. The slice from i to j consists of all characters between the edges labeled i and j, respectively.
For nonnegative indices, the length of a slice is the difference of the indices, if both are within bounds, e.g., the length
of word[1:3] is 2.
The built-in function len() returns the length of a string:
>>> s = ’supercalifragilisticexpialidocious’
>>> len(s)
34
Strings can also be concatenated into longer sequences, just as lists can:
>>> X + Y
’dogsd’
Using the name of the type as a function gives us a way of MAKING strings, just as it did with lists:
>>> One = str(1)
One is no longer an int!:
>>> One
’1’
>>> I = int(str(1))
>>> I
1
And as with lists, calling the type with no arguments produces the empty string:
>>> Empty = str()
>>> Empty
’’
There is one thing that can be done with lists that canNOT be done with strings. Assignment of values:
>>> ’spin’[2]= ’a’
...
TypeError: object does not support item assignment
This can be fixed, by avoiding the assignment or making it on a mutable sequenece, such as a list, which contains the
relevant information.
See Also:
Section Mutability.
String methods
In all the following examples, S is a string. This is just a sample. See the official Python docs for the complete list of
string methods. Or just type help(str) at the Python prompt!
S.capitalize()
Return a string just like S, except that it is capitalized. If S is already capitalized, the result is identical to
S.
S.count(x)
Return the number of times x appears in the string S.
S.index(x)
Return the index in L of the first substring whose identicql to x. It is an error if there is no such item.
L.replace(x,y)
Return a string in which every instance of the substring x in L is replaced with y:
>>> X = ’abracadabra’
>>> X.replace(’dab’,’bad’)
’abracabadra’
>>> X.replace(’a’,’b’)
’bbrbcbdbbrb’
S.title()
Return a string just like S in which all words are capitalized:
>>> ’los anGeles’.title()
’Los Angeles’
S.istitle()
Return True is every word in S is capitalized. Otherwise, return False:
>>> ’los anGeles’.istitle()
False
>>> ’Los AnGeles’.istitle()
False
>>> ’Los Angeles’.istitle()
True
Reverse the elements of the list, in place.
1.4.3 Tuples
Syntax
Python uses commas to create sequences of elements called tuples. The result is more readable if the tuple ements are
enclosed in parentheses:
>>> X = (24, 3.14, ’w’, 7) # Tuple with 4 items
>>> Y = (100,)
# tuple with one item
>>> Z = () #empty tuple
The following is not a tuple:
>>> Q = (7)
>>> Q == 7
True
Making tuples
The tuple constructoor function is the name of the type:
>>> L = tuple(’hi!’)
>>> L
(’h’, ’i’, ’!’)
>>> M = tuple()
>>> M
()
# make a string a tuple
>>> X(0)
# 1st element
24
>>> X(1)
# 2nd element
3.1400000000000001
>>> X(-1)
# last element
7
>>> X(0:2) # tuple of 1st and 2nd elements
(24, 3.1400000000000001)
>>> X(:-1) # tuple excluding last element
The following raises an IndexError:
>>> X[4]
# Raises exception!
...
IndexError: tuple index out of range
Concatenation of tuples:
>>> X + Y
(24, 3.1400000000000001, ’w’, 7, 100)
1.4.4 Dictionaries
Dictionaries store mappings between many kinds of data types. Suppose we are studying the Enron email network,
keeping track of who emailed who. For we each employee, we want to keep a list who they emailed. This is the kind
of information that would be stored in a dictionary. It would look like this:
1
>>> enron_network
2
3
4
5
{’Mike Grigsby’:
’Greg Whalley’:
...
[’Louise Kitchen’, ’Scott Neal’, ’Kenneth Lay’, ... ],
[’Mike Grigsby’, ’Louise Kitchen;, ... ],
6
7
}
Each entry represents the email sent by one employee. Thus, the first entry tells us Mike Grigsby sent email to Louise
Kitchen, Scott Neal, Kenneth Lay, etcetera. Each employee is represented by a string that uniquely identifies them
(their names). Thus, this dictionary relates strings to lists of strings (Pythonistas say it maps strings to lists of strings).
The strings before the colons are the keys, and the lists after the colon are the values. Any immutable type can be the
key of a dictionary.
For our purposes we will most often use strings as the ‘keys’ of a dictionary. We will use all kinds of things as values.
>>> X = {’x’:42, ’y’: 3.14, ’z’:7}
>>> Y = {1:2, 3:4}
>>> Z = {}
Make some assignments into a dict
>>> L = dict(x=42,y=3.14,z=7)
The following is the way to do the same with integer keys.
>>> M = dict([[1, 2],[3, 4]])
1
2
3
4
5
6
>>> X[’x’]
42
>>> X[’y’]
3.1400000000000001
>>> M[1]
2
The following raises a KeyError:
>>> X[’w’]
...
KeyError: ’w’
Dictionaries and lists can be mixed. We saw above that a list can be a value in a dictionary. Dictionaries can also be
elements in lists:
>>> Double = [{’x’:42,’y’: 3.14, ’z’:7},{1:2,3:4},{’w’:14}]
>>> Double[0]
{’y’: 3.1400000000000001, ’x’: 42, ’z’: 7}
This means to retrieve the value 42 for the key x, we can do:
>>> FirstDict = Double[0]
>>> FirstDict[’y’]
42
But Python allows any expression which has a Dictionary as its value to be followed by [key]. It is much more
convenient and Pythonic to do this in one step:
>>> Double[0][’x’]
42
From list to dictionary
It often happens that we have data in a format convenient for providing one kind of information, but we need it for a
slightly different purpose. Suppose we have the words of English arranged according to frequency rank, from most
frequent to least frequent. Such a list looks like this:
rank_list = [’the’, ’be’, ’of’, ’and’, ’a’, ’in’, ’to’, ’have’, ’it’,
’to’, ’for’, ’I’, ’that’, ’you’, ’he’, ’on’, ’with’, ’do’,
’at’, ’by’, ’not’, ’this’, ’but’, ’from’, ’they’, ’his’, ...]
This is convenient if you want to know things like what the 6th most frequent word of English is:
>>> rank_list[5]
’in’
# 6th word on list has index 5
But suppose you want to go in the opposite direction? Given a word, you want to know what its frequency rank is.
Python provides a built-in method for finding the index of any item in a list, so we could equally well do:
>>> rank_list.index(’in’)
5
But this is actually a little inefficient: Python has to search through the list, comparing each item with ‘in’, until it
finds the right one. If Python starts from the beginning of the list each time, that will take quite a lo of comparisons
for low-ranked words.
What this example illustrates is a very basic but extremely important computational concept. You want your information to be indexed in a way convenient for the kind of questions you’re going to ask. A phone book is an extremely
useful format if you know a name and you need a phone number. Similarly, Webster’s Dictionary is wonderful if you
have a word and want to know what the meaning is. In both cases, alphabetically sorting the entries give us an efficient
way of accessing the information. But the phone book is fairly inconvenient if you have a phone number and want to
know who has it, and Webster’s is quite hard to use if you have a meaning in mind and are hunting for the word to
express it, or even if you have a word, and are looking for its rhyme.
If you have a set of strings, and you want to get particular kinds of information about each, dictionaries are the way
to go: Dictionary look ups use an indexing scheme much like alphabetization to efficiently retrieve values for keys. In
the example of the rank list, if our main interest is in going from words to ranks, what we want to do is convert the list
into a dictionary. That looks like this:
rank_dict = dict()
for (i, word) in enumerate(rank_list):
rank_dict[word] = (i+1)
Here is what enumerate does to a list:
>>> list(enumerate([’a’,’b’,’c’]))
[(0,’a’),(1,’b’),(2,’c’)]
What enumerate returns Essentially, what enumerate does is pair each element in the list with its index, which
is exactly the information we want for our dictionary.
Combining lists into a dictionary
It is frequently the case that the meaning of a particular data item is defined by its position in a sequence of things. The
i th thing in that sequence is associated with a particular entity xi and the ith thing in some other sequence might also
be associated with xi. For example, a company database might store information about employees in separate files, but
the i th line of each file is reserved for information about the employee with id number i.
As a concrete example, suppose we downloaded information about word frequencies in the British National Corpus,
a very large sample of English, likely to give very reliable counts, and it came in two files, with contents that looked
like this:
File 1
File 2
a
2186369
abandon
4249
abbey
1110
ability
10468
able
30454
abnormal
809
abolish
1744
abolition
1154
abortion
1471
about
52561
about
144554
above
2139
above
10719
above
12889
abroad
3941
abruptly
1146
absence
5949
absent
1504
absolute
3489
absolutely
5782
absorb
2684
absorption
932
abstract
1605
That is, the i th numbered line in File 2 gives the frequency of the i th English word in File 1. The easiest way to
read this data into Python produces two lists, word_list and freq_list. Now suppose we want to be able to go
conveniently from a word to its frequency. The right thing to do is to make a dictionary. Here is one way to do that,
using code like the code we’ve already seen:
freq_dict = dict()
for (i, word) in enumerate(word_list):
freq_dict[word] = freq_list[i]
Look at this code and make sure you understand it. We use the enumerate function to give us the index of each word
in the list as we see it, then associate that with the i th frequency in freq_list.
But Python provides a faster, much more Pythonic way to do this. As with list and str, the name of the Python
type dict is also a function for producing dictionaries. In fact, we used that convention to create an empty dictionary
in the code snippet above. But dict can do more than just create empty dictionaries. Given a list of pairs, it produces
a dictionary whose keys are the first members of the pairs and whose values are corresponding second members:
>>> L = [(’a’,1), (’b’,2), (’c’,3)]
>>> dd = dict(L)
>>> dd
{’a’:1, ’b’:2, ’c’:3}
Unfortunately, we have two lists, not one, and neither is a list of pairs. Not to worry. Python also provides an easy
way to create the list we want. The function is called zip, and it takes two lists of the same length and returns a list
of pairs:
>>> L_left = [’a’,’b’,’c’]
>>> L_right = [1, 2, 3]
>>> zip(L_left, L_right)
[(’a’,1), (’b’,2), (’c’,3)]
Thus in this instance zip returns the same list L we saw above.
Given these feature, it is quite simple to produce the frequency dictionary we want:
>>> freq_dict = dict(zip(word_list,freq_list))
This is a frequently used Python idiom which will come in handy.
A text-based example
Note: This section uses the data module example_string which can be found here. [Add ipython notebook]
We illustrate some code here for computing word counts in a string. This is included here as a very clear example of
when dictionaries should be used: We have a set of strings (words in our vocabulary) and information that we want to
keep about each string (the word’s count). We need to update that information in a loop (see Section Loops), as we
look at each word in the string:
1
from example_string import example_string
2
3
count_dict = dict()
4
5
6
7
8
9
for word in example_string.split():
if word in count_dict:
count_dict[word] += 1
else:
count_dict[word] = 1
And then we have:
1
2
3
4
5
6
7
8
9
>>> count_dict
{’all’: 3,
’consider’: 2,
’dance’: 1,
’better,’: 1,
’sakes,’: 1,
’pleasant,’: 1,
’four’: 3,
’go’: 1,
10
...
11
12
13
14
’Lizzy’: 2,
’Jane,’: 1,
15
...
16
17
’Kitty,’: 2
18
19
}
Let’s see what’s going on in the code. [Explain]
Dictionary methods
[some of the most important]
Python collections module
Introduce Counters and defaultdicts.
Run though the above example, simplifying through the use of Counters.
A counter is a special kind of dictionary defined in the Python collections module:
1
2
3
4
5
6
7
>>> from collections import Counter
>>> # Tally occurrences of words in a list
>>> cnt = Counter()
>>> for word in [’red’, ’blue’, ’red’, ’green’, ’blue’, ’blue’]:
...
cnt[word] += 1
>>> cnt
Counter({’blue’: 3, ’red’: 2, ’green’: 1})
8
9
10
11
12
13
14
>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall(r’\w+’, open(’hamlet.txt’).read().lower())
>>> Counter(words).most_common(10)
[(’the’, 1143), (’and’, 966), (’to’, 762), (’of’, 669), (’i’, 631),
(’you’, 554), (’a’, 546), (’my’, 514), (’hamlet’, 471), (’in’, 451)]
Counters can be initialized with any sequence, and they will count the token occurrences in that sequence. For example:
>>> c = Counter(’gallahad’)
>>> c
Counter({’a’: 3, ’l’: 2, ’h’: 1, ’g’: 1, ’d’: 1})
They can also be initialized directly with count information from a dictionary:
>>> c = Counter({’red’: 4, ’blue’: 2})
>>> c = Counter(cats=4, dogs=8)
# a new counter from a mapping
# a new counter from keyword args
A Counter is a kind of dictionary, but it does not behave entirely like a standard dictionary. The count of a missing
element is 0; there are no key error occurs, as would the case with a standard Python dictionary:
>>> c = Counter([’eggs’, ’ham’])
>>> c[’bacon’]
0
See the Python docs for more features.
Now we said a counter can initialized directly with any sequence; the correct term is any iterable, roughly, anything
that can be iterated through with a for loop. But caution must be exercised when initializing counters this way, to
guarantee that the right things are being counted. For example, if what is desired is the word counts, it won’t work to
simply initilize a Counter with a file handle, even though a file handle is an iterable:
1
2
3
4
5
6
7
>>> from collections import Counter
>>> tr = Counter(open(’../anatomy/pride_and_prejudice.txt’,’r’))
>>> len(tr)
11003
>>> tr.most_common(10)
[(’\r\n’, 2394), (’
* * * * *\r\n’, 6),
(’them."\r\n’, 3), (’it.\r\n’, 3), (’them.\r\n’, 3),
8
9
10
(’family.\r\n’, 2), (’do."\r\n’, 2),
(’between Mr. Darcy and herself.\r\n’, 2),
(’almost no restrictions whatsoever. You may copy it, give it away or\r\n’, 2), (’together.\r\n’, 2
What happened and why?