Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER ONE INTRODUCTION TO PYTHON This section contains an introduction to Python. Python has had a lot of good introductory material written for it, and the notes given here are just a bare bones introduction targeting basic concepts needed in this course. It often pays when working on a programming language to get several different takes on the material. In the interest of giving students a different perspective, and generally broadening their knowledge, here are some of the best quick introductions to Python: 1. Older version of Python tutorial, better for beginners Actually written by Guido van Rossum. 2. Current Python docs tutorial 3. How to think like a programmer 4. Google’s Python class Contents 1.1 Python example Note: Python and Ipython notebook versions of code (.py .ipynb). This example illustrates some Python code: 1 2 3 4 5 6 7 >>> >>> >>> >>> >>> >>> Hrs total_secs = 7684 hours = total_secs // 3600 secs_still_remaining = total_secs % 3600 minutes = secs_still_remaining // 60 secs_finally_remaining = secs_still_remaining % 60 print "Hrs =", hours, "mins =", minutes, "secs =", secs_finally_remaining = 2 mins = 8 secs = 4 1.2 Your first Python program Here is some Python code here. You should download this code and put it in a file called hello.py. Throughout these notes you will be given files to download. You should adopt the following practice. Create a directory (folder) in which you place these files and always start up python from the commandline after connecting to that directory. The file hello.py is a tiny Python program to check that Python is working. Try running this program from the command line like this: i gawron$ python hello.py (the gawron$ is my commandline prompt; the rest is what I typed on my keyboard, followed by <Enter>). The program should print: Hello World Or if you type: gawron$ python hello.py Alice it should print: Hello Alice Or if you type: gawron$ python hello.py stupid it should print: Hello stupid If you have a text editor that you feel comfortable with (as discussed in Section text_editor_ide), try editing the hello.py file, changing the ‘Hello’ to ‘Howdy’, and running it again. Once you have that working, you’re ready for class – you can edit and run Python code; now you just need to learn Python! 1.2.1 A little more discussion The following function is what the file contains: 1 import sys 2 3 4 5 6 def main(): """ Get the name from the command line, using ’World’ as a fallback. """ 7 8 9 10 11 12 if len(sys.argv) >= 2: name = sys.argv[1] else: name = ’World’ print ’Hello’, name This block of Python code is a function definition. A function is a program, and the way you define a function in Python is to write: def <function_name> ([Arguments]): followed by an number of lines of indented legal Python code. The arguments are values that are passed into the function that may affect what it does. In the case of main, there are no arguments, and so the arguments list is simply (). To see how this function works in python, let’s run the Python **interactively, so that we end up seeing the python prompt after the program is run. This is done by including a -i on the commandline as follows: gawron$ python -i hello.py We then get: Hello World >>> We can run the function main again from inside Python, because the first thing Python did on startup was load up all the definitions in the file hello.py. Then it executed main. We can tell it to execute main again as follows: >>> main() Hello World And it still works. In general in this course, we’ll be executing Python code interactively by typing directly to the Python prompt. This simple function contains a lot of features of Python which we will be discussing in more detail in the next few sections. Just as a small preview, we discuss some of them now: 1. Line 1 is an import statement. Python comes with a small amount of built in functionality, but most of what you do when you program Python requires importing other files defining new functionality. These helper files are called modules, and any Python distribution, even the bare bones standard distribution, comes with a large number of them. The core set of modules that comes with a standard Python distribution is called the Python standard library. The sys module imported in line 1 is one of these. It provides access to some variables containing information gathered when python started up, including information about the machine Python is running on and how Python was started. One of these is the variable sys.argv used in line 8. 2. The variable sys.argv is a list containing any arguments you provided when you called Python from the commandline, including the name of the program you asked Python to run. So if you typed: python hello_world.py then sys.argv looks like this: >>> sys.argv [’hello.py’] if you typed: python hello_world.py Alice then sys.argv looks like this: >>> sys.argv [’hello.py’, ’Alice’] 3. Line 8 of the program is a test to see if the sys.argv list is long enough to contain a name; its length is 1 if no name was supplied on the commandline and its length is 2 only if a name was supplied. If the test is passed (the length is 2), then the variable name is set to the name supplied on the commandline (line 9). This is because sys.argv[0] gives us the fi<rst thing on the list and sys.argv[1] gives us the second. Otherwise (else), the name is set to be ‘World‘ (line 11). 4. In line 12, the print command is called. This prints something to the screen so the user can see it. In the case, the print command prints “Hello” followed by a space (signalled by the comma) followed by the value of name. Each of these points illustrates an important feature of Python. We’ll see them again in the upcoming sections. 1.3 Python types In this section. we discuss two basic kinds of objects in Python, numbers and strings. There are lots of other kinds of objects in Python, but these are the two most important for the kinds of problems discussed in this course. In addition, they provide a good starting point for understanding some of the other Python types. 1.3.1 Numbers First, as our first Python session showed, there are numbers: >>> X = 3 Python actually has several different number types. In many simple scripts, Python programmers do not actually have to think about the different kinds of numbers (this is not true in every programming language!). Nevertheless, it is helpful to understand the basic concept, and since we are going to have to understand how different data types work, it helps to understand how the simplest kinds of type distinctions work, and some of the motivations behind them. Figure Python number types shows the Python type tree for numbers. Figure 1.1: Python number types Let’s start with the distinction between integers and floats. For most purposes, you can simply think of this as a distinction between the kinds of values you want to represent. For values that are exactly equal to integers (..., -2, -1, 0, 1, 2, ...), you use integers (Python type name int); for values that come in between, you use floats: >>> type(1) int >>> type(1.2) float >>> X = 1 >>> type(X) <type ’int’> >>> X = 1.2 >>> type(X) <type ’float’> Now the real question is why bother to have this distinction at all? Why not just have a number type and leave it at that? The answer in part is space. It takes a lot of information to represent values between 1 and 2 exactly. In fact, for many values that come up in mathematics (The value of π, for example), it would necessarily take an infinite amount of space. In a decimal representation, fractions like 13 are infinitely repeating decimals, and would also take an infinite of space to represent exactly. Since numbers are represented as binary fractions in computer memory, a different set of fractions comes out as infinitely repeating in computer memory (.1, for example) 1 . So what we do instead is set aside a standard amount of space for each floating point number we want to use, in fact quite a lot of space — to allow for satisfactory precision in extended calculations. On the other hand, sometimes we 1 For an excellent discussion of floats in Python, see the Python tutorial page. don’t want to use numbers for extended mathematical calculations of arbitrary precision. Sometimes we just want to use them for counting. So when I use a particular variable to store the number of times I see the word ricochet, I know that no matter how much data I’ve got, the number of times the word occurs can still be represented by an integer. So for storing an integer we set aside another smaller amount of space, and just as there are floats I cant represent in the given amount of space, so there are also integers (big ones) I can’t represent in the agreed-upon amount of space. Now if I really need more space, there is another BIGGER data type I can use for REALLY big integers (say I am counting subatomic events), called a long (or long integer), and that too has its limits. When the absolute value of numbers gets too big to represent in the amount of memory available, that’s called overflow. Finally, there is a distinct number type for complex numbers, which are really numbers with two number components: >>> X = 3j+2 >>> type(X) <type ’complex’> And these come up less in Social Science settings, so we’ll pass over them quickly. In sum, each of the number data types has its specific purposes, and its specific limits. In general each number type has its maximum and minimum value; floats have maximum precision values, which means, for example, that certain numbers are too close to 0 to represent. This problem is called underflow. Most of these facts aren’t very important in social science computing, but it is important to understand that there ARE different number types, and that they exist for very good reasons. As the domain of social science computing expands, these kinds of distinctions become important to understand. For example, since the advent of successful speech recognition systems in the 1980s, the branch of linguistics devoted to computer processing of language has undergone a massive expansion and influx of new ideas. Statistical modeling has become much more important. As a result computing the probabilities of very rare linguistics events has become a practical necessity; in such computations, underflow problems often arise, and computational linguists have learned how to write programs that deal with them. 1.3.2 Strings The other basic data type is strings: >>> X = ’frog’ >>> type(X) <type ’str’> When we type in a word with quotes to the Python prompt, or when we write a program that reads in a file of ordinary English text, generally the data type you get is strings. Unless you tell Python otherwise, the data type you get by reading in a file is strings. Much of this course will be concerned with dealing with string data, since a lot of data of interest to social scientists is in string form. The important thing to remember about strings is that when you want to explicitly reference a string value, you need quotes b, as in the example above with ‘frog’. Leaving out the quotes is an error: >>> X = frog Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name ’frog’ is not defined Python interprets this as a reference to a variable. The variable frog might refer to anything, an integer, a float, a file; Python doesn’t know, and reports an error. Python allows any character to occur in a string, including the punctiation marks and spaces. So the following is fine: >>> X = ’The big dog laughed.’ But how about the quotation mark character? Can that occur in a string? The answer is that it can, but you have to wrap the string in a distinct kind of quotation mark. So both of the following are fine: >>> X = "The big dog laughed and said, ’Hello, Jeremy.’" >>> Y = ’The big dog laughed and said, "Hello, Jeremy."’ >>> X == Y False The convention is that the string expression has to start and end with the same kind of quotation mark. Any quotation marks inside have to be different and are considered part of the strings being referred to, so X and Y differ in that X contains two instances “”’ and Y contains two instances of ‘”’. The quotation marks at the beginning and end of the string are not considered part of the string; they are just delimiters, like parentheses in arithmetic, telling you where the first and last character of the string are. So contrast the above examples with the following: >>> X = "The big dog laughed." >>> Y = ’The big dog laughed.’ >>> X == Y True Which quotation character you use as your delimiter doesn’t matter (as long as there are no quotation characters inside the string). Generally speaking, strings of more than one line require some special provisions. They should be begun and ended with triple quotes: >>> X = """ Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. """ >>> print X Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Note that the spaces included at the beginning of each line are part of the string. Such multiline strings serve an important purpose in Python, since they are used for documentation. Strings can also include special characters such as tabs. To place a tab in a string use the special \t symbol; To place a line break in a string use the special \n symbol. Thus, to place a tab between ‘x’ and ‘y’, we write: >>> Z = ’x\ty’ >>> print Z x y And since \n produces a line break, the string X defined above, giving four lines of the Zen of Python, can also be defined: >>> X = "\n >>> print X Beautiful is better than ugly.\n Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Explicit is better than implicit.\n Simple is bett Generally speaking there is little need for multiline strings with explcit \n, except for strings assembled from pieces by a program. The triple-quoted form is preferred because it is more readable. 1.4 More_Python types We’ve learned about numbers and strings, but the world does not consist of numbers and strings alone, not even in a programming language. First of all there is the world outside the program. There are different types for files (or, more precisely, for the streams by which we communicate with them) and for the endpoints of our connections to the outside world (for instance, in talking to another host on the internet or to a printer). But even before we get to the outside world, there is the basic need to have data that is structured, in a way we now try to make clear. Sometimes we need data that contains other data. In Python, this kind of datum is called a container. Figure Python container type tree shows the basic Python container types. All serve different needs. The container types are Figure 1.2: Python container type tree One feature that all containers share is they support the in operator: >>> x = ’abcde’ >>> ’c’ in x True So containers contain things and x in y checks to see if x is contained in y, and in this case, the character ‘c’ is contained in the string ‘abcde’, so the answer is True. The tree above contains one non-container which is an important special case, file-like objects. These are IO streams for communicating with files or with the user interactively. Technically, file-like objects are not containers, but in many ways you can think of them that way (what they “contain” is lines). They even support the in operator, though it works in a different way (see Files and file IO streams ) Most important Python supertypes • Mutable types • Containers • Iterables For the authoritative definitions of each of these, see Python.org glossary. The term container is missing from this glossary, though it is clearly a design concept in Python. For a clear discussion (for the geeky) of what a container really is, see the documentation for the Python collections module. Basic container types 1.4.1 Lists Lists are sequences of things. What kinds of things? Well, anything that can be a thing in Python. That is one key fact about lists. A list can be a higgledypiggledy assortment. One list can contain a number, a string, and another list. You should certainly use lists when you want to remember the order in which you saw a sequence of things. But there are other reasons to use lists. They are in some ways the default container type. Python uses square brackets to enclose lists and commas to separate the elements, so X, Y, and Z are all valid lists: >>> X = [24, 3.14, ’w’, [1,’a’]] # List with 4 items >>> Y = [100] # list with one item >>> Z = [] #empty list Note in particular that one of the elements of X is itself a list. More on lists containing lists below. The name list is special in Python, because it refers to the type list: >>> list <type ’list’> Creating lists You can use the type name as function of to create lists. So consider the following: >>> L = list(’hi!’) Python interprets this as a request to make a list that contains the same things as the string hi!; that string contains 3 characters, so Python makes L be a list containing three characters: >>> L [’h’, ’i’, ’!’] Use of the the type as a function that creates instances of the type is a standard practice in Python, so the list function can be fed any sequence as an argument, and it returns a list containing the same elements. A special case is calling list with no arguments at all: >>> M = list() This returns a special list called the empty list, which is of length 0, and contains no elements at all: >>> M [] This may seem useless, but it is a great convenience when programming the result be something that is well defined and legal when all the elements have been removed from a list. Indexing lists, list slices A list is an index where the indices are numbers; items are referred to by their integer indices: >>> X[0] 24 >>> X[1] 3.14 >>> X[-1] 7 # 1st element # 2nd element # last element Thus, indexing starts with 0. This means the highest index that retrieves anything is 1 less than the length of the list. List X has length 4, so the following raises an IndexError: >>> X[4] # Raises exception! ... IndexError: list index out of range Python also provides easy access to subsequences of a list. The following examples illustrate how to make such references: >>> X[0:2] # list of 1st and 2nd elements [24, 3.14] >>> X[:-1] # list excluding last element References to subsequences of a list are called slices. List can be concatenated a longer lists: >>> X + Y # Concatenation! [24, 3.14, ’w’, [1,’a’], 100] Lists allow value assignments, which change the value of a reference in place (in place assignment): >>> X[2] = 5 >>> X [24, 3.14, 5, [1,’a’]] >>> X[0:2] [24, 3.14] >>> X[0:2] = [1,3] >>> X [1, 3, 5, [1,’a’]] Only list-values can be assigned to slices: >>> X[0:2] = 1 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can only assign an iterable The Error message here “TypeError: can only assign an iterable” refers to a general class containing containers called iterables, the root of the the tree in the Section on More_Python types. We’ll talk more about iterables later. The important point for now is that iterables include all container including lists but not integers, so: >>> X[0:2] = [1] works, but >>> X[0:2] = 1 does not. A list slice must be filled in by a list; any iterable can be easily turned into a list, but 1 cannot. Learning how to read and interpret Python errors is an important part of being able to write simple programs in Python. Properly understood, Python’s error-reporting will make it easier for you to fix errors. Lists containing lists We introduced the list structure by saying a list could contain anything. That includes a list. Lists of lists are useful for many purposes. One of the most intuitive is to represent a table. Suppose we want to represent the following table from some dataset in Python. 42 2 14 3.14 4 0 7 0 0 We can do that as follows: >>> Table = [[42, 3.14,7],[2,4,0],[14,0,0]] >>> Table[0] [42, 3.14,7] This means to retrieve the value 42, we can do: >>> FirstRow = Table[0] >>> FirstRow[0] 42 But Python allows any expression which has a list as its value to be followed by [index]. That includes the expression Table[0]. It is much more convenient and Pythonic to do this in one step: >>> Table[0][0] 42 So in general, thinking of a table as a list of rows, each of which is a list, we can access the element in row i, column j, with Table[i][j]. List methods In all the following examples, L is a list. L.insert(i, x) Insert an item at a given position. The first argument is the index of the element before which to insert, so a.insert(0, x) inserts at the front of the list, and a.insert(len(a), x) is equivalent to a.append(x). L.append(x) Equivalent to a.insert(len(a), x). L.index(x) Return the index in the list of the first item whose value is x. It is an error if there is no such item. L.remove(x) Remove the first item from the list whose value is x. It is an error if there is no such item. L.sort() Sort the items of the list. Much more to be said about this. But it changes the list. L.reverse() Reverse the elements of the list, in place. L.count(x) Return the number of times x appears in the list. See Also: For a nice overview of list operations, see the Google’s tutorial. This is geared toward the more advanced student. For a real programmer’s discussion of Python lists, addressing questions like what’s faster, assignment or insertion, or what’s faster, insertion at the beginning, or insertion at the end, see Fredrik Lundh’s effbot.org discussion. 1.4.2 Strings We have already been introduced to strings as a basic data type. Now we take a look them again from a different point of view. Strings are containers. This means you can look at their insides and do things like check whether the first character is capitalized and whether the third character is “e”. Indexing strings, string slices To get at the inner components of strings Python uses the same syntax and operators as lists. The Pythonic conception is that both lists and strings belong to a ‘super’ data type, sequences. Sequence types are containers that contain elements in a particular order, so indexing by number makes sense for all sequences: >>> >>> ’d’ >>> ’o’ >>> ’s’ X = ’dogs’ X[0] X[1] X[-1] The following raises an IndexError, as it would with a 4-element list: >>> X[4] ... IndexError: string index out of range Strings can also be one element long: >>> Y = ’d’ Note: Unlike C, there is no special type for characters in Python. Characters are just one-element strings. And they can be empty, just as lists can: >>> Z = ’’ Python also provides easy access to subsequences of a string, just as it does for lists. The following examples illustrate how to make such references: >>> X[0:2] # string of 1st and 2nd characters ’do’ >>> X[:-1] # string excluding last character ’dog’ References to subsequences of a string are called slices. Guido va Rossum says: “The best way to remember how slices work is to think of the indices as pointing between characters, with the left edge of the first character numbered 0. Then the right edge of the last character of a string of n characters has index n”: +---+---+---+---+---+ | H | e | l | p | A | +---+---+---+---+---+ 0 1 2 3 4 5 -5 -4 -3 -2 -1 The first row of numbers gives the position of the indices 0...5 in the string; the second row gives the corresponding negative indices. The slice from i to j consists of all characters between the edges labeled i and j, respectively. For nonnegative indices, the length of a slice is the difference of the indices, if both are within bounds, e.g., the length of word[1:3] is 2. The built-in function len() returns the length of a string: >>> s = ’supercalifragilisticexpialidocious’ >>> len(s) 34 Strings can also be concatenated into longer sequences, just as lists can: >>> X + Y ’dogsd’ Using the name of the type as a function gives us a way of MAKING strings, just as it did with lists: >>> One = str(1) One is no longer an int!: >>> One ’1’ >>> I = int(str(1)) >>> I 1 And as with lists, calling the type with no arguments produces the empty string: >>> Empty = str() >>> Empty ’’ There is one thing that can be done with lists that canNOT be done with strings. Assignment of values: >>> ’spin’[2]= ’a’ ... TypeError: object does not support item assignment This can be fixed, by avoiding the assignment or making it on a mutable sequenece, such as a list, which contains the relevant information. See Also: Section Mutability. String methods In all the following examples, S is a string. This is just a sample. See the official Python docs for the complete list of string methods. Or just type help(str) at the Python prompt! S.capitalize() Return a string just like S, except that it is capitalized. If S is already capitalized, the result is identical to S. S.count(x) Return the number of times x appears in the string S. S.index(x) Return the index in L of the first substring whose identicql to x. It is an error if there is no such item. L.replace(x,y) Return a string in which every instance of the substring x in L is replaced with y: >>> X = ’abracadabra’ >>> X.replace(’dab’,’bad’) ’abracabadra’ >>> X.replace(’a’,’b’) ’bbrbcbdbbrb’ S.title() Return a string just like S in which all words are capitalized: >>> ’los anGeles’.title() ’Los Angeles’ S.istitle() Return True is every word in S is capitalized. Otherwise, return False: >>> ’los anGeles’.istitle() False >>> ’Los AnGeles’.istitle() False >>> ’Los Angeles’.istitle() True Reverse the elements of the list, in place. 1.4.3 Tuples Syntax Python uses commas to create sequences of elements called tuples. The result is more readable if the tuple ements are enclosed in parentheses: >>> X = (24, 3.14, ’w’, 7) # Tuple with 4 items >>> Y = (100,) # tuple with one item >>> Z = () #empty tuple The following is not a tuple: >>> Q = (7) >>> Q == 7 True Making tuples The tuple constructoor function is the name of the type: >>> L = tuple(’hi!’) >>> L (’h’, ’i’, ’!’) >>> M = tuple() >>> M () # make a string a tuple >>> X(0) # 1st element 24 >>> X(1) # 2nd element 3.1400000000000001 >>> X(-1) # last element 7 >>> X(0:2) # tuple of 1st and 2nd elements (24, 3.1400000000000001) >>> X(:-1) # tuple excluding last element The following raises an IndexError: >>> X[4] # Raises exception! ... IndexError: tuple index out of range Concatenation of tuples: >>> X + Y (24, 3.1400000000000001, ’w’, 7, 100) 1.4.4 Dictionaries Dictionaries store mappings between many kinds of data types. Suppose we are studying the Enron email network, keeping track of who emailed who. For we each employee, we want to keep a list who they emailed. This is the kind of information that would be stored in a dictionary. It would look like this: 1 >>> enron_network 2 3 4 5 {’Mike Grigsby’: ’Greg Whalley’: ... [’Louise Kitchen’, ’Scott Neal’, ’Kenneth Lay’, ... ], [’Mike Grigsby’, ’Louise Kitchen;, ... ], 6 7 } Each entry represents the email sent by one employee. Thus, the first entry tells us Mike Grigsby sent email to Louise Kitchen, Scott Neal, Kenneth Lay, etcetera. Each employee is represented by a string that uniquely identifies them (their names). Thus, this dictionary relates strings to lists of strings (Pythonistas say it maps strings to lists of strings). The strings before the colons are the keys, and the lists after the colon are the values. Any immutable type can be the key of a dictionary. For our purposes we will most often use strings as the ‘keys’ of a dictionary. We will use all kinds of things as values. >>> X = {’x’:42, ’y’: 3.14, ’z’:7} >>> Y = {1:2, 3:4} >>> Z = {} Make some assignments into a dict >>> L = dict(x=42,y=3.14,z=7) The following is the way to do the same with integer keys. >>> M = dict([[1, 2],[3, 4]]) 1 2 3 4 5 6 >>> X[’x’] 42 >>> X[’y’] 3.1400000000000001 >>> M[1] 2 The following raises a KeyError: >>> X[’w’] ... KeyError: ’w’ Dictionaries and lists can be mixed. We saw above that a list can be a value in a dictionary. Dictionaries can also be elements in lists: >>> Double = [{’x’:42,’y’: 3.14, ’z’:7},{1:2,3:4},{’w’:14}] >>> Double[0] {’y’: 3.1400000000000001, ’x’: 42, ’z’: 7} This means to retrieve the value 42 for the key x, we can do: >>> FirstDict = Double[0] >>> FirstDict[’y’] 42 But Python allows any expression which has a Dictionary as its value to be followed by [key]. It is much more convenient and Pythonic to do this in one step: >>> Double[0][’x’] 42 From list to dictionary It often happens that we have data in a format convenient for providing one kind of information, but we need it for a slightly different purpose. Suppose we have the words of English arranged according to frequency rank, from most frequent to least frequent. Such a list looks like this: rank_list = [’the’, ’be’, ’of’, ’and’, ’a’, ’in’, ’to’, ’have’, ’it’, ’to’, ’for’, ’I’, ’that’, ’you’, ’he’, ’on’, ’with’, ’do’, ’at’, ’by’, ’not’, ’this’, ’but’, ’from’, ’they’, ’his’, ...] This is convenient if you want to know things like what the 6th most frequent word of English is: >>> rank_list[5] ’in’ # 6th word on list has index 5 But suppose you want to go in the opposite direction? Given a word, you want to know what its frequency rank is. Python provides a built-in method for finding the index of any item in a list, so we could equally well do: >>> rank_list.index(’in’) 5 But this is actually a little inefficient: Python has to search through the list, comparing each item with ‘in’, until it finds the right one. If Python starts from the beginning of the list each time, that will take quite a lo of comparisons for low-ranked words. What this example illustrates is a very basic but extremely important computational concept. You want your information to be indexed in a way convenient for the kind of questions you’re going to ask. A phone book is an extremely useful format if you know a name and you need a phone number. Similarly, Webster’s Dictionary is wonderful if you have a word and want to know what the meaning is. In both cases, alphabetically sorting the entries give us an efficient way of accessing the information. But the phone book is fairly inconvenient if you have a phone number and want to know who has it, and Webster’s is quite hard to use if you have a meaning in mind and are hunting for the word to express it, or even if you have a word, and are looking for its rhyme. If you have a set of strings, and you want to get particular kinds of information about each, dictionaries are the way to go: Dictionary look ups use an indexing scheme much like alphabetization to efficiently retrieve values for keys. In the example of the rank list, if our main interest is in going from words to ranks, what we want to do is convert the list into a dictionary. That looks like this: rank_dict = dict() for (i, word) in enumerate(rank_list): rank_dict[word] = (i+1) Here is what enumerate does to a list: >>> list(enumerate([’a’,’b’,’c’])) [(0,’a’),(1,’b’),(2,’c’)] What enumerate returns Essentially, what enumerate does is pair each element in the list with its index, which is exactly the information we want for our dictionary. Combining lists into a dictionary It is frequently the case that the meaning of a particular data item is defined by its position in a sequence of things. The i th thing in that sequence is associated with a particular entity xi and the ith thing in some other sequence might also be associated with xi. For example, a company database might store information about employees in separate files, but the i th line of each file is reserved for information about the employee with id number i. As a concrete example, suppose we downloaded information about word frequencies in the British National Corpus, a very large sample of English, likely to give very reliable counts, and it came in two files, with contents that looked like this: File 1 File 2 a 2186369 abandon 4249 abbey 1110 ability 10468 able 30454 abnormal 809 abolish 1744 abolition 1154 abortion 1471 about 52561 about 144554 above 2139 above 10719 above 12889 abroad 3941 abruptly 1146 absence 5949 absent 1504 absolute 3489 absolutely 5782 absorb 2684 absorption 932 abstract 1605 That is, the i th numbered line in File 2 gives the frequency of the i th English word in File 1. The easiest way to read this data into Python produces two lists, word_list and freq_list. Now suppose we want to be able to go conveniently from a word to its frequency. The right thing to do is to make a dictionary. Here is one way to do that, using code like the code we’ve already seen: freq_dict = dict() for (i, word) in enumerate(word_list): freq_dict[word] = freq_list[i] Look at this code and make sure you understand it. We use the enumerate function to give us the index of each word in the list as we see it, then associate that with the i th frequency in freq_list. But Python provides a faster, much more Pythonic way to do this. As with list and str, the name of the Python type dict is also a function for producing dictionaries. In fact, we used that convention to create an empty dictionary in the code snippet above. But dict can do more than just create empty dictionaries. Given a list of pairs, it produces a dictionary whose keys are the first members of the pairs and whose values are corresponding second members: >>> L = [(’a’,1), (’b’,2), (’c’,3)] >>> dd = dict(L) >>> dd {’a’:1, ’b’:2, ’c’:3} Unfortunately, we have two lists, not one, and neither is a list of pairs. Not to worry. Python also provides an easy way to create the list we want. The function is called zip, and it takes two lists of the same length and returns a list of pairs: >>> L_left = [’a’,’b’,’c’] >>> L_right = [1, 2, 3] >>> zip(L_left, L_right) [(’a’,1), (’b’,2), (’c’,3)] Thus in this instance zip returns the same list L we saw above. Given these feature, it is quite simple to produce the frequency dictionary we want: >>> freq_dict = dict(zip(word_list,freq_list)) This is a frequently used Python idiom which will come in handy. A text-based example Note: This section uses the data module example_string which can be found here. [Add ipython notebook] We illustrate some code here for computing word counts in a string. This is included here as a very clear example of when dictionaries should be used: We have a set of strings (words in our vocabulary) and information that we want to keep about each string (the word’s count). We need to update that information in a loop (see Section Loops), as we look at each word in the string: 1 from example_string import example_string 2 3 count_dict = dict() 4 5 6 7 8 9 for word in example_string.split(): if word in count_dict: count_dict[word] += 1 else: count_dict[word] = 1 And then we have: 1 2 3 4 5 6 7 8 9 >>> count_dict {’all’: 3, ’consider’: 2, ’dance’: 1, ’better,’: 1, ’sakes,’: 1, ’pleasant,’: 1, ’four’: 3, ’go’: 1, 10 ... 11 12 13 14 ’Lizzy’: 2, ’Jane,’: 1, 15 ... 16 17 ’Kitty,’: 2 18 19 } Let’s see what’s going on in the code. [Explain] Dictionary methods [some of the most important] Python collections module Introduce Counters and defaultdicts. Run though the above example, simplifying through the use of Counters. A counter is a special kind of dictionary defined in the Python collections module: 1 2 3 4 5 6 7 >>> from collections import Counter >>> # Tally occurrences of words in a list >>> cnt = Counter() >>> for word in [’red’, ’blue’, ’red’, ’green’, ’blue’, ’blue’]: ... cnt[word] += 1 >>> cnt Counter({’blue’: 3, ’red’: 2, ’green’: 1}) 8 9 10 11 12 13 14 >>> # Find the ten most common words in Hamlet >>> import re >>> words = re.findall(r’\w+’, open(’hamlet.txt’).read().lower()) >>> Counter(words).most_common(10) [(’the’, 1143), (’and’, 966), (’to’, 762), (’of’, 669), (’i’, 631), (’you’, 554), (’a’, 546), (’my’, 514), (’hamlet’, 471), (’in’, 451)] Counters can be initialized with any sequence, and they will count the token occurrences in that sequence. For example: >>> c = Counter(’gallahad’) >>> c Counter({’a’: 3, ’l’: 2, ’h’: 1, ’g’: 1, ’d’: 1}) They can also be initialized directly with count information from a dictionary: >>> c = Counter({’red’: 4, ’blue’: 2}) >>> c = Counter(cats=4, dogs=8) # a new counter from a mapping # a new counter from keyword args A Counter is a kind of dictionary, but it does not behave entirely like a standard dictionary. The count of a missing element is 0; there are no key error occurs, as would the case with a standard Python dictionary: >>> c = Counter([’eggs’, ’ham’]) >>> c[’bacon’] 0 See the Python docs for more features. Now we said a counter can initialized directly with any sequence; the correct term is any iterable, roughly, anything that can be iterated through with a for loop. But caution must be exercised when initializing counters this way, to guarantee that the right things are being counted. For example, if what is desired is the word counts, it won’t work to simply initilize a Counter with a file handle, even though a file handle is an iterable: 1 2 3 4 5 6 7 >>> from collections import Counter >>> tr = Counter(open(’../anatomy/pride_and_prejudice.txt’,’r’)) >>> len(tr) 11003 >>> tr.most_common(10) [(’\r\n’, 2394), (’ * * * * *\r\n’, 6), (’them."\r\n’, 3), (’it.\r\n’, 3), (’them.\r\n’, 3), 8 9 10 (’family.\r\n’, 2), (’do."\r\n’, 2), (’between Mr. Darcy and herself.\r\n’, 2), (’almost no restrictions whatsoever. You may copy it, give it away or\r\n’, 2), (’together.\r\n’, 2 What happened and why?