Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 8: A cure for what ails you When human beings acquired language, we learned not just how to listen but how to speak. When we gained literacy, we learned not just how to read but how to write. And as we move into an increasingly digital reality, we must learn not just how to use programs but how to make them. In the emerging, highly programmed landscape ahead, you will either create the software or you will be the software. It’s really that simple: Program, or be programmed. Choose the former, and you gain access to the control panel of civilization.Choose the latter, and it could be the last real choice you get to make. Today • Today we’ll start by having a look at functions -- Next time we will finish off the last dangling Python thread, user-defined objects • We’ll then cover some coding strategies in Python to deal with uncertainties as well as a simple Python debugger to assess the operation of your running code • We will then spend some time with MongoDB, an option we will revisit later in the quarter but one that some of you may be interested in for your project -We’ll close with your next homework assignment! Functions • As you’ve probably seen with this homework, the more you “build” with Python, the more difficult it becomes to effectively create/maintain/share your code as one single piece (whether that be a script or a module) • You are inevitably led to breaking up computations into smaller units; and, as with R, a function is Python’s way of letting you group statements that can be repeated elsewhere in your program • With that in mind, you should anticipate that the specification for a function will need to define the group of statements you want to consider; specify the variables you want to involve in your computation; and return a result Function definition • As we saw last time, functions are created with the def statement; it defines another block of code that makes up the function’s body (either an intended set of statements or a simple statement after the colon) • The def statement is a kind of assignment in that it associates the function’s name with an object of type funct; the def statement can occur anywhere (well, anywhere a statement can) and the named functions are defined at “runtime” (meaning the function is created when you execute your code) • Let’s see how this works... >>> from random import normalvariate, lognormvariate >>> def noise(x): ... return x+normalvariate(0,1) ... >>> noise(3) 2.844815682100834 >>> type(noise) <type 'function'> >>> clatter = noise >>> clatter(2) 0.2781754850162974 >>> if wild: ... def noise(x): ... 'log-normal noise' ... return x+lognormvariate(0,1) ... else: ... def noise(x): ... 'gaussian noise' ... return x+normalvariate(0,1) ... >>> noise(5) 5.899560478595187 >>> help(noise) # defining functions # noise points to an object of type 'function' Identifying data to compute with • So that example seemed pretty straightforward; we had a single argument and assigned it a value when we called the function • But almost immediately subtleties arise, and we have to ask questions about how this assignment is done and how, in general, Python treats variables in the body of a function Scoping rules (again) • As we will see with R, when we use a name in a program (in an assignment or a simple expression, say), Python needs to associate that name with some object • The visibility of a name within our program is determined by where it’s assigned (literally the location of the assignment statement in our code); this visibility is referred to as a name’s scope • To talk about scope and the process that Python uses to associate names with objects, we need to revisit the concept of a namespace; but first some examples... >>> from random import normalvariate, lognormvariate >>> >>> def noise(x): ... z = 5 # y,z are local variables; they exist within a ... y = x+normalvariate(z,1) # namespace created when noise is executed ... return y ... >>> noise(3) 8.069331171014415 >>> z # but we can’t find z outside of the body of noise Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'z' is not defined >>> >>> def noise(x): # let’s create another version of noise, this time ... y = x+normalvariate(z,1) # removing the definition of z ... return y ... >>> noise(3) # oops! Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 2, in noise NameError: global name 'z' is not defined >>> z = 10 # now let’s create a variable named z >>> noise(3) # where is python finding it? 13.726079223271851 >>> y = 1 >>> noise(3) 13.630053167817655 >>> y 1 Scoping rules • When we first started writing Python code, all our variable assignments took place at the top level of a module*; that is, their names were part of the module’s namespace, or the “global scope” of the module and we could refer to them simply • Notice that this sense of “global” is really file-based; that is, when we write a module, we don’t have to worry about whether someone using our module has defined variables of the same name in their code • With functions, we introduce a nested namespace (a nested scope) that localizes the names they use so that you can avoid similar kinds of clashes • * If it is typed in at the “>>>” prompt, you are in a module called __main__; otherwise the enclosing module is the file that contains your program Scoping rules • The execution of a function introduces a new namespace for the “local” variables of the function; all variable assignments in a function store their names in this local namespace • When we look up variable name (by referring to it in an expression in the function, say), Python first looks in this local namespace; if it can’t find an object of that name locally, it starts a search that moves out to (eventually) the global namespace and then the collection of built-in names • During this lookup process, Python will return the object associated with the first instance of the name it finds; so local names take precedence over globals • Also, the names associated with a function’s namespace are determined when its definition is executed; they are treated as locals everywhere in the function, not just after the statements where they are assigned... >>> # built-in scope >>> import __builtin__ >>> dir(__builtin__) ['ArithmeticError', 'AssertionError', 'AttributeError', 'BaseException', 'BufferError', 'BytesWarning', 'DeprecationWarning', 'EOFError', 'Ellipsis', 'EnvironmentError', 'Exception', 'False', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError', 'KeyError', 'KeyboardInterrupt', 'LookupError', 'MemoryError', 'NameError', 'None', 'NotImplemented', 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning', 'ReferenceError', 'RuntimeError', 'RuntimeWarning', 'StandardError', 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit', 'TabError', 'True', 'TypeError', 'UnboundLocalError', 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError', 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'ZeroDivisionError', '_', '__debug__', '__doc__', '__import__', '__name__', '__package__', 'abs', 'all', 'any', 'apply', 'basestring', 'bin', 'bool', 'buffer', 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'cmp', 'coerce', 'compile', 'complex', 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval', 'execfile', 'exit', 'file', 'filter', 'float', 'format', 'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int', 'intern', 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'long', 'map', 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow', 'print', 'property', 'quit', 'range', 'raw_input', 'reduce', 'reload', 'repr', 'reversed', 'round', 'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple', 'type', 'unichr', 'unicode', 'vars', 'xrange', 'zip'] >>> from random import normalvariate, lognormvariate >>> y = 10 >>> def noise(x): print y ... # y is not defined in the local namespace >>> noise(3) 10 >>> def noise(x): ... print y ... y = x+normalvariate(0,1) ... return y ... # y is defined in the local namespace, but only # assigned after the print statement, hence the error >>> noise(3) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 2, in noise UnboundLocalError: local variable 'y' referenced before assignment Scoping rules • Because def is a statement like any other, we can certainly define functions within other functions; in that case, our search for variables works way out through the enclosing functions • Lutz defines the LEGB rule for resolving a name: When a name is referenced, Python will look it up in the following order: 1. The Local (function) scope 2. The Enclosing function scope 3. The Global (module) scope, and 4. The Built-in scope Passing arguments • The names in your argument list become new local names (local variables) and arguments are passed to a function by assigning objects to local names; that means the variable names in the argument list are assigned references to the objects you specify (this is VERY different from what you will see in R) • For immutable objects like numbers or strings, this is safe enough (remember, Python makes copies of immutable objects when you start to work with them); for mutable objects like lists and dictionaries, this can produce some unexpected consequences... >>> from random import normalvariate, lognormvariate >>> # now let's try passing a mutable object... ... >>> def vnoise(x): # a vector version ... y = [a+normalvariate(0,1) for a in x] ... return y ... >>> >>> x = range(5) >>> vnoise(x) [-1.5723386658881426, 2.296002316306496, 3.2770955564939332, 4.131264879693449, 2.9934905053231757] >>> >>> def vnoise(x): ... y = [a+normalvariate(0,1) for a in x] ... x[1] = "YIKES" ... return y ... >>> vnoise(x) [0.32935092382450726, 1.8960418070905316, 1.1232901548877434, 3.751933140620686, 4.609638038164722] >>> x [0, 'YIKES', 2, 3, 4] Passing arguments • With unexpected consequences like these, it’s important to adopt a good coding style; you probably want to avoid having functions change global variables (people hate unexpected surprises) -- In coding parlance, things like “YIKES” are known as side effects and R, for example, strives to minimize these • Remember, good coding practice is as much about readability and reliability as it is about efficiency... Argument matching • Often, we want to include default values for some of the arguments of our function; these can be convenient and might also serve a kind of documentation service • In general, our function definition can include both non-keyword as well as keyword arguments, separated into two groups, with non-keyword coming first • In R, we will see a detailed process whereby arguments were assigned values in a given function call -- Python takes a simpler, constrained approach, throwing an error if you break the rules • When we call a function in Python, you are to first specify values for your nonkeyword arguments followed by some collection of your keyword arguments (in any order) -- Python also allows for the equivalent of ‘...’, but uses separate lists for the keyword and non-keyword arguments >>> from random import normalvariate, lognormvariate >>> def noise(x,mu=0,sig=1): ... y = x+normalvariate(mu,sig) ... return y ... >>> noise(3,5,1) # 7.266460438279932 >>> noise(x=3,sig=1,mu=5) # 7.796645622344021 >>> noise(3,sig=2) # 6.001828223964036 >>> >>> def noise(x,*junk,**named_junk): # ... print "junk: ", junk # ... print "junk: ",type(junk) ... print "junk: ",named_junk ... print "junk: ",type(named_junk) ... return normalvariate(0,1) ... >>> noise(3,17,5,w="hi",z="low") junk: (17, 5) junk: <type 'tuple'> junk: {'z': 'low', 'w': 'hi'} junk: <type 'dict'> -0.4917423386765775 call by position call by name using defaults the ** catches named things in a dictionary the * catches unnamed things in a list Catching the result • Finally, all of the functions we’ve defined today explicitly return something with a return statement -- When a return statement is not present, the function will instead return the value None after it has completed its computation • Recall that None is an object with type NoneType and will evaluate to False Double duty • With a simple device, we can have some code we have created act either as a module (sharing computations) and a standalone program • cocteau@homework:~$ cat some_math.py • #!/usr/local/bin/python • def square(x): return x*x • if __name__ == '__main__': • print "test: square(35) = ",square(35) • cocteau@homework:~$ python some_math.py • test: square(35) = • cocteau@homework:~$ python • Python 2.7 (r27:82500, Oct 10 2010, 16:27:47) • [GCC 4.4.3] on linux2 • Type "help", "copyright", "credits" or "license" for more information. • >>> import some_math • >>> some_math.square(5) • 25 1225 Behind the scenes • While we’re here, it’s worth commenting on what Python is doing when you execute a file -- There are a few basic steps before it starts working on your tasks 1. Byte code compilation: Python translates your source code into another format known as byte code, a platform independent translation to byte code instructions 2. PVM: Once byte compiled, your code is then handed over to the Python virtual machine (PVM), the Python runtime engine; this is where your code is actually executed cocteau@homework:~$ cp /data/text_examples/some_math.py . cocteau@homework:~$ ls some_math* some_math.py cocteau@homework:~$ python Python 2.7 (r27:82500, Oct 10 2010, 16:27:47) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import some_math >>> type(some_math.square) <type 'function'> >>> some_math.square(5) 25 >>> ^D cocteau@homework:~$ ls -l some_math* -rwxr-xr-x 1 cocteau cocteau 122 2010-10-20 19:31 some_math.py -rw-r--r-- 1 cocteau cocteau 301 2010-10-20 19:32 some_math.pyc cocteau@homework:~$ hexdump 0000000 f303 0a0d 43ad 4cbf 0000010 0200 0000 4000 0000 0000020 8400 0000 005a 6500 0000030 2772 6400 0002 6547 0000040 4847 006e 6400 0004 0000050 0000 0100 0000 0200 0000060 0008 0000 007c 7c00 0000070 4e00 0028 0000 2800 0000080 7800 0028 0000 2800 0000090 7300 6d6f 5f65 616d 00000a0 0000 7173 6175 6572 00000b0 7400 0008 0000 5f5f 00000c0 0000 7400 7365 3a74 00000d0 3533 2029 203d 2369 00000e0 5200 0001 0000 0874 00000f0 5f65 285f 0000 0000 0000100 0000 0c73 0000 7300 0000110 702e 7479 0008 0000 0000120 0003 0000 0473 0000 000012d some_math.pyc 0063 0000 0000 7300 002b 0000 0001 0164 6b00 0000 0364 8300 2853 0005 0000 0000 4300 0000 0000 5314 0128 0001 0000 0174 0000 0000 0c73 6874 702e 7479 0003 0000 0073 616d 6e69 5f5f 7320 7571 7261 0000 4e00 0228 0000 5f00 6e5f 0028 0000 2800 6d6f 5f65 616d 6d3c 646f 6c75 0900 0c02 0002 0000 0064 0002 0001 0163 7300 0000 0000 0000 0006 0000 1373 2865 0000 6d61 0000 6874 3e65 Debugging • Python has a simple facility (that is reminiscent of many similar tools for other languages) that helps you assess what is going on when you program runs -- By now you’ve had the experience of typing commands into the interactive shell, then collecting them into a file and executing a program • Inevitably, as you work with that program, some program, you’ll come across conditions that cause your computations to fail in some way -- Even if you’re very careful, code that you take from others may not be crafted with the same level of caution • Of course a simple approach to debugging is to just insert print statements everywhere in your code -- I’ll admit it, sometimes you’ll catch me doing this depending on the complexity of the task Debugging • Python’s built-in debugger, PDB provides us a more formal technique for investigating a running program -- PDB creates (yet another) shell with its own commands for working with lines of code • b (“break”) to set a breakpoint • cl (“clear”) a breakpoint • tbreak to set a one-time breakpoint • ignore to specify that a certain breakpoint will be ignored the next k times, where k is specified in the command • l (“list”) to list some lines of source code • n (“next”) to step to the next line, not stopping in function code if the current line is a function call • s (“subroutine”) same as n, except that the function is entered in the case of a call • c (“continue”) to continue until the next break point • w (“where”) to get a stack report • u (“up”) to move up a level in the stack, e.g. to query a local variable there • d (“down”) to move down a level in the stack • r (“return”) continue execution until the current function returns • j (“jump”) to jump to another line without the intervening code being executed • h (“help”) to get (minimal) online help (e.g. h b to get help on the b command, and simply h to get a list of all commands); type h pdb to get a tutorial on PDB31 • q (“quit”) to exit PDB cocteau@homework:~$ cp /data/text_examples/debug_test.py . cocteau@homework:~$ /usr/local/lib/python2.7/pdb.py debug_test.py > /home/cocteau/test.py(5)<module>() -> import re (Pdb) l 1 ! #!/usr/local/bin/python 2 ! 3 ! #import pdb 4 ! 5 ->! import re 6 ! from BeautifulSoup import BeautifulStoneSoup 7 ! 8 ! recipe_file = "/data/text_examples/1985/01/02/186946.sgml" 9 ! bs = BeautifulStoneSoup(open(recipe_file)) 10 ! 11 ! #pdb.set_trace() (Pdb) n > /home/cocteau/test.py(6)<module>() -> from BeautifulSoup import BeautifulStoneSoup (Pdb) n > /home/cocteau/test.py(8)<module>() -> recipe_file = "/data/text_examples/1985/01/02/186946.sgml" (Pdb) n > /home/cocteau/test.py(9)<module>() -> bs = BeautifulStoneSoup(open(recipe_file)) (Pdb) p recipe_file '/data/text_examples/1985/01/02/186946.sgml' (Pdb) l 4 ! 5 ! import re 6 ! from BeautifulSoup import BeautifulStoneSoup 7 ! 8 ! recipe_file = "/data/text_examples/1985/01/02/186946.sgml" 9 ->! bs = BeautifulStoneSoup(open(recipe_file)) 10 ! 11 ! #pdb.set_trace() 12 ! 13 ! word_count = 0 14 ! (Pdb) l 20 15 ! for p in bs.findAll("p"): 16 ! 17 ! line = p.getText() 18 ! line = re.sub("\s+"," ",line) 19 ! line = line.strip() 20 ! line = line.lower() 21 ! 22 ! for w in line.split(" "): 23 ! 24 ! w = re.sub("\W","",w) 25 ! (Pdb) b 24 Breakpoint 1 at /home/cocteau/test.py:24 (Pdb) c > /home/cocteau/test.py(24)<module>() -> w = re.sub("\W","",w) (Pdb) l 19 ! line = line.strip() 20 ! line = line.lower() 21 ! 22 ! for w in line.split(" "): 23 ! 24 B->! w = re.sub("\W","",w) 25 ! 26 ! if w: 27 ! 28 ! word_count += 1 29 ! print w (Pdb) n > /home/cocteau/test.py(26)<module>() -> if w: (Pdb) p w u'while' (Pdb) c while > /home/cocteau/test.py(24)<module>() -> w = re.sub("\W","",w) (Pdb) n > /home/cocteau/test.py(26)<module>() -> if w: (Pdb) p w u'there' (Pdb) cl 1 Deleted breakpoint 1 (Pdb) !x = "arbitrary python statements prefaced by a !" (Pdb) p x (Pdb) h Documented commands (type help <topic>): ======================================== EOF bt cont enable jump a c continue exit l alias cl d h list args clear debug help n b commands disable ignore next break condition down j p Miscellaneous help topics: ========================== exec pdb Undocumented commands: ====================== retval rv (Pdb) pp q quit r restart return run s step tbreak u unalias unt until up w whatis where Debugging • When debugging, one often employs a strategy of divide and conquer -- That is, you first check to see if everything is OK in the first half of your program, and, if so, check the 3/4 point, otherwise check the 1/4 point • In short, a debugging program won’t tell you what your bug is, but it can help you find out where it is • There are various visual or GUI-based extensions to PDB and IPython has a very clean debugger built-in -- This kind of tool can help you scrape through a piece of code more efficiently than the old stand-by of inserting print commands everywhere Programming defensively • Python offers a simple construction that allows you to catch errors and handle them in your running code -- Not every error should cause your program to exit, but instead produce fixable situations • For example, if we are pulling data from the web, we might occasionally encounter an network error -- In that case we might want to have our program respond by sleeping for a little while and try the access later • The try/catch/finally structure allows you trap various kinds of “exceptions” that are raised while your code executes -- Let’s start with a simple arithmetic error >>> x = 5 >>> y = 0 >>> try: ... z = x/y ... except ZeroDivisionError: ... print "divide by zero" divide by zero >>> # here we are looking for a particular exception >>> try: ... x/y ... except ZeroDivisionError, e: # here we are catching an “exception” objects ... z = e ... >>> print z integer division or modulo by zero >>> type(z) <type 'exceptions.ZeroDivisionError'> >>> try: ... x/y ... except: ... print 'a problem' ... else: ... print 'it worked!' 'a problem' # catch any error Programming defensively • You can handle exceptions in a nested way, testing for more specific errors first, and ending with the more general -- The finally statement provides code that’s executed no matter what happened inside the code blocks try: block-1 ... except Exception1: handler-1 ... except Exception2: handler-2 ... else: else-block finally: final-block • You are also able to raise exceptions in your code, allowing your modules to propagate exceptions so that your users can handle them as they see fit Data formats • XML is not the only data format out there; and with the advent of client-side tools like JavaScript (a language that runs in your browser and was originally meant to let programmers work with “pages” displayed by the Netscape Navigator; what kinds of objects might this language “expose”? What methods?) • JSON (JavaScript Object Notation) is billed as a “light-weight data-interchange format that is easy for humans to read and write”; why might a program running in your browser need to send and receive data? • As a format, JSON uses conventions that are familiar to users of languages like C or, as luck would have it, Python; here’s what you get when you request the Twitter public timeline “page” in JSON* • * http://twitter.com/statuses/public_timeline.json curl http://twitter.com/statuses/public_timeline.json > ptl.json % Total 100 29909 % Received % Xferd 100 29909 0 0 Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 73186 0 --:--:-- --:--:-- --:--:-- 91745 head ptl.json [{"place":null,"contributors":null,"coordinates":null,"truncated":false,"in_reply_ to_screen_name":null,"geo":null,"retweeted":false,"source":"<a href=\"http:// twitterfeed.com\" rel=\"nofollow\">twitterfeed</a>","created_at":"Mon Oct 18 21:14:45 +0000 2010","in_reply_to_status_id":null,"user": {"geo_enabled":false,"friends_count": 0,"profile_text_color":"333333","description":null,"contributors_enabled":false,"p rofile_background_tile":false,"favourites_count": 0,"profile_link_color":"0084B4","listed_count": 5,"verified":false,"profile_sidebar_fill_color":"DDEEF6","url":null,"follow_reques t_sent":null,"notifications":null,"time_zone":null,"lang":"en","created_at":"Tue Jun 08 11:52:29 +0000 2010","profile_sidebar_border_color":"C0DEED","profile_image_url":"http:// s.twimg.com/a/1287010001/images/ default_profile_2_normal.png","location":null,"protected":false,"profile_use_backg round_image":true,"screen_name":"2webtraffic","name":"web traffic","show_all_inline_media":false,"following":null,"profile_background_color" :"C0DEED","followers_count":734,"id":153380152,"statuses_count": 14544,"profile_background_image_url":"http://s.twimg.com/a/1287010001/images/ themes/theme1/ bg.png","utc_offset":null},"retweet_count":null,"favorited":false,"id": 27769891000,"in_reply_to_user_id":null,"text":"How To Get Traffic From Social Networks | Social Marketing Tips: Social networking could be defined as an online ... http://bit.ly/a3fGE3"}, {"place":null,"contributors":null,"coordinates":null,"truncated":false,"in_reply_t o_screen_name":null,"geo":null,"retweeted":false,"source":"web","created_at":"Mon Oct 18 21:14:43 +0000 2010","in_reply_to_status_id":null,"user":{"statuses_count": [ {"place":null, "contributors":null, "coordinates":null, "truncated":false, "in_reply_to_screen_name":null, "geo":null, "retweeted":false, "source":"<a href=\"http://twitterfeed.com\" rel=\"nofollow\">twitterfeed</a>", "created_at":"Mon Oct 18 21:14:45 +0000 2010", "in_reply_to_status_id":null, "user":{ "geo_enabled":false, "friends_count":0, "profile_text_color":"333333", "description":null, "contributors_enabled":false, "profile_background_tile":false, "favourites_count":0, "profile_link_color":"0084B4", "listed_count":5, ... "utc_offset":null}, "retweet_count":null, "favorited":false, "id":27769891000, "in_reply_to_user_id":null, "text":"How To Get Traffic From Social Networks | Social Marketing Tips: Social networking could be defined as an online ... http://bit.ly/a3fGE3"}, ... ] Look familiar? Python - JSON • As you might expect, a JSON object has a (relatively) direct translation into Python built-in types (numbers, strings, dictionaries, lists) -- For this reason, it is exceedingly popular as a tool for storing data • As we will see in a later lecture, there are also very efficient databases for storing, indexing and retrieving JSON strings -- One such offering is MongoDB, something we’ll work with once our recipes are done • How might this help us? % curl http://twitter.com/statuses/public_timeline.json > ptl.json % Total 100 30438 % Received % Xferd 100 30438 0 0 Average Speed Time Time Time Current Dload Total Spent Left Speed 67911 Upload 0 --:--:-- --:--:-- --:--:-- 85500 % python Python 2.7 (r27:82500, Oct 10 2010, 16:27:47) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import json >>> f = open("ptl.json") >>> tweets = json.loads(f.readline()) >>> type(tweets) <type 'list'> >>> type(tweets[0]) <type 'dict'> >>> tweets[0].keys() ['favorited', 'contributors', 'truncated', 'text', 'created_at', 'retweeted', 'coordinates', 'source', 'in_reply_to_status_id', 'in_reply_to_screen_name', 'user', 'place', 'retweet_count', 'geo', 'id', 'in_reply_to_user_id'] >>> tweets[0]['text'] u'Agora sim! Tudo bem, alvinegra? Mudou de foto, n\xe9? Gostei! ;) / @ManaPinheiro: @_OMaisQuerido Algu\xe9\xe9m ai?' >>> original = json.dumps(tweets) # convert it back to a string (and write to a file, say) MongoDB • Once created, JSON strings can be easily stored in one of several so-called NoSQL databases -- MongoDB is one example, and one that’s running on our homework machine • The next few slides have instructions about how to make use of Mongo, but please contact me before you do -- Right now Mongo is running without authentication (without any notion of users) and it’s easy for someone to overwrite your work • But if you want to take your homework assignments one step farther, you can use Mongo to store recipes and issue simple searches... >>> import pymongo, re >>> rec1 = {"name":"venison and eggs", ... "instructions":["mix well","bake","don't cut yourself"], ... "ingredients":["3 eggs","some milk","venison!"]} >>> rec2 = {"name":"venison and pasta", ... "instructions":["chop","sift","chop again"], ... "ingredients":["linguini","some milk","venison!"]} >>> rec3 = {"name":"cheese and pasta", ... "instructions":["stir","whisk"], ... "ingredients":["linguini","american cheese"]} >>> conn = pymongo.Connection() # connect to the db >>> type(conn) <class 'pymongo.connection.Connection'> >>> db = conn.mh_test >>> recipes = db.fist_recipes database >>> recipes.insert(rec1) >>> recipes.insert(rec2) >>> recipes.insert(rec3) # reate a new database # create a collection in the >>> # retrieving data from the db >>> recipes.find_one() {u'instructions': [u'mix well', u'bake', u"don't cut yourself"], u'_id': ObjectId('4cbf5ca51658f72264000000'), u'name': u'venison and eggs', u'ingredients': [u'3 eggs', u'some milk', u'venison!']} >>> ven = re.compile(".*venison.*") >>> for r in recipes.find({"name":ven}): print r ... {u'instructions': [u'mix well', u'bake', u"don't cut yourself"], u'_id': ObjectId('4cbf5ca51658f72264000000'), u'name': u'venison and eggs', u'ingredients': [u'3 eggs', u'some milk', u'venison!']} {u'instructions': [u'chop', u'sift', u'chop again'], u'_id': ObjectId('4cbf5ca51658f72264000001'), u'name': u'venison and pasta', u'ingredients': [u'linguini', u'some milk', u'venison!']} Your next homework • For Wednesday, I want you to read a chapter in Saltzer and Kaashoek on Systems and Complexity -- I’ll scan the chapter and put it up on our course Moodle page • For our next assignment, we are going to build a system -- Specifically we are going to build something called Shazam, a music tagging system Shazam • The algorithm behind Shazam is fairly straightforward -- A time-frequency decomposition is performed examining which frequencies are dominant at which times in a song • The peaks in this map forma kind of constellation -- Relationships between the individual elements are then encoded using something called geometric hashing (don’t worry about this yet) • Given a sample of audio, the same process is repeated and a search is made to see if there are matching patterns of peaks... The goal • The goal of this assignment is to have you design a system -- Each group will implement and end-to-end system that takes in an audio file, computes the needed decompositions and hashes and performs the match • Your job is to divide the tasks and design a way to had data back and forth -The emphasis is on cooperation, coordination, modularity • For Wednesday, you are to read the book chapter, the original Shazam article and meet with your group to think about the basic system components -- You’ll then write up a short proposal for how work should proceed