Download Lecture 8: A cure for what ails you

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lecture 8: A cure for what ails you
When human beings acquired language, we learned
not just how to listen but how to speak. When we
gained literacy, we learned not just how to read but
how to write. And as we move into an increasingly
digital reality, we must learn not just how to use
programs but how to make them.
In the emerging, highly programmed landscape
ahead, you will either create the software or you will
be the software. It’s really that simple: Program, or
be programmed. Choose the former, and you gain
access to the control panel of civilization.Choose the
latter, and it could be the last real choice you get to
make.
Today
•
Today we’ll start by having a look at functions -- Next time we will finish off the
last dangling Python thread, user-defined objects
•
We’ll then cover some coding strategies in Python to deal with uncertainties as
well as a simple Python debugger to assess the operation of your running code
•
We will then spend some time with MongoDB, an option we will revisit later in
the quarter but one that some of you may be interested in for your project -We’ll close with your next homework assignment!
Functions
•
As you’ve probably seen with this homework, the more you “build” with Python,
the more difficult it becomes to effectively create/maintain/share your code as
one single piece (whether that be a script or a module)
•
You are inevitably led to breaking up computations into smaller units; and, as
with R, a function is Python’s way of letting you group statements that can
be repeated elsewhere in your program
•
With that in mind, you should anticipate that the specification for a function will
need to define the group of statements you want to consider; specify the
variables you want to involve in your computation; and return a result
Function definition
•
As we saw last time, functions are created with the def statement; it defines
another block of code that makes up the function’s body (either an
intended set of statements or a simple statement after the colon)
•
The def statement is a kind of assignment in that it associates the function’s
name with an object of type funct; the def statement can occur anywhere
(well, anywhere a statement can) and the named functions are defined at
“runtime” (meaning the function is created when you execute your code)
•
Let’s see how this works...
>>> from random import normalvariate, lognormvariate
>>> def noise(x):
...
return x+normalvariate(0,1)
...
>>> noise(3)
2.844815682100834
>>> type(noise)
<type 'function'>
>>> clatter = noise
>>> clatter(2)
0.2781754850162974
>>> if wild:
...
def noise(x):
...
'log-normal noise'
...
return x+lognormvariate(0,1)
... else:
...
def noise(x):
...
'gaussian noise'
...
return x+normalvariate(0,1)
...
>>> noise(5)
5.899560478595187
>>> help(noise)
# defining functions
# noise points to an object of type 'function'
Identifying data to compute with
•
So that example seemed pretty straightforward; we had a single argument and
assigned it a value when we called the function
•
But almost immediately subtleties arise, and we have to ask questions about
how this assignment is done and how, in general, Python treats variables in
the body of a function
Scoping rules (again)
•
As we will see with R, when we use a name in a program (in an assignment
or a simple expression, say), Python needs to associate that name with
some object
•
The visibility of a name within our program is determined by where it’s assigned
(literally the location of the assignment statement in our code); this visibility is
referred to as a name’s scope
•
To talk about scope and the process that Python uses to associate names with
objects, we need to revisit the concept of a namespace; but first some
examples...
>>> from random import normalvariate, lognormvariate
>>>
>>> def noise(x):
...
z = 5
# y,z are local variables; they exist within a
...
y = x+normalvariate(z,1)
# namespace created when noise is executed
...
return y
...
>>> noise(3)
8.069331171014415
>>> z
# but we can’t find z outside of the body of noise
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'z' is not defined
>>>
>>> def noise(x):
# let’s create another version of noise, this time
...
y = x+normalvariate(z,1)
# removing the definition of z
...
return y
...
>>> noise(3)
# oops!
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in noise
NameError: global name 'z' is not defined
>>> z = 10
# now let’s create a variable named z
>>> noise(3)
# where is python finding it?
13.726079223271851
>>> y = 1
>>> noise(3)
13.630053167817655
>>> y
1
Scoping rules
•
When we first started writing Python code, all our variable assignments took
place at the top level of a module*; that is, their names were part of the
module’s namespace, or the “global scope” of the module and we could
refer to them simply
•
Notice that this sense of “global” is really file-based; that is, when we write
a module, we don’t have to worry about whether someone using our module
has defined variables of the same name in their code
•
With functions, we introduce a nested namespace (a nested scope) that
localizes the names they use so that you can avoid similar kinds of clashes
•
* If it is typed in at the “>>>” prompt, you are in a module called __main__;
otherwise the enclosing module is the file that contains your program
Scoping rules
•
The execution of a function introduces a new namespace for the “local”
variables of the function; all variable assignments in a function store their
names in this local namespace
•
When we look up variable name (by referring to it in an expression in the
function, say), Python first looks in this local namespace; if it can’t find an
object of that name locally, it starts a search that moves out to (eventually)
the global namespace and then the collection of built-in names
•
During this lookup process, Python will return the object associated with the
first instance of the name it finds; so local names take precedence over
globals
•
Also, the names associated with a function’s namespace are determined when
its definition is executed; they are treated as locals everywhere in the
function, not just after the statements where they are assigned...
>>> # built-in scope
>>> import __builtin__
>>> dir(__builtin__)
['ArithmeticError', 'AssertionError', 'AttributeError', 'BaseException', 'BufferError',
'BytesWarning', 'DeprecationWarning', 'EOFError', 'Ellipsis', 'EnvironmentError', 'Exception',
'False', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError',
'ImportWarning', 'IndentationError', 'IndexError', 'KeyError', 'KeyboardInterrupt',
'LookupError', 'MemoryError', 'NameError', 'None', 'NotImplemented', 'NotImplementedError',
'OSError', 'OverflowError', 'PendingDeprecationWarning', 'ReferenceError', 'RuntimeError',
'RuntimeWarning', 'StandardError', 'StopIteration', 'SyntaxError', 'SyntaxWarning',
'SystemError', 'SystemExit', 'TabError', 'True', 'TypeError', 'UnboundLocalError',
'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError',
'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'ZeroDivisionError', '_',
'__debug__', '__doc__', '__import__', '__name__', '__package__', 'abs', 'all', 'any', 'apply',
'basestring', 'bin', 'bool', 'buffer', 'bytearray', 'bytes', 'callable', 'chr', 'classmethod',
'cmp', 'coerce', 'compile', 'complex', 'copyright', 'credits', 'delattr', 'dict', 'dir',
'divmod', 'enumerate', 'eval', 'execfile', 'exit', 'file', 'filter', 'float', 'format',
'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int',
'intern', 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'long',
'map', 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow', 'print',
'property', 'quit', 'range', 'raw_input', 'reduce', 'reload', 'repr', 'reversed', 'round',
'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple', 'type',
'unichr', 'unicode', 'vars', 'xrange', 'zip']
>>> from random import normalvariate, lognormvariate
>>> y = 10
>>> def noise(x): print y
...
# y is not defined in the local namespace
>>> noise(3)
10
>>> def noise(x):
...
print y
...
y = x+normalvariate(0,1)
...
return y
...
# y is defined in the local namespace, but only
# assigned after the print statement, hence the error
>>> noise(3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in noise
UnboundLocalError: local variable 'y' referenced before assignment
Scoping rules
•
Because def is a statement like any other, we can certainly define functions
within other functions; in that case, our search for variables works way out
through the enclosing functions
•
Lutz defines the LEGB rule for resolving a name: When a name is referenced,
Python will look it up in the following order:
1. The Local (function) scope
2. The Enclosing function scope
3. The Global (module) scope, and
4. The Built-in scope
Passing arguments
•
The names in your argument list become new local names (local variables) and
arguments are passed to a function by assigning objects to local names; that
means the variable names in the argument list are assigned references to
the objects you specify (this is VERY different from what you will see in R)
•
For immutable objects like numbers or strings, this is safe enough
(remember, Python makes copies of immutable objects when you start to work
with them); for mutable objects like lists and dictionaries, this can produce
some unexpected consequences...
>>> from random import normalvariate, lognormvariate
>>> # now let's try passing a mutable object...
...
>>> def vnoise(x):
# a vector version
...
y = [a+normalvariate(0,1) for a in x]
...
return y
...
>>>
>>> x = range(5)
>>> vnoise(x)
[-1.5723386658881426, 2.296002316306496, 3.2770955564939332, 4.131264879693449,
2.9934905053231757]
>>>
>>> def vnoise(x):
...
y = [a+normalvariate(0,1) for a in x]
...
x[1] = "YIKES"
...
return y
...
>>> vnoise(x)
[0.32935092382450726, 1.8960418070905316, 1.1232901548877434, 3.751933140620686,
4.609638038164722]
>>> x
[0, 'YIKES', 2, 3, 4]
Passing arguments
•
With unexpected consequences like these, it’s important to adopt a good
coding style; you probably want to avoid having functions change global
variables (people hate unexpected surprises) -- In coding parlance, things
like “YIKES” are known as side effects and R, for example, strives to
minimize these
•
Remember, good coding practice is as much about readability and reliability
as it is about efficiency...
Argument matching
•
Often, we want to include default values for some of the arguments of our
function; these can be convenient and might also serve a kind of
documentation service
•
In general, our function definition can include both non-keyword as well as
keyword arguments, separated into two groups, with non-keyword
coming first
•
In R, we will see a detailed process whereby arguments were assigned values
in a given function call -- Python takes a simpler, constrained approach,
throwing an error if you break the rules
•
When we call a function in Python, you are to first specify values for your nonkeyword arguments followed by some collection of your keyword arguments (in
any order) -- Python also allows for the equivalent of ‘...’, but uses separate
lists for the keyword and non-keyword arguments
>>> from random import normalvariate, lognormvariate
>>> def noise(x,mu=0,sig=1):
...
y = x+normalvariate(mu,sig)
...
return y
...
>>> noise(3,5,1)
#
7.266460438279932
>>> noise(x=3,sig=1,mu=5)
#
7.796645622344021
>>> noise(3,sig=2)
#
6.001828223964036
>>>
>>> def noise(x,*junk,**named_junk): #
...
print "junk: ", junk
#
...
print "junk: ",type(junk)
...
print "junk: ",named_junk
...
print "junk: ",type(named_junk)
...
return normalvariate(0,1)
...
>>> noise(3,17,5,w="hi",z="low")
junk: (17, 5)
junk: <type 'tuple'>
junk: {'z': 'low', 'w': 'hi'}
junk: <type 'dict'>
-0.4917423386765775
call by position
call by name
using defaults
the ** catches named things in a dictionary
the * catches unnamed things in a list
Catching the result
•
Finally, all of the functions we’ve defined today explicitly return something with
a return statement -- When a return statement is not present, the function
will instead return the value None after it has completed its computation
•
Recall that None is an object with type NoneType and will evaluate to False
Double duty
•
With a simple device, we can have some code we have created act either as a
module (sharing computations) and a standalone program
•
cocteau@homework:~$ cat some_math.py
•
#!/usr/local/bin/python
•
def square(x): return x*x
•
if __name__ == '__main__':
•
print "test: square(35) = ",square(35)
•
cocteau@homework:~$ python some_math.py
•
test: square(35) =
•
cocteau@homework:~$ python
•
Python 2.7 (r27:82500, Oct 10 2010, 16:27:47)
•
[GCC 4.4.3] on linux2
•
Type "help", "copyright", "credits" or "license" for more
information.
•
>>> import some_math
•
>>> some_math.square(5)
•
25
1225
Behind the scenes
•
While we’re here, it’s worth commenting on what Python is doing when
you execute a file -- There are a few basic steps before it starts working on
your tasks
1.
Byte code compilation: Python translates your source code into another
format known as byte code, a platform independent translation to byte
code instructions
2.
PVM: Once byte compiled, your code is then handed over to the Python
virtual machine (PVM), the Python runtime engine; this is where your
code is actually executed
cocteau@homework:~$ cp /data/text_examples/some_math.py .
cocteau@homework:~$ ls some_math*
some_math.py
cocteau@homework:~$ python
Python 2.7 (r27:82500, Oct 10 2010, 16:27:47)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import some_math
>>> type(some_math.square)
<type 'function'>
>>> some_math.square(5)
25
>>> ^D
cocteau@homework:~$ ls -l some_math*
-rwxr-xr-x 1 cocteau cocteau 122 2010-10-20 19:31 some_math.py
-rw-r--r-- 1 cocteau cocteau 301 2010-10-20 19:32 some_math.pyc
cocteau@homework:~$ hexdump
0000000 f303 0a0d 43ad 4cbf
0000010 0200 0000 4000 0000
0000020 8400 0000 005a 6500
0000030 2772 6400 0002 6547
0000040 4847 006e 6400 0004
0000050 0000 0100 0000 0200
0000060 0008 0000 007c 7c00
0000070 4e00 0028 0000 2800
0000080 7800 0028 0000 2800
0000090 7300 6d6f 5f65 616d
00000a0 0000 7173 6175 6572
00000b0 7400 0008 0000 5f5f
00000c0 0000 7400 7365 3a74
00000d0 3533 2029 203d 2369
00000e0 5200 0001 0000 0874
00000f0 5f65 285f 0000 0000
0000100 0000 0c73 0000 7300
0000110 702e 7479 0008 0000
0000120 0003 0000 0473 0000
000012d
some_math.pyc
0063 0000 0000
7300 002b 0000
0001 0164 6b00
0000 0364 8300
2853 0005 0000
0000 4300 0000
0000 5314 0128
0001 0000 0174
0000 0000 0c73
6874 702e 7479
0003 0000 0073
616d 6e69 5f5f
7320 7571 7261
0000 4e00 0228
0000 5f00 6e5f
0028 0000 2800
6d6f 5f65 616d
6d3c 646f 6c75
0900 0c02 0002
0000
0064
0002
0001
0163
7300
0000
0000
0000
0006
0000
1373
2865
0000
6d61
0000
6874
3e65
Debugging
•
Python has a simple facility (that is reminiscent of many similar tools for other
languages) that helps you assess what is going on when you program runs
-- By now you’ve had the experience of typing commands into the interactive
shell, then collecting them into a file and executing a program
•
Inevitably, as you work with that program, some program, you’ll come across
conditions that cause your computations to fail in some way -- Even if you’re
very careful, code that you take from others may not be crafted with the same
level of caution
•
Of course a simple approach to debugging is to just insert print statements
everywhere in your code -- I’ll admit it, sometimes you’ll catch me doing this
depending on the complexity of the task
Debugging
•
Python’s built-in debugger, PDB provides us a more formal technique for
investigating a running program -- PDB creates (yet another) shell with its
own commands for working with lines of code
•
b (“break”) to set a breakpoint
•
cl (“clear”) a breakpoint
•
tbreak to set a one-time breakpoint
•
ignore to specify that a certain breakpoint will be ignored the next k times, where k is
specified in the command
•
l (“list”) to list some lines of source code
•
n (“next”) to step to the next line, not stopping in function code if the current line is a
function call
•
s (“subroutine”) same as n, except that the function is entered in the case of a call
•
c (“continue”) to continue until the next break point
•
w (“where”) to get a stack report
•
u (“up”) to move up a level in the stack, e.g. to query a local variable there
•
d (“down”) to move down a level in the stack
•
r (“return”) continue execution until the current function returns
•
j (“jump”) to jump to another line without the intervening code being executed
•
h (“help”) to get (minimal) online help (e.g. h b to get help on the b command, and
simply h to get a list of all commands); type h pdb to get a tutorial on PDB31
•
q (“quit”) to exit PDB
cocteau@homework:~$ cp /data/text_examples/debug_test.py .
cocteau@homework:~$ /usr/local/lib/python2.7/pdb.py debug_test.py
> /home/cocteau/test.py(5)<module>()
-> import re
(Pdb) l
1 !
#!/usr/local/bin/python
2 !
3 !
#import pdb
4 !
5 ->! import re
6 !
from BeautifulSoup import BeautifulStoneSoup
7 !
8 !
recipe_file = "/data/text_examples/1985/01/02/186946.sgml"
9 !
bs = BeautifulStoneSoup(open(recipe_file))
10 !
11 !
#pdb.set_trace()
(Pdb) n
> /home/cocteau/test.py(6)<module>()
-> from BeautifulSoup import BeautifulStoneSoup
(Pdb) n
> /home/cocteau/test.py(8)<module>()
-> recipe_file = "/data/text_examples/1985/01/02/186946.sgml"
(Pdb) n
> /home/cocteau/test.py(9)<module>()
-> bs = BeautifulStoneSoup(open(recipe_file))
(Pdb) p recipe_file
'/data/text_examples/1985/01/02/186946.sgml'
(Pdb) l
4 !
5 !
import re
6 !
from BeautifulSoup import BeautifulStoneSoup
7 !
8 !
recipe_file = "/data/text_examples/1985/01/02/186946.sgml"
9 ->! bs = BeautifulStoneSoup(open(recipe_file))
10 !
11 !
#pdb.set_trace()
12 !
13 !
word_count = 0
14 !
(Pdb) l 20
15 ! for p in bs.findAll("p"):
16 !
17 !
line = p.getText()
18 !
line = re.sub("\s+"," ",line)
19 !
line = line.strip()
20 !
line = line.lower()
21 !
22 !
for w in line.split(" "):
23 !
24 !
w = re.sub("\W","",w)
25 !
(Pdb) b 24
Breakpoint 1 at /home/cocteau/test.py:24
(Pdb) c
> /home/cocteau/test.py(24)<module>()
-> w = re.sub("\W","",w)
(Pdb) l
19 !
line = line.strip()
20 !
line = line.lower()
21 !
22 !
for w in line.split(" "):
23 !
24 B->!
w = re.sub("\W","",w)
25 !
26 !
if w:
27 !
28 !
word_count += 1
29 !
print w
(Pdb) n
> /home/cocteau/test.py(26)<module>()
-> if w:
(Pdb) p w
u'while'
(Pdb) c
while
> /home/cocteau/test.py(24)<module>()
-> w = re.sub("\W","",w)
(Pdb) n
> /home/cocteau/test.py(26)<module>()
-> if w:
(Pdb) p w
u'there'
(Pdb) cl 1
Deleted breakpoint 1
(Pdb) !x = "arbitrary python statements prefaced by a !"
(Pdb) p x
(Pdb) h
Documented commands (type help <topic>):
========================================
EOF
bt
cont
enable jump
a
c
continue exit
l
alias cl
d
h
list
args
clear
debug
help
n
b
commands
disable
ignore next
break condition down
j
p
Miscellaneous help topics:
==========================
exec pdb
Undocumented commands:
======================
retval rv
(Pdb)
pp
q
quit
r
restart
return
run
s
step
tbreak
u
unalias
unt
until
up
w
whatis
where
Debugging
•
When debugging, one often employs a strategy of divide and conquer -- That
is, you first check to see if everything is OK in the first half of your program,
and, if so, check the 3/4 point, otherwise check the 1/4 point
•
In short, a debugging program won’t tell you what your bug is, but it can help
you find out where it is
•
There are various visual or GUI-based extensions to PDB and IPython has a
very clean debugger built-in -- This kind of tool can help you scrape through a
piece of code more efficiently than the old stand-by of inserting print commands
everywhere
Programming defensively
•
Python offers a simple construction that allows you to catch errors and handle
them in your running code -- Not every error should cause your program to exit,
but instead produce fixable situations
•
For example, if we are pulling data from the web, we might occasionally
encounter an network error -- In that case we might want to have our program
respond by sleeping for a little while and try the access later
•
The try/catch/finally structure allows you trap various kinds of
“exceptions” that are raised while your code executes -- Let’s start with a
simple arithmetic error
>>> x = 5
>>> y = 0
>>> try:
...
z = x/y
... except ZeroDivisionError:
...
print "divide by zero"
divide by zero
>>>
# here we are looking for a particular exception
>>> try:
...
x/y
... except ZeroDivisionError, e:
# here we are catching an “exception” objects
...
z = e
...
>>> print z
integer division or modulo by zero
>>> type(z)
<type 'exceptions.ZeroDivisionError'>
>>> try:
...
x/y
... except:
...
print 'a problem'
... else:
...
print 'it worked!'
'a problem'
# catch any error
Programming defensively
•
You can handle exceptions in a nested way, testing for more specific errors first,
and ending with the more general -- The finally statement provides code that’s
executed no matter what happened inside the code blocks
try:
block-1 ...
except Exception1:
handler-1 ...
except Exception2:
handler-2 ...
else:
else-block
finally:
final-block
•
You are also able to raise exceptions in your code, allowing your modules to
propagate exceptions so that your users can handle them as they see fit
Data formats
•
XML is not the only data format out there; and with the advent of client-side
tools like JavaScript (a language that runs in your browser and was originally
meant to let programmers work with “pages” displayed by the Netscape
Navigator; what kinds of objects might this language “expose”? What
methods?)
•
JSON (JavaScript Object Notation) is billed as a “light-weight data-interchange
format that is easy for humans to read and write”; why might a program running
in your browser need to send and receive data?
•
As a format, JSON uses conventions that are familiar to users of languages like
C or, as luck would have it, Python; here’s what you get when you request the
Twitter public timeline “page” in JSON*
•
* http://twitter.com/statuses/public_timeline.json
curl http://twitter.com/statuses/public_timeline.json > ptl.json
% Total
100 29909
% Received % Xferd
100 29909
0
0
Average Speed
Time
Time
Time Current
Dload Upload
Total
Spent
Left Speed
73186
0 --:--:-- --:--:-- --:--:-- 91745
head ptl.json
[{"place":null,"contributors":null,"coordinates":null,"truncated":false,"in_reply_
to_screen_name":null,"geo":null,"retweeted":false,"source":"<a href=\"http://
twitterfeed.com\" rel=\"nofollow\">twitterfeed</a>","created_at":"Mon Oct 18
21:14:45 +0000 2010","in_reply_to_status_id":null,"user":
{"geo_enabled":false,"friends_count":
0,"profile_text_color":"333333","description":null,"contributors_enabled":false,"p
rofile_background_tile":false,"favourites_count":
0,"profile_link_color":"0084B4","listed_count":
5,"verified":false,"profile_sidebar_fill_color":"DDEEF6","url":null,"follow_reques
t_sent":null,"notifications":null,"time_zone":null,"lang":"en","created_at":"Tue
Jun 08 11:52:29 +0000
2010","profile_sidebar_border_color":"C0DEED","profile_image_url":"http://
s.twimg.com/a/1287010001/images/
default_profile_2_normal.png","location":null,"protected":false,"profile_use_backg
round_image":true,"screen_name":"2webtraffic","name":"web
traffic","show_all_inline_media":false,"following":null,"profile_background_color"
:"C0DEED","followers_count":734,"id":153380152,"statuses_count":
14544,"profile_background_image_url":"http://s.twimg.com/a/1287010001/images/
themes/theme1/
bg.png","utc_offset":null},"retweet_count":null,"favorited":false,"id":
27769891000,"in_reply_to_user_id":null,"text":"How To Get Traffic From Social
Networks | Social Marketing Tips: Social networking could be defined as an
online ... http://bit.ly/a3fGE3"},
{"place":null,"contributors":null,"coordinates":null,"truncated":false,"in_reply_t
o_screen_name":null,"geo":null,"retweeted":false,"source":"web","created_at":"Mon
Oct 18 21:14:43 +0000 2010","in_reply_to_status_id":null,"user":{"statuses_count":
[
{"place":null,
"contributors":null,
"coordinates":null,
"truncated":false,
"in_reply_to_screen_name":null,
"geo":null,
"retweeted":false,
"source":"<a href=\"http://twitterfeed.com\" rel=\"nofollow\">twitterfeed</a>",
"created_at":"Mon Oct 18 21:14:45 +0000 2010",
"in_reply_to_status_id":null,
"user":{
"geo_enabled":false,
"friends_count":0,
"profile_text_color":"333333",
"description":null,
"contributors_enabled":false,
"profile_background_tile":false,
"favourites_count":0,
"profile_link_color":"0084B4",
"listed_count":5,
...
"utc_offset":null},
"retweet_count":null,
"favorited":false,
"id":27769891000,
"in_reply_to_user_id":null,
"text":"How To Get Traffic From Social Networks | Social Marketing Tips: Social networking
could be defined as an online ... http://bit.ly/a3fGE3"},
...
]
Look familiar?
Python - JSON
•
As you might expect, a JSON object has a (relatively) direct translation into
Python built-in types (numbers, strings, dictionaries, lists) -- For this reason, it is
exceedingly popular as a tool for storing data
•
As we will see in a later lecture, there are also very efficient databases for
storing, indexing and retrieving JSON strings -- One such offering is MongoDB,
something we’ll work with once our recipes are done
•
How might this help us?
% curl http://twitter.com/statuses/public_timeline.json > ptl.json
% Total
100 30438
% Received % Xferd
100 30438
0
0
Average Speed
Time
Time
Time
Current
Dload
Total
Spent
Left
Speed
67911
Upload
0 --:--:-- --:--:-- --:--:-- 85500
% python
Python 2.7 (r27:82500, Oct 10 2010, 16:27:47)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> f = open("ptl.json")
>>> tweets = json.loads(f.readline())
>>> type(tweets)
<type 'list'>
>>> type(tweets[0])
<type 'dict'>
>>> tweets[0].keys()
['favorited', 'contributors', 'truncated', 'text', 'created_at', 'retweeted', 'coordinates',
'source', 'in_reply_to_status_id', 'in_reply_to_screen_name', 'user', 'place', 'retweet_count',
'geo', 'id', 'in_reply_to_user_id']
>>> tweets[0]['text']
u'Agora sim! Tudo bem, alvinegra? Mudou de foto, n\xe9? Gostei! ;) / @ManaPinheiro: @_OMaisQuerido
Algu\xe9\xe9m ai?'
>>> original = json.dumps(tweets)
# convert it back to a string (and write to a file, say)
MongoDB
•
Once created, JSON strings can be easily stored in one of several so-called
NoSQL databases -- MongoDB is one example, and one that’s running on our
homework machine
•
The next few slides have instructions about how to make use of Mongo, but
please contact me before you do -- Right now Mongo is running without
authentication (without any notion of users) and it’s easy for someone to
overwrite your work
•
But if you want to take your homework assignments one step farther, you can
use Mongo to store recipes and issue simple searches...
>>> import pymongo, re
>>> rec1 = {"name":"venison and eggs",
...
"instructions":["mix well","bake","don't cut yourself"],
...
"ingredients":["3 eggs","some milk","venison!"]}
>>> rec2 = {"name":"venison and pasta",
...
"instructions":["chop","sift","chop again"],
...
"ingredients":["linguini","some milk","venison!"]}
>>> rec3 = {"name":"cheese and pasta",
...
"instructions":["stir","whisk"],
...
"ingredients":["linguini","american cheese"]}
>>> conn = pymongo.Connection()
# connect to the db
>>> type(conn)
<class 'pymongo.connection.Connection'>
>>> db = conn.mh_test
>>> recipes = db.fist_recipes
database
>>> recipes.insert(rec1)
>>> recipes.insert(rec2)
>>> recipes.insert(rec3)
# reate a new database
# create a collection in the
>>> # retrieving data from the db
>>> recipes.find_one()
{u'instructions': [u'mix well', u'bake', u"don't cut yourself"],
u'_id': ObjectId('4cbf5ca51658f72264000000'), u'name': u'venison and eggs',
u'ingredients': [u'3 eggs', u'some milk', u'venison!']}
>>> ven = re.compile(".*venison.*")
>>> for r in recipes.find({"name":ven}): print r
...
{u'instructions': [u'mix well', u'bake', u"don't cut yourself"],
u'_id': ObjectId('4cbf5ca51658f72264000000'), u'name': u'venison and eggs',
u'ingredients': [u'3 eggs', u'some milk', u'venison!']}
{u'instructions': [u'chop', u'sift', u'chop again'],
u'_id': ObjectId('4cbf5ca51658f72264000001'), u'name': u'venison and pasta',
u'ingredients': [u'linguini', u'some milk', u'venison!']}
Your next homework
•
For Wednesday, I want you to read a chapter in Saltzer and Kaashoek on
Systems and Complexity -- I’ll scan the chapter and put it up on our course
Moodle page
•
For our next assignment, we are going to build a system -- Specifically we are
going to build something called Shazam, a music tagging system
Shazam
•
The algorithm behind Shazam is fairly straightforward -- A time-frequency
decomposition is performed examining which frequencies are dominant at
which times in a song
•
The peaks in this map forma kind of constellation -- Relationships between the
individual elements are then encoded using something called geometric
hashing (don’t worry about this yet)
•
Given a sample of audio, the same process is repeated and a search is made
to see if there are matching patterns of peaks...
The goal
•
The goal of this assignment is to have you design a system -- Each group will
implement and end-to-end system that takes in an audio file, computes the
needed decompositions and hashes and performs the match
•
Your job is to divide the tasks and design a way to had data back and forth -The emphasis is on cooperation, coordination, modularity
•
For Wednesday, you are to read the book chapter, the original Shazam article
and meet with your group to think about the basic system components -- You’ll
then write up a short proposal for how work should proceed