Download M02 Notes - The Huttenhower Lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Monday #2
5m Announcements and questions
Don't forget to send your questions to the web site discussion forum
Problem Set #1 is due by end of today, Problem Set #2 by end of next Monday
Note that readings now begin from the Practical Computing book, linked from web site
Extensive notes also available for many topics, e.g. intro to Python and installation/setup
Also note that Python assignments have a particular, very specific submission format
Detailed on web site handout, will also be covered in lab
10m
Background on genomic data manipulation: what is programming?
A program is a set of instructions that tells a computer how to transform some input into some output
Not terribly different from a recipe or a lab protocol
Instead of reagents, entities being manipulated are data, in many forms (numbers, tables, etc.)
You can think of a mathematical function like f(x) = 2x or g(x) = x 2 + 5 as a "program" that manipulates a value
Likewise a game is a program that translates your input data (mouse + keys) into outputs (shapes + values)
A program is a series of plain text instructions written in a specific programming language
Literally plain text: not a Word document, not a web page, just text
A defined language allows the computer to understand how it should move data in response to instructions
An interpreter is a special program whose input data is another program
That is, it knows how to execute a given set of instructions in order to transform input into output
All programs are "interpreters" to some degree, but dedicated interpreters are exactly that: dedicated
Reminds you that programs are data (and the interpreter itself is also data; freaky!)
Most programs, interpreters included, deal with two kinds of output:
Return values or results are the literal output of an instruction
They're communicated from the instruction to the computer/interpreter, not to you
Again much like the result of a mathematical function; produced, not displayed
Displayed outputs, also side effects or printed outputs, are displayed on a screen but not to the computer
Usually must be requested specifically, e.g. with a command like print
We'll talk more later about how computers display data as opposed to store it internally
To the computer, you can always replace or substitute an instruction with its return value
If I tell a computer to "add five to the result of multiplying two and three," it's equivalent to "add five to six"
And thus also equivalent to 11; this is true for arbitrarily complex instructions or expressions
5m Programming semantics versus Python syntax
There are many different programming styles
In Python a program consists of lines of code, roughly one instruction per line
Instructions include keywords, operators, functions, data, and whitespace
Keywords are special instructions built into the language
Operators are special punctuation, also build into the language
Functions are complex sets of reusable instructions that perform a specific task; more on that later
Data are discrete values of several possible types, which we'll discuss next
Whitespace are the spaces, tabs, and newlines used to organize your code and make it readable
Python especially uses whitespace to organize instructions into lines and lines into blocks
A block is a group of related instructions that share data, context, and environment
Thus some whitespace is very important in Python, and some is ignored
Typically whitespace within a line or between lines ignored
But whitespace indenting a line is critical for distinguishing blocks
Beware of tabs and spaces at the beginning of lines!
Capitalization and case are always important in Python
Honor thine errors
Programming languages are strict
When you break English rules, your high school teacher marks your paper in red ink
When you break Python rules, the computer laughs at you and won't do what you ask
Typos, incorrect capitalization/spelling, or misguided semantics will all produce errors instead of results
These can be inscrutable, but read them: they will always give some indication of what went wrong
And often that indication is exactly right, leading you to the specific line or word with a problem
But not always - think and interpret Python's error statements
10m
Data in Python
Three basic types and two collection types:
Basic or unit types: numbers, strings, and booleans
Collection types: lists (or arrays) and dictionaries or (hashes/hash tables)
Variables
Data are typically stored in variables, which are named storage bins for values
Think of a variable as a labeled bucket, or a tabbed index card
Of several operators that modify variables, assignment (=) is the most important
It would be lame if you instead had to type in all of your data by hand!
Variables provide a place to store non-static data, from files, runtime input, calculations, etc.
Variable buckets can hold one of two things:
A data value, being a specific number, string, truth value, etc.
Or a data reference, which indicates the storage location of one or more other values
Python treats almost all data except basic types as values; more on this later
I will typically name variables with a prefix that indicates what type of data they hold
Referred to as Hungarian notation for odd historical reasons
Watch for combinations of lower case letters prefixing variables as I introduce data types
None is a special value indicating the absence of any data
It has no "type" and can be thought of as an "uninitialized" or "NA" value
Typically stored in variables to indicate the absence of "real" data
Numbers
Comprise integers and real values
Integers have no decimal and can be positive, negative, or zero - no surprises
Real values (also floating point or sometimes double precision values) are decimal numbers
These can take values from negative to positive infinity
And can be notated in decimal or scientific notation
Strings
Textual data, consisting of zero or more characters written between quotes
Quotes can be either single apostrophes ' or double quotes " (not backticks `)
Python does not differentiate between single and double quotes like some languages
You can write special characters in strings using escape sequences that start with a backslash \
Think of these as secret codes that tell Python to include a non-literal character
"cat" means the literal characters c, a, t
"ca\t" means the literal characters c, a, followed by a tab
The most common escape sequences are:
\t for tab, \n for newline, \\ for backslash itself
These are very useful for nicely-formatted output, among other things
Booleans
Also logical or binary or truth values, represent a two-valued categorical outcome
Written as True and False (capitalization critical)
These are the simplest values used to test for the truth or falsehood of a program's result
A computation can return True to indicate that a test of some input succeeded
Or that it's in some defined range, or matches some expected pattern
When using these values to run test instructions:
False, None, 0 (of any sort), the empty string, and any empty collection all count as false values
Anything else is true, not just True: 1, 10, 10.10, "ten", "0", etc.
15m
Collections: organized references to multiple values
Python treats each of these "unit" data types like a discrete object
They can be put into buckets (variables) and computed with
A variable thus always contains exactly one unit data type
Python refers to any data type containing multiple unit data types as a collection
This includes several different ways of organizing data
Collections are always stored as references
This means their containing variable does not contain the "value" of the whole collection
It instead serves as a signpost that "points" to the data in the collection
This won't matter until a bit later, but remember that only unit data types are stored as Python values
Lists
Also arrays or vectors in other languages
A list is an ordered collection of zero or more values arranged in sequence
Very much like a mathematical vector
A variable containing a list thus acts like a sequence of individual variables
Each contained value is accessed by an integer index starting at zero
Lists are denoted by square brackets [] in Python, with individual elements separated by commas
Individual elements can be any Python data, either unit types or other collections
Thus in ["a", "b", "c"], the 0th element is "a", the 1st element is "b", and so forth
Individual elements are retrieved by accessing a list index by integer, also using brackets
["a", "b", "c"][1] returns "b", also aList[1] if aList = ["a", "b", "c"]
Every list has a length, which might be zero
The empty list, denoted [], has zero length
A list's length is an integer calculated with the len( aList ) function; more on that later
Note that accessing a list element beyond its length will produce an error
For example ["a", "b"][2] or aList[0] if aList = [] will both fail
Dictionaries
Also hashes or hash tables in other languages
A dictionary is an unordered collection of zero or more values arranged by keys
Very much like a real dictionary, in which key words are linked to definition values
Keys must be unique; values may or may not be
In other words, every key has at most one value
A dictionary is thus an unordered set of key/value pairs
Each contained value is accessed by an arbitrary valued index that can be any unit data type
Dictionary keys can be numbers or strings; they cannot be other collections
Dictionary values can be any data type at all, including other collections
Dictionaries are denoted by curly brackets {} in Python, with key/value pairs separated by commas
And key/value pairs themselves joined with a colon :
Thus {"a" : 0, "b" : [1], "c" : "two"} contains three key/value pairs of various types
Individual elements are retrieved by accessing a value by key, using square brackets
Thus {"a" : 0, "b":[1]}["a"] returns 0, also hashDict["a"] if hashDict = {etc.}
Every dictionary has a length, which is the number of key/value pairs it contains
Also equivalent to the number of unique keys, calculated using len( hashDict ) as above
The empty dictionary, denoted {}, has zero length
5m Nuts and bolts: software for Python programming
Carla has covered installing Python on Macs and PCs during last week's lab
Will continue to familiarize with the environment over the next few weeks
You don't want to write Python using Notepad or Word!
jEdit (or a similar editor) is the easiest way to get up to speed for the homeworks
I recommend TextWrangler for Mac, Notepad++ for Window
Instructions also available to install and use Eclipse + PyDev, a full-featured programming editor
20m
Operators
Special operators
= assignment: variable = instruction stores the return value of instruction in variable
# comment: # text indicates that text is not part of the program and should be ignored by the interpreter
Note that this only applies to the text between the # and the end of the single current line
Docstrings are a special type of string that provides a way of commenting out multiple lines in Python
() parenthesis: group operators (and instructions) together, working very much as in mathematics
in inclusion: variable in collection returns True if variable is in collection, False otherwise
Numerical operators
+ - * / arithmetic: x + y returns the value of adding numbers x and y, other operators likewise
** power: x ** y returns the value of x raised to the power y
+= -= *= /= assignment: x += y is equivalent to x = x + y, other operators likewise
Logical operators
== != equivalence: x == y returns True if x and y have the same values, False otherwise (!= reversed)
< > <= >= numerical comparison: x < y returns True if x has a lower value than y, other operators likewise
not inversion: not x returns True if x is false, False otherwise
and or logical composition: x and y returns True if x and y are both true, or likewise if at least one true
Technically x and y and z returns the last true value of the sequence, x or y or z the first true value
Sequence operators: lists or strings
+ concatenation: x + y returns the string/list of elements from x followed by elements from y
* repetition: x * y returns the string/list of x repeated y times, for y an integer
: slice: x[y:z] returns the substring/sublist of x beginning at index y up to (but not including) index z
20m
Primitives: built-in functions
Functions and function calls
Functions, also referred to as methods, are reusable blocks of instructions to perform a specific task
They can operate on zero or input input data of arbitrary types
They can produce zero or one output, called the return value or result of the function
Functions store instructions, but they do not execute them until they're called
Calling a function tells it to run its instructions on a particular set of input values
These inputs are called the arguments that are passed to the function
A typical Python function call looks like this: function_name( arg1, arg2 )
A function call consists of the name, followed by zero or more arguments
The arguments are surrounded by parentheses and separated by commas
Some functions have restrictions on what types of arguments they understand, some don't
Just like operators; + makes sense both for numbers and strings, but not **
Some functions produce side effects in addition to (or instead of) a return value
One common example is print( "example" ), which inputs one string argument
It then displays it on the screen, but does not return it for further processing
A counterexample is abs( -2 )  2, which takes one numeric argument and returns its absolute value
Substitution works for arguments: any place you can pass a data value, you can pass a variable containing it
For example, dValue = -5.6, abs( dValue )  5.6
For example, adValues = [-1, 2], abs( adValues[1] )  2
And just like mathematical functions, substitutions work for functions as well (composition)
For example, round( abs( -1.8 ) )  2.0
Function targets and objects
Some functions also take a special argument called their target
This special argument is not included between parentheses, and is also the object of the function
Functions can be targeted to a particular object (which is just another data value) using the . operator
For example, the replace function targets a string and takes two additional string arguments
The first is a substring to search for, the second is a new substring to replace it with
Thus "test".replace( "es", "arge" )  "target"
Just like normal arguments, substitution by variables (or other instructions) works for targets
For example, strValue = "fast", strValue.replace( "a", "ea" )  "feast"
For example, "test".replace( "e", "a" ).replace( "st", "ste" )  "taste"
For example, abs( -len( "a" + "bey".replace( "e", "" ) ) )  3
Utility functions
print( x ) display: outputs x to the screen, returning nothing
len( x ) length: returns the number of elements in string, list, or dictionary x
str(x), float(x), int(x) type conversion: returns x coerced into the requested type
cmp( x, y ) comparison: returns -1 if x<y, 1 if x>y, and 0 otherwise
10m
More primitive functions
Numeric functions
abs( x ) absolute value: returns the absolute value of number x
round( x ) rounding: returns number x rounded to the nearest integer (but as a real number!)
min( x ), max( x ) minimum/maximum: returns smallest/largest value in list x
sum( x ) summation: returns the sum of elements in list x
String functions
x.strip( ) trimming: returns a copy of string x with all leading and trailing whitespace removed
x.find( y ) substring find: returns the index of substring y within x, or -1 if it's not present
x.replace( y, z ) replacement: returns a copy of string x with the first occurrence of y replaced with z
List functions
x.append( y ) element addition: extends list x to include element y, returning nothing
x.pop( ) element removal: removes and returns the last element of list x
x.sort( ) sorting: sorts the elements of list x in ascending order, returning nothing
sorted( x ) sorting: returns a copy of list x with elements in sorted ascending order
x.reverse( ) reversal: reverses the order of elements in list x, returning nothing
reversed( x ) reversal: returns a copy of list x with elements in reverse order
Dictionary functions
x.keys( ) key list: returns a list of all keys in dictionary x (in no particular order)
x.values( ) value list: returns a list of all values in dictionary x (in no particular order)
x.items( ) pair list: returns a list of all key/value pairs as length-two lists (in no particular order)
Reading
Python setup:
Haddock and Dunn, Chapter 1 p9-15, Chapter 4 p47-55, 59-62
Python data:
Haddock and Dunn, Chapter 7 p105-112, 118-120
Python operators:
Haddock and Dunn, Chapter 7 p112-115
Python keywords:
Haddock and Dunn, Chapter 7 p115-118, Chapter 9 p141-172