Download M02 Notes - The Huttenhower Lab

Monday #2 5m Announcements and questions Don't forget to send your questions to the web site discussion forum Problem Set #1 is due by end of today, Problem Set #2 by end of next Monday Note that readings now begin from the Practical Computing book, linked from web site Extensive notes also available for many topics, e.g. intro to Python and installation/setup Also note that Python assignments have a particular, very specific submission format Detailed on web site handout, will also be covered in lab 10m Background on genomic data manipulation: what is programming? A program is a set of instructions that tells a computer how to transform some input into some output Not terribly different from a recipe or a lab protocol Instead of reagents, entities being manipulated are data, in many forms (numbers, tables, etc.) You can think of a mathematical function like f(x) = 2x or g(x) = x 2 + 5 as a "program" that manipulates a value Likewise a game is a program that translates your input data (mouse + keys) into outputs (shapes + values) A program is a series of plain text instructions written in a specific programming language Literally plain text: not a Word document, not a web page, just text A defined language allows the computer to understand how it should move data in response to instructions An interpreter is a special program whose input data is another program That is, it knows how to execute a given set of instructions in order to transform input into output All programs are "interpreters" to some degree, but dedicated interpreters are exactly that: dedicated Reminds you that programs are data (and the interpreter itself is also data; freaky!) Most programs, interpreters included, deal with two kinds of output: Return values or results are the literal output of an instruction They're communicated from the instruction to the computer/interpreter, not to you Again much like the result of a mathematical function; produced, not displayed Displayed outputs, also side effects or printed outputs, are displayed on a screen but not to the computer Usually must be requested specifically, e.g. with a command like print We'll talk more later about how computers display data as opposed to store it internally To the computer, you can always replace or substitute an instruction with its return value If I tell a computer to "add five to the result of multiplying two and three," it's equivalent to "add five to six" And thus also equivalent to 11; this is true for arbitrarily complex instructions or expressions 5m Programming semantics versus Python syntax There are many different programming styles In Python a program consists of lines of code, roughly one instruction per line Instructions include keywords, operators, functions, data, and whitespace Keywords are special instructions built into the language Operators are special punctuation, also build into the language Functions are complex sets of reusable instructions that perform a specific task; more on that later Data are discrete values of several possible types, which we'll discuss next Whitespace are the spaces, tabs, and newlines used to organize your code and make it readable Python especially uses whitespace to organize instructions into lines and lines into blocks A block is a group of related instructions that share data, context, and environment Thus some whitespace is very important in Python, and some is ignored Typically whitespace within a line or between lines ignored But whitespace indenting a line is critical for distinguishing blocks Beware of tabs and spaces at the beginning of lines! Capitalization and case are always important in Python Honor thine errors Programming languages are strict When you break English rules, your high school teacher marks your paper in red ink When you break Python rules, the computer laughs at you and won't do what you ask Typos, incorrect capitalization/spelling, or misguided semantics will all produce errors instead of results These can be inscrutable, but read them: they will always give some indication of what went wrong And often that indication is exactly right, leading you to the specific line or word with a problem But not always - think and interpret Python's error statements 10m Data in Python Three basic types and two collection types: Basic or unit types: numbers, strings, and booleans Collection types: lists (or arrays) and dictionaries or (hashes/hash tables) Variables Data are typically stored in variables, which are named storage bins for values Think of a variable as a labeled bucket, or a tabbed index card Of several operators that modify variables, assignment (=) is the most important It would be lame if you instead had to type in all of your data by hand! Variables provide a place to store non-static data, from files, runtime input, calculations, etc. Variable buckets can hold one of two things: A data value, being a specific number, string, truth value, etc. Or a data reference, which indicates the storage location of one or more other values Python treats almost all data except basic types as values; more on this later I will typically name variables with a prefix that indicates what type of data they hold Referred to as Hungarian notation for odd historical reasons Watch for combinations of lower case letters prefixing variables as I introduce data types None is a special value indicating the absence of any data It has no "type" and can be thought of as an "uninitialized" or "NA" value Typically stored in variables to indicate the absence of "real" data Numbers Comprise integers and real values Integers have no decimal and can be positive, negative, or zero - no surprises Real values (also floating point or sometimes double precision values) are decimal numbers These can take values from negative to positive infinity And can be notated in decimal or scientific notation Strings Textual data, consisting of zero or more characters written between quotes Quotes can be either single apostrophes ' or double quotes " (not backticks `) Python does not differentiate between single and double quotes like some languages You can write special characters in strings using escape sequences that start with a backslash \ Think of these as secret codes that tell Python to include a non-literal character "cat" means the literal characters c, a, t "ca\t" means the literal characters c, a, followed by a tab The most common escape sequences are: \t for tab, \n for newline, \\ for backslash itself These are very useful for nicely-formatted output, among other things Booleans Also logical or binary or truth values, represent a two-valued categorical outcome Written as True and False (capitalization critical) These are the simplest values used to test for the truth or falsehood of a program's result A computation can return True to indicate that a test of some input succeeded Or that it's in some defined range, or matches some expected pattern When using these values to run test instructions: False, None, 0 (of any sort), the empty string, and any empty collection all count as false values Anything else is true, not just True: 1, 10, 10.10, "ten", "0", etc. 15m Collections: organized references to multiple values Python treats each of these "unit" data types like a discrete object They can be put into buckets (variables) and computed with A variable thus always contains exactly one unit data type Python refers to any data type containing multiple unit data types as a collection This includes several different ways of organizing data Collections are always stored as references This means their containing variable does not contain the "value" of the whole collection It instead serves as a signpost that "points" to the data in the collection This won't matter until a bit later, but remember that only unit data types are stored as Python values Lists Also arrays or vectors in other languages A list is an ordered collection of zero or more values arranged in sequence Very much like a mathematical vector A variable containing a list thus acts like a sequence of individual variables Each contained value is accessed by an integer index starting at zero Lists are denoted by square brackets [] in Python, with individual elements separated by commas Individual elements can be any Python data, either unit types or other collections Thus in ["a", "b", "c"], the 0th element is "a", the 1st element is "b", and so forth Individual elements are retrieved by accessing a list index by integer, also using brackets ["a", "b", "c"][1] returns "b", also aList[1] if aList = ["a", "b", "c"] Every list has a length, which might be zero The empty list, denoted [], has zero length A list's length is an integer calculated with the len( aList ) function; more on that later Note that accessing a list element beyond its length will produce an error For example ["a", "b"][2] or aList[0] if aList = [] will both fail Dictionaries Also hashes or hash tables in other languages A dictionary is an unordered collection of zero or more values arranged by keys Very much like a real dictionary, in which key words are linked to definition values Keys must be unique; values may or may not be In other words, every key has at most one value A dictionary is thus an unordered set of key/value pairs Each contained value is accessed by an arbitrary valued index that can be any unit data type Dictionary keys can be numbers or strings; they cannot be other collections Dictionary values can be any data type at all, including other collections Dictionaries are denoted by curly brackets {} in Python, with key/value pairs separated by commas And key/value pairs themselves joined with a colon : Thus {"a" : 0, "b" : [1], "c" : "two"} contains three key/value pairs of various types Individual elements are retrieved by accessing a value by key, using square brackets Thus {"a" : 0, "b":[1]}["a"] returns 0, also hashDict["a"] if hashDict = {etc.} Every dictionary has a length, which is the number of key/value pairs it contains Also equivalent to the number of unique keys, calculated using len( hashDict ) as above The empty dictionary, denoted {}, has zero length 5m Nuts and bolts: software for Python programming Carla has covered installing Python on Macs and PCs during last week's lab Will continue to familiarize with the environment over the next few weeks You don't want to write Python using Notepad or Word! jEdit (or a similar editor) is the easiest way to get up to speed for the homeworks I recommend TextWrangler for Mac, Notepad++ for Window Instructions also available to install and use Eclipse + PyDev, a full-featured programming editor 20m Operators Special operators = assignment: variable = instruction stores the return value of instruction in variable # comment: # text indicates that text is not part of the program and should be ignored by the interpreter Note that this only applies to the text between the # and the end of the single current line Docstrings are a special type of string that provides a way of commenting out multiple lines in Python () parenthesis: group operators (and instructions) together, working very much as in mathematics in inclusion: variable in collection returns True if variable is in collection, False otherwise Numerical operators + - * / arithmetic: x + y returns the value of adding numbers x and y, other operators likewise ** power: x ** y returns the value of x raised to the power y += -= *= /= assignment: x += y is equivalent to x = x + y, other operators likewise Logical operators == != equivalence: x == y returns True if x and y have the same values, False otherwise (!= reversed) < > <= >= numerical comparison: x < y returns True if x has a lower value than y, other operators likewise not inversion: not x returns True if x is false, False otherwise and or logical composition: x and y returns True if x and y are both true, or likewise if at least one true Technically x and y and z returns the last true value of the sequence, x or y or z the first true value Sequence operators: lists or strings + concatenation: x + y returns the string/list of elements from x followed by elements from y * repetition: x * y returns the string/list of x repeated y times, for y an integer : slice: x[y:z] returns the substring/sublist of x beginning at index y up to (but not including) index z 20m Primitives: built-in functions Functions and function calls Functions, also referred to as methods, are reusable blocks of instructions to perform a specific task They can operate on zero or input input data of arbitrary types They can produce zero or one output, called the return value or result of the function Functions store instructions, but they do not execute them until they're called Calling a function tells it to run its instructions on a particular set of input values These inputs are called the arguments that are passed to the function A typical Python function call looks like this: function_name( arg1, arg2 ) A function call consists of the name, followed by zero or more arguments The arguments are surrounded by parentheses and separated by commas Some functions have restrictions on what types of arguments they understand, some don't Just like operators; + makes sense both for numbers and strings, but not ** Some functions produce side effects in addition to (or instead of) a return value One common example is print( "example" ), which inputs one string argument It then displays it on the screen, but does not return it for further processing A counterexample is abs( -2 )  2, which takes one numeric argument and returns its absolute value Substitution works for arguments: any place you can pass a data value, you can pass a variable containing it For example, dValue = -5.6, abs( dValue )  5.6 For example, adValues = [-1, 2], abs( adValues[1] )  2 And just like mathematical functions, substitutions work for functions as well (composition) For example, round( abs( -1.8 ) )  2.0 Function targets and objects Some functions also take a special argument called their target This special argument is not included between parentheses, and is also the object of the function Functions can be targeted to a particular object (which is just another data value) using the . operator For example, the replace function targets a string and takes two additional string arguments The first is a substring to search for, the second is a new substring to replace it with Thus "test".replace( "es", "arge" )  "target" Just like normal arguments, substitution by variables (or other instructions) works for targets For example, strValue = "fast", strValue.replace( "a", "ea" )  "feast" For example, "test".replace( "e", "a" ).replace( "st", "ste" )  "taste" For example, abs( -len( "a" + "bey".replace( "e", "" ) ) )  3 Utility functions print( x ) display: outputs x to the screen, returning nothing len( x ) length: returns the number of elements in string, list, or dictionary x str(x), float(x), int(x) type conversion: returns x coerced into the requested type cmp( x, y ) comparison: returns -1 if x<y, 1 if x>y, and 0 otherwise 10m More primitive functions Numeric functions abs( x ) absolute value: returns the absolute value of number x round( x ) rounding: returns number x rounded to the nearest integer (but as a real number!) min( x ), max( x ) minimum/maximum: returns smallest/largest value in list x sum( x ) summation: returns the sum of elements in list x String functions x.strip( ) trimming: returns a copy of string x with all leading and trailing whitespace removed x.find( y ) substring find: returns the index of substring y within x, or -1 if it's not present x.replace( y, z ) replacement: returns a copy of string x with the first occurrence of y replaced with z List functions x.append( y ) element addition: extends list x to include element y, returning nothing x.pop( ) element removal: removes and returns the last element of list x x.sort( ) sorting: sorts the elements of list x in ascending order, returning nothing sorted( x ) sorting: returns a copy of list x with elements in sorted ascending order x.reverse( ) reversal: reverses the order of elements in list x, returning nothing reversed( x ) reversal: returns a copy of list x with elements in reverse order Dictionary functions x.keys( ) key list: returns a list of all keys in dictionary x (in no particular order) x.values( ) value list: returns a list of all values in dictionary x (in no particular order) x.items( ) pair list: returns a list of all key/value pairs as length-two lists (in no particular order) Reading Python setup: Haddock and Dunn, Chapter 1 p9-15, Chapter 4 p47-55, 59-62 Python data: Haddock and Dunn, Chapter 7 p105-112, 118-120 Python operators: Haddock and Dunn, Chapter 7 p112-115 Python keywords: Haddock and Dunn, Chapter 7 p115-118, Chapter 9 p141-172

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download M02 Notes - The Huttenhower Lab