Download M02 Notes: Introduction to Python (abbreviated)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Monday #2
10m
Announcements and questions
No discussion forum activity - please send your questions there!
Still some students without website accounts; email Rui if missing
Problem Set #1 is due by end of tomorrow, Problem Set #2 by end of next Tuesday
Note that readings now begin from the Bioinformatics Programming book, linked from web site
Extensive notes also available for many topics, e.g. intro to Python and installation/setup
15m
Performance evaluation
We can (and should!) formalize these concepts of tests being "right" or "wrong"
How do you assess the accuracy of a hypothesis test with respect to a ground truth?
Gold standard: list of correct outcomes for a test, also standard or answer set or reference
Often categorical binary outcomes (0/1), also sometimes true numerical values
In the former (very common) case, answers are referred to as:
Positives: outcomes expected to be "true" or "1" in the gold standard, drawn from H 0 distribution
Negatives: outcomes expected to be "false" or "0" in the standard, drawn from Ha distribution
Predictions or results are the outcomes of a test or inference process
Often whether or not a hypothesis test falls above or below a critical value
Example: is a p-value <0.05
For any test with a reference outcome in the gold standard, four results are possible:
The feature was drawn from H0 and your test accepts H0: true negative
The feature was drawn from H0 and your test rejects H0: false positive (also type I error)
The feature was drawn from Ha and your test accepts Ha: true positive
The feature was drawn from Ha and your test rejects Ha: false negative (also type II error)
H0 True
H0 False
H0 Accepted
True Negative
False Negative (Type II)
H0 Rejected
False Positive (Type I)
True Positive
The rates or probabilities with which incorrect outcomes occur indicate the performance or accuracy of a test
False positive rate = FPR = fraction of successful tests for uninteresting features = P(reject H0|H0)
False negative rate = FNR = fraction of failed tests for interesting features = P(accept H0|Ha)
What type of "performance" you care about depends a lot on how the test's to be used
Are you running expensive validations in the lab? False positives can hurt!
Are you trying to detect dangerous contaminants? False negatives can hurt!
There are thus a slew of complementary performance measures based on different aspects of these quantities
Power: probability of detecting a true effect, P(reject H 0|Ha)
Precision: probability a detected effect is true = TP/(TP+FP) = P(Ha|reject H0)
Recall: probability of detecting true effects = TP/P = TP/(TP+FN) = P(reject H 0|Ha)
Also called true positive rate (TPR) or sensitivity
Specificity: probability a rejected effect is false = TN/(TN+FP) = P(accept H0|H0)
Also called true negative rate (TNR) and equivalent to 1 - false positive rate (FPR)
Most hypothesis tests provide at least one tunable parameter that will trade off between aspects of performance
High-precision tests are typically low-recall (few false positives, more false negatives)
Highly sensitive tests are typically less specific (few false negatives, more false positives)
These tradeoffs are commonly visualized to provide a sense of prediction accuracy over a range of thresholds
Precision/recall plots: recall (x) vs. precision (y), upper right corner is "good"
Receiver Operating Characteristic (ROC) or sensitivity/specificity plots: 1-FPR (x) vs. TPR (y), upper left good
Both feature recall, but since precision ≠ specificity, can provide complementary views of the same data
Entire curves can be further reduced to summary statistics of overall prediction performance
Area Under the Curve (AUC): area under a ROC curve
Random = 0.5, Perfect = 1.0, Perfectly wrong = 0.0
Area under precision/recall curve (AUPRC): exactly what it sounds like
Not commonly used, since there's no baseline; "random" isn't fixed to a particular value
Computational predictions are very often evaluated using AUC, as are diagnostic tests and risk scores
10m
Background on genomic data manipulation: what is programming?
A program is a set of instructions that tells a computer how to transform some input into some output
Not terribly different from a recipe or a lab protocol
Instead of reagents, entities being manipulated are data, in many forms (numbers, tables, etc.)
You can think of a mathematical function like f(x) = 2x or g(x) = x 2 + 5 as a "program" that manipulates a value
Likewise a game is a program that translates your input data (mouse + keys) into outputs (shapes + values)
A program is a series of plain text instructions written in a specific programming language
Literally plain text: not a Word document, not a web page, just text
A defined language allows the computer to understand how it should move data in response to instructions
An interpreter is a special program whose input data is another program
That is, it knows how to execute a given set of instructions in order to transform input into output
All programs are "interpreters" to some degree, but dedicated interpreters are exactly that: dedicated
Reminds you that programs are data (and the interpreter itself is also data; freaky!)
Most programs, interpreters included, deal with two kinds of output:
Return values or results are the literal output of an instruction
They're communicated from the instruction to the computer/interpreter, not to you
Again much like the result of a mathematical function; produced, not displayed
Displayed outputs, also side effects or printed outputs, are displayed on a screen but not to the computer
Usually must be requested specifically, e.g. with a command like print
We'll talk more later about how computers display data as opposed to store it internally
To the computer, you can always replace or substitute an instruction with its return value
If I tell a computer to "add five to the result of multiplying two and three," it's equivalent to "add five to six"
And thus also equivalent to 11; this is true for arbitrarily complex instructions or expressions
5m Programming semantics versus Python syntax
There are many different programming styles
In Python a program consists of lines of code, roughly one instruction per line
Instructions include keywords, operators, functions, data, and whitespace
Keywords are special instructions built into the language
Operators are special punctuation, also build into the language
Functions are complex sets of reusable instructions that perform a specific task; more on that later
Data are discrete values of several possible types, which we'll discuss next
Whitespace are the spaces, tabs, and newlines used to organize your code and make it readable
Python especially uses whitespace to organize instructions into lines and lines into blocks
A block is a group of related instructions that share data, context, and environment
Thus some whitespace is very important in Python, and some is ignored
Typically whitespace within a line or between lines ignored
But whitespace indenting a line is critical for distinguishing blocks
Beware of tabs and spaces at the beginning of lines!
Capitalization and case are always important in Python
Honor thine errors
Programming languages are strict
When you break English rules, your high school teacher marks your paper in red ink
When you break Python rules, the computer laughs at you and won't do what you ask
Typos, incorrect capitalization/spelling, or misguided semantics will all produce errors instead of results
These can be inscrutable, but read them: they will always give some indication of what went wrong
And often that indication is exactly right, leading you to the specific line or word with a problem
But not always - think and interpret Python's error statements
10m
Data in Python
Three basic types and two collection types:
Basic or unit types: numbers, strings, and booleans
Collection types: lists (or arrays) and dictionaries or (hashes/hash tables)
Variables
Data are typically stored in variables, which are named storage bins for values
Think of a variable as a labeled bucket, or a tabbed index card
Of several operators that modify variables, assignment (=) is the most important
It would be lame if you instead had to type in all of your data by hand!
Variables provide a place to store non-static data, from files, runtime input, calculations, etc.
Variable buckets can hold one of two things:
A data value, being a specific number, string, truth value, etc.
Or a data reference, which indicates the storage location of one or more other values
Python treats almost all data except basic types as values; more on this later
I will typically name variables with a prefix that indicates what type of data they hold
Referred to as Hungarian notation for odd historical reasons
Watch for combinations of lower case letters prefixing variables as I introduce data types
None is a special value indicating the absence of any data
It has no "type" and can be thought of as an "uninitialized" or "NA" value
Typically stored in variables to indicate the absence of "real" data
Numbers
Comprise integers and real values
Integers have no decimal and can be positive, negative, or zero - no surprises
Real values (also floating point or sometimes double precision values) are decimal numbers
These can take values from negative to positive infinity
And can be notated in decimal or scientific notation
Strings
Textual data, consisting of zero or more characters written between quotes
Quotes can be either single apostrophes ' or double quotes " (not backticks `)
Python does not differentiate between single and double quotes like some languages
You can write special characters in strings using escape sequences that start with a backslash \
Think of these as secret codes that tell Python to include a non-literal character
"cat" means the literal characters c, a, t
"ca\t" means the literal characters c, a, followed by a tab
The most common escape sequences are:
\t for tab, \n for newline, \\ for backslash itself
These are very useful for nicely-formatted output, among other things
Booleans
Also logical or binary or truth values, represent a two-valued categorical outcome
Written as True and False (capitalization critical)
These are the simplest values used to test for the truth or falsehood of a program's result
A computation can return True to indicate that a test of some input succeeded
Or that it's in some defined range, or matches some expected pattern
When using these values to run test instructions:
False, None, 0 (of any sort), the empty string, and any empty collection all count as false values
Anything else is true, not just True: 1, 10, 10.10, "ten", "0", etc.
10m
Collections: organized references to multiple values
Python treats each of these "unit" data types like a discrete object
They can be put into buckets (variables) and computed with
A variable thus always contains exactly one unit data type
Python refers to any data type containing multiple unit data types as a collection
This includes several different ways of organizing data
Collections are always stored as references
This means their containing variable does not contain the "value" of the whole collection
It instead serves as a signpost that "points" to the data in the collection
This won't matter until a bit later, but remember that only unit data types are stored as Python values
Lists
Also arrays or vectors in other languages
A list is an ordered collection of zero or more values arranged in sequence
Very much like a mathematical vector
A variable containing a list thus acts like a sequence of individual variables
Each contained value is accessed by an integer index starting at zero
Lists are denoted by square brackets [] in Python, with individual elements separated by commas
Individual elements can be any Python data, either unit types or other collections
Thus in ["a", "b", "c"], the 0th element is "a", the 1st element is "b", and so forth
Individual elements are retrieved by accessing a list index by integer, also using brackets
["a", "b", "c"][1] returns "b", also aList[1] if aList = ["a", "b", "c"]
Every list has a length, which might be zero
The empty list, denoted [], has zero length
A list's length is an integer calculated with the len( aList ) function; more on that later
Note that accessing a list element beyond its length will produce an error
For example ["a", "b"][2] or aList[0] if aList = [] will both fail
Dictionaries
Also hashes or hash tables in other languages
A dictionary is an unordered collection of zero or more values arranged by keys
Very much like a real dictionary, in which key words are linked to definition values
Keys must be unique; values may or may not be
In other words, every key has at most one value
A dictionary is thus an unordered set of key/value pairs
Each contained value is accessed by an arbitrary valued index that can be any unit data type
Dictionary keys can be numbers or strings; they cannot be other collections
Dictionary values can be any data type at all, including other collections
Dictionaries are denoted by curly brackets {} in Python, with key/value pairs separated by commas
And key/value pairs themselves joined with a colon :
Thus {"a" : 0, "b" : [1], "c" : "two"} contains three key/value pairs of various types
Individual elements are retrieved by accessing a value by key, using square brackets
Thus {"a" : 0, "b":[1]}["a"] returns 0, also hashDict["a"] if hashDict = {etc.}
Every dictionary has a length, which is the number of key/value pairs it contains
Also equivalent to the number of unique keys, calculated using len( hashDict ) as above
The empty dictionary, denoted {}, has zero length
20m
Operators
Special operators
= assignment: variable = instruction stores the return value of instruction in variable
# comment: # text indicates that text is not part of the program and should be ignored by the interpreter
Note that this only applies to the text between the # and the end of the single current line
Fah will cover docstrings tomorrow, which are a way of commenting out multiple lines in Python
() parenthesis: group operators (and instructions) together, working very much as in mathematics
in inclusion: variable in collection returns True if variable is in collection, False otherwise
Numerical operators
+ - * / arithmetic: x + y returns the value of adding numbers x and y, other operators likewise
** power: x ** y returns the value of x raised to the power y
+= -= *= /= assignment: x += y is equivalent to x = x + y, other operators likewise
Logical operators
== != equivalence: x == y returns True if x and y have the same values, False otherwise (!= reversed)
< > <= >= numerical comparison: x < y returns True if x has a lower value than y, other operators likewise
not inversion: not x returns True if x is false, False otherwise
and or logical composition: x and y returns True if x and y are both true, or likewise if at least one true
Technically x and y and z returns the last true value of the sequence, x or y or z the first true value
Sequence operators: lists or strings
+ concatenation: x + y returns the string/list of elements from x followed by elements from y
* repetition: x * y returns the string/list of x repeated y times, for y an integer
: slice: x[y:z] returns the substring/sublist of x beginning at index y up to (but not including) index z
20m
Keywords and control flow
Special keywords
pass no-operation: does nothing, used only to prevent empty blocks (very rare)
del removal: del x[y] removes the value at index/key y from collection x
raise Exception: raise Exception( "text" ) immediately stops program execution with error "text"
Control flow
if/elif/else conditionals: if a: b elif c: d else e executes b if a is true, otherwise d if c is true, otherwise e
Can include zero or more (arbitrarily many) elif, executed in order and stopping as soon as one is true
Can include at most one (and possible zero) else, only executed if no preceding statement succeeds
Loops
while loop: while x: y runs y if x is true, and repeats this until x is false
for/in loop: for x in y: z repeats z once for each value in collection y, setting x to each value sequentially
break instruction: when used in a loop, immediately exits the single innermost loop and continues below
continue instruction: in a loop, immediately exits the current innermost iteration and continues with the next
5m Nuts and bolts: software for Python programming
Rui will cover installing Python on Macs and PCs during tomorrow's lab
You don't want to write Python using Notepad or Word!
Rui will also show how to install and use Eclipse + PyDev, a full-featured programming editor
Lightweight (free) alternatives include TextWrangler for MacOS, Notepad++ for Windows, jEdit for either
Reading
Performance evaluation:
Pagano and Gauvreau, 6.4
Python data:
Model, Chapter 1 p1-9, 12-20
Python collections:
Model, Chapter 3 p47-72, 94-98
Python keywords:
Model, Chapter 4 p99-113