Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Monday #2 10m Announcements and questions No discussion forum activity - please send your questions there! Still some students without website accounts; email Rui if missing Problem Set #1 is due by end of tomorrow, Problem Set #2 by end of next Tuesday Note that readings now begin from the Bioinformatics Programming book, linked from web site Extensive notes also available for many topics, e.g. intro to Python and installation/setup 15m Performance evaluation We can (and should!) formalize these concepts of tests being "right" or "wrong" How do you assess the accuracy of a hypothesis test with respect to a ground truth? Gold standard: list of correct outcomes for a test, also standard or answer set or reference Often categorical binary outcomes (0/1), also sometimes true numerical values In the former (very common) case, answers are referred to as: Positives: outcomes expected to be "true" or "1" in the gold standard, drawn from H 0 distribution Negatives: outcomes expected to be "false" or "0" in the standard, drawn from Ha distribution Predictions or results are the outcomes of a test or inference process Often whether or not a hypothesis test falls above or below a critical value Example: is a p-value <0.05 For any test with a reference outcome in the gold standard, four results are possible: The feature was drawn from H0 and your test accepts H0: true negative The feature was drawn from H0 and your test rejects H0: false positive (also type I error) The feature was drawn from Ha and your test accepts Ha: true positive The feature was drawn from Ha and your test rejects Ha: false negative (also type II error) H0 True H0 False H0 Accepted True Negative False Negative (Type II) H0 Rejected False Positive (Type I) True Positive The rates or probabilities with which incorrect outcomes occur indicate the performance or accuracy of a test False positive rate = FPR = fraction of successful tests for uninteresting features = P(reject H0|H0) False negative rate = FNR = fraction of failed tests for interesting features = P(accept H0|Ha) What type of "performance" you care about depends a lot on how the test's to be used Are you running expensive validations in the lab? False positives can hurt! Are you trying to detect dangerous contaminants? False negatives can hurt! There are thus a slew of complementary performance measures based on different aspects of these quantities Power: probability of detecting a true effect, P(reject H 0|Ha) Precision: probability a detected effect is true = TP/(TP+FP) = P(Ha|reject H0) Recall: probability of detecting true effects = TP/P = TP/(TP+FN) = P(reject H 0|Ha) Also called true positive rate (TPR) or sensitivity Specificity: probability a rejected effect is false = TN/(TN+FP) = P(accept H0|H0) Also called true negative rate (TNR) and equivalent to 1 - false positive rate (FPR) Most hypothesis tests provide at least one tunable parameter that will trade off between aspects of performance High-precision tests are typically low-recall (few false positives, more false negatives) Highly sensitive tests are typically less specific (few false negatives, more false positives) These tradeoffs are commonly visualized to provide a sense of prediction accuracy over a range of thresholds Precision/recall plots: recall (x) vs. precision (y), upper right corner is "good" Receiver Operating Characteristic (ROC) or sensitivity/specificity plots: 1-FPR (x) vs. TPR (y), upper left good Both feature recall, but since precision ≠ specificity, can provide complementary views of the same data Entire curves can be further reduced to summary statistics of overall prediction performance Area Under the Curve (AUC): area under a ROC curve Random = 0.5, Perfect = 1.0, Perfectly wrong = 0.0 Area under precision/recall curve (AUPRC): exactly what it sounds like Not commonly used, since there's no baseline; "random" isn't fixed to a particular value Computational predictions are very often evaluated using AUC, as are diagnostic tests and risk scores 10m Background on genomic data manipulation: what is programming? A program is a set of instructions that tells a computer how to transform some input into some output Not terribly different from a recipe or a lab protocol Instead of reagents, entities being manipulated are data, in many forms (numbers, tables, etc.) You can think of a mathematical function like f(x) = 2x or g(x) = x 2 + 5 as a "program" that manipulates a value Likewise a game is a program that translates your input data (mouse + keys) into outputs (shapes + values) A program is a series of plain text instructions written in a specific programming language Literally plain text: not a Word document, not a web page, just text A defined language allows the computer to understand how it should move data in response to instructions An interpreter is a special program whose input data is another program That is, it knows how to execute a given set of instructions in order to transform input into output All programs are "interpreters" to some degree, but dedicated interpreters are exactly that: dedicated Reminds you that programs are data (and the interpreter itself is also data; freaky!) Most programs, interpreters included, deal with two kinds of output: Return values or results are the literal output of an instruction They're communicated from the instruction to the computer/interpreter, not to you Again much like the result of a mathematical function; produced, not displayed Displayed outputs, also side effects or printed outputs, are displayed on a screen but not to the computer Usually must be requested specifically, e.g. with a command like print We'll talk more later about how computers display data as opposed to store it internally To the computer, you can always replace or substitute an instruction with its return value If I tell a computer to "add five to the result of multiplying two and three," it's equivalent to "add five to six" And thus also equivalent to 11; this is true for arbitrarily complex instructions or expressions 5m Programming semantics versus Python syntax There are many different programming styles In Python a program consists of lines of code, roughly one instruction per line Instructions include keywords, operators, functions, data, and whitespace Keywords are special instructions built into the language Operators are special punctuation, also build into the language Functions are complex sets of reusable instructions that perform a specific task; more on that later Data are discrete values of several possible types, which we'll discuss next Whitespace are the spaces, tabs, and newlines used to organize your code and make it readable Python especially uses whitespace to organize instructions into lines and lines into blocks A block is a group of related instructions that share data, context, and environment Thus some whitespace is very important in Python, and some is ignored Typically whitespace within a line or between lines ignored But whitespace indenting a line is critical for distinguishing blocks Beware of tabs and spaces at the beginning of lines! Capitalization and case are always important in Python Honor thine errors Programming languages are strict When you break English rules, your high school teacher marks your paper in red ink When you break Python rules, the computer laughs at you and won't do what you ask Typos, incorrect capitalization/spelling, or misguided semantics will all produce errors instead of results These can be inscrutable, but read them: they will always give some indication of what went wrong And often that indication is exactly right, leading you to the specific line or word with a problem But not always - think and interpret Python's error statements 10m Data in Python Three basic types and two collection types: Basic or unit types: numbers, strings, and booleans Collection types: lists (or arrays) and dictionaries or (hashes/hash tables) Variables Data are typically stored in variables, which are named storage bins for values Think of a variable as a labeled bucket, or a tabbed index card Of several operators that modify variables, assignment (=) is the most important It would be lame if you instead had to type in all of your data by hand! Variables provide a place to store non-static data, from files, runtime input, calculations, etc. Variable buckets can hold one of two things: A data value, being a specific number, string, truth value, etc. Or a data reference, which indicates the storage location of one or more other values Python treats almost all data except basic types as values; more on this later I will typically name variables with a prefix that indicates what type of data they hold Referred to as Hungarian notation for odd historical reasons Watch for combinations of lower case letters prefixing variables as I introduce data types None is a special value indicating the absence of any data It has no "type" and can be thought of as an "uninitialized" or "NA" value Typically stored in variables to indicate the absence of "real" data Numbers Comprise integers and real values Integers have no decimal and can be positive, negative, or zero - no surprises Real values (also floating point or sometimes double precision values) are decimal numbers These can take values from negative to positive infinity And can be notated in decimal or scientific notation Strings Textual data, consisting of zero or more characters written between quotes Quotes can be either single apostrophes ' or double quotes " (not backticks `) Python does not differentiate between single and double quotes like some languages You can write special characters in strings using escape sequences that start with a backslash \ Think of these as secret codes that tell Python to include a non-literal character "cat" means the literal characters c, a, t "ca\t" means the literal characters c, a, followed by a tab The most common escape sequences are: \t for tab, \n for newline, \\ for backslash itself These are very useful for nicely-formatted output, among other things Booleans Also logical or binary or truth values, represent a two-valued categorical outcome Written as True and False (capitalization critical) These are the simplest values used to test for the truth or falsehood of a program's result A computation can return True to indicate that a test of some input succeeded Or that it's in some defined range, or matches some expected pattern When using these values to run test instructions: False, None, 0 (of any sort), the empty string, and any empty collection all count as false values Anything else is true, not just True: 1, 10, 10.10, "ten", "0", etc. 10m Collections: organized references to multiple values Python treats each of these "unit" data types like a discrete object They can be put into buckets (variables) and computed with A variable thus always contains exactly one unit data type Python refers to any data type containing multiple unit data types as a collection This includes several different ways of organizing data Collections are always stored as references This means their containing variable does not contain the "value" of the whole collection It instead serves as a signpost that "points" to the data in the collection This won't matter until a bit later, but remember that only unit data types are stored as Python values Lists Also arrays or vectors in other languages A list is an ordered collection of zero or more values arranged in sequence Very much like a mathematical vector A variable containing a list thus acts like a sequence of individual variables Each contained value is accessed by an integer index starting at zero Lists are denoted by square brackets [] in Python, with individual elements separated by commas Individual elements can be any Python data, either unit types or other collections Thus in ["a", "b", "c"], the 0th element is "a", the 1st element is "b", and so forth Individual elements are retrieved by accessing a list index by integer, also using brackets ["a", "b", "c"][1] returns "b", also aList[1] if aList = ["a", "b", "c"] Every list has a length, which might be zero The empty list, denoted [], has zero length A list's length is an integer calculated with the len( aList ) function; more on that later Note that accessing a list element beyond its length will produce an error For example ["a", "b"][2] or aList[0] if aList = [] will both fail Dictionaries Also hashes or hash tables in other languages A dictionary is an unordered collection of zero or more values arranged by keys Very much like a real dictionary, in which key words are linked to definition values Keys must be unique; values may or may not be In other words, every key has at most one value A dictionary is thus an unordered set of key/value pairs Each contained value is accessed by an arbitrary valued index that can be any unit data type Dictionary keys can be numbers or strings; they cannot be other collections Dictionary values can be any data type at all, including other collections Dictionaries are denoted by curly brackets {} in Python, with key/value pairs separated by commas And key/value pairs themselves joined with a colon : Thus {"a" : 0, "b" : [1], "c" : "two"} contains three key/value pairs of various types Individual elements are retrieved by accessing a value by key, using square brackets Thus {"a" : 0, "b":[1]}["a"] returns 0, also hashDict["a"] if hashDict = {etc.} Every dictionary has a length, which is the number of key/value pairs it contains Also equivalent to the number of unique keys, calculated using len( hashDict ) as above The empty dictionary, denoted {}, has zero length 20m Operators Special operators = assignment: variable = instruction stores the return value of instruction in variable # comment: # text indicates that text is not part of the program and should be ignored by the interpreter Note that this only applies to the text between the # and the end of the single current line Fah will cover docstrings tomorrow, which are a way of commenting out multiple lines in Python () parenthesis: group operators (and instructions) together, working very much as in mathematics in inclusion: variable in collection returns True if variable is in collection, False otherwise Numerical operators + - * / arithmetic: x + y returns the value of adding numbers x and y, other operators likewise ** power: x ** y returns the value of x raised to the power y += -= *= /= assignment: x += y is equivalent to x = x + y, other operators likewise Logical operators == != equivalence: x == y returns True if x and y have the same values, False otherwise (!= reversed) < > <= >= numerical comparison: x < y returns True if x has a lower value than y, other operators likewise not inversion: not x returns True if x is false, False otherwise and or logical composition: x and y returns True if x and y are both true, or likewise if at least one true Technically x and y and z returns the last true value of the sequence, x or y or z the first true value Sequence operators: lists or strings + concatenation: x + y returns the string/list of elements from x followed by elements from y * repetition: x * y returns the string/list of x repeated y times, for y an integer : slice: x[y:z] returns the substring/sublist of x beginning at index y up to (but not including) index z 20m Keywords and control flow Special keywords pass no-operation: does nothing, used only to prevent empty blocks (very rare) del removal: del x[y] removes the value at index/key y from collection x raise Exception: raise Exception( "text" ) immediately stops program execution with error "text" Control flow if/elif/else conditionals: if a: b elif c: d else e executes b if a is true, otherwise d if c is true, otherwise e Can include zero or more (arbitrarily many) elif, executed in order and stopping as soon as one is true Can include at most one (and possible zero) else, only executed if no preceding statement succeeds Loops while loop: while x: y runs y if x is true, and repeats this until x is false for/in loop: for x in y: z repeats z once for each value in collection y, setting x to each value sequentially break instruction: when used in a loop, immediately exits the single innermost loop and continues below continue instruction: in a loop, immediately exits the current innermost iteration and continues with the next 5m Nuts and bolts: software for Python programming Rui will cover installing Python on Macs and PCs during tomorrow's lab You don't want to write Python using Notepad or Word! Rui will also show how to install and use Eclipse + PyDev, a full-featured programming editor Lightweight (free) alternatives include TextWrangler for MacOS, Notepad++ for Windows, jEdit for either Reading Performance evaluation: Pagano and Gauvreau, 6.4 Python data: Model, Chapter 1 p1-9, 12-20 Python collections: Model, Chapter 3 p47-72, 94-98 Python keywords: Model, Chapter 4 p99-113