* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download M01 Notes
Survey
Document related concepts
Transcript
Monday #1 10m Introduction, overview, and questions 20m Thinking about data visually Visualizing data: bar charts, histograms, density plots, cumulative distributions, stripcharts, box plots, scatter plots 10m Why does quantitative data analysis matter? It takes five minutes in Excel to generate an incorrect biological conclusion Open excel and create a table of random data with 15 columns and 5,000 rows The top row is a measurement of bacterial cell survival "age" The following 5,000 rows are measurements of bacterial whole-genome "expression" Freeze these data (copy/paste by value) and create a new column correlating each "gene" with "age" Sort by correlation and scatter plot "age" versus the most correlated "gene" Does this "gene" control cell "longevity"? 10m Descriptive and inferential statistics Descriptive statistics summarize information that is already present in a data collection These include: Common visualizations like box plots, histograms, etc. Common summary measures like averages, standard deviations, medians, and so forth The computer science equivalent is data mining or search: Google Inferential statistics use a sample of data to make predictions about larger populations or about unobserved/future trends These include: Any measurements made in the presence of noise or variation Generalizations from a sample to a population: confidence intervals, hypothesis tests, etc. Comparisons made between datasets: comparisons, correlations, regression, etc. The computer science equivalent is machine learning: classification or prediction Data are typically described in one of three forms: categorical, ordinal, or continuous Categorical values take one of a discrete set of unordered values: Binary or boolean values: a coin flip (H/T) A tissue type: blood/skin/lung/GI/etc. Ordinal values take one of a discrete set of ordered values: Counts or rank orders Often used for nonparametric methods (discussed later) These are rarely analyzed differently than continuous values Continuous values take one value from an ordered numerical scale Times, frequencies, ratios, percentages, abundances, etc. 30m Simple descriptive statistics A statistic is any single value that summarizes an entire dataset Parametric summary statistics: mean, standard deviation, and z-scores Typically used to describe "well behaved" data that's approximately normally distributed This means continuous, symmetric, thin-tailed Closeness needed for "approximately" depends on exact applications - often pretty flexible, not always Average = mean = = x/n Standard deviation = variance = = (x2/n-2) (population) Beware the difference between population and sample standard deviation ((x-)2/(n-1)) or (n/(n-1))(x2/n-2) (sample) The latter is slightly larger due to uncertainty in estimating the population mean from a sample Data expressed as z-scores are relative to a dataset's mean and standard deviation: z=(x-)/ 2/3 of normally distributed data will be within of , 95% within 2, 99% within 3 Be careful when using sample to detect outliers: 33%, 5%, or 1% of your data will always be excluded Nonparametric summary statistics: median, percentiles, quartiles, interquartile range, and outliers Can be used to describe any data regardless of distribution; consider rank order and not value Median = m = x[|x|/2] = midpoint of dataset Percentile = p(y) = x[y|x|] = data point y% of the way "through" dataset Quartiles = 25th, 50th, and 75th percentiles = {p(0.25), p(0.5), p(0.75)}, also often quintiles, deciles, etc. Inter-quartile range = IQR = difference between 75th and 25th percentile = p(0.75) - p(0.25) Thus exactly half of any dataset is within its IQR Fences = bounds on "usual" range of data Upper fence lies above the 75th percentile p(0.75), lower fence below 25th percentile p(0.25) Inner fences are 1.5 inter-quartile ranges above/below, outer fences 3 IQRs Provide a good way to detect outliers: <LIF or >UIF is unusual, <LOF or >UOF is extremely unlikely 30m Comparing data: Manhattan, Euclidean, Pearson, Cosine, and Spearman correlations and distances Useful to have summary statistics of paired data Measurements consisting of two correspondingly ordered vectors Motivations for different paired summary statistics are diverse Some describe properties of probability distributions, like correlation Others describe properties of "real" space, like Euclidean distance In this class we'll focus on simply cataloging and having an intuition for these Note that we'll avoid the term metric, which means some specific Slightly safer to refer to similarity or dissimilarity measures Distances: larger means less similar Euclidean = e = (x-y)2, also the L2 norm Straight line distance, which amplifies outliers; in the range [0, ] Manhattan = m = |x-y|, also the L1 norm Grid or absolute distance; in the range [0, ] Correlations: larger means more similar Pearson = = ((x-x)(y-y))/((x-x)2(y-y)2 Also the Euclidean distance of z-scored data, i.e. normalized by mean and standard deviation Thus location and scale invariant, but assumes normal distribution; in the range [-1, 1] Cosine = c = (xy)/(x2y2) Also uncentered Pearson, normalized by standard deviation but not mean Thus scale but not location invariant; in the range [-1, 1] Spearman = r = Pearson correlation of ranks (with ties ranked identically) Assesses two datasets' monotonicity Nonparametric measure of similarity of trend, thus location and scale invariant; in the range [-1, 1] Beware of summary statistics: Anscombe's quartet Manually constructed in 1973 by Francis Anscombe at Yale as a demonstration Four pairs of datasets with: Equal mean (9) and standard deviation (11) of x, mean (7.5) and standard deviation (4.1) of y Identical Pearson correlation (0.816) and regression (y = 3+0.5x) But completely different relationships Understand (and visualize) your data before summarizing them! *** 20m Probability: basic definitions Statistics describe data; probability provides an underlying mathematical theory for manipulating them Experiment: anything that produces a non-deterministic (random or stochastic) result Coin flip, die roll, item count, concentration measurement, distance measurement... Sample space: the set of all possible outcomes for a particular experiment, finite or infinite, discrete or continuous {H, T}, {1, 2, 3, 4, 5, 6}, {0, 1, 2, 3, ...}, {0, 0.1, 0.001, 0.02, 3.14159, ...} Event: any subset of a sample space {}, {H}, {1, 3, 5}, {0, 1, 2}, [0, 3) Probability: for an event E, the limit of n(E)/n as n grows large (at least if you're a frequentist) Thus many symbolic proofs of probability relationships are based on integrals or limit theory (Kolmogorov) axioms: one definition of "probability" that matches reality For any event E, P(E)≥0 "Probability" is a non-negative real number For any sample space S, P(S)=1 The "probability" of all outcomes for an experiment must total 1 For disjoint events EF={}, P(EF)=P(E)+P(F) For two mutually exclusive events that share no outcomes... The "probability" of either happening equals the summed "probability" of each happening independently These represent one set of three assumptions from which intuitive rules about probability can be derived 0≤P(E)≤1 for any event E P({})=0, i.e. every experiment must have some outcome P(E)≤P(F) if EF, i.e. the probability of more events happening must be at least as great as fewer events *** 15m Conditional probabilities and Bayes' theorem Conditional probability The probability of an event given that another event has already occurred The probability of event F given that the sample space S has been reduced to ES Notated P(F|E) and defined as P(EF)/P(E) Bayes' theorem P(F|E) = P(E|F)P(F)/P(E) True since P(F|E) = P(EF)/P(E) and P(EF) = P(FE) = P(E|F)P(F) Provides a means of calculating a conditional probability based on the inverse of its condition Typically described in terms of prior, posterior, and support P(F) is the prior probability of F occurring at all in the first place, "before" anything else P(F|E) is the posterior probability of F occurring "after" E has occurred P(E|F)/P(E) is the support E provides for F Some examples from poker Pick a card, any card! Probability of drawing a jack given that you've drawn a face card? P(jack|face) = P(jackface)/P(face) = P(jack)/P(face) = (4/52)/(12/52) = 1/3 P(jack|face) = P(face|jack)P(jack)/P(face) = 1*(4/52)/(12/52) = 1/3 Probability of drawing a face card given that you've drawn a jack? P(face|jack) = P(jackface)/P(jack) = P(jack)/P(jack) = 1 P(face|jack) = P(jack|face)P(face)/P(jack) = (1/3)(12/52)/(4/52) = 1 Pick three cards: probability of drawing (exactly) a pair of aces given that you've drawn (exactly) a pair? P(2A|pair) = P(2Apair)/P(pair) = P(2A)/P(pair) = ((4/52)(3/51)(48/50)/3!)/(1*(3/51)(48/50)/3!) = 1/13 P(2A|pair) = P(pair|2A)P(2A)/P(pair) = 1*((4/52)(3/51)(48/50)/3!)/(1*(3/51)(48/50)/3!) = 1/13 Reading Basic definitions: Pagano and Gauvreau, 2.1, 2.3-4 Mean, median, etc.: Pagano and Gauvreau, 3.1-2 Correlation: Pagano and Gauvreau, 17.1-3 Probability: Pagano and Gauvreau, 6.1-3 Arumugam Nature 2011, "Enterotypes of the human gut microbiome" Problem Set 01: Quantitative methods