Download M01 Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Foundations of statistics wikipedia , lookup

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Monday #1
10m
Introduction, overview, and questions
20m
Thinking about data visually
Visualizing data: bar charts, histograms, density plots, cumulative distributions, stripcharts, box plots, scatter plots
10m
Why does quantitative data analysis matter?
It takes five minutes in Excel to generate an incorrect biological conclusion
Open excel and create a table of random data with 15 columns and 5,000 rows
The top row is a measurement of bacterial cell survival "age"
The following 5,000 rows are measurements of bacterial whole-genome "expression"
Freeze these data (copy/paste by value) and create a new column correlating each "gene" with "age"
Sort by correlation and scatter plot "age" versus the most correlated "gene"
Does this "gene" control cell "longevity"?
10m
Descriptive and inferential statistics
Descriptive statistics summarize information that is already present in a data collection
These include:
Common visualizations like box plots, histograms, etc.
Common summary measures like averages, standard deviations, medians, and so forth
The computer science equivalent is data mining or search: Google
Inferential statistics use a sample of data to make predictions about larger populations or about unobserved/future trends
These include:
Any measurements made in the presence of noise or variation
Generalizations from a sample to a population: confidence intervals, hypothesis tests, etc.
Comparisons made between datasets: comparisons, correlations, regression, etc.
The computer science equivalent is machine learning: classification or prediction
Data are typically described in one of three forms: categorical, ordinal, or continuous
Categorical values take one of a discrete set of unordered values:
Binary or boolean values: a coin flip (H/T)
A tissue type: blood/skin/lung/GI/etc.
Ordinal values take one of a discrete set of ordered values:
Counts or rank orders
Often used for nonparametric methods (discussed later)
These are rarely analyzed differently than continuous values
Continuous values take one value from an ordered numerical scale
Times, frequencies, ratios, percentages, abundances, etc.
30m
Simple descriptive statistics
A statistic is any single value that summarizes an entire dataset
Parametric summary statistics: mean, standard deviation, and z-scores
Typically used to describe "well behaved" data that's approximately normally distributed
This means continuous, symmetric, thin-tailed
Closeness needed for "approximately" depends on exact applications - often pretty flexible, not always
Average = mean =  = x/n
Standard deviation = variance =  = (x2/n-2) (population)
Beware the difference between population and sample standard deviation
((x-)2/(n-1)) or (n/(n-1))(x2/n-2) (sample)
The latter is slightly larger due to uncertainty in estimating the population mean from a sample
Data expressed as z-scores are relative to a dataset's mean and standard deviation: z=(x-)/
2/3 of normally distributed data will be within  of , 95% within 2, 99% within 3
Be careful when using sample  to detect outliers: 33%, 5%, or 1% of your data will always be excluded
Nonparametric summary statistics: median, percentiles, quartiles, interquartile range, and outliers
Can be used to describe any data regardless of distribution; consider rank order and not value
Median = m = x[|x|/2] = midpoint of dataset
Percentile = p(y) = x[y|x|] = data point y% of the way "through" dataset
Quartiles = 25th, 50th, and 75th percentiles = {p(0.25), p(0.5), p(0.75)}, also often quintiles, deciles, etc.
Inter-quartile range = IQR = difference between 75th and 25th percentile = p(0.75) - p(0.25)
Thus exactly half of any dataset is within its IQR
Fences = bounds on "usual" range of data
Upper fence lies above the 75th percentile p(0.75), lower fence below 25th percentile p(0.25)
Inner fences are 1.5 inter-quartile ranges above/below, outer fences 3 IQRs
Provide a good way to detect outliers: <LIF or >UIF is unusual, <LOF or >UOF is extremely unlikely
30m
Comparing data: Manhattan, Euclidean, Pearson, Cosine, and Spearman correlations and distances
Useful to have summary statistics of paired data
Measurements consisting of two correspondingly ordered vectors
Motivations for different paired summary statistics are diverse
Some describe properties of probability distributions, like correlation
Others describe properties of "real" space, like Euclidean distance
In this class we'll focus on simply cataloging and having an intuition for these
Note that we'll avoid the term metric, which means some specific
Slightly safer to refer to similarity or dissimilarity measures
Distances: larger means less similar
Euclidean = e = (x-y)2, also the L2 norm
Straight line distance, which amplifies outliers; in the range [0, ]
Manhattan = m = |x-y|, also the L1 norm
Grid or absolute distance; in the range [0, ]
Correlations: larger means more similar
Pearson =  = ((x-x)(y-y))/((x-x)2(y-y)2
Also the Euclidean distance of z-scored data, i.e. normalized by mean and standard deviation
Thus location and scale invariant, but assumes normal distribution; in the range [-1, 1]
Cosine = c = (xy)/(x2y2)
Also uncentered Pearson, normalized by standard deviation but not mean
Thus scale but not location invariant; in the range [-1, 1]
Spearman = r = Pearson correlation of ranks (with ties ranked identically)
Assesses two datasets' monotonicity
Nonparametric measure of similarity of trend, thus location and scale invariant; in the range [-1, 1]
Beware of summary statistics: Anscombe's quartet
Manually constructed in 1973 by Francis Anscombe at Yale as a demonstration
Four pairs of datasets with:
Equal mean (9) and standard deviation (11) of x, mean (7.5) and standard deviation (4.1) of y
Identical Pearson correlation (0.816) and regression (y = 3+0.5x)
But completely different relationships
Understand (and visualize) your data before summarizing them!
*** 20m
Probability: basic definitions
Statistics describe data; probability provides an underlying mathematical theory for manipulating them
Experiment: anything that produces a non-deterministic (random or stochastic) result
Coin flip, die roll, item count, concentration measurement, distance measurement...
Sample space: the set of all possible outcomes for a particular experiment, finite or infinite, discrete or continuous
{H, T}, {1, 2, 3, 4, 5, 6}, {0, 1, 2, 3, ...}, {0, 0.1, 0.001, 0.02, 3.14159, ...}
Event: any subset of a sample space
{}, {H}, {1, 3, 5}, {0, 1, 2}, [0, 3)
Probability: for an event E, the limit of n(E)/n as n grows large (at least if you're a frequentist)
Thus many symbolic proofs of probability relationships are based on integrals or limit theory
(Kolmogorov) axioms: one definition of "probability" that matches reality
For any event E, P(E)≥0
"Probability" is a non-negative real number
For any sample space S, P(S)=1
The "probability" of all outcomes for an experiment must total 1
For disjoint events EF={}, P(EF)=P(E)+P(F)
For two mutually exclusive events that share no outcomes...
The "probability" of either happening equals the summed "probability" of each happening independently
These represent one set of three assumptions from which intuitive rules about probability can be derived
0≤P(E)≤1 for any event E
P({})=0, i.e. every experiment must have some outcome
P(E)≤P(F) if EF, i.e. the probability of more events happening must be at least as great as fewer events
*** 15m
Conditional probabilities and Bayes' theorem
Conditional probability
The probability of an event given that another event has already occurred
The probability of event F given that the sample space S has been reduced to ES
Notated P(F|E) and defined as P(EF)/P(E)
Bayes' theorem
P(F|E) = P(E|F)P(F)/P(E)
True since P(F|E) = P(EF)/P(E) and P(EF) = P(FE) = P(E|F)P(F)
Provides a means of calculating a conditional probability based on the inverse of its condition
Typically described in terms of prior, posterior, and support
P(F) is the prior probability of F occurring at all in the first place, "before" anything else
P(F|E) is the posterior probability of F occurring "after" E has occurred
P(E|F)/P(E) is the support E provides for F
Some examples from poker
Pick a card, any card!
Probability of drawing a jack given that you've drawn a face card?
P(jack|face) = P(jackface)/P(face) = P(jack)/P(face) = (4/52)/(12/52) = 1/3
P(jack|face) = P(face|jack)P(jack)/P(face) = 1*(4/52)/(12/52) = 1/3
Probability of drawing a face card given that you've drawn a jack?
P(face|jack) = P(jackface)/P(jack) = P(jack)/P(jack) = 1
P(face|jack) = P(jack|face)P(face)/P(jack) = (1/3)(12/52)/(4/52) = 1
Pick three cards: probability of drawing (exactly) a pair of aces given that you've drawn (exactly) a pair?
P(2A|pair) = P(2Apair)/P(pair) = P(2A)/P(pair) = ((4/52)(3/51)(48/50)/3!)/(1*(3/51)(48/50)/3!) = 1/13
P(2A|pair) = P(pair|2A)P(2A)/P(pair) = 1*((4/52)(3/51)(48/50)/3!)/(1*(3/51)(48/50)/3!) = 1/13
Reading
Basic definitions: Pagano and Gauvreau, 2.1, 2.3-4
Mean, median, etc.: Pagano and Gauvreau, 3.1-2
Correlation:
Pagano and Gauvreau, 17.1-3
Probability:
Pagano and Gauvreau, 6.1-3
Arumugam Nature 2011, "Enterotypes of the human gut microbiome"
Problem Set 01: Quantitative methods