Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Monday #1 20m Introduction, overview, and questions 20m Thinking about data visually Visualizing data: bar charts, histograms, density plots, cumulative distributions, stripcharts, box plots, scatter plots 5m Why does quantitative data analysis matter? It takes five minutes in Excel to generate an incorrect biological conclusion Open excel and create a table of random data with 15 columns and 5,000 rows The top row is a measurement of bacterial cell survival "age" The following 5,000 rows are measurements of bacterial whole-genome "expression" Freeze these data (copy/paste by value) and create a new column correlating each "gene" with "age" Sort by correlation and scatter plot "age" versus the most correlated "gene" Does this "gene" control cell "longevity"? 10m Descriptive and inferential statistics Descriptive statistics summarize information that is already present in a data collection These include: Common visualizations like box plots, histograms, etc. Common summary measures like averages, standard deviations, medians, and so forth The computer science equivalent is data mining or search: Google Inferential statistics use a sample of data to make predictions about larger populations or about unobserved/future trends These include: Any measurements made in the presence of noise or variation Generalizations from a sample to a population: confidence intervals, hypothesis tests, etc. Comparisons made between datasets: comparisons, correlations, regression, etc. The computer science equivalent is machine learning: classification or prediction Data are typically described in one of three forms: categorical, ordinal, or continuous Categorical values take one of a discrete set of unordered values: Binary or boolean values: a coin flip (H/T) A tissue type: blood/skin/lung/GI/etc. Ordinal values take one of a discrete set of ordered values: Counts or rank orders Often used for nonparametric methods (discussed later) These are rarely analyzed differently than continuous values Continuous values take one value from an ordered numerical scale Times, frequencies, ratios, percentages, abundances, etc. 20m Simple descriptive statistics A statistic is any single value that summarizes an entire dataset Parametric summary statistics: mean, standard deviation, and z-scores Typically used to describe "well behaved" data that's approximately normally distributed This means continuous, symmetric, thin-tailed Closeness needed for "approximately" depends on exact applications - often pretty flexible, not always Average = mean = = x/n Standard deviation = variance = = (x2/n-2) (population) Beware the difference between population and sample standard deviation ((x-)2/(n-1)) or (n/(n-1))(x2/n-2) (sample) The latter is slightly larger due to uncertainty in estimating the population mean from a sample Data expressed as z-scores are relative to a dataset's mean and standard deviation: z=(x-)/ 2/3 of normally distributed data will be within of , 95% within 2, 99% within 3 Be careful when using sample to detect outliers: 33%, 5%, or 1% of your data will always be excluded Nonparametric summary statistics: median, percentiles, quartiles, interquartile range, and outliers Can be used to describe any data regardless of distribution; consider rank order and not value Median = m = x[|x|/2] = midpoint of dataset Percentile = p(y) = x[y|x|] = data point y% of the way "through" dataset Quartiles = 25th, 50th, and 75th percentiles = {p(0.25), p(0.5), p(0.75)}, also often quintiles, deciles, etc. Inter-quartile range = IQR = difference between 75th and 25th percentile = p(0.75) - p(0.25) Thus exactly half of any dataset is within its IQR Fences = bounds on "usual" range of data Upper fence lies above the 75th percentile p(0.75), lower fence below 25th percentile p(0.25) Inner fences are 1.5 inter-quartile ranges above/below, outer fences 3 IQRs Provide a good way to detect outliers: <LIF or >UIF is unusual, <LOF or >UOF is extremely unlikely 25m Comparing data: Manhattan, Euclidean, Pearson, Cosine, and Spearman correlations and distances Useful to have summary statistics of paired data Measurements consisting of two correspondingly ordered vectors Motivations for different paired summary statistics are diverse Some describe properties of probability distributions, like correlation Others describe properties of "real" space, like Euclidean distance In this class we'll focus on simply cataloging and having an intuition for these Note that we'll avoid the term metric, which means some specific Slightly safer to refer to similarity or dissimilarity measures Distances: larger means less similar Euclidean = e = (x-y)2, also the L2 norm Straight line distance, which amplifies outliers; in the range [0, ] Manhattan = m = |x-y|, also the L1 norm Grid or absolute distance; in the range [0, ] Correlations: larger means more similar Pearson = = ((x-x)(y-y))/((x-x)2(y-y)2 Also the Euclidean distance of z-scored data, i.e. normalized by mean and standard deviation Thus location and scale invariant, but assumes normal distribution; in the range [-1, 1] Cosine = c = (xy)/(x2y2) Also uncentered Pearson, normalized by standard deviation but not mean Thus scale but not location invariant; in the range [-1, 1] Spearman = r = Pearson correlation of ranks (with ties ranked identically) Assesses two datasets' monotonicity Nonparametric measure of similarity of trend, thus location and scale invariant; in the range [-1, 1] Beware of summary statistics: Anscombe's quartet Manually constructed in 1973 by Francis Anscombe at Yale as a demonstration Four pairs of datasets with: Equal mean (9) and standard deviation (11) of x, mean (7.5) and standard deviation (4.1) of y Identical Pearson correlation (0.816) and regression (y = 3+0.5x) But completely different relationships Understand (and visualize) your data before summarizing them! Reading Basic definitions: Pagano and Gauvreau, 2.1, 2.3-4 Mean, median, etc.: Pagano and Gauvreau, 3.1-2 Correlation: Pagano and Gauvreau, 17.1-3 Probability: Pagano and Gauvreau, 6.1-3 Arumugam Nature 2011, "Enterotypes of the human gut microbiome" Problem Set 01: Quantitative methods