Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Lecture 3 Univariate Analyses: Discrete Data MBP1010 † Dr. Paul C. Boutros Winter 2014 DEPARTMENT OF MEDICAL BIOPHYSICS † Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE) This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others Course Overview • • • • • • • • • • Lecture 1: What is Statistics? Introduction to R Lecture 2: Univariate Analyses I: continuous Lecture 3: Univariate Analyses II: discrete Lecture 4: Multivariate Analyses I: specialized models Lecture 5: Multivariate Analyses II: general models Lecture 6: Sequence Analysis Lecture 7: Microarray Analysis I: Pre-Processing Lecture 8: Microarray Analysis II: Multiple-Testing Lecture 9: Machine-Learning Final Exam (written) Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca How Will You Be Graded? • 9% Participation: 1% per week • 56% Assignments: 8 x 7% each • 35% Final Examination: in-class • Each individual will get their own, unique assignment • Assignments will all be in R, and will be graded according to computational correctness only (i.e. does your R script yield the correct result when run) • Final Exam will include multiple-choice and written answers Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Course Information Updates • Website will have up to date information, lecture notes, sample source-code from class, etc. • http://medbio.utoronto.ca/students/courses/mbp1010/mbp_10 10.html • Tutorials are Thursdays 13:00-15:00 in 4-204 TMDT • New TA (focusing on bioinformatics component) will be Irakli (Erik) Dzneladze • Assignment #1 is released today, due on January 30 • Assignment #2 will be released on January 31, due Feb 7 • Updated course-schedule on website Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca House Rules • Cell phones to silent • No side conversations • Hands up for questions Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Review From Lecture #1 Population vs. Sample All MBP Students = Population MBP Students in 1010 = Sample How do you report statistical information? P-value, variance, effect-size, sample-size, test Why don’t we use Excel/spreadsheets? Input errors, reproducibility, wrong results Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Review From Lecture #2 Define discrete data No gaps on the number-line What is the central limit theorem? A random variable that is the sum of many small random variables is normally distributed Theoretical vs. empirical quantiles Probability vs. percentage of values less than p Components of a boxplot? 25% - 1.5 IQR, 25%, 50%, 75%, 75% + 1.5 IQR Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Boxplot Descriptive statistics can be intuitively summarized in a Boxplot. 1.5 x IQR 75% quantile IQR Median 25% quantile > boxplot(x) 1.5 x IQR Everything above and below 1.5 x IQR is considered an "outlier". IQR = Inter Quantile Range = 75% quantile – 25% quantile Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Review From Lecture #2 How can you interpret a QQ plot? Compares two samples or a sample and a distribution. Straight line indicates identity. What is hypothesis testing? Confirmatory data-analysis; test null hypothesis What is a p-value? Evidence against null; probability of FP, probability of seeing as extreme a value by chance alone Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Review From Lecture #2 Parametric vs. non-parametric tests Parametric tests have distributional assumptions What is the t-statistics? Signal:Noise ratio Assumptions of the t-test? Data sampled from normal distribution; independence of replicates; independence of groups; homoscedasticity Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Flow-Chart For Two-Sample Tests Is Data Sampled From a Normally-Distributed Population? Yes No Equal Variance (F-Test)? Yes Homoscedastic T-Test Yes Sufficient n for CLT (>30)? No Heteroscedastic T-Test Lecture 3: Univariate Analyses II: Discrete Data No Wilcoxon U-Test bioinformatics.ca Topics For This Week • Correlations • ceRNAs • Attendance • Common discrete univariate analyses Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Power, error rates and decision Power calculation in R: > power.t.test(n = 5, delta = 1, sd=2, alternative="two.sided", type="one.sample") One-sample t test power calculation n=5 delta = 1 sd = 2 sig.level = 0.05 power = 0.1384528 alternative = two.sided Other tests are available – see ??power. Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Power, error rates and decision PR(False Negative) PR(Type II error) μ0 μ 1 PR(False Positive) PR(Type I error) Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Problem When we measure more one than one variable for each member of a population, a scatter plot may show us that the values are not completely independent: there is e.g. a trend for one variable to increase as the other increases. Regression analyses assess the dependence. Examples: • Height vs. weight • Gene dosage vs. expression level • Survival analysis: probability of death vs. age Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Correlation When one variable depends on the other, the variables are to some degree correlated. (Note: correlation need not imply causality.) In R, the function cov() measures covariance and cor() measures the Pearson coefficient of correlation (a normalized measure of covariance). Pearson's coeffecient of correlation values range from -1 to 1, with 0 indicating no correlation. Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Pearson's Coefficient of Correlation How to interpret the correlation coefficient: Explore varying degrees of randomness ... > x<-rnorm(50) > r <- 0.99; > y <- (r * x) + ((1-r) * rnorm(50)); > plot(x,y); cor(x,y) [1] 0.9999666 Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Pearson's Coefficient of Correlation Varying degrees of randomness ... > x<-rnorm(50) > r <- 0.8; > y <- (r * x) + ((1-r) * rnorm(50)); > plot(x,y); cor(x,y) [1] 0.9661111 Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Pearson's Coefficient of Correlation Varying degrees of randomness ... > x<-rnorm(50) > r <- 0.4; > y <- (r * x) + ((1-r) * rnorm(50)); > plot(x,y); cor(x,y) [1] 0.6652423 Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Pearson's Coefficient of Correlation Varying degrees of randomness ... > x<-rnorm(50) > r <- 0.01; > y <- (r * x) + ((1-r) * rnorm(50)); > plot(x,y); cor(x,y) [1] 0.01232522 Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Pearson's Coefficient of Correlation Non-linear relationships ... > x<-runif(50,-1,1) > r <- 0.9 > # periodic ... > y <- (r * cos(x*pi)) + ((1-r) * rnorm(50)) > plot(x,y); cor(x,y) [1] 0.3438495 Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Pearson's Coefficient of Correlation Non-linear relationships ... > x<-runif(50,-1,1) > r <- 0.9 > # polynomial ... > y <- (r * x*x) + ((1-r) * rnorm(50)) > plot(x,y); cor(x,y) [1] -0.5024503 Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Pearson's Coefficient of Correlation Non-linear relationships ... > x<-runif(50,-1,1) > r <- 0.9 > # exponential > y <- (r * exp(5*x)) + ((1-r) * rnorm(50)) > plot(x,y); cor(x,y) [1] 0.6334732 Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Pearson's Coefficient of Correlation Non-linear relationships ... > x<-runif(50,-1,1) > r <- 0.9 > # circular ... > a <- (r * cos(x*pi)) + ((1-r) * rnorm(50)) > b <- (r * sin(x*pi)) + ((1-r) * rnorm(50)) > plot(a,b); cor(a,b) [1] 0.04531711 Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Correlation coefficient Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Other Correlations • There are many other types of correlations • Spearman’s correlation • rho • Kendall’s correlation • Tau • Spearman is a Pearson on ranked values • Spearman rho = 1 means a monotonic relationship • Pearson R = 1 means a linear relationship Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca When Do We Use Statistics? • Ubiquitous in modern biology • Every class I will show a use of statistics in a (very, very) recent Nature paper. January 9, 2014 Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Non-Small Cell Lung Cancer 101 15% 5-year survival Lung Cancer 80% of lung cancer Non-Small Cell Adenocarcinomas Squamous Cell Carcinomas Lecture 3: Univariate Analyses II: Discrete Data Small Cell Large Cell (and others) bioinformatics.ca Non-Small Cell Lung Cancer 102 Stage I Local Tumour Only Stage II Local Lymph Nodes Stage III Distal Lymph Nodes Stage IV Metastasis IA = small tumour; IB = large tumour Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca General Idea: HMGA2 is a ceRNA What are ceRNAs? Salmena et al. Cell 2011 Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Test Multiple Constructs for Activity Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca What Statistical Analysis Did They Do? • No information given in main text! • Figure legend says: “Values are technical triplicates, have been performed independently three times, and represent mean +/- standard deviation (s.d.) with propagated error.” • In supplementary they say: “Unless otherwise specified, statistical significance was assessed by the Student’s t-test” • So, what would you do differently? Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Attendance Break Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Let’s Go Back to Discrete vs. Continuous • Definition? • Let’s take a few examples of discrete univariate statistical analyses in biology and write them down here: • • • • • Cell counts Embryo pigmentation yes/no with morpholino SNP calling Immunohistochemistry Colony formations Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Four Main Discrete Univariate Tests • Hypergeometric test • Is a sample randomly selected from a fixed population? • Proportion test • Are two proportions equivalent? • Fisher’s Exact test • Are two binary classifications associated? • (Pearson’s) Chi-Squared Test • Are paired observations on two variables independent? Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Hypergeometric Test • • • • • • Is a sample randomly selected from a fixed population? Closer to discrete mathematics than statistics Technically: sampling without replacement s N s In R: ?phyper x n x P( x) Classic example: marbles N Less classic: poker n 5/24 are yellow 1/6 sampled are yellow Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Hypergeometric Test: Biological Example • Class example in genomics: pathway analysis • • • • I do a screen and identify n genes associated with something Are those n genes biased towards a pathway? Well a pathway contains m genes So is n a random selection of m? Hypergeometric test! • Similar example: drug screening • I test 1000 drugs to see which ones kill a cell-line • 100 of these are kinase inhibitors • 100 drugs kill my cell-line • 30 of these are kinase inhibitors • Did I find more kinase inhibitors than expected by chance? • Let’s do the calculation Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Hypergeometric: Venn Diagram Overlap Let’s pretend X and Y are sets of genes (or drugs, etc.) found in two separate experiments. We want to know, is there more overlap than expected by chance? To do this: Total Balls: total number of genes considered (but a gene must be analyzed in both experiments: exclude those studied in only one) Black Balls: all genes found in experiment X White Balls: all genes not found in experiment X Sample: all genes found in experiment Y Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Proportion Test • Are two proportions equivalent? • Example: is the fraction of people who play hockey in MBP different from the fraction who play hockey in Mathematics? • Mathematics: 12/85 • MBP: 24/135 • In R: prop.test • Only useful for two-group studies Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Proportion Test: Biological Example • Does the frequency of TP53 mutations differ between prostate cancer patients who will suffer a recurrence and those who will not? • 12/150 patients whose tumours recur have mutated TP53 • 50/921 patients whose tumours do not recur have mutated TP53 • P-value guesses? Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Fisher’s Exact Test • Are two binary categorizations associated? • Based on a contingency table • What are these? Have we seen any before? • In R: ?fisher.test • Classic example: drinking tea Dr. Muriel Bristow claimed to be able to taste if whether tea or milk was added first to a cup. Dr. Ronald Fisher didn’t believe her. Milk Lecture 3: Univariate Analyses II: Discrete Data Tea Milk 4 0 Tea 0 4 bioinformatics.ca Fisher’s Exact Test: Biological Example • You can use this any time you form a contingency table • Any time you make predictions (biomarkers) • Any time you compare two binary phenomena • Examples? Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca (Pearson’s) Chi-Squared Test • Are two variables independent? • There are a lot of different chi-squared tests. Why? • • • • Pearson Yates McNemar Portmanteau test • In R: ?chisq.test • You can think of it as a multiple-category Fisher’s test • The assumptions break down if <5 values in a cell Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Chi-Squared Test: Biological Example • Comparing sex across different tumour subtypes Male Female Adenocarcinoma 250 192 Squamous Cell Carcinoma 202 261 Small Cell Carcinoma 15 9 Neuroendocrine 12 10 Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca Course Overview • • • • • • • • • • Lecture 1: What is Statistics? Introduction to R Lecture 2: Univariate Analyses I: continuous Lecture 3: Univariate Analyses II: discrete Lecture 4: Multivariate Analyses I: specialized models Lecture 5: Multivariate Analyses II: general models Lecture 6: Sequence Analysis Lecture 7: Microarray Analysis I: Pre-Processing Lecture 8: Microarray Analysis II: Multiple-Testing Lecture 9: Machine-Learning Final Exam (written) Lecture 3: Univariate Analyses II: Discrete Data bioinformatics.ca