Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Explicit Definition of Concept Hierarchies Disease Gene Ontology n n Patient Anatomy Ontology 1 1 n n Gene Cluster n n n n Gene Sequence 1 1 1 n Array Probe Clinical Sample n n mRNA Expression n 1 n 1 n 1 Project Platform 1 Normalization 1 Measurement Unit Experiment n n Sample Classification Hierarchy All_diseases Tumor Normal AdenoCNS_tumor Leukemia carcinoma ... Brain Blood Colon Breast Glio. blastoma ... .. ALL AML Colon Breast . tumor tumor ... ... ... ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... (Patients) ... ... ... ... (Clinical Samples) Aggregate Functions • Simple: sum, average, max, min, etc. • Statistical: variance, standard deviation, tstatistic, F-statistic, etc. • User-defined: e.g., for aggregation of Affymetrix gene expression data on the Measurement Unit dimension, we may define the following function: Exp = Val if PA = ‘P’ or ‘M’, 0 if PA = ‘A’. Here, Exp is summarized gene expression; Val and PA are the numeric value and PA call given by the Affymetrix platform, respectively. Conventional OLAP Operations • Roll-up: aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction. • Drill-down: the reverse of roll-up, navigation from less detailed data to more detailed data. • Slice: selection on one dimension of the given data cube, resulting in a subcube. • Dice: defining a subcube by performing a selection on two or more dimensions. • Pivot: a visualization operation that rotates the data axes to provide an alternative presentation. t Test • The t-Test assesses whether the means of two groups are statistically different from each other. _ • Given two groups of samples X 1 : {n1 , x1 , s12 } and _ X 2 : {n2 , x 2 , s22 }: N : number of samples X : the mean of the samples N S 2 : the var iance of the samples (x x) i 1 2 i N 1 Degrees of freedom. Due to bias of the sample • Assumption: the differences in the groups follow a normal distribution. Degrees of Freedom (df) Idea: Number of observations that are free to vary after sample mean has been calculated Example: Suppose the mean of 3 numbers is 8.0 Let X1 = 7 Let X2 = 8 What is X3? If the mean of these three values is 8.0, then X3 must be 9 (i.e., X3 is not free to vary) Here, n = 3, so degrees of freedom = n – 1 = 3 – 1 = 2 (2 values can be any numbers, but the third is not free to vary for a given mean) Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc.. Chap 8-6 Student t-distribution • It is family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown. t Test • Hypothesis: H0(null hypothesis): µ1=µ2 Hα: µ1 µ2 • Choose the level of confidence (significance): α = 0.05 (the amount of uncertainty we are prepared to accept in the study. _ • Test Statistics t _ x1 x2 s / n1 s / n2 2 1 2 2 • The t-value can be positive or negative (positive if the first mean is larger than the second and negative if it is smaller). • Calculate the p-value corresponding to t-value: look up a table. • The t is a family of distributions Student’s t Distribution Note: t Z as n increases Standard Normal(t with df = ∞) t (df = 13) t-distributions are bellshaped and symmetric, but have ‘fatter’ tails than the normal Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc.. t (df = 5) 0 Chap 8-9 t Selected t distribution values With comparison to the Z value Confidence t Level (10 d.f.) t (20 d.f.) t (30 d.f.) Z (∞ d.f.) 0.80 1.372 1.325 1.310 1.28 0.90 1.812 1.725 1.697 1.645 0.95 2.228 2.086 2.042 1.96 0.99 3.169 2.845 2.750 2.58 Note: t Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc.. Z as n increases Chap 8-10 Example of t distribution confidence interval A random sample of n = 25 has X = 50 and S = 8. Form a 95% confidence interval for μ – d.f. = n – 1 = 24, so t α/2 t 0.025 2.0639 The confidence interval is S 8 X tα/2 50 (2.0639) n 25 46.698 ≤ μ ≤ 53.302 Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc.. Chap 8-11 P - Value • The p-value is the upper-tail (or lower tail) area of the t curve. • Steps to accept/reject the null hypothesis H0 – Calculate the t-statistics – Look up the table to find the p-value – Given confidence level , if p-value is smaller than , then reject H0; otherwise, accept H0 The t-curve of 25 degrees of freedom This area is the p-value! The t-statistics value New OLAP Operation: Compare • Compare two random variables by computing ratios, differences or t-statistics. • Example: Question: Is gene X expressed differently between two groups? Solution: (1) Compute the mean and variance. (2) Compute t and p: t = 3.120 p = 0.013/0.007 Answer: Yes (at 5% significance level) Different measurements of gene X Disease 1 Disease 2 100 90 105 83 78 70 72 81 74 75 Mean 91.2 Variance 127.7 N 5 74.4 17.3 5 X 1 91.2 n1 5 2 ( X X ) i 1 1i 1 n1 VarX 1 n1 1 X 2 74.4 VarX 2 t (100 91.2) 2 (78 91.2) 2 127.7 5 1 n2 5 n2 2 ( X X ) 2 i 2 i 1 n2 1 (70 74.4) 2 (75 74.4) 2 17.3 5 1 X1 X 2 91.2 74.4 3.120 VarX 1 VarX 2 127.7 17.3 n1 n2 5 5 Assuming unequal variances , the degrees of freedom are : 2 2 VarX 1 VarX 2 127 . 7 17 . 3 n n 5 5 2 df 1 2 5.06 5 2 2 2 VarX 1 VarX 2 127.7 17.3 5 5 n1 n2 4 4 n1 1 n2 1 p 0.013 (Calculate d using Excel' s TDIST function, one - tail) Output from Excel X 1 91.2 n1 5 2 ( X X ) i1 1i 1 n1 VarX 1 (100 91.2) 2 (78 91.2) 2 127.7 5 1 n1 1 X 2 74.4 n2 5 2 ( X X ) i 1 2i 2 n2 VarX 2 n2 1 (70 74.4) 2 (75 74.4) 2 17.3 5 1 Pooled sample variance (assuming equal variance) : (n1 1)(VarX 1 ) (n2 1)(VarX 2 ) VarX 12 72.5 n1 n2 2 t X1 X 2 91.2 74.4 3.120 1 1 1 1 72.5 (VarX 12 ) 5 5 n1 n2 Degree of freedom, df n1 n2 2 5 5 2 8 p 0.007 (Calculate d using Excel' s TDIST function, one - tail) Output from Excel New OLAP Operation: ANOVA • Analysis of variance (ANOVA) tests if there are differences between any pair of variables. • Example: Is there a significant difference between the expression of gene X in the various disease conditions? Different measurements of gene X Disease 1 Disease 2 Disease 3 100 90 105 83 78 70 72 81 74 75 95 93 79 85 90 74.4 4.2 88.4 6.5 mean st dev 91.2 11.3 ANOVA • ANalysis Of VAriance (ANOVA) is used to find significant genes in more than two conditions: Disease A Disease B Disease C Gene A1 A2 A3 B1 B2 B3 C1 C2 C3 g1 0.9 1.1 1.4 1.9 2.1 2.5 3.1 2.9 2.6 g2 4.2 3.9 3.5 5.1 4.6 4.3 1.8 2.4 1.5 g3 0.7 1.2 0.9 1.1 0.9 0.6 1.2 0.8 1.4 g4 2.0 1.2 1.7 4.0 3.2 2.8 6.3 5.7 5.1 ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ • For each gene, compute the F statistic. • Calculate the p value for the F statistic. One-way Analysis of Variance (ANOVA) • Decide whether there are any differences between the values from k conditions (groups). – H0: µ1 = µ2 = …. = µk – Hα: There is at least one pair of means that are different from each other. • Assumptions: – All k populations have the same variance – All k populations are normal. • ANOVA can be applied to any number of samples. If there are only two groups, the ANOVA will provide the same results as a t-test. • Problem with multiple t-tests: accumulated error may be large. Idea of ANOVA • The measurement of each group vary around their mean – within group variance. • The means of each condition will vary around an overall mean – inter-group variability. • ANOVA studies the relationship between the inter-group and the within-group variance. # of groups : k , # of measuremen ts for group i : ni Total # of measuremen ts : N i 1 ni k k The overall mean : X ni X i 1 j 1 The mean for group i : X i ij N ni j 1 X ij ni k ni Sum of squares between : SSbetween group SSCond X i X i 1 j 1 k ni Sum of squares within : SSwithingroup ( SS Error ) X ij X i ] i 1 j 1 2 2 Degrees of freedom for the conditions : k 1 SSCond Condition mean squares : MS Cond k 1 Degrees of freedom for the error : N k SS Error Error mean squares : MS Error N k F statistic : F Calculate : p MS cond ( F distribution with v1 k 1, v2 N k ) MS Error # of diseases : k 3 # of measuremen ts for disease i : ni 5 Total # of measuremen ts : N i 1 ni 15 k k The overall mean : X ni X i 1 j 1 The mean for disease i : X i ij 84.67 N ni j 1 X ij ni ni k Disease sum of squares : SS Disease X i X 810.13 2 i 1 j 1 k ni Error sum of squares : SS Error X ij X i 747.20 i 1 j 1 2 Degrees of freedom for the diseases : k 1 3 1 2 SS Disease 810.13 Disease mean squares : MS Disease 405.06 k 1 2 Degrees of freedom for the error : N k 15 3 12 SS Error 747.20 Error mean squares : MS Error 62.27 N k 12 MS Disease 405.06 F statistic : F 6.50 MS Error 62.27 p 0.012 Output from Excel (ANOVA, single factor): At the 5% significance level, gene X is expressed differently between some of the disease conditions (p = 0.012). New OLAP Operation: Correlate • Computing the Pearson correlation coefficient between two variables (e.g., between a clinical variable and a gene expression variable). • Example: Is the gene expression correlated with the drug use? ρxy = Cov(X, Y) √ (Var X)(Var Y) Expression of gene X Dosage of Drug Y 50 205 45 83 155 78 15 50 0 20 40 20 The Covariance • The covariance measures the strength of the linear relationship between two numerical variables (X & Y) • The sample covariance: n cov ( X , Y ) ( X X)( Y Y) i1 i i n 1 • Only concerned with the strength of the relationship • No causal effect is implied Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc.. Chap 3-28 Coefficient of Correlation • Measures the relative strength of the linear relationship between two numerical variables • Sample coefficient of correlation: cov (X , Y) r SX SY where n cov (X , Y) (X X)(Y Y) i1 i Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc.. n i n 1 SX Chap 3-29 (X X) i1 i n 1 n 2 SY (Y Y ) i1 i n 1 2 Person’s Correlation Coefficient • Given two groups of samples X = {x1, …, xn } and Y = { y1, …, yn } . • Pearson’ correlation coefficient r is given by n r _ ( x x)( y i 1 n i _ ( xi x) i 1 _ i n 2 y) _ 2 ( y y ) i i 1 Features of the Coefficient of Correlation • The population coefficient of correlation is referred as ρ. • The sample coefficient of correlation is referred to as r. • Either ρ or r have the following features: – Unit free – Ranges between –1 and 1 – The closer to –1, the stronger the negative linear relationship – The closer to 1, the stronger the positive linear relationship – The closer to 0, the weaker the linear relationship Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc.. Chap 3-31 Scatter Plots of Sample Data with Various Coefficients of Correlation Y Y X r = -1 Y X r = -.6 Y Y r = +1 Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc.. X X rChap = 3-32 +.3 X r=0 Calculation of the Correlation Coefficient X 102.67 Y 24.17 n6 VarX n 2 ( X X ) i i 1 n 1 2 ( Y Y ) i 1 i n VarY n 1 Cov( X , Y ) n i 1 XY (50 102.67) 2 (78 102.67) 2 4061.07 6 1 (15 24.17) 2 (20 24.17) 2 324.17 6 1 ( X i X )(Yi Y ) n 1 (50 102.67)(15 24.17) 922.22 6 1 Cov( X , Y ) 922.22 922.22 0.80 (VarX )(VarY ) 4061.07 324.17 1147.38 New OLAP Operation: Select • Given a threshold, select the entries that meet the minimum requirement. • Example: For a threshold of p < 0.05, gene 2 and gene 6 are selected. Gene p value 1 2 3 4 5 6 7 8 0.561 0.004 0.160 0.335 0.083 0.025 0.532 0.476 Discovery of Differentially Expressed Genes (1) Roll-up the microarray data over the Measurement Unit dimension using the user-defined aggregate function. PA Val D13626 10 14 18 5 24 32 16 D13627 roll-up J04605 0 24 32 16 D13628 Gene Gene D13628 D13626 10 14 0 D13627 J04605 L37042 L37042 S78653 S78653 X60003 X60003 Z11518 Z11518 1 1 2 3 4 5 6 Sample (patient) 7 2 3 4 5 6 Sample (patient) 7 Discovery of Differentially Expressed Genes (2) Roll-up the data over the Clinical Sample dimension from the patient level to disease level (or normal tissue level). After the roll-up, each cell contains mean, variance and the number of values aggregated. D13628 Gene D13626 12 0 28 19 D13627 0 24 32 16 roll-up to disease level J04605 L37042 D13628 Gene D13626 10 14 0 D13627 J04605 L37042 S78653 S78653 X60003 X60003 Z11518 Z11518 1 2 3 4 5 6 Sample (patient) 7 a b c d Sample (disease) Discovery of Differentially Expressed Genes (3) Compare a particular disease type with its corresponding normal tissue type. Compute the t statistic and p value for each gene. Select the genes that have a p value less than a given threshold (e.g., p < 0.05). D13626 12 0 28 19 D13627 D13626 D13628 D13628 D13627 Compare a with c J04605 L37042 S78653 Gene Gene p value J04605 L37042 S78653 X60003 X60003 Z11518 Z11518 a b c d Sample (disease) 0.561 0.004 0.160 0.335 0.083 0.025 0.532 0.476 Discovery of Informative Genes Roll-up the microarray data over the Measurement Unit dimension Roll-up the data over the Clinical Sample dimension from the patient level to disease type or normal tissue level Slice the data for a particular disease type and its corresponding normal tissue type t-test on each pair of the selected cells for each gene (p-values are computed and adjusted) p-select the genes that have p-values less than a given threshold