* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Microarrays
History of genetic engineering wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Gene nomenclature wikipedia , lookup
Minimal genome wikipedia , lookup
Public health genomics wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene desert wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome evolution wikipedia , lookup
Genome (book) wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genomic imprinting wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Metagenomics wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Microevolution wikipedia , lookup
Ridge (biology) wikipedia , lookup
Designer baby wikipedia , lookup
Gene expression programming wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Pabio590B – week 1 Microarrays Overview Design & hybridization Data analysis Overview Affix/synthesize probes of known sequence to chip Hybridize with labeled sample Quantify level of hybridization to each probe Normalization Statistics Clustering & more Experiments you might do Measure RNA expression Changes in gene expression over time / lifecycle Compare differences between tissues/cell types Comparisons between species/strains/conditions Whole genome transcript mapping (tiling arrays) Measure DNA content Presence or absence of region Copy number via Comparative Genomic Hybridization SNP Genotyping/Re-sequencing Other ChIP on chip arrays RIP on chip Microarray Design Affix/synthesize probes of known sequence to chip Hybridize with labeled sample Quantify level of hybridization to each probe Normalization Statistics Clustering & more RNA Expression Chip Designs Expression Array: - N number of probes per gene of interest - Trade-off between accuracy and number of features Tiling array: - Place probe of X nt every Y bases - Biased vs unbiased 20 nt 50 nt 50 nt 70 nt tiling window Probe considerations Number of probes per region of interest Specificity of probes Distance between probes (tiling) Mismatch probes (Affymetrix) Hybridization Affix/synthesize probes of known sequence to chip Hybridize with labeled sample Quantify level of hybridization to each probe Normalization Statistics Clustering & more Two-color vs One-color Two-color • Two samples one each slide • cy3 - green - 532nm • cy5 - red - 635nm One-color • One sample per slide • cy3 No significant difference in accuracy or reproducibility Designs for Two-color Array Experiment Replicates cy3 WT WT WT cy5 Mu Mu Mu Dye Swaps cy3 WT Mu WT Mu cy5 Mu WT Mu WT Biological Replicates cy3 WT1 WT2 WT3 cy5 Mu1 Mu2 Mu3 Common Reference cy3 ref ref ref ref ref ref cy5 A B C D E F Round Robin cy3 A B C D E F cy5 B C D E F A Data Normalization Affix/synthesize probes of known sequence to chip Hybridize with labeled sample Quantify level of hybridization to each probe Normalization Statistics Clustering & more Within-Array Normalization Cy3/Cy5 Lowess Normalization Signal intensity Before After Between-Array Normalization RNA Spike-in Random Probes Median Scaling Quantile Scaling Median and quantile normalization are predicated upon the arrays in question having the same distribution. That is to say, if you can safely assume that the bulk of genes have the same expression across the arrays, only then you can use those methods. Quantile Normalization Before After Statistical Analysis Affix/synthesize probes of known sequence to chip Hybridize with labeled sample Quantify level of hybridization to each probe Normalization Statistics Clustering & more Some Advice About Statistics Don’t get too hung up on p-values [or any other stat]. Ultimately what matters is biological relevance and external knowledge and other heterogeneous measures (related functions, pathways, other data types) that are not easily measured by statistics alone. P-values should help you evaluate the strength of the evidence, rather than being used as an absolute yardstick of significance. Statistical significance is not necessarily the same as biological relevance and vice-versa. John Quackenbush Probe Signal Is this gene differentially expressed between the two conditions? Sample A Sample B To rephrase the question Is the mean probe value different between Samples A & B • Null Hypothesis = H0 = means are the same • Alternate Hypothesis = Ha = means are different What affects our ability to test the hypothesis? Difference in means Number of sample points Standard deviations of sample The T-statistic Directly proportional to difference in means Inversely proportional to standard deviation Directly proportional to sample size The T-test calculates how likely the T-statistic is, given the null hypothesis that the means are actually the same. T-statistic and P-values P-values can be determined from theoretical distributions or permutation testing • Theoretical distributions rely on a set of assumptions that array experiments do not necessarily follow • Permutation tests do not rely on any assumptions Permutation Testing Gene A Gene B Permutation 2 Probe Signal Permutation 1 Probe Signal Probe Signal Original Group 1 Group 2 Group 1 Group 2 1) Permute n times by random shuffling 2) Calculate T-statistic for each permutation 3) Calculate probability of original T-statistic Interpreting P-values T-test tests the null hypothesis that sample means are equal Gene X has p-value of 5% from T-test 95% chance it is differentially expressed 5% chance that is NOT differentially expressed = False Positive Rate = 5% T-Test Refinements Equal vs unequal variance of samples Equal vs unequal sample size Dependant vs independent samples CAVEAT: As sample sizes get smaller, the validity of p-values calculated via permutation diminishes. Microarrays typically have few probes per gene, so sample size is smallish. Multiple Testing Problem If there is a 5% chance of false positives in one experiment, what happens when we are testing 10,000 genes. • The majority of those genes are not differentially expressed, but • a 5% p-value means we will have 500 falsepositives. Family-Wise Error Rate (FWER) FWER is the probability of making one or more false discoveries (type I errors) among all the hypotheses when performing multiple pair-wise tests. One comparison: FWER = p-value 10,000 comparisons: FWER ~ 1.0 That means that when making 10,000 comparisons you are sure to make at least one error. Bonferroni Correction What if you want to keep the FWER at 5% • 0.05 / 10,000 = 0.000005 = 5e-6 • Only those genes with T-test p-value of < 5xe-6 are called differentially expressed • Leads to experiment-wide of 0.05 The Standard Bonferroni correction is considered very conservative Adjusted Bonferroni Rank all genes by ascending order of p-value Assign gene with smallest p-value a corrected p-value of / N (0.5/10,000) Assign gene with second smallest p-value a corrected p-value of / N-1 Etc… The Adjusted Bonferroni correction is less conservative False Discovery Rate Measures the likely number of false positives amongst “discovered” genes Factors affecting FDR: • • • • Proportion of actual differentially expressed genes Distribution of the true differences Measurement variability Sample size Analysis of Variance (ANOVA) Microarray testing across ≥ 3 conditions Is a gene expressed equally across all conditions? F-ratio for given gene X: (variability within conditions) / (variability across conditions) Calculate p-value • Look up probability of F-ratio • Determine probability by permutation testing Significance Analysis of Microarrays (SAM) Gene-specific T-tests Computes statistic (dj) for each gene j • measures the relationship between gene expression and a response variable • describes and groups the data based on experimental conditions • uses non-parametric statistics • repeated permutations are used to determine FDR Accounts for correlations in genes and avoids parametric assumptions about the (normal vs non-normal) distribution of individual genes Clustering Affix/synthesize probes of known sequence to chip Hybridize with labeled sample Quantify level of hybridization to each probe Normalization Statistics Clustering & more Why do clustering? Identify groups of possibly co-regulated genes (e.g. so you can look for common sequence motifs) Identify typical temporal or spatial gene expression patterns (e.g. cell-cycle data) Arrange a set of genes in a linear order that is at least not totally meaningless Can also cluster experiments Quality control • detect bad/outlying experiments Identify or categorize classes of biological samples • sorting by tumor sub-type How you cluster? Define a distance measure Group genes (or experiments) based on that measure Objects are placed into groups. Objects within a group are more similar to each other than objects across groups. In some cases groups are hierarchically organized based on the intra-group similarity Distance Metrics Correlation Euclidean Correlation (X,Y) = 1 Distance (X,Y) = 4 Correlation (X,Z) = -1 Distance (X,Z) = 2.83 Correlation (X,W) = 1 Distance (X,W) = 1.41 Clustering considerations Correlation clustering • Direction only • ≥ 3 conditions Euclidean clustering • Magnitude & direction • ≥ 2 conditions Array data is noisy, so you probably need multiple data points per condition Clustering methods • Hierarchical • Partitional • Other Hierarchical clustering Agglomerative, bottom-up method Initial state - each item is a cluster Iterate - join two most similar cluster Stop - when number of clusters reaches user-defined value Linkage methods Ways to determine cluster similarity Single Link: Similarity of two most similar members Complete Link: Similarity of two most similar members Average Link: Average similarity of all members Comparing linkage methods Single Complete Average Partitional (K-means) clustering Divisive, top-down method Partition data into K random clusters Assign each point to nearest cluster Calculate centroid of each cluster GOTO step 2 Other methods Support Vector Machines (SVM) K-nearest Neighbor (KNN) Self Organizing Maps (SOM) Self Organizing Tree Algorithm (SOTA) Cluster Affinity Search Technique (CAST) QT Cluster (QTC) Discriminant Analysis Classifier (DAM) Principal Component Analysis (PCA) Etc. Warnings and Limitations Clusters are like statistics Ideally they mirror reality, but they should only be taken seriously in conjunction with confirmatory data from other sources. Clustering software clusters things If you tell it to find 4 clusters, it will find 4 clusters in anything! Garbage In, Garbage Out Clustering typically relies on a set of input parameters that can be hard to evaluate except for empirically evaluating the outputs for a given set of input parameters. Clusters Interpretation - EASE (Expression Analysis Systematic Explorer) Population Size: 40 genes Cluster size: 12 genes 10 genes, shown in green, have a common biological theme and 8 occur within the cluster Microarray Analysis Software TIGR MEV Limma SAM EDGE • • These software packages are free and open-source Each has different strengths/weaknesses and makes different assumptions about your data $$ Analysis Platforms Gene Sifter Rosetta Resolver Bio Discovery Microarray Data Sources Gene Expression Omnibus (NCBI) ArrayExpress (EBI) Stanford Microarray Database Yale Microarray Database Microarray Data Standards Microarray Gene Expression Data Society (MGED) • MIAME • MAGE - OM • MAGE ML RNA Abundance Database (RAD) • Integrating data from various types of expression experiments