* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Gene Expression Profiles and Microarray Data Analysis - BIDD
Epitranscriptome wikipedia , lookup
RNA silencing wikipedia , lookup
Pathogenomics wikipedia , lookup
Transposable element wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
RNA interference wikipedia , lookup
Oncogenomics wikipedia , lookup
X-inactivation wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Point mutation wikipedia , lookup
Copy-number variation wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Public health genomics wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genetic engineering wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
History of genetic engineering wikipedia , lookup
Ridge (biology) wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Genome evolution wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
The Selfish Gene wikipedia , lookup
Gene therapy wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Genome (book) wikipedia , lookup
Helitron (biology) wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Gene desert wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene nomenclature wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Gene expression programming wikipedia , lookup
Microevolution wikipedia , lookup
Gene expression profiling wikipedia , lookup
LSM3241: Bioinformatics and Biocomputing Lecture 8: Gene Expression Profiles and Microarray Data Analysis Prof. Chen Yu Zong Tel: 6874-6877 Email: [email protected] http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS Biology and Cells • All living organisms consist of cells (trillions of cells in human, yeast has one cell). • Cells are of many different types (blood, skin, nerve), but all arose from a single cell (the fertilized egg) • Each* cell contains a complete copy of the genome (the program for making the organism), encoded in DNA. 2 Gene Expression • Cells are different because of differential gene expression. • About 40% of human genes are expressed at one time. • Gene is expressed by transcribing DNA into singlestranded mRNA • mRNA is later translated into a protein • Microarrays measure the level of mRNA expression 3 Overview of Molecular Biology Cell Nucleus Chromosome Protein cDNA Gene (mRNA), single strand Gene (DNA) 4 Gene Expression • Genes control cell behavior by controlling which proteins are made by a cell • House keeping genes vs. cell/tissue specific genes • Regulation: • Transcriptional (promoters and enhancers) • Post Transcriptional (RNA splicing, stability, localization small non coding RNAs) 5 Gene Expression Regulation: • Translational (3’UTR repressors, poly A tail) • Post Transcriptional (RNA splicing, stability, localization small non coding RNAs) cDNA • Post Translational (Protein modification: carbohydrates, lipids, phosphorylation, hydroxylation, methlylation, precursor protein) 6 Gene Expression Measurement • mRNA expression represents dynamic aspects of cell • mRNA expression can be measured by latest technology • mRNA is isolated and labeled with fluorescent protein • mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser 7 Traditional Methods • Northern Blotting – Single RNA isolated – Probed with labeled cDNA • RT-PCR – Primers amplify specific cDNA transcripts 8 Microarray Technology • Microarray: – New Technology (first paper: 1995) • Allows study of thousands of genes at same time – Glass slide of DNA molecules • Molecule: string of bases (25 bp – 500 bp) • uniquely identifies gene or unit to be studied 9 Gene Expression Microarrays The main types of gene expression microarrays: • Short oligonucleotide arrays (Affymetrix) • cDNA or spotted arrays (Brown/Botstein). • Long oligonucleotide arrays (Agilent Inkjet); • Fiber-optic arrays • ... 10 Fabrications of Microarrays • Size of a microscope slide Images: http://www.affymetrix.com/ 11 Differing Conditions • Ultimate Goal: – Understand expression level of genes under different conditions • Helps to: – Determine genes involved in a disease – Pathways to a disease – Used as a screening tool 12 Gene Conditions • • • • • Cell types (brain vs. liver) Developmental (fetal vs. adult) Response to stimulus Gene activity (wild vs. mutant) Disease states (healthy vs. diseased) 13 Expressed Genes • Genes under a given condition – mRNA extracted from cells – mRNA labeled – Labeled mRNA is mRNA present in a given condition – Labeled mRNA will hybridize (base pair) with corresponding sequence on slide 14 Two Different Types of Microarrays • Custom spotted arrays (up to 20,000 sequences) – cDNA – Oligonucleotide • High-density (up to 100,000 sequences) synthetic oligonucleotide arrays – Affymetrix (25 bases) – SHOW AFFYMETRIX LAYOUT 15 Custom Arrays • Mostly cDNA arrays • 2-dye (2-channel) – RNA from two sources (cDNA created) • Source 1: labeled with red dye • Source 2: labeled with green dye 16 Two Channel Microarrays • Microarrays measure gene expression • Two different samples: – Control (green label) – Sample (red label) • Both are washed over the microarray – Hybridization occurs – Each spot is one of 4 colors 17 Microarray Technology 18 Microarray Image Analysis • Microarrays detect gene interactions: 4 colors: – – – – Green: high control Red: High sample Yellow: Equal Black: None • Problem is to quantify image signals 19 Single Color Microarrays • Prefabricated – Affymetrix (25mers) • Custom – cDNA (500 bases or so) – Spotted oligos (70-80 bases) 20 Microarray Animations • Davidson University: • http://www.bio.davidson.edu/courses/genomics/chip/chip.html • Imagecyte: • http://www.imagecyte.com/array2.html 21 Basic idea of Microarray • Construction – Place array of probes on microchip • Probe (for example) is oligonucleotide ~25 bases long that characterizes gene or genome • Each probe has many, many clones • Chip is about 2cm by 2cm • Application principle – Put (liquid) sample containing genes on microarray and allow probe and gene sequences to hybridize and wash away the rest – Analyze hybridization pattern 22 Operation Principle: Samples are tagged with flourescent material to show pattern of sample-probe interaction (hybridization) Microarray analysis Microarray may have 60K probe 23 Microarray Processing sequence 24 Gene Expression Data Gene expression data on p genes for n samples mRNA samples sample1 sample2 sample3 sample4 sample5 … Genes 1 2 3 4 0.46 -0.10 0.15 -0.45 0.30 0.49 0.74 -1.03 0.80 0.24 0.04 -0.79 1.51 0.06 0.10 -0.56 0.90 0.46 0.20 -0.32 ... ... ... ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Gene expression level of gene i in mRNA sample j = Log (Red intensity / Green intensity) Log(Avg. PM - Avg. MM) 25 Some possible applications • Sample from specific organ to show which genes are expressed and responsible for a functionality • Compare samples from healthy and sick host to find gene-disease connection • Analyze samples to differentiate sick and healthy, disease subtypes, drug response groups • Probe samples, including human pathogens, for disease detection 26 Huge amount of data from single microarray • If just two color, then amount of data on array with N probes is 2N • Cannot analyze pixel by pixel • Analyze by pattern – cluster analysis 27 Major Data Mining Techniques • Link Analysis – Associations Discovery – Sequential Pattern Discovery – Similar Time Series Discovery • Predictive Modeling – Classification (assigns genes into known classes) – Clustering (groups genes into unknown clusters) 28 Supervised vs. Unsupervised Learning • Supervised: there is a teacher, class labels are known • Support vector machines • Backpropagation neural networks • Unsupervised: No teacher, class labels are unknown • Clustering • Self-organizing maps 29 Cluster Analysis: Grouping Similarly Expressed Genes, Cell Samples, or Both • Strengthens signal when averages are taken within clusters of genes (Eisen) • Useful (essential?) when seeking new subclasses of cells, diseases, drug responses etc. • Leads to readily interpreted figures 30 Some clustering methods and software • Partitioning:K-Means, K-Medoids, PAM, CLARA … • Hierarchical:Cluster, HAC、BIRCH、CURE、 ROCK • Density-based: CAST, DBSCAN、OPTICS、 CLIQUE… • Grid-based:STING、CLIQUE、WaveCluster… • Model-based:SOM (self-organized map)、 COBWEB、CLASSIT、AutoClass… • Two-way Clustering • Block clustering 31 Partitioning 32 Density-based clustering 33 Hierarchical (used most often) 0 1 2 3 4 agglomerative a a,b b a,b,c,d,e c c,d,e d d,e e 4 3 2 1 0 divisive 34 Gene Expression Data Gene expression data on p genes for n samples mRNA samples sample1 sample2 sample3 sample4 sample5 … Genes 1 2 3 4 0.46 -0.10 0.15 -0.45 0.30 0.49 0.74 -1.03 0.80 0.24 0.04 -0.79 1.51 0.06 0.10 -0.56 0.90 0.46 0.20 -0.32 ... ... ... ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Gene expression level of gene i in mRNA sample j = Log (Red intensity / Green intensity) Log(Avg. PM - Avg. MM) 35 Expression Vectors Gene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types. Numeric Vector -0.8 1.5 1.8 0.5 -0.4 -1.3 0.8 1.5 2 Line Graph 0 1 2 3 4 5 6 7 8 -2 Heat map -2 2 36 Expression Vectors As Points in ‘Expression Space’ G1 G2 G3 G4 G5 t1 t2 t3 -0.8 -0.4 -0.6 0.9 1.3 -0.3 -0.8 -0.8 1.2 0.9 -0.7 -0.7 -0.4 1.3 -0.6 Similar Expression Experiment 3 Experiment 2 Experiment 1 37 Cluster Analysis • Group a collection of objects into subsets or “clusters” such that objects within a cluster are closely related to one another than objects assigned to different clusters. 38 How can we do this? • What is closely related? • Distance or similarity metric • What is close? • Clustering algorithm • How do we minimize distance between objects in a group while maximizing distances between groups? 39 Distance Metrics Gene Expression 2 (5.5,6) (3.5,4) Gene Expression 1 • Euclidean Distance measures average distance • Manhattan (City Block) measures average in each dimension • Correlation measures difference with respect to linear trends 40 Clustering Time Series Data • Measure gene expression on consecutive days • Gene Measurement matrix • G1= [1.2 4.0 5.0 1.0] • G2= [2.0 2.5 5.5 6.0] • G3= [4.5 3.0 2.5 1.0] • G4= [3.5 1.5 1.2 1.5] 41 Euclidean Distance 0 5.3 4.3 5.1 5.3 0 6.4 6.5 4.3 6.4 0 2.3 5.1 6.5 2.3 0 • Distance is the square root of the sum of the squared distance between coordinates 2 2 2 • dij dij x i1 x j1 xi 2 x j 2 xin x jn 1.2 2 4 2.5 5 5.5 1 6 2 2 2 2 42 City Block or Manhattan Distance • • • • G1= [1.2 G2= [2.0 G3= [4.5 G4= [3.5 4.0 2.5 3.0 1.5 5.0 5.5 2.5 1.2 1.0] 6.0] 1.0] 1.5] 0 7.8 6.8 9.1 7.8 0 11 11.3 6.8 11 0 4.3 9.1 11.3 4.3 0 • Distance is the sum of the absolute value between coordinates dij xi1 x j1 xi 2 x j 2 xin x jn dij 1.2 2 4 2.5 5 5.5 1 6 43 Correlation Distance • Pearson correlation measures the degree of linear relationship between variables, [-1,1] • Distance is 1-(pearson correlation), range of [0,2] N dij 1 1 0 .91 .98 1.6 .91 0 1.9 1.7 .98 1.9 0 .22 1.6 1.7 .22 0 1 xin x jn N n 1 N N x x n 1 in n 1 jn 2 2 N N N 2 1 N 1 2 xin xin x jn x jn n 1 n 1 N N n 1 n 1 44 Similarity Measurements • Pearson Correlation x1 y1 x y Two profiles (vectors) and xN y N C pearson( x , y ) N i 1 ( xi mx )( yi my ) [i 1 ( xi mx ) ][i 1 ( yi my ) 2 ] N x y 2 mx 1 N xn N n 1 my 1 N N +1 Pearson Correlation – 1 N n 1 yn x y 45 Hierarchical Clustering • IDEA: Iteratively combines genes into groups based on similar patterns of observed expression • By combining genes with genes OR genes with groups algorithm produces a dendrogram of the hierarchy of relationships. • Display the data as a heat map and dendrogram • Cluster genes, samples or both 46 (HCL-1) Hierarchical Clustering Dendrogram Venn Diagram of Clustered Data 47 Hierarchical clustering • Merging (agglomerative): start with every measurement as a separate cluster then combine • Splitting: make one large cluster, then split up into smaller pieces • What is the distance between two clusters? 48 Distance between clusters • Single-link: distance is the shortest distance from any member of one cluster to any member of the other cluster • Complete link: distance is the longest distance from any member of one cluster to any member of the other cluster • Average: Distance between the average of all points in each cluster • Ward: minimizes the sum of squares of any two clusters 49 Hierarchical Clustering-Merging • Euclidean distance • Average linking Distance between clusters when combined Gene expression time series 50 Manhattan Distance • Average linking Distance between clusters when combined Gene expression time series 51 Correlation Distance 52 Data Standardization • Data points are normalized with respect to mean and variance, “sphering” the data x ˆ x ˆ • After sphering, Euclidean and correlation distance are equivalent • Standardization makes sense if you are not interested in the size of the effects, but in the effect itself • Results are misleading for noisy data 53 Hierarchical Clustering Initial Data Items Distance Matrix Dist A A B C D A B C D 20 7 2 B 10 25 C 3 D 54 Hierarchical Clustering Initial Data Items Distance Matrix Dist A A B C D A B C D 20 7 2 B 10 25 C 3 D 55 Hierarchical Clustering Single Linkage Current Clusters Distance Matrix Dist A 2 A D B C A B C D 20 7 2 B 10 25 C 3 D 56 Hierarchical Clustering Single Linkage Current Clusters Distance Matrix Dist AD B C AD 20 3 B 10 C A D B C 57 Hierarchical Clustering Single Linkage Current Clusters Distance Matrix Dist AD B C AD 20 3 B 10 C A D B C 58 Hierarchical Clustering Single Linkage Current Clusters Distance Matrix Dist AD B C AD 20 3 B 10 C 3 A D C B 59 Hierarchical Clustering Single Linkage Current Clusters Distance Matrix Dist AD C AD C B 10 B A D C B 60 Hierarchical Clustering Single Linkage Current Clusters Distance Matrix Dist AD C AD C B 10 B A D C B 61 Hierarchical Clustering Single Linkage Current Clusters Distance Matrix Dist AD C AD C 10 B 10 B A D C B 62 Hierarchical Clustering Single Linkage Final Result Distance Matrix Dist AD CB AD CB A D C B 63 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 64 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 65 Hierarchical Clustering Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Gene 7 66 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 67 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 68 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 69 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 70 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 71 Hierarchical Clustering H L 72 Hierarchical Clustering Genes Samples The Leaf Ordering Problem: • Find ‘optimal’ layout of branches for a given dendrogram architecture • 2N-1 possible orderings of the branches • For a small microarray dataset of 500 genes, there are 1.6*E150 branch configurations 73 Hierarchical Clustering The Leaf Ordering Problem: 74 Hierarchical Clustering • Pros: – Commonly used algorithm – Simple and quick to calculate • Cons: – Real genes probably do not have a hierarchical organization 75 Using Hierarchical Clustering 1. 2. 3. 4. 5. 6. 7. 8. Choose what samples and genes to use in your analysis Choose similarity/distance metric Choose clustering direction Choose linkage method Calculate the dendrogram Choose height/number of clusters for interpretation Assess results Interpret cluster structure 76 Limitations • Cluster analyses: – Usually outside the normal framework of statistical inference – Less appropriate when only a few genes are likely to change – Needs lots of experiments • Single gene tests: – May be too noisy in general to show much – May not reveal coordinated effects of positively correlated genes. – Hard to relate to pathways 77 Useful Links • Affymetrix www.affymetrix.com • Michael Eisen Lab at LBL (hierarchical clustering software “Cluster” and “Tree View” (Windows)) rana.lbl.gov/ • Review of Currently Available Microarray Software www.the-scientist.com/yr2001/apr/profile1_010430.html • ArrayExpress at the EBI http://www.ebi.ac.uk/arrayexpress/ • Stanford MicroArray Database http://genome-www5.stanford.edu/ • Yale Microarray Database http://info.med.yale.edu/microarray/ • Microarray DB www.biologie.ens.fr/en/genetiqu/puces/bddeng.html 78