* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Transcriptome - Nematode bioinformatics. Analysis tools and data
Human genome wikipedia , lookup
Point mutation wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Transcription factor wikipedia , lookup
Oncogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Heritability of IQ wikipedia , lookup
Non-coding DNA wikipedia , lookup
Metagenomics wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Pathogenomics wikipedia , lookup
Essential gene wikipedia , lookup
Primary transcript wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome evolution wikipedia , lookup
Designer baby wikipedia , lookup
Genome (book) wikipedia , lookup
Microevolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genomic imprinting wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Minimal genome wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Ridge (biology) wikipedia , lookup
Microarray analysis Quantitation of Gene Expression Expression Data to Networks Reading: Ch 16 BIO520 Bioinformatics Jim Lund Microarray data • Image quantitation. • Normalization • Find genes with significant expression differences • Annotation • Clustering, pattern analysis, network analysis Sources of Non-Biological Variation • Dye bias: differences in heat and light sensitivity, efficiency of dye incorporation • Differences in the amount of labeled cDNA hybridized to each channel in a microarray experiment (Channel is used to refer to a combination of a dye and a slide.) • Variation across replicate slides • Variation across hybridization conditions • Variation in scanning conditions • Variation among technicians doing the lab work. Factors which impact on the signal level • • • • • Amount of mRNA Labeling efficiencies Quality of the RNA Laser/dye combination Detection efficiency of photomultiplier or CCD Hela HepG2 Hela HepG2 M = Log (Red - Log Green M vs. A Plot A = (Log Green + Log Red) / 2 M v A plots of chip pairs: before normalization M v A plots of chip pairs: after quantile normalization Types of normalization • To total signal (linear normalization) • LOESS (LOcally WEighted polynomial regreSSion). • To “house keeping genes” • To genomic DNA spots (Research Genetics) or mixed cDNA’s • To internal spikes Microarray analysis • Data exploration: expression of gene X? • Statistical analysis: which genes show large, reproducible changes? • Clustering: grouping genes by expression pattern. • Knowledge-based analysis: Are amine synthesis genes involved in this experiment? Fold change: the crudest method of finding differentially expressed genes Hela HepG2 >2-fold expression change >2-fold expression change What do we mean by differentially expressed? • Statistically, our gene is different from the other genes. Distribution of measurements for gene of interest Log ratio Probability of a given Value of the ratio Number of genes Distribution of average ratios for all genes Finding differentially expressed genes What affects our certainty that a gene is up or down-regulated? Probe Signal • Number of sample points • Difference in means • Standard deviations of sample Sample A Sample B Practical views on statistics • With appropriate biological replicates, it is possible to select statistically meaningful genes/patterns. • Sensitivity and selectivity are inversely related - e.g. increased selection of true positives WILL result in more false positive and less false negatives. • False negatives are lost opportunities, false positives cost $’s and waste time. • A typical set of experiments treated with conservative statistics typically results in more genes/pathways/patterns than one can sensibly follow so use conservative statistics to protect against false positives when designing follow-on experiments. Statistical Tests • Student’s t-test – Correct for multiple testing! (Holm-Bonferroni) • False discovery rate. • Significance Analysis of Microarrays (SAM) – http://www-stat.stanford.edu/~tibs/SAM/ • ANOVA • Principal components analysis • Special methods for periodic patterns in data. p-value Volcano plot: log(expr) vs p-value Log(fold change) Scatter plot showing genes with significant p-values Pattern finding • In many cases, the patterns of differential expression are the target (as opposed to specific genes) – Clustering or other approaches for pattern identification - find genes which behave similarly across all experiments or experiments which behave similarly across all genes – Classification - identify genes which best distinguish 2 or more classes. • The statistical reliability of the pattern or classifier is still an issue and similar considerations apply - e.g. cluster analysis of random noise will produce clusters which will be meaningless…. What is clustering? • Group similar objects together. – Genes with similar expression patterns. • Objects in the same cluster (group) are more similar to each other than objects in different clusters. Clustering • What is clustering? • Similarity/distance metrics • Hierarchical clustering algorithms – Made popular by Stanford, ie. [Eisen et al. 1998] • K-means – Made popular by many groups, eg. [Tavazoie et al. 1999] • Self-organizing map (SOM) – Made popular by Whitehead, ie. [Tamayo et al. 1999] Typical Tools • SAM (Significance Analysis of Microarrays), Stanford • GeneSpring • Affymetrix GeneChip Operating System (GCOS) • Cluster/Treeview • R statistics package microarray analysis libraries. How to define similarity? 1 1 Experiments X p genes n genes genes X Y n Y Raw matrix n Similarity matrix • Similarity metric: – A measure of pairwise similarity or dissimilarity – Examples: • Correlation coefficient • Euclidean distance Similarity metrics • Euclidean distance Euclidean clustering = magnitude & Direction p 2 ( X [ j ] Y [ j ] ) j 1 • Correlation coefficient p p ( X [ j ] X )(Y [ j ] Y ) j 1 p p ( X [ j ] X ) (Y [ j ] Y ) 2 j 1 , where X j 1 2 X [ j ] Correlation j 1 p clustering = direction Sporulation-example Sporulation-example Self-organizing maps (SOM) [Kohonen 1995] • Basic idea: – map high dimensional data onto a 2D grid of nodes – Neighboring nodes are more similar than points far away Self-organizing maps (SOM) SOM Clusters Things learned from from microarray gene expression experiments • Pathways not known to be involved –Ontology? • Novel genes involved in a known pathway • “like” and “unlike” tissues Transcription Factors Regulatory Networks • Identify co-regulated genes • Search for common motifs (transcription factor binding sites) –Evaluate known motifs/factors –Search for new ones. • Programs: MEME, etc. mRNA-protein Correlation • YPD: should have relevant data – will yeast be typical? • Electrophoresis 18:533 – 23 proteins on 2D gels – r=0.48 for mRNA=protein • Post transcriptional and post translational regulation important! Other microarray formats • Single nucleotide polymorphism (SNP) chips – Oligos with each of 4 nt at each SNP. • Chromosomal IP chips (ChIP:chip) – Determine transcription factor binding sites – Promoter DNA on the chip. • Alternative splicing chips – Long oligos, covering alternatively spliced exons, or all exons. • Genome tiling chips ChIP:chip--Identification of Transcription Factor Binding Sites • Cross link transcription factors to DNA with formaldehyde • Pull out transcription factor of interest via immunoprecipitation with an antibody or by tagging the factor of interest with an isolatable epitope (e.g GST fusion). • Fractionate the DNA associated with the transcription factor, reverse the cross links, label and hybridize to an array of protomer DNA. • Brown et.al. (2001) Nature, 409(533-8) ChIP:chip Analysis of TF Binding Sites On to Proteomics DNARNA Protein