* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Normalization between a pair of arrays
Non-coding DNA wikipedia , lookup
Genetic engineering wikipedia , lookup
Protein moonlighting wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene therapy wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Pathogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Metagenomics wikipedia , lookup
Genomic imprinting wikipedia , lookup
Ridge (biology) wikipedia , lookup
Point mutation wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene desert wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Gene nomenclature wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Minimal genome wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome evolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Helitron (biology) wikipedia , lookup
Microevolution wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Designer baby wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Lecture 6 Introduction to transcriptional networks Microarray experiments MA plots Normalization of microarray data Tests for differential expression of genes Multiple testing and FDR The concept of Line Graphs Introduction to BLAST Central dogma of molecular biology transcriptional networks By the term transcriptional networks we generally mean gene regulatory networks Unlike protein-protein interaction networks the transcriptional networks are directed networks transcriptional networks: Basic mechanism of gene regulation transcriptional networks transcriptional networks Most genes are regulated at transcription level and it is assumed that 510% of protein coding genes encode regulatory proteins. Some regulatory proteins play targeted role i.e. they take part in regulation of a few genes. Some regulatory proteins play more general role in initiating transcription (for example the eukaryotic transcription factors of type II or the RNA polymerase itself that is essential for the transcription of all genes). It is considered that dedicated regulatory proteins are those that affect up to 5% genes of a genome. However the boundary between the generalist and dedicated regulatory proteins is blurred. transcriptional networks Some experiments and methods used to generate data to determine regulatory relations 1. Complementary DNA microarrays 2. Oligonucleotide chips 3. Reverse transcription polymerase chain reaction 4. Serial analysis of gene expression 5. Chromatin Immunoprecipitation 6. Next generation sequencing 7. Bioinformatics—e.g. by way of identifying binding sites Transcriptional Networks: Case study 1 An extended transcriptional regulatory network of Escherichia coli and analysis of its hierarchical structure and network motifs Hong-Wu Ma, Bharani Kumar, Uta Ditges2, Florian Gunzer2, Jan Buer1,2 and An-Ping Zeng* Nucleic Acids Research, 2004, Vol. 32, No. 22 6643–6649 This work combined data sets from 3 different sources: 1. RegulonDB (version 4.0, http://www.cifn.unam.mx/Computational_Genomics/regulondb/) 2. Ecocyc (version 8.0, www.ecocyc.org) 3. Shen-Orr,S.S., Milo,R., Mangan,S. and Alon,U. (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nature Genet., 31, 64–68. Transcriptional Network: Case study 1 Nucleic Acids Research, 2004, Vol. 32, No. 22 6643–6649 Comparison of the TRN of E.coli from three different data sources (A) Based on number of genes (B) Based on number regulatory interactions Transcriptional Network: Case study 1 Nucleic Acids Research, 2004, Vol. 32, No. 22 6643– 6649 A combined network that includes all the 2624 interactions from the three data sets has been produced. In addition, this work extended this network by adding 23 additional genes and around 100 regulatory relationships through literature survey. The final TRN altogether includes 1278 genes and 2724 interactions. Transcriptional Network: Case study 1 Nucleic Acids Research, 2004, Vol. 32, No. 22 6643–6649 This work discovered a hierarchical structure in the TRN. The hierachical structure was identified according to the following way: (1) genes which do not code for transcription factors (TFs) or code for a TF which only regulates its own expression (auto-regulatory loop) were assigned to layer 1 (the lowest layer); (2) then we removed all the genes in layer 1 and from the remaining network identified TFs which do not regulate other genes and assigned the corresponding genes in layer 2; (3) we repeated step 2 to remove nodes which have been assigned to a layer and identified a new layer until all the genes were assigned to different layers. As a result, a nine layer hierarchical structure was uncovered. From BMC Bioinformatics 2004, 5:199 of the related authors Transcriptional Network: Case study 1 Nucleic Acids Research, 2004, Vol. 32, No. 22 6643–6649 Transcriptional Network: Case study 1 Nucleic Acids Research, 2004, Vol. 32, No. 22 6643–6649 To calculate network motifs in the E.coli TRN, this work removed all the loops in the network (including the autoregulatory loops and the two-gene regulatory loops). Then they used the program Mfinder developed by Kashtan et al. to generate the motif profiles. The first four types are the so-called coherent FFLs in which the direct effect of the up regulator is consistent with its indirect effect through the mid regulator. In contrast, the last four types of FFLs are incoherent because the direct effect of the up regulator is contradictive with its indirect effect Transcriptional Network: Case study 1 Nucleic Acids Research, 2004, Vol. 32, No. 22 6643–6649 (A) Gene gadA is regulated by six FFLs (B)Gene lpd is regulated by five FFLs (C) Gene slp is regulated by 17 regulators Transcriptional Network: Case study 1 Nucleic Acids Research, 2004, Vol. 32, No. 22 6643–6649 DNA Microarray DNA Microarray Typical microarray chip •Though most cells in an organism contain the same genes, not all of the genes are used in each cell. •Some genes are turned on, or "expressed" when needed in particular types of cells. •Microarray technology allows us to look at many genes at once and determine which are expressed in a particular cell type. DNA Microarray Typical microarray chip •DNA molecules representing many genes are placed in discrete spots on a microscope slide which are called probes. •Messenger RNA--the working copies of genes within cells is purified from cells of a particular type. •The RNA molecules are then "labeled" by attaching a fluorescent dye that allows us to see them under a microscope, and added to the DNA dots on the microarray. •Due to a phenomenon termed base-pairing, RNA will stick to the probe corresponding to the gene it came from DNA Microarray Usually a gene is interrogated by 11 to 20 probes and usually each probe is a 25mer sequence The probes are typically spaced widely along the sequence Sometimes probes are choosen closer to the 3’ end of the sequence A probe that is exactly complementary to the sequence is called perfect match (PM) A mismatch probe (MM) is not complementary only at the central position In theory MM probes can be used to quantify and remove non specific hybridization Source: PhD thesis by Benjamin Milo Bolstad, 2004, University of California, Barkeley Sample preparation and hybridization Source: PhD thesis by Benjamin Milo Bolstad, 2004, University of California, Barkeley Sample preparation and hybridization During the hybridization process cRNA binds to the array Earlier probes had all the probes of a probset located continuously on the array This may fall prey to spatial defects Newer chips have all the probes spread out across the array A PM and MM probe pair are always adjacent on the array Source: PhD thesis by Benjamin Milo Bolstad, 2004, University of California, Barkeley Growth curve of bacteria •Samples can be taken at different stages of the growth curve •One of them is considered as control and others are considered as targets •Samples can be taken before and after application of drugs •Sample can be taken under different experimental conditions e.g. starvation of some metabolite or so •What types of samples should be used depends on the target of the experiment at hand. DNA Microarray Typical microarray chip •After washing away all of the unstuck RNA, the microarray can be observed under a microscope and it can be determined which RNA remains stuck to the DNA spots •Microarray technology can be used to learn which genes are expressed differently in a target sample compared to a control sample (e.g diseased versus healthy tissues) However background correction and normalization are necessary before making useful decisions or conclusions MA plots MA plots are typically used to compare two color channels, two arrays or two groups of arrays The vertical axis is the difference between the logarithm of the signals(the log ratio) and the horizontal axis is the average of the logarithms of the signals The M stands for minus and A stands for add MA is also mnemonic for microarray Mi= log(Xij) - log(Xik) = Log(Xij/Xik) (Log ratio) Ai=[log(Xij) + log(Xik)]/2 (Average log intensity) A typical MA plot From the first plot we can see differences between two arrays but the non linear trend is not apparent This is because there are many points at low intensities compared to at high intensities MA plot allows us to assess the behavior across all intensities Normalization of microarray data Normalization is the process of removing unwanted nonbiological variation that might exist between chips in microarray experiments By normalization we want to remove the non-biological variation and thus make the biological variations more apparent. Typical microarray data ・・・ Array j ・・・ Array 1 Array 2 Array m Gene 1 X11 X12 X1j X1m Gene 2 X21 X22 X2j X2m Xi1 Xi2 Xij Xim Gene n Xn1 Xn2 Xnj Xnm Mean X1 X2 Xj Xm SD σ1 σ2 σj σm ・・・ Gene i ・・・ Normalization within individual arrays Array 1 Array 2 ・・・ Array j ・・・ Array m Gene 1 X11 X12 X1j X1m Gene 2 X21 X22 X2j X2m Xi1 Xi2 Xij Xim Gene n Xn1 Xn2 Xnj Xnm Mean X1 X2 Xj Xm SD σ1 σ2 σj σm ・・・ Gene i ・・・ Scaling: Centering: Sij = Xij - Xj Cij = ( Xij - Xj ) / σj Effect of Scaling and centering normalization Original Data Scaling Centering Normalization between a pair of arrays: Loess(Lowess) Normalization Lowess normalization is separately applied to each experiment with two dyes This method can be used to normalize Cy5 and Cy3 channel intensities (usually one of them is control and the other is the target) using MA plots Normalization between a pair of arrays: Loess(Lowess) Normalization Genei-1 Ci-1 Ti-1 Genei Ci Ti Genei+1 Ci+1 Ti+1 2 channel data Mi=Log(Ti/Ci) (Log ratio) Mi=Log(Ti/Ci) Ai=[log(Ti) + log(Ci)]/2 (Average log intensity) Each point corresponds to a single gene Ai=[log(Ti) + log(Ci)]/2 Normalization between a pair of arrays: Loess(Lowess) Normalization Mi=Log(Ti/Ci) (Log ratio) Ai=[log(Ti) + log(Ci)]/2 (Average log intensity) Mi=Log(Ti/Ci) Each point corresponds to a single gene The MA plot shows some bias Typical regression line Ai=[log(Ti) + log(Ci)]/2 Normalization between a pair of arrays: Loess(Lowess) Normalization Mi=Log(Ti/Ci) (Log ratio) Ai=[log(Ti) + log(Ci)]/2 (Average log intensity) Mi=Log(Ti/Ci) Each point corresponds to a single gene The MA plot shows some bias Ai=[log(Ti) + log(Ci)]/2 Usually several regression lines/polynomials are considered for different sections The final result is a smooth curve providing a model for the data. This model is then used to remove the bias of the data points Normalization between a pair of arrays: Loess(Lowess) Normalization Bias reduction by lowess normalization Normalization between a pair of arrays: Loess(Lowess) Normalization Unnormalized fold changes fold changes after Loess normalization Normalization across arrays Here we are discussing the following two normalization procedure applicable to a number of arrays 1. Quantile normalization 2. Baseline scaling normalization Normalization across arrays Quantile normalization quantile- quantile plot motivates the quantile normalization algorithm The goal of quantile normalization is to give the same empirical distribution to the intensities of each array If two data sets have the same distribution then their quantile- quantile plot will have straight diagonal line with slope 1 and intercept 0. Or projecting the data points of the quantilequantile plot to 45-degree line gives the transformation to have the same distribution. Normalization across arrays Quantile normalization Algorithm Source: PhD thesis by Benjamin Milo Bolstad, 2004, University of California, Barkeley Normalization across arrays Quantile Normalization: Original data No. Exp.1 No. Exp.2 1 1.6 1 1.2 2 0.6 2 2.8 3 1.8 3 1.8 4 0.8 4 3.8 5 0.4 5 0.8 No. Exp.1 No. Exp.2 Mean 5 0.4 5 0.8 0.6 = (0.4+0.8)/2 2 0.6 1 1.2 0.9 4 0.8 3 1.8 1.3 1 1.6 2 2.8 2.2 3 1.8 4 3.8 2.8 Sort 1. Sort each column of X (values) 2. Take the means across rows of X sort No. Exp.1 No. Exp.2 No. Exp.1 No. Exp.2 5 0.6 5 0.6 1 2.2 1 0.9 2 0.9 1 0.9 2 0.9 2 2.2 4 1.3 3 1.3 3 2.8 3 1.3 1 2.2 2 2.2 4 1.3 4 2.8 3 2.8 4 2.8 5 0.6 5 0.6 Sort 3. Assign this mean to each element in the row to get X' sort 4. Get X normalized by rearranging each column of X' sort to have the same ordering as original X Normalization across arrays Raw data After quantile normalization Normalization across arrays Baseline scaling method In this method a baseline array is chosen and all the arrays are scaled to have the same mean intensity as this chosen array This is equivalent to selecting a baseline array and then fitting a linear regression line without intercept between the chosen array and every other array Normalization across arrays Baseline scaling method Normalization across arrays Raw data After Baseline scaling normalization Tests for differential expression of genes Let x1…..xn and y1…yn be the independent measurements of the same probe/gene across two conditions. Whether the gene is differentially expressed between two conditions can be determined using statistical tests. Tests for differential expression of genes Important issues of a test procedure are (a)Whether the distributional assumptions are valid (b)Whether the replicates are independent of each other (c)Whether the number of replicates are sufficient (d)Whether outliers are removed from the sample Replicates from different experiments should not be mixed since they have different characteristics and cannot be treated as independent replicates Tests for differential expression of genes Most commonly used statistical tests are as follows: (a) Student’s t-test (b) Welch’s test (c) Wilcoxon’s rank sum test (d) Permutation tests The first two test assumes that the samples are taken from Gaussian distributed data and the pvalues are calculated by a probability distribution function The later two are nonparametric and the p values are calculated using combinatorial arguments. Student’s t-test Assumptions: Both samples are taken from Gaussian distribution that have equal variances Degree of freedom: m+n-2 Welch’s test is a variant of t-test where t is calculated as follows Welch’s test does not assume equal population variances Student’s t-test The value of t is supposed to follow a t-distribution if . After calculating the value of t we can determine the p-value from the t distribution of the corresponding degree of freedom Wilcoxon’s rank sum test Let x1…..xn and y1…ym be the independent measurements of the same probe/gene across two conditions. Consider the combined set x1…..xn ,y1…ym The test statistic of Wilcoxon test is Where is the rank of xi in the combined series Possible Minimum value of T is Possible Maximum value of T is Minimum and maximum values of T occur if all X data are greater or smaller than the Y data respectively i.e. if they are sampled from quite different distributions Expected value and variance of T under null hypothesis are as follow: Now unusually low or high values of T compared to the expected value indicate that the null hypothesis should be rejected i.e. the samples are not from the same population For larger samples i.e. m+n >25 we have the following approximation Wilcoxon’s rank sum test (Example) X Data Y Data X&Y Data Rank x1 7 y1 5 x4 9 1 x2 8 y2 6 x2 8 2 x3 5 y3 8 y3 8 3 x4 9 y4 4 x5 7 4 x5 7 x1 7 5 y2 6 6 y1 5 7 x3 5 8 y4 4 9 n=5. m=4 T=R(x1)+R(x2)+R(x3)+R(x4)+R(x5) =5+2+8+1+4= 20 EH0(T)=n(m+n+1)/2= 5(4+5+1)/2=25 VarH0(T)=mn(m+n+1)/12= 5*4(4+5+1)/12=50/3=16.66 P-value = .1112 (From chart) Example Multiple testing and FDR The single gene analysis using statistical tests has a drawback. This arises from the fact that while analyzing microarray data we conduct thousands of tests in parallel. Let we select 10000 genes with a significant level α=0.05 i.e a false positive rate of 5% This means we expect that 500 individual tests are false which is not at all logical Therefore corrections for multiple testing are applied while analyzing microarray data Multiple testing and FDR Let αg be the global significance level and αs is the significance level at single gene level In case of a single gene the probability of making a correct decision is Therefore the probability of making correct decision for all n genes (i.e. at global level) Now the probability of drawing the wrong conclusion in either of n tests is For example if we have 100 different genes and αs=0.05 the probability that we make at least 1 error is 0.994 ---this is very high and this is called family-wise error rate (FWER) Multiple testing and FDR Using binomial expansion we can write Thus Therefore the Bonferroni correction of the single gene level is the global level divided by the number of tests Therefore for FWER of 0.01 for n= 10000 genes the P-value at single gene level should be 10-6 Usually very few genes can meet this requirement Therefore we need to adjust the threshold p-value for the single gene case. Multiple testing and FDR A method for adjusting p-value is given in the following paper Westfall P. H. and Young S. S. Resampling based multiple testing : examples and methods for p-value adjustment(1993), Wiley, New York Multiple testing and FDR An alternative to controlling FWER is the computation of false discovery rate(FDR) The following papers discuss about FDR Storey J. D. and Tibshirani R. Statistical significance for genome wise studies(2003), PNAS 100, 9440-9445 Benjamini Y and Hochberg Y Controlling the false discovery rate : a practical and powerful approach to multiple testing(1995) J Royal Statist Soc B 57, 289-300 Still the practical use of multiple testing is not entirely clear. However it is clear that we need to adjust the p-value at single gene level while testing many genes together. Line Graphs Given a graph G, its line graph L(G) is a graph such that each vertex of L(G) represents an edge of G; and two vertices of L(G) are adjacent if and only if their corresponding edges share a common endpoint ("are adjacent") in G. Graph G Vertices in L(G) constructed from edges in G Added edges in L(G) http://en.wikipedia.org/wiki/Line_graph The line graph L(G) Line Graphs RASCAL: Calculation of Graph Similarity using Maximum Common Edge Subgraphs By JOHN W. RAYMOND1, ELEANOR J. GARDINER2 AND PETER WILLETT2 THE COMPUTER JOURNAL, Vol. 45, No. 6, 2002 The above paper has introduced a new graph similarity calculation procedure for comparing labeled graphs. The chemical graphs G1 and G2 are shown in Figure a, and their respective line graphs are depicted in Figure b. Line Graphs Detection of Functional Modules From Protein Interaction Networks By Jose B. Pereira-Leal,1 Anton J. Enright,2 and Christos A. Ouzounis1 PROTEINS: Structure, Function, and Bioinformatics 54:49–57 (2004) A star is transformed into a clique For applying DPClus to sparse graphs, it is recommended that you first transform to its corresponding line graph. Transforming a network of proteins to a network of interactions. a) Schematic representation illustrating a graph representation of protein interactions: nodes correspond to proteins and edges to interactions. b) Schematic representation illustrating the transformation of the protein graph connected by interactions to an interaction graph connected by proteins. Each node represents a binary interaction and edges represent shared proteins. Note that labels that are not shared correspond to terminal nodes in (a) Whats’ BLAST : BLAST is Basic Local Alignment Search Tool BLAST finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance. gene 2 gene 3 gene N ・・ ・ gene 1 61 Type of BLAST There are now a handful of different BLAST programs available, which can be used depending on what one is attempting to do and what they are working with. These different programs vary in query sequence input, the database being searched, and what is being compared. These programs and their details are listed below: Protein Query Database blastn Nucleotide Nucleotide blastp Protein Protein blastx Nucleotide Protein tblastn Protein Nucleotide Nucleotide blastp blastn blastx Protein tblastn Nucleotide 62 BLAST in NCBI Web BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) 63 BLAST in NCBI Web BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) Select your sequence(fasta format file). You can also paste the copied sequence directly into the Query box. 64 BLAST in NCBI Web BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) Select Database and Organism. In this case, Arabidopsis thaliana is selected. 65 BLAST in NCBI Web BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) Click “BLAST” 66 BLAST in NCBI Web BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) 67 Fasta format file In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. >TR1|c1_g1_i1 len=261 path=[239:0-260] [-1, 239, -2] header line TTATTGAGAAAATTGCTGGTGTTACAATGCATGATGCACAACAGGCGCAAACGTGCAGAAAAGCATATGCACCT AAAAAACACTGAGCAAATGGCATGCCAGAGTAGGTATAAATGGTCCGCTGTGGGGGCTGTTTGCAGGGCAAC GATGATTGATGCACATGAACCAAAAATGCACAAGTATGCATACCGTACTTTTTGCACATATTGTAAAGAACGCAA TGATTGCGCATACAGCAACGGCTTGGCACGTGGGGTCCAG >TR2|c0_g1_i1 len=268 path=[491:0-267] [-1, 491, -2] actual sequence in one-letter code CAAAGCGTACGGAGGCAGAAGGTTTGTTCAATGTCGTGGGGGGGGGTCCCACACTCCCTCTTTTCAGACTGTG GTAACTAAGGCAGGTTCGGTCGAAGCCGCAAGGGGGAGAATTTCCCTACCGCCCCCACCAGGAAATCGTCAC CTCACAAATAGTGCAAGTCCACAGGAGAACTTCTGGTGGGATCAATAACTAAAAAAAAACCTCCTTGCACGTG GTTTTGGGGGATTTGGACAATT 68