* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slide 1
Long non-coding RNA wikipedia , lookup
Genome (book) wikipedia , lookup
Transposable element wikipedia , lookup
Human genetic variation wikipedia , lookup
Pathogenomics wikipedia , lookup
Designer baby wikipedia , lookup
History of genetic engineering wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Human–animal hybrid wikipedia , lookup
Metagenomics wikipedia , lookup
Sequence alignment wikipedia , lookup
Genome editing wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Genome evolution wikipedia , lookup
Non-coding DNA wikipedia , lookup
Human Genome Project wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Human genome wikipedia , lookup
HUMAN Modelling the Boundary of Highly Conserved Non-Coding DNA Klaudia Walter, Wally Gilks, Lorenz Wernisch 12th December 2006 Overview • Background – What are CNEs? – A+T nucleotide frequency in and around CNEs • Phylogenetic Model – What is a phylogenetic tree model? – Likelihood of a tree model – Likelihood of the scaling of a tree – Likelihood of CNE boundary – Variable CNE boundaries for each species Motivation • DNA sequences that are conserved between organisms are likely to have special functions. • The Fugu genome represents a good model to find conserved non-coding sequences (CNEs) in the human genome. • Are conserved regions different from their neighbouring sequences in the genome? • Is it possible to define CNE boundaries better than with pairwise sequence alignment of Fugu and human? What are CNEs? Multiple Alignment of Mouse, Rat, Human and Fugu Fugu Genome • Fugu genome contains only 400Mb. • Only an eighth of human genome. • Gene repertoire is similar to human. • Human and Fugu shared last common ancestor 450 million years ago. (Brenner et al, 1993; Aparicio et al, 2002) Conserved Non-coding Elements (CNE) • 1373 CNEs identified in human and Fugu • 93 - 740 bp long; 68 - 98% identical • Situated around developmental genes • Can act over 1 Mb distance, eg. Shh expression (Lettice et al, 2003; Nobrega et al, 2003; Kleinjan & van Heyningen, 2004) • Likely to be fundamental to vertebrate life (Dermitzakis et al, 2002, 2003; Margulies et al, 2003; Bejerano et al 2004a; Woolfe et al, 2005) Are vertebrate CNEs enhancers? element 1 element 4 Fugu / Mouse Fugu / Rat Fugu / Human element 8-10 Coding Exon Conserved Non-coding Sequence SOX21 gene element 19 element 5 Element 19 sox21 gene element 19 central nervous system eye forebrain (Woolfe et al, 2005; McEwen et al, 2006) Model of duplication of cis-element and target gene CNE Target (Vavouri et al, 2006; McEwen et al, 2006) A+T base frequency in CNEs Position Specific Base Composition Upstream flanking region Conserved non-coding ACTAGCCTCATCGTAGCGCAATTCTAGATGATAACA TACCGAGTTCGGTAGGAGCTTAGTATGAGCATAACG CGTGTGCTAGGTCACGGCGCAACATACTTATAGACT ACGCCCTTGCACGATCCGGATATCATAGTCTTACAA A = 0.00 C = 0.25 G = 0.50 T = 0.25 A = 0.50 C = 0.00 G = 0.25 T = 0.25 A+T relative frequency across CNE boundaries in Fugu and human (Walter et al, 2005) A+T relative frequency across 2000 genes in human chromosome 1 Genes were aligned at the start and the end. Distribution of Position Weight Matrix (PWM) Scores for CNEs and Random Sequences A position weight matrix (PWM) is constructed by dividing the nucleotide probabilities by expected background probabilities. n p (b, i ) S log 2 p(b) i 1 p(b,i) = probability of base b in position i p(b) = background probability of base b Scores for Fugu CNEs Scores for Human CNEs The sequence logo for the 100 top scoring CNEs. What do CNEs do? • Some CNEs enhance GFP (green fluorescent protein) expression in zebrafish embryos. • The function of CNEs is still unknown. • Necessary to do more lab experiments. • Are CNEs defined well enough for experiments? Conservation pattern across CNE boundaries 1373 Fugu-human CNE pairs plus 100bp flanking regions aligned using Needleman-Wunsch’s algorithm. A+T frequency in Fugu, Human, Worm and Fly (Glazov et al, 2005; Vavouri et al, 2006 (submitted)) Are CNE ends well defined? • Different parameter settings produce different alignments. • Even just different mismatch penalties change – the alignments – the A+T bias at the CNE boundaries A+T frequency for Fugu CNEs using pairwise alignments with Human Phylogenetic Model Multiple sequence alignment Human Mouse Chicken Xenopus Fugu 5’ flanking conserved ACAGTAT ACCGTAT AACGTAT CCACTAT CGACTTA ATCGTAAT ATCGTAAT ATCGTAAT ATCGTAAT ATCGTAAT 300 bp 100 bp boundary Phylogenetic tree model qAA qCA • Substitution rate matrix Q qGA – Continuous-time Markov process qTA • Tree topology • Branch lengths • Scaling of tree qAC qAG qCC qGC qTC qCG qGG qTG C H F M qAT qCT qGT qTT Matrix P(t) of substitution probabilities for branch length t (Qt )i P(t ) exp(Qt ) i! i 1 Q should be diagonalizable. If Q is not symmetric, we need to find the eigensystem of a symmetric matrix S related to Q and to convert results to the eigensystem of Q. Example: A, C, G, T aC bG a A d G Q bA d C c A eC f G c T eT f T Estimating A+T frequency around Fugu CNE boundary relative A+T frequency Phylogenetic tree with conserved and flanking scalings g Mouse Fugu Conserved scaling gC Xenopus Chicken Human Mouse Fugu Flanking scaling gF Xenopus Chicken Human scale What is the optimal scaling? flanking scaling gF conserved scaling gC boundary position Compute likelihood of scaling g 5’ flanking Human Mouse Chicken Xenopus Fugu ACA ACC AAC CCA CGA G G G C C conserved TATATCGTAAT TATATCGTAAT TATATCGTAAT TATATCGTAAT TTAATCGTAAT Felsenstein’s algorithm: P(s | T, g) Felsenstein’s algorithm “Pruning” algorithm by Felsenstein (1973, 1981) uses dynamic programming to calculate likelihood of a tree model P(S | T ). u Recursion: • If u is a leaf If xu = a, then P (Lu | a) = 1 Otherwise, P (Lu | a) = 0 a w v b c • Otherwise P (Lu | a) = P(b | a,t b v ) P (Lv | b) P (c | a, tw ) P (Lw | c ) c Likelihood of scaling g • Calculate likelihood P(S | T, g) of scaling vector g by summing over boundary b. • Assume evolutionary independence of each position i in the multiple alignment S. • P(S | T, g) is calculated by Felsenstein’s algorithm. N P (S | T, g ) P (S | T , g b ) P (g b ) b 1 Model with common scaling and individual boundaries Probability of scaling g given sequences S1, …, Sn n P ( g | S1,..., Sn ) P (S1,..., Sn | g ) P ( g ) P (Si | g ) P ( g ) i 1 Likelihood of scaling g over CNEs Hierarchical model for g (, ) (gC, gF)1 (gC, gF)2 (gC, gF)3 ..... (gC, gF)n S1 S2 S3 ..... Sn P (S1,S2 ,..., Sn | , ) P (S | , ) P (S | , ) P(S | g g C ,g F S C ,g F ) P ( gC , g F | , ) Multivariate log normal distribution for (gC, gF) gF gC gC Likelihood of boundary b • The likelihood of the boundary is computed by summing over scalings g. • b and g are independent. • Prior on g. P (S | b ) P (S | b, g ) P ( g ) g Likelihood of boundary b Boundary shifts for phylogenetic model density position Boundary shift 0 bp ≤ 20bp ≤ 50bp ≤ 100bp Cumulative frequency 12% 40% 61% 80% Relative conservation by position Model for variable boundary Branches 000000 0 11111111 000011 1 11111111 000011 1 11111111 000000 0 00111111 000000 0 00111111 000000 0 00111111 000000 1 11111111 000000 1 11111111 0 1 0 0 0 H Positions 1 1 M C 1 X F Transitions 1. 0000 0001 0010 0011 ......... 1111 2. 0000 0001 0010 0011 ......... 1111 3. 0000 0001 0010 0011 ......... 1111 ...... ...... ...... ...... ...... Variable boundary for CNE1031 Human Mouse Chicken Xenopus Fugu AGTAGTTTCC AGGAGCCTCT AGTAGTTTCC -GTTATATAC AATAGTTCCC ATGCCTGTCA ATGCCTGTCA ATGCCTGTCA ACGCCTGTCA ATGCCTGTCA 10 bp 10 bp Boundary shift = 154 bp Variable boundary for CNE1043 Human Mouse Chicken Xenopus Fugu TGATGTTGAA TGATGTGTAG TGACGTTCAG TGACACTCAA TGACGCGCAG TCATTTAAAA TCATTTAAAA TCAGTTAAAA TCATTTAAAT TCAGTTAAAT 10 bp 10 bp Boundary shift = 0 bp Variable boundary for CNE1037 Human Mouse Chicken Xenopus Fugu TA-GGCCATT TA-GGCCATT TA-GGCCATT AA-GACCATA TGTGGTAGGT CTGATTTGTA CTGATTTGTA CTGATTTGTA CTGATTTTTT CTGATTTGTA 10 bp 10 bp Boundary shift = 65 bp Conservation structure of CNEs Summary • Statistical models for CNE boundaries that incorporates phylogenetic information. • Aim is to define location of CNE boundaries more reliably than pairwise or multiple sequence alignments. Acknowledgments Greg Elgar (Queen Mary College, University of London) Irina Abnizova Gayle McEwen Krys Kelly Brian Tom (MRC Biostatistics Unit, Cambridge) Tanya Vavouri (QMUL & Sanger Institute, Hinxton) Adam Woolfe (NHGRI, National Institutes of Health, US) Yvonne Edwards (University College, University of London) Martin Goodson References • Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K, Walter K, Abnizova I, Gilks W, Edwards YJ, Cooke JE, Elgar G. Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 2005, 3(1). • Walter K, Abnizova I, Elgar G, Gilks WR. Striking nucleotide frequency pattern at the borders of highly conserved vertebrate non-coding sequences. Trends Genet. 2005, 21(8):436-40. • Vavouri T, McEwen GK, Woolfe A, Gilks WR, Elgar G. Defining a genomic radius for long-range enhancer action: duplicated conserved non-coding elements hold the key. Trends Genet. 2006, 22(1):5-10. • McEwen GK, Woolfe A, Goode D, Vavouri T, Callaway H, Elgar G. Ancient duplicated conserved noncoding elements in vertebrates: a genomic and functional analysis. Genome Res. 2006,16(4):451-65. • Vavouri T, Walter K, Gilks WR, Lehner B, Elgar G. Parallel evolution of conserved noncoding elements that target a common set of developmental regulatory genes from worms to humans. Submitted 2006. Human CNE boundary MegaBLAST Phylogenetic position position A+T frequency Chicken CNE boundary MegaBLAST Phylogenetic position position A+T frequency Fugu CNE boundary MegaBLAST Phylogenetic position position A+T frequency From rate matrix Q to probability matrix P qAA qCA ' p p Q ( pA , pC , pG , pT ) qGA qTA qAC qAG qCC qGC qTC qCG qGG qTG qAT qCT qGT qTT pA' pAqAA pC qCA pGqGA pT qTA pA (qAC qAG qAT ) pC qCA pGqGA pT qTA P(t) of substitution probabilities (2) S diag( ) Q diag( (A , C , G , T ) 1/ 2 1/ 2 ) is symmetric with S and Q have the same eigenvalues (1, 2, 3, 4). exp(St ) V diag(exp( t )) (V )T exp(Qt ) diag( 1/ 2 )exp(St ) diag( 1/ 2 ) P (t ) exp(Qt )