* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download statgen9
Microevolution wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Human genome wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Transfer RNA wikipedia , lookup
Microsatellite wikipedia , lookup
Frameshift mutation wikipedia , lookup
Metagenomics wikipedia , lookup
Genome editing wikipedia , lookup
Helitron (biology) wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Point mutation wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Expanded genetic code wikipedia , lookup
Sequence analysis How to locate rare/important subsequences. Sequence Analysis Tasks Representing sequence features, and finding sequence features using consensus sequences and frequency matrices Sequence features Features following an exact pattern- restriction enzyme recognition sites Features with approximate patterns promoters transcription initiation sites transcription termination sites polyadenylation sites ribosome binding sites protein features Representing uncertainty in nucleotide sequences It is often the case that we would like to represent uncertainty in a nucleotide sequence, i.e., that more than one base is “possible” at a given position to express ambiguity during sequencing to express variation at a position in a gene during evolution to express ability of an enzyme to tolerate more than one base at a given position of a recognition site Representing uncertainty in nucleotide sequences To do this for nucleotides, we use a set of single character codes that represent all possible combinations of bases This set was proposed and adopted by the International Union of Biochemistry and is referred to as the I.U.B. code Given the size of the amino acid “alphabet”, it is not practical to design a set of codes for ambiguity in protein sequences The I.U.B. Code A, C, G, T, U R = A, G (puRine) Y = C, T (pYrimidine) S = G, C (Strong hydrogen bonds) W = A, T (Weak hydrogen bonds) M = A, C (aMino group) K = G, T (Keto group) B = C, G, T (not A) D = A, G, T (not C) H = A, C, T (not G) V = A, C, G (not T/U) N = A, C, G, T/U (iNdeterminate) X or - are sometimes used Definitions A sequence feature is a pattern that is observed to occur in more than one sequence and (usually) to be correlated with some function A consensus sequence is a sequence that summarizes or approximates the pattern observed in a group of aligned sequences containing a sequence feature Consensus sequences are regular expressions Finding occurrences of consensus sequences Example: recognition site for a restriction enzyme EcoRI recognizes GAATTC AccI recognizes GTMKAC Basic Algorithm Start with first character of sequence to be searched See if enzyme site matches starting at that position Advance to next character of sequence to be searched Repeat previous two steps until all positions have been tested Block Diagram for Search with a Consensus Sequence Consensus Sequence (in IUB codes) Sequence to be searched Search Engine List of positions where matches occur Statistics of pattern appearance Goal: Determine the significance of observing a feature (pattern) Method: Estimate the probability that that pattern would occur randomly in a given sequence. Three different methods Assume all nucleotides are equally frequent Use measured frequencies of each nucleotide (mononucleotide frequencies) Use measured frequencies with which a given nucleotide follows another (dinucleotide frequencies) Determining mononucleotide frequencies Count how many times each nucleotide appears in sequence Divide (normalize) by total number of nucleotides Result: fA mononucleotide frequency of A (frequency that A is observed) Define: pAmononucleotide probability that a nucleotide will be an A pA assumed to equal fA Determining dinucleotide frequencies Make 4 x 4 matrix, one element for each ordered pair of nucleotides Zero all elements Go through sequence linearly, adding one to matrix entry corresponding to the pair of sequence elements observed at that position Divide by total number of dinucleotides Result: fAC dinucleotide frequency of AC (frequency that AC is observed out of all dinucleotides) Determining conditional dinucleotide probabilities Divide each dinucleotide frequency by the mononucleotide frequency of the first nucleotide Result: p*AC conditional dinucleotide probability of observing a C given an A p*AC = fAC/ fA Illustration of probability calculation What is the probability of observing the sequence feature ART? A followed by a purine, (either A or G), followed by a T? Using equal mononucleotide frequencies pA = pC = pG = pT = 1/4 pART = 1/4 * (1/4 + 1/4) * 1/4 = 1/32 Illustration (continued) Using observed mononucleotide frequencies: pART = pA (pA + pG) pT Using dinucleotide frequencies: pART = pA (p*AAp*AT + p*AGp*GT) Another illustration What is pACT in the sequence TTTAACTGGG? fA = 2/10, fC = 1/10 pA = 0.2 fAC = 1/10, fCT = 1/10 p*AC = 0.1/0.2 = 0.5, p*CT = 0.1/0.1 = 1 pACT = pA p*AC p*CT = 0.2 * 0.5 * 1 = 0.1 (would have been 1/5 * 1/10 * 4/10 = 0.008 using mononucleotide frequencies) Expected number and spacing Probabilities are per nucleotide How do we calculate number of expected features in a sequence of length L? Expected number (for large L) Lp How do we calculate the expected spacing between features? ART expected spacing between ART features = 1/pART Renewals For greatest accuracy in calculating spacing of features, need to consider renewals of a feature (taking into account whether a feature can overlap with a neighboring copy of that feature) For example what is the frequency of GCGC in : ACTGCATGCGCGCATGCGCATATGACGA Renewals We define a renewal as the end of a non overlapping motif. For example: The renewals of GCGC in ACTGCATGCGCGCATGCGCATATGCGCGCG C Are at 11,19,27,31 The clamps size are: 2,1,2,1 Renewals and Clump size. Let R be a general pattern: R=(r1,…,rm) Let us denote: R(i)=(r1,…,ri) R(i)=(rm-i+1,…,rm) The clamp size is: m 1 c 1 pri1 ... prm 1R ( i ) R (i ) i 1 Clamp Frequency Let us assume that the clamps are distributed randomly. Their frequency, and the interval between any two clamps would be: nc npr1 ... prm 1 m 1 i 1 1 1R ( i ) R (i ) pr1 ... pri Statistical tests In order to test if the motif is over/under represented or non-uniformly distributed we must test the clamp distribution. In order to test motif frequency we can test if the clamp frequency has an average and variance of n In order to test their distribution, we can divide the entire sequence into k subsequences of size: m<T<<1/ and test that S has a c2 distribution, where Ti is the clump frequency in the subsequence 2 and S is: T n / k k s i 1 i n / k Frequency of simple motifs Statistics of AT- or GC-rich regions What is the probability of observing a “run” of the same nucleotide (e.g., 25 A’s) Let px be the mononucleotide probability of nucleotide x The per nucleotide probability of a run of N consecutive x’s is pxN The probability of occurrence in a sequence of length L much longer than N is ≈ L pxN Statistics of AT- or GC-rich regions What if J “mismatches” are allowed? Let py be the probability of observing a different nucleotide (normally py = 1 - px) The probability of observing n-j of nucleotide x and j of nucleotide y in a region of length n is n- j np x n p y j j n n! j (n j )! j! Statistics of AC- or GC-rich regions As before, we can multiply by L to approximate the probability of observing that combination in a sequence of length L Note that this is the probability of observing exactly N-J matches and exactly J mismatches. We may also wish to know the probability of finding at least N-J matches, which requires summing the probability for I=0 to I=J. j np i 0 n -i x n p y i i Frequency matrices Frequency matrices Goal: Describe a sequence feature (or motif) more quantitatively than possible using consensus sequences Definition: For a feature of length m using an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature Weight matrix Probabilistic model: How likely is each letter at each motif position? A C G T 1 2 3 4 5 6 7 8 9 .89 .02 .38 .34 .22 .27 .02 .03 .02 .04 .91 .20 .17 .28 .31 .30 .04 .02 .04 .05 .41 .18 .29 .16 .07 .92 .18 .03 .02 .01 .31 .21 .26 .61 .01 .78 Nomenclature Weight matrices are also known as Position-specific scoring matrices Position-specific probability matrices Position-specific weight matrices Scoring a motif model A motif is interesting if it is very different from the background distribution A C G T 1 2 3 4 5 6 7 8 9 .89 .02 .38 .34 .22 .27 .02 .03 .02 .04 .91 .20 .17 .28 .31 .30 .04 .02 .04 .05 .41 .18 .29 .16 .07 .92 .18 .03 .02 .01 .31 .21 .26 .61 .01 .78 less interesting more interesting Relative entropy A motif is interesting if it is very different from the background distribution Use relative entropy*: pi , pi , log b position i letter pi, = probability of in matrix position i b = background frequency (in non-motif sequence) * Relative entropy is sometimes called information content. Scoring motif instances A motif instance matches if it looks like it was generated by the weight matrix A C G T 1 2 3 4 5 6 7 8 9 .89 .02 .38 .34 .22 .27 .02 .03 .02 .04 .91 .20 .17 .28 .31 .30 .04 .02 .04 .05 .41 .18 .29 .16 .07 .92 .18 .03 .02 .01 .31 .21 .26 .61 .01 .78 “ A C G G C G C C T” Not likely! Hard to tell Matches weight matrix Log likelihood ratio A motif instance matches if it looks like it was generated by the weight matrix Use log likelihood ratio pi ,i log b position i i i: the character at position i of the instance Measures how much more like the weight matrix than like the background. Alternating approach Guess an initial weight matrix 2. Use weight matrix to predict instances in the input sequences 3. Use instances to predict a weight matrix 4. Repeat 2 & 3 until satisfied. 1. Examples: Gibbs sampler (Lawrence et al.) MEME (expectation max. / Bailey, Elkan) ANN-Spec (neural net / Workman, Stormo) Expectation-maximization foreach subsequence of width W convert subsequence to a matrix do { re-estimate motif occurrences from matrix EM re-estimate matrix model from motif occurrences } until (matrix model stops changing) end select matrix with highest score Sample DNA sequences >ce1cg TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA GCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCAC AAAAATGGAAGTCCACAGTCTTGACAG >ara GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAG AAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTG CTATGCCATAGCATTTTTATCCATAAG >bglr1 ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATA TAACTTTATAAATTCCTAAAATTACACAAAGTTAATAAC TGTGAGCATGGTCATATTTTTATCAAT >crp CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTAC AGTAATACATTGATGTACTGCATGTATGCAAAGGACGTC ACATTACCGTGCAGTACAGTTGATAGC Motif occurrences >ce1cg taatgtttgtgctggtttttgtggcatcgggcgagaata gcgcgtggtgtgaaagactgttttTTTGATCGTTTTCAC aaaaatggaagtccacagtcttgacag >ara gacaaaaacgcgtaacaaaagtgtctataatcacggcag aaaagtccacattgattaTTTGCACGGCGTCACactttg ctatgccatagcatttttatccataag >bglr1 acaaatcccaataacttaattattgggatttgttatata taactttataaattcctaaaattacacaaagttaataac TGTGAGCATGGTCATatttttatcaat >crp cacaaagcgaaagctatgctaaaacagtcaggatgctac agtaatacattgatgtactgcatgtaTGCAAAGGACGTC ACattaccgtgcagtacagttgatagc Starting point …gactgttttTTTGATCGTTTTCACaaaaatgg… A C G T T 0.17 0.17 0.17 0.50 T 0.17 0.17 0.17 0.50 T 0.17 0.17 0.17 0.50 G 0.17 0.17 0.50 0.17 A T C 0.50 ... 0.17 0.17 0.17 G T T Re-estimating motif occurrences TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA A C G T T 0.17 0.17 0.17 0.50 T 0.17 0.17 0.17 0.50 T 0.17 0.17 0.17 0.50 G 0.17 0.17 0.50 0.17 A T C 0.50 ... 0.17 0.17 0.17 G T T Score = 0.50 + 0.17 + 0.17 + 0.17 + 0.17 + ... Scoring each subsequence Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA Subsequences Score TGTGCTGGTTTTTGT 2.95 GTGCTGGTTTTTGTG 4.62 TGCTGGTTTTTGTGG 2.31 GCTGGTTTTTGTGGC ... Select from each sequence the subsequence with maximal score. Re-estimating motif matrix Occurrences TTTGATCGTTTTCAC TTTGCACGGCGTCAC TGTGAGCATGGTCAT TGCAAAGGACGTCAC A C G T Counts 000132011000040 001010300200403 020301131130000 423001002114001 Adding pseudocounts A C G T Counts 000132011000040 001010300200403 020301131130000 423001002114001 Counts + Pseudocounts A 111243122111151 C 112121411311514 G 131412242241111 T 534112113225112 Converting to frequencies Counts + Pseudocounts A 111243122111151 C 112121411311514 G 131412242241111 T 534112113225112 A C G T T 0.13 0.13 0.13 0.63 T 0.13 0.13 0.38 0.38 T 0.13 0.25 0.13 0.50 G 0.25 0.13 0.50 0.13 A T C 0.50 ... 0.25 0.13 0.13 G T T Amino acid weight matrices A sequence logo is a scaled position-specific A.A. distribution. Scaling is by a measure of a position’s information content. Sequence logos A visual representation of a position-specific distribution. Easy for nucleotides, but we need colour to depict up to 20 amino acid proportions. Idea: overall height at position l proportional to information content (2-Hl); proportions of each nucleotide ( or amino acid) are in relation to their observed frequency at that position, with most frequent on top, next most frequent below, etc.. Summary of motif detection Block Diagram for Searching with a PSSM PSSM Threshold Set of Sequences to search PSSM search Sequences that match above threshold Positions and scores of matches Block Diagram for Searching for sequences related to a family with a PSSM Set of Aligned Sequence Features Expected frequencies of each sequence element PSSM builder PSSM Threshold Set of Sequences to search PSSM search Sequences that match above threshold Positions and scores of matches Consensus sequences vs. frequency matrices Should I use a consensus sequence or a frequency matrix to describe my site? If all allowed characters at a given position are equally "good", use IUB codes to create consensus sequence Example: Restriction enzyme recognition sites If some allowed characters are "better" than others, use frequency matrix Example: Promoter sequences Advantages of consensus sequences: smaller description, quicker comparison Disadvantage: lose quantitative information on preferences at certain locations Similarity Functions Used to facilitate comparison of two sequence elements logical valued (true or false, 1 or 0) test whether first argument matches (or could match) second argument numerical valued test degree to which first argument matches second Logical valued similarity functions Let Search(I)=‘A’ and Sequence(J)=‘R’ A Function to Test for Exact Match MatchExact(Search(I),Sequence(J)) would return FALSE since A is not R A Function to Test for Possibility of a Match using IUB codes for Incompletely Specified Bases MatchWild(Search(I),Sequence(J)) would return TRUE since R can be either A or G Numerical valued similarity functions return value could be probability (for DNA) Let Search(I) = 'A' and Sequence(J) = 'R' SimilarNuc (Search(I),Sequence(J)) could return 0.5 since chances are 1 out of 2 that a purine is adenine return value could be similarity (for protein) Let Seq1(I) = 'K' (lysine) and Seq2(J) = 'R' (arginine) SimilarProt(Seq1(I),Seq2(J)) could return 0.8 since lysine is similar to arginine usually use integer values for efficiency Concluding Notes: Protein detection Given a DNA or RNA sequence, find those regions that code for protein(s) Direct approach: Genetic codes The set of tRNAs that an organism possesses defines its genetic code(s) The universal genetic code is common to all organisms Prokaryotes, mitochondria and chloroplasts often use slightly different genetic codes More than one tRNA may be present for a given codon, allowing more than one possible translation product Genetic codes Differences in genetic codes occur in start and stop codons only Alternate initiation codons: codons that encode amino acids but can also be used to start translation (GTG, TTG, ATA, TTA, CTG) Suppressor tRNA codons: codons that normally stop translation but are translated as amino acids (TAG, TGA, TAA) Reading Frames Since nucleotide sequences are “read” three bases at a time, there are three possible “frames” in which a given nucleotide sequence can be “read” (in the forward direction) Taking the complement of the sequence and reading in the reverse direction gives three more reading frames Reading frames RF1 RF2 RF3 RF4 RF5 RF6 TTC Phe Ser Leu AAG <Glu <Glu <Arg TCA Ser His Met AGT *** His Met TGT Cys Val Phe ACA Thr Lys Asn TTG Leu *** Asp AAC Gln Val Ser ACA GCT Thr Ala> Gln Leu> Ser> TGT CGA Cys Ser Ala Leu Reading frames To find which reading frame a region is in, take nucleotide number of lower bound of region, divide by 3 and take remainder (modulus 3) 1=RF1, 2=RF2, 0=RF3 For reverse reading frames, take nucleotide number of upper bound of region, subtract from total number of nucleotides, divide by 3 and take remainder (modulus 3) 0=RF4, 1=RF5, 2=RF6 This is because the convention MacVector uses is that RF4 starts with the last nucleotide and reads backwards Open Reading Frames (ORF) Concept: Region of DNA or RNA sequence that could be translated into a peptide sequence (open refers to absence of stop codons) Prerequisite: A specific genetic code Definition: (start codon) (amino acid coding codon)n (stop codon) Note: Not all ORFs are actually used Block Diagram for Direct Search for ORFs Genetic code Both strands? Ends start/stop? Sequence to be searched Search Engine List of ORF positions Statistical Approaches Calculation Windows Many sequence analyses require calculating some statistic over a long sequence looking for regions where the statistic is unusually high or low To do this, we define a window size to be the width of the region over which each calculation is to be done Example: %AT Base Composition Bias For a protein with a roughly “normal” amino acid composition, the first 2 positions of all codons will be about 50% GC If an organism has a high GC content overall, the third position of all codons must be mostly GC Useful for prokaryotes Not useful for eukaryotes due to large amount of noncoding DNA Fickett’s statistic Also called TestCode analysis Looks for asymmetry of base composition Strong statistical basis for calculations Method: For each window on the sequence, calculate the base composition of nucleotides 1, 4, 7..., then of 2, 5, 8..., and then of 3, 6, 9... Calculate statistic from resulting three numbers Codon Bias (Codon Preference) Principle Different levels of expression of different tRNAs for a given amino acid lead to pressure on coding regions to “conform” to the preferred codon usage Non-coding regions, on the other hand, feel no selective pressure and can drift Codon Bias (Codon Preference) Starting point: Table of observed codon frequencies in known genes from a given organism best to use highly expressed genes Method Calculate “coding potential” within a moving window for all three reading frames Look for ORFs with high scores Codon Bias (Codon Preference) Works best for prokaryotes or unicellular eukaryotes because for multicellular eukaryotes, different pools of tRNA may be expressed at different stages of development in different tissues may have to group genes into sets Codon bias can also be used to estimate protein expression level Portion of D. melanogaster codon frequency table Amino Acid Codon GlyG Number Freq/1000 Fraction Gly GGG 11 2.60 0.03 Gly GGA 92 21.74 0.28 Gly GGT 86 20.33 0.26 Gly GGC 142 33.56 0.43 Glu GAG 212 50.11 0.75 Glu GAA 69 16.31 0.25 Comparison of Glycine codon frequencies Codon GlyG E. coli D. melanogaster GGG 0.02 0.03 GGA 0.00 0.28 GGT 0.59 0.26 GGC 0.38 0.43