Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright 1996-2001. All rights reserved. Sequence Analysis Tasks Calculating the probability of finding a region with a particular base composition Statistics of AT- or GC-rich regions What is the probability of observing a “run” of the same nucleotide (e.g., 25 A’s) Let px be the mononucleotide probability of nucleotide x The per nucleotide probability of a run of N consecutive x’s is pxN The probability of occurence in a sequence of length L longer than N is ≈ L pxN Statistics of AT- or GC-rich regions What if J “mismatches” are allowed? Let py be the probability of observing a different nucleotide (normally py = 1 - px) The probability of observing N-J of nucleotide x and J of nucleotide y in a region of length N is pxN-J pyJ C(N,J) where C(N,J) = N! / ( (N-J)! J! ) Statistics of AC- or GC-rich regions As before, we can multiply by L to approximate the probability of observing that combination in a sequence of length L Note that this is the probability of observing exactly N-J matches and exactly J mismatches. We may also wish to know the probability of finding at least N-J matches, which requires summing the probability for I=0 to I=J. Statistics of AT- or GC-rich regions (A4 Enriched seq prob demo) Sequence Analysis Tasks Calculating the probability of finding a sequence pattern Calculating the probability of finding a region with a particular base composition Representing and finding sequence features/motifs using frequency matrices Describing features using frequency matrices Goal: Describe a sequence feature (or motif) more quantitatively than possible using consensus sequences Need to describe how often particular bases are found in particular positions in a sequence feature Describing features using frequency matrices Definition: For a feature of length m using an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature Frequency matrices (continued) Three uses of frequency matrices Describe a sequence feature Calculate probability of occurrence of feature in a random sequence Calculate degree of match between a new sequence and a feature Interactive Demonstration (A2 Frequency matrix demo) Frequency Matrices, PSSMs, and Profiles A frequency matrix can be converted to a Position-Specific Scoring Matrix (PSSM) by converting frequencies to scores (e.g., by taking logs) PSSMs also called Position Weight Matrixes (PWMs) or Profiles Finding occurrences of a sequence feature using a Profile As with finding occurrences of a consensus sequence, we consider all positions in the target sequence as candidate matches For each position, we calculate a score by “looking up” the value corresponding to the base at that position Interactive Demonstration (A10 Searching with Profile demo) Block Diagram for Building a PSSM Set of Aligned Sequence Features Expected frequencies of each sequence element PSSM builder PSSM Block Diagram for Searching with a PSSM PSSM Threshold Set of Sequences to search PSSM search Sequences that match above threshold Positions and scores of matches Block Diagram for Searching for sequences related to a family with a PSSM Set of Aligned Sequence Features Expected frequencies of each sequence element PSSM builder PSSM Threshold Set of Sequences to search PSSM search Sequences that match above threshold Positions and scores of matches Consensus sequences vs. frequency matrices Should I use a consensus sequence or a frequency matrix to describe my site? If all allowed characters at a given position are equally "good", use IUB codes to create consensus sequence Example: Restriction enzyme recognition sites If some allowed characters are "better" than others, use frequency matrix Example: Promoter sequences Consensus sequences vs. frequency matrices Advantages of consensus sequences: smaller description, quicker comparison Disadvantage: lose quantitative information on preferences at certain locations Summary, Part 3 Probability of finding sequences enriched in one or more bases can be calculated using probability of consecutive bases multiplied by number of combinations allowed Complex sequence features can be described using frequency matrices Frequency matrices can be used for quantitative estimates of the degree to which a given sequence matches a feature