* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PPT - Department of Computer Science
Gene expression profiling wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Histone acetylation and deacetylation wikipedia , lookup
RNA silencing wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Community fingerprinting wikipedia , lookup
Genome evolution wikipedia , lookup
Epitranscriptome wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Transcription factor wikipedia , lookup
Network motif wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Non-coding RNA wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Gene regulatory network wikipedia , lookup
Non-coding DNA wikipedia , lookup
Molecular evolution wikipedia , lookup
Gene expression wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
RNA polymerase II holoenzyme wikipedia , lookup
Eukaryotic transcription wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Discovering gapped binding sites Chengwei Lei Dr. Jianhua Ruan University of Texas at San Antonio Department of Computer Science Outline of Talk • • Motif Finding Background Gapped Motif Finding – Chen’s method – SPACE • • The PSO-motif algorithm Future Work Introduction/Motivation • Introduction: Identification of a transcription factor binding sites is an important aspect of the analysis of genetic regulation. Many programs have been developed for discovering the motif. • Motivation: The previously algorithms cost too much memory or time to find out the result; my work is trying to find out a new algorithm use less memory and less time to find the motif. What is motif finding • Motif finding, the process of discovering a meaningful pattern (of nucleotides or amino acids) that is shared by two or more sequences, is an important part of the study of gene function. Cells respond to environment Various external messages Heat Responds to environmental conditions Food Supply Regulation of Genes Transcription Factor (TF) (Protein) RNA polymerase (Protein) DNA Promoter Gene Regulation of Genes Transcription Factor (TF) (Protein) RNA polymerase (Protein) DNA Regulatory Element, TF binding site, TF binding motif, cis-regulatory motif (element) Gene Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene Regulation of Genes New protein RNA polymerase Transcription Factor DNA Regulatory Element Gene Real example . . . Real example . . . Look Like • I need a refrigerator, so I go to a refrigerator shop, I try to pick a very beautiful refrigerator from a lot of refrigerator(s). Finally I decide that I will buy a GE refrigerator. Look Like • I need a refrigeretor, so I go to a rafrigerator shop, I try to pick a very beautiful refragerator from a lot of refrigerater(s). Finally I decide that I will buy a GE refrigarator. Mismatch …TACGAT… …TAAAAT… …TATACT… …GATAAT… …TATAAT… …TATGTT… . . . Real example Consensus: TATAAT • • • • • • …TACGAT… …TAAAAT… …TATACT… …GATAAT… …TATAAT… …TATGTT… refrigerator •refrigeretor •rafrigerator •refragerator •refrigerater •refrigarator. Gapped Motif New protein RNA polymerase Transcription Factor DNA Regulatory Element Gene Gapped DNA binding? Gapped Motif • Together • Separate Together mutations n=5 5+3+5 L • Red+blue+green=5/25+15/15+5/25 = 25/65 • Red+xxx+green=5/25+xxx+5/25 = 10/50 Separate mutations n=5 L • Red=5/25 • Green=5/25 • Pink=4/25 What can we do with the gap? • Chen’s method • SPACE • PSO Chen’s method • ChIP-chip experiment – Get a positive set Ga – Get a negative set G-a Compact Blocks • Patterns that are found in Ga with a proportion larger than a predefined value (25% by default) are included in the pattern list. Compact Blocks • Long enough patterns (3containing at least six nonwildcards) are taken as candidate motifs. Short patterns (2blocks of 3 or 4 bp) are filtered Hit/Seq ratio • The sequences that match the pattern are called the supporting sequences of a pattern. It is possible that a pattern matches a sequence at more than one position. • The Hit/Seq ratio of a pattern is the average number of occurrences of a pattern among its supporting sequences. Block Filtering • Filtered out if the Hit/Seq ratio is larger than 15 • A large Hit/Seq ratio implies that the compact blocks are frequently repeated in a single promoter region. • In addition to the Hit/Seq ratio, they also use an upper threshold for f-a (the proportion of sequences with a pattern P in G-a) to eliminate repetitive elements present across different promoter sequences. A pattern is retained only if it satisfies: (less than 0.16) Growing Gapped Motifs • Growing gapped motifs is similar to growing compact motifs. Pattern Ranking • An identified pattern is filtered out before ranking if the Hit/Seq ratio is2, which is considered as a reasonable upper bound for selecting reliable patterns. • Sd is the preferential occurrence of a pattern in Ga relative to G-a • Sp is a formula value. • Sc is the conservation score. Sd • The proportions of sequences in Ga and G-a that contain a pattern P are denoted as fa and f-a. The one-tailed two-sample proportion test can be performed as follows: • Patterns with a z score (Sd) smaller than z1– 0.01 are treated as nonsignificant and are removed before the ranking process. Sp Sc • Sc is the degree of evolutionary conservation among a set of orthologous sequences. • (from Saccharomyces paradoxus, Saccharomyces kudriavzevii, Saccharomyces mikatae, and Saccharomyces bayanus) Result Key point • Filter !! SPACE • Generation of motif candidates – Consider L=20 • Consider L=20, r=0.5, l=5, d=1 and q=4. Refinding Motif • GAAGAnnnnnnnTAGAAAnn is a spaced motif of five sequences. • Motif Score(M) = • + • E(M, e) be the expected frequency of M with at most e mutations based on a set of background sequences Why PSO method Background • Particle swarm optimization (PSO) is a population based stochastic optimization technique and it is inspired by social behavior of bird flocking or fish schooling. • PSO shares many similarities with evolutionary computation techniques such as Genetic Algorithms (GA). But it is simpler and faster than GA. • It has been shown to be effective in optimizing difficult multidimensional problems in a variety of fields. • PSO has widely application in ANN (Artificial Neural Network), Nonlinear Control, Electromagnetic, Antenna design, Bioinformatics. Some key terms used to describe PSO Agent (Particle) One single individual in the swarm Position An agent’s N-dimensional coordinates which represents a solution to the problem Swarm The entire collection of agents. Fitness A single number representing the goodness of a given solution Pbest The location in parameter space of the best fitness returned for a specific agent Gbest The location in parameter space of the best fitness returned for the entire swarm V The velocity of each agent. gbest Pbest2 Pbest1 Vn Vn C1 rand () ( pbest ,n xn ) C2 rand () ( gbest , n xn ) xn xn Vn • One agent’s movement in the PSO algorithm. Flow chart of the PSO algorithm • In a typical PSO algorithm, one wishes to control the velocity so that at the beginning stage the particles can fly around quickly inside the search space, and when a particle approaches the optimal solution, it should slow down so it can converge quickly. . . . • • • • • • …TACGATA… …TAAAAT… …TATACT… …GATAAT… …TATGAT… …TATGTT… • One can achieve this if the fitness function is continuous, since the velocity is updated according to the distances between the current position and the positions of pbest and gbest. How to solve • Remap • Redefine Remap the neighborhood information 1 2 N A C G T T C C A T.............A C G T T C C T mis is 6 mis is 1 Redefine n=5 L • • • • Green Red Pink Blue Current Gbest Pbest Random Redefine • Good for gapped motif finding. – Quick – Flexible – High sensitivity – High extensibility Thank you !