* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download tutorial5_12
Survey
Document related concepts
Deoxyribozyme wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Non-coding DNA wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Gene regulatory network wikipedia , lookup
Gene expression wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Molecular evolution wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Point mutation wikipedia , lookup
Network motif wikipedia , lookup
Homology modeling wikipedia , lookup
Transcript
Tutorial 5 Motif discovery Multiple sequence alignments and motif discovery • Motif discovery – – – – – MEME MAST TOMTOM GOMO PROSITE Can we find motifs using multiple sequence alignment? Motif ..YDEEGGDAEE.. ..YDEEGGDAEE.. ..YGEEGADYED.. ..YDEEGADYEE.. ..YNDEGDDYEE.. ..YHDEGAADEE.. A widespread pattern with a biological significance 1 2 3 4 5 6 7 8 9 10 A 0 0 0 0 0 3/6 1/6 2/6 0 0 D 0 3/6 2/6 0 0 1/6 5/6 1/6 0 1/6 E 0 0 4/6 1 0 0 0 0 1 5/6 G 0 1/6 0 0 1 1/3 0 0 0 0 H 0 1/6 0 0 0 0 0 0 0 0 N 0 1/6 0 0 0 0 0 0 0 0 Y 1 0 0 0 0 0 3/6 3/6 0 0 Can we find motifs using multiple sequence alignment (MSA)? YES! NO Using MSA for motif discovery Can only work if things align nicely alone For most motifs this is not the case! ClustalW - Input http://www.ebi.ac.uk/Tools/clustalw2/index.html Input sequences Scoring matrix Output format Email address Gap scoring Muscle http://www.ebi.ac.uk/Tools/muscle/index.html Input sequences Output format Email address Motif search: from de-novo motifs to motif annotation gapped motifs Large DNA data http://meme.sdsc.edu/ MEME – Multiple EM* for Motif finding http://meme.sdsc.edu/ • Motif discovery from unaligned sequences Genomic or protein sequences • Flexible model of motif presence (Motif can be absent in some sequences or appear several times in one sequence) *Expectation-maximization Email address Input file (fasta file) MEME - Input How many times in each sequence? Range of motif lengths How many motifs? How many sites? MEME - Output Motif score MEME - Output Motif score Motif length Number of times MEME - Output Low uncertainty = High information content MEME - Output Multilevel Consensus Patterns can be presented as regular expressions [AG]-x-V-x(2)-{YW} [] - Either residue x - Any residue x(2) - Any residue in the next 2 positions {} - Any residue except these Examples: AYVACM, GGVGAA MEME - Output Position in sequence Strength of match Sequence names Motif within sequence Sequence names Overall strength of motif matches MEME - Output Motif location in the input sequence What can we do with motifs? • MAST - Search for them in non annotated sequence databases (protein and DNA) • TOMTOM - Find the protein who binds the DNA motifs. • GOMO - Find putative target genes (DNA) of motifs and analyze their associated annotation terms. • PROSITE - Search for them in annotated protein sequence databases. MAST http://meme.sdsc.edu/meme4_4_0/cgi-bin/mast.cgi • Searches for motifs (one or more) in sequence databases: – Like BLAST but motifs for input – Similar to iterations of PSI-BLAST • Profile defines strength of match – Multiple motif matches per sequence – Combined E value for all motifs • MEME uses MAST to summarize results: – Each MEME result is accompanied by the MAST result for searching the discovered motifs on the given sequences. MAST - Input Email address Input file (motifs) Database Input motifs MAST - Output Presence of the motifs in a given database TOMTOM http://meme.sdsc.edu/meme/doc/tomtom.html • Searches one or more query DNA motifs against one or more databases of target motifs, and reports for each query a list of target motifs, ranked by p-value. • The output contains results for each query, in the order that the queries appear in the input file. TOMTOM - Input Input motif Background frequencies Database DNA IUPAC* code A --> adenosine C --> cytidine G --> guanine T --> thymidine M --> A C (amino) S --> G C (strong) W --> A T (weak) B --> G T C R --> G A (purine) Y --> T C (pyrimidine) K --> G T (keto) D --> G A T H --> A C T V --> G C A N --> A G C T (any) Example: YCAY = [TC]CA[TC] *IUPAC = International Union of Pure and Applied Chemistry TOMTOM - Output Input motif Matching motifs TOMTOM – Output Wrong input, ok results JASPAR • Profiles – Transcription factor binding sites – Multicellular eukaryotes – Derived from published collections of experiments • Open data accesss Name of gene/protein organism score logo GOMO • GOMO takes DNA binding motifs to find putative target genes and analyze their associated GO terms. A list of significant GO terms that can be linked to the given motifs will be produced. • GOMO returns a list of GO-terms that are significantly associated with target genes of the motif. • Gene Ontology provides a controlled vocabulary to describe gene and gene product attributes in any organism. GOMO - Input Email address Input file (motifs) Database Input motifs GOMO - Output MF - Molecular function BP - Biological process CC - Cellular compartment GO annotation Prosite http://www.expasy.org/tools/scanprosite ProSite is a database of protein domains and motifs that can be searched by either regular expression patterns or sequence profiles. Input motif a regular expression Database Filters Prosite - input Input motif Prosite - Output Location in the protein sequence protein