Download tutorial5_12

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Deoxyribozyme wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Non-coding DNA wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Gene regulatory network wikipedia , lookup

Gene expression wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Molecular evolution wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Point mutation wikipedia , lookup

Network motif wikipedia , lookup

Homology modeling wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Tutorial 5
Motif discovery
Multiple sequence alignments and motif discovery
• Motif discovery
–
–
–
–
–
MEME
MAST
TOMTOM
GOMO
PROSITE
Can we find motifs using multiple sequence
alignment?
Motif
..YDEEGGDAEE..
..YDEEGGDAEE..
..YGEEGADYED..
..YDEEGADYEE..
..YNDEGDDYEE..
..YHDEGAADEE..
A widespread pattern
with a biological
significance
1
2
3
4
5
6
7
8
9
10
A
0
0
0
0
0
3/6
1/6
2/6
0
0
D
0
3/6
2/6
0
0
1/6
5/6
1/6
0
1/6
E
0
0
4/6
1
0
0
0
0
1
5/6
G
0
1/6
0
0
1
1/3
0
0
0
0
H
0
1/6
0
0
0
0
0
0
0
0
N
0
1/6
0
0
0
0
0
0
0
0
Y
1
0
0
0
0
0
3/6
3/6
0
0
Can we find motifs using multiple
sequence alignment (MSA)?
YES!
NO
Using MSA for motif discovery
Can only work if things align nicely alone
For most motifs this is not the case!
ClustalW - Input
http://www.ebi.ac.uk/Tools/clustalw2/index.html
Input
sequences
Scoring
matrix
Output
format
Email
address
Gap scoring
Muscle
http://www.ebi.ac.uk/Tools/muscle/index.html
Input
sequences
Output
format
Email
address
Motif search: from de-novo motifs to
motif annotation
gapped motifs
Large DNA data
http://meme.sdsc.edu/
MEME – Multiple EM* for Motif finding
http://meme.sdsc.edu/
• Motif discovery from unaligned sequences
Genomic or protein sequences
• Flexible model of motif presence (Motif can be absent in
some sequences or appear several times in one sequence)
*Expectation-maximization
Email
address
Input file
(fasta file)
MEME - Input
How many
times in each
sequence?
Range of
motif
lengths
How many
motifs?
How
many
sites?
MEME - Output
Motif score
MEME - Output
Motif
score
Motif length
Number of
times
MEME - Output
Low uncertainty
=
High information content
MEME - Output
Multilevel Consensus
Patterns can be presented as regular
expressions
[AG]-x-V-x(2)-{YW}
[] - Either residue
x - Any residue
x(2) - Any residue in the next 2 positions
{} - Any residue except these
Examples: AYVACM, GGVGAA
MEME - Output
Position in
sequence
Strength of
match
Sequence
names
Motif within
sequence
Sequence
names
Overall strength of
motif matches
MEME - Output
Motif location in
the input
sequence
What can we do with motifs?
• MAST - Search for them in non annotated
sequence databases (protein and DNA)
• TOMTOM - Find the protein who binds the
DNA motifs.
• GOMO - Find putative target genes (DNA) of
motifs and analyze their associated
annotation terms.
• PROSITE - Search for them in annotated
protein sequence databases.
MAST
http://meme.sdsc.edu/meme4_4_0/cgi-bin/mast.cgi
• Searches for motifs (one or more) in sequence
databases:
– Like BLAST but motifs for input
– Similar to iterations of PSI-BLAST
• Profile defines strength of match
– Multiple motif matches per sequence
– Combined E value for all motifs
• MEME uses MAST to summarize results:
– Each MEME result is accompanied by the MAST result for
searching the discovered motifs on the given sequences.
MAST - Input
Email
address
Input file
(motifs)
Database
Input
motifs
MAST - Output
Presence of the motifs in a given database
TOMTOM
http://meme.sdsc.edu/meme/doc/tomtom.html
• Searches one or more query DNA motifs
against one or more databases of target
motifs, and reports for each query a list of
target motifs, ranked by p-value.
• The output contains results for each query, in
the order that the queries appear in the input
file.
TOMTOM - Input
Input
motif
Background
frequencies
Database
DNA IUPAC* code
A --> adenosine
C --> cytidine
G --> guanine
T --> thymidine
M --> A C (amino)
S --> G C (strong)
W --> A T (weak)
B --> G T C
R --> G A (purine)
Y --> T C (pyrimidine)
K --> G T (keto)
D --> G A T
H --> A C T
V --> G C A
N --> A G C T (any)
Example: YCAY = [TC]CA[TC]
*IUPAC = International Union of Pure and Applied Chemistry
TOMTOM - Output
Input
motif
Matching
motifs
TOMTOM – Output
Wrong input, ok results
JASPAR
• Profiles
– Transcription factor binding sites
– Multicellular eukaryotes
– Derived from published collections of experiments
• Open data accesss
Name of
gene/protein
organism
score
logo
GOMO
• GOMO takes DNA binding motifs to find putative
target genes and analyze their associated GO
terms. A list of significant GO terms that can be
linked to the given motifs will be produced.
• GOMO returns a list of GO-terms that are
significantly associated with target genes of the
motif.
• Gene Ontology provides a controlled vocabulary
to describe gene and gene product attributes in
any organism.
GOMO - Input
Email
address
Input file
(motifs)
Database
Input
motifs
GOMO - Output
MF - Molecular function
BP - Biological process
CC - Cellular compartment
GO
annotation
Prosite
http://www.expasy.org/tools/scanprosite
ProSite is a database of protein domains and
motifs that can be searched by either regular
expression patterns or sequence profiles.
Input motif
a regular
expression
Database
Filters
Prosite - input
Input motif
Prosite - Output
Location in
the protein
sequence
protein