Download lecture07_13

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cre-Lox recombination wikipedia , lookup

Epitranscriptome wikipedia , lookup

Transcription factor wikipedia , lookup

Non-coding DNA wikipedia , lookup

Protein moonlighting wikipedia , lookup

List of types of proteins wikipedia , lookup

Histone acetylation and deacetylation wikipedia , lookup

Magnesium transporter wikipedia , lookup

P-type ATPase wikipedia , lookup

Western blot wikipedia , lookup

Molecular evolution wikipedia , lookup

Protein adsorption wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gene expression wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Homology modeling wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Point mutation wikipedia , lookup

Cooperative binding wikipedia , lookup

Protein domain wikipedia , lookup

Gene regulatory network wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Network motif wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Transcript
Motif Search
What are Motifs
• Motif (dictionary) A recurrent thematic
element, a common theme
Find a common motif in the text
Find a short common motif in the text
Motifs in biological sequences
Sequence motifs represent a short common
sequence (length 4-20) which is highly
represented in the data
Challenges in biological sequences
Motifs are usually not exact words
How to present non exact motifs?
• Consensus string NTAHAWT
May allow “degenerate” symbols in string,
e.g., N = A/C/G/T; W = A/T; H=not G; S =
C/G; R = A/G; Y = T/C etc.
• Position Weight Matrix (PWM)
1 2 3 4 5 6
Probability for each base A
T
in each position
G
C
0.1
0.7 0.2
0.6
0.5
0.1
0.7
0.1 0.5
0.2
0.2
0.8
0.1
0.1 0.1
0.1
0.1 0.0
0.1
0.1 0.2
0.1
0.1 0.1
Motifs in biological sequences
What can we learn from these motifs?
– Regulatory motifs in DNA (transcription factor
binding sites)
– Functional site in proteins (Phosphorylation
site)
DNA Regulatory Motifs
• Transcription Factors (TF) are regulatory
protein that bind to regulatory motifs near
the gene and act as a switch bottom (on/off)
Transcription
Start Site
TF1
TF2
Gene X
TF1
motif
TF2
motif
– TF binding motifs are usually 6 – 20
nucleotides long
– located near target gene, mostly upstream the
transcription start site
Can we find TF targets using a
bioinformatics approach?
P53 is a transcription factor
involved in most human cancers
We are interested to identify the genes regulated by p53
Finding TF targets using a
bioinformatics approach?
Scenario 1 : Binding motif is known (easier case)
Scenario 2 : Binding motif is unknown (hard case)
Scenario 1 :
Binding motif is known
• Given a motif (e.g., consensus string, or
weight matrix), find the binding sites in an
input sequence
Given a consensus :
For each position l in the input sequence, check if
substring starting at position l matches the motif.
Example: find the consensus motif NTAHAWT in
the promoter of a gene
>promoter of gene A
ACGCGTATATTACGGGTACACCCTCCCAATT
ACTACTATAAATTCATACGGACTCAGACCTT
AAAA…….
Given a Position Weight Matrix
(PWM):
Starting from a set of aligned motifs
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
1
2
3
4
5
6
7
8
9
AAAGCCC
CTATCCA
CTATCCC
CTATCCC
GTATCCC
CTATCCC
CTATCCC
CTATCCC
TTATCTG
Given a Position Weight Matrix
(PWM):
• Given a string
1 1 s9 of
9 0length
0 0 1 Al = 7
• s = s1s2…sl
6 0 0 0 0 9 8 7 C
W
• Pr(s | W) = 1 0 0 0 W
1 s0 k0 1 G

k
.11 .11 1
1
0
0
0 .11
A
.67 0
0
0
0
1 .89 .78
C
.11 0
0
0 .11 0
0 .11
G
k
• Example: 1 8 0 0 8 0 1 0
Pr(CTAATCCG) =
0.67 x 0.89 Counts
x 1 xof1each
x 0.89
base
each column
x 1 x0.89 xIn0.11
T
.11 .89 0
0 .89 0 .11 0
T
Probability of each base
In each column
Wk = probability of base  in column k
Given a Position Weight Matrix
(PWM)
• Given sequence S (e.g., 1000 base-pairs long)
• For each substring s of S,
– Compute Pr(s|W)
– If Pr(s|W) > some threshold, call that a binding site
• In DNA sequences we need to search both strands
AGTTACACCA
TGGTGTAACT (reverse complement)
Scenario 2 :
Binding motif is unknown
“Ab initio motif finding”
Ab initio motif finding:
Expectation Maximization
• Local search algorithm
- Start from a random PWM
– Move from one PWM to another so as to
improve the score which fits the sequence to the
motif
– Keep doing this until no more improvement is
obtained : Convergence to local optima
Expectation Maximization
• Let W be a PWM .
Let S be the input sequence .
• Imagine a process that randomly searches,
picks different strings matching W and
threads them together to a new PWM
Expectation Maximization
• Find W so as to maximize Pr(S|W)
• The “Expectation-Maximization” (EM)
algorithm iteratively finds a new motif W
that improves Pr(S|W)
Expectation Maximization
PWM
1.
Start from a random motif
2.
Scan sequence for good matches to the
current motif.
3.
Build a new PWM out of these matches, and
make it the new motif
The final PWM represents the motif which is
mostly enriched in the data
The PWM can be also represented as
a sequence logo
-A letter’s height indicates the information it contains
-The top letter at each position can be read to obtain the
consensus sequence (motif)
Are common motifs the right thing to search for ?
?
Solutions:
-Searching for motifs which are enriched
in one set but not in a random set
- Use experimental information to rank
the sequences according to their binding
affinity and search for enriched motifs at
the top of the list
Searching for enriched motifs in a ranked list
Binding affinity
1
2
3
4
Hyper Geometric (HG)
Distribution test
k= number of motifs in the top of the list
m= number of sequences in the
top of the list
n= number of total motifs found
N= total number of sequences
The P reflects the surprise of seeing the observed density of motif occurrences at
the top of the list compared to the rest of the list.
Searching for enriched motifs in ranked list
Choosing the best way to cut the list (minimal HG score)
Binding affinity
1
2
3
4
k= number of motifs in the top of the list
m= number of sequences in the
top of the list
n= number of total motifs found
N= total number of sequences
Finding the p53 binding motif in a set of p53
target sequences which are ranked according
to binding affinity
>affinity = 5.962
ACAAAAGCGUGAACACUUCCACAUGAAAUUCGUUUUUUGUCCUUUUUUUUCUCUUCUUUUUCUCUCCUGUUUCU
>affinity = 5.937
AAUAAAAAUAGAUAUAAUAGAUGGCACCGCUCUUCACGCCCGAAAGUUGGACAUUUUAAAUUUUAAUUCUCAUGA
> affinity = 5.763
UCACACUUGAAUGUGCUGCACUUUACUAGAAGUUUCUUUUUCUUUUUUUAAAAAUAAAAAAAGAGGAGAAAAAUGC
>affinity = 5.498
GCUGGUGCAAGUUUCCGGUAAAAAUAAUGAUGUUCUAGUCAUUCAUAUAUACGAUACAAAAAUAACA
...
http://drimust.technion.ac.il/
Protein Motifs
Protein motifs are usually 6-20 amino acids long and
can be represented as a consensus/profile:
P[ED]XK[RW][RK]X[ED]
or as PWM
Protein Domains
• In additional to protein short motifs, proteins are
characterized by Domains.
• Domains are long motifs (30-100 aa) and are
considered as the building blocks of proteins
(evolutionary modules).
The zinc-finger domain
Some domains can be found in many proteins
with different functions:
….while other domains are only
found in proteins with a certain
function…..
MBD= Methylated DNA Binding Domain
Varieties of protein domains
Extending along the length of a protein
Occupying a subset of a protein sequence
Occurring one or more times
Page 228
Pfam
> Database that contains a large collection of
multiple sequence alignments of protein domains
Based on
Profile hidden Markov Models (HMMs).
HMM in comparison to PWM is a model
which considers dependencies between the
different columns in the matrix (different
residues) and is thus much more powerful!!!!
http://pfam.sanger.ac.uk/
Profile HMM (Hidden Markov Model)
can accurately represent a MSA
D16
D17
delete
Match
insert
D 0.8
S 0.2
I16
X
M17
50%
16 17 18 19
100%
50%
M16
D19
100% D18
P 0.4
R 0.6
M18
100%
T 1.0
M19
100%
R 0.4
S 0.6
I17
I18
I19
X
X
X
DRTR
DRTS
S - - S
SP TR
DR TR
DP TS
D - - S
D - - S
D - - S
D - - R