Download Slides: background and project plan

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Motif Finding Workshop
Project
Chaim Linhart
January 2008
MF workshop 08 © Ron Shamir
1
Outline
1. Some background again…
2. The project
MF workshop 08 © Ron Shamir
2
1. Background
Slides with Ron Shamir and Adi Akavia
MF workshop 08 © Ron Shamir
3
Gene: from DNA to protein
DNA
PremRNA
transcription
MF workshop 08 © Ron Shamir
Mature
mRNA
splicing
protein
translation
4
DNA
• DNA: a “string” over the alphabet of 4 bases
(nucleotides): { A, C, G, T }
• Resides in chromosomes
• Complementary strands: A-T ; C-G
Forward/sense strand:
AACTTGCG
Reverse-complement/anti-sense strand:
TTGAACGC
• Directional: from 5’ to 3’:
(upstream)
AACTTGCGATACTCCTA
5’ end
MF workshop 08 © Ron Shamir
(downstream)
3’ end
5
Gene structure (eukaryotes)
Promoter
DNA
Coding strand
Transcription
start site (TSS)
Transcription
Pre-mRNA
Exon
Intron
Splicing
Mature mRNA
5’ UTR
Start codon
(RNA polymerase)
Exon
(spliceosome)
3’ UTR
Coding region
Translation
MF workshop 08 © Ron Shamir
Protein
Stop codon
(ribosome)
6
Translation
• Codon - a triplet of bases, codes a specific amino
acid (except the stop codons); many-to-1 relation
• Stop codons - signal termination of the protein
synthesis process
MF workshop 08 © Ron Shamir
http://ntri.tamuk.edu/cell/ribosomes.html
7
Genome sequences
• Many genomes have been sequences,
including those of viruses, microbes, plants
and animals.
• Human:
– 23 pairs of chromosomes
– 3+ Gbps (bps = base pairs) , only ~3% are genes
– ~25,000 genes
• Yeast:
– 16 chromosomes
– 20 Mbps
– 6,500 genes
MF workshop 08 © Ron Shamir
8
Regulation of Expression
• Each cell contains an identical copy of the
whole genome - but utilizes only a subset
of the genes to perform diverse, unique
tasks
• Most genes are highly regulated –
their expression is limited to specific
tissues, developmental stages,
physiological condition
• Main regulatory mechanism –
transcriptional regulation
9
MF workshop 08 © Ron Shamir
Transcriptional regulation
• Transcription is regulated primarily by transcription
factors (TFs) – proteins that bind to DNA
subsequences, called binding sites (BSs)
• TFBSs are located mainly (not always!) in the gene’s
promoter – the DNA sequence upstream the gene’s
transcription start site (TSS)
• BSs of a particular TF share a common pattern, or
motif
• Some TFs operate together – TF modules
TF
5’
BS
MF workshop 08 © Ron Shamir
TF
Gene
BS
TSS
3’
10
TFBS motif models
• Consensus (“degenerate”) string:
A ACT CT
C
G
AACTGT
CACTGT
CACTCT
CACTGT
AACTGT
gene 1
gene 2
gene 3
gene 4
gene 5
gene 6
gene 7
gene 8
gene 9
gene 10
• Statistical models…
• Motif logo representation
MF workshop 08 © Ron Shamir
11
Human G2+M cell-cycle genes:
The CHR – NF-Y module
CDCA3 (trigger of mitotic entry 1)
CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAACT -18
CDCA8 (cell division cycle associated 8)
TTGTGATTGGATGTTGTGGGA…[25bp]…TGACTGTGGAGTTTGAATTGG +23
CDC2 (cell division control protein 2 homolog)
CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTGGTGAATCCGGG
GCCCTTTAGCGCGGTGAGTTTGAAACTGCT 0
CDC42EP4 (cdc42 effector protein 4)
GCTTTCAGTTTGAACCGAGGA…[25bp]…CGACGGCCATTGGCTGCTGC -110
CCNB1 (G2/mitotic-specific cyclin B1)
AGCCGCCAATGGGAAGGGAG…[30bp]…AGCAGTGCGGGGTTTAAATCT +45
CCNB2 (G2/mitotic-specific cyclin B2)
TTCAGCCAATGAGAGT…[15bp]…GTGTTGGCCAATGAGAAC…[15bp]…GGGC
CGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA +10
BS’s are short, non-specific, hiding in both strands and at various
TFs: NF-Y , CHR
12
locations
along
the
promoters
MF workshop 08 © Ron Shamir
The computational challenge
• Given a set of co-regulated genes
(e.g., from gene expression chips)
• Find a motif that is over-represented
(occurs unusually often) in their
promoters
• This may be the TF binding site motif
• Find TF modules – over-represented
motifs that tend to co-occur
MF workshop 08 © Ron Shamir
13
The computational challenge (II)
• Motifs can also be found w/o a given
target-set – “genome-wide”
• Find a motif that is localized - occurs
more often neat the TSS of genes
• Find a motif with a strand bias –
occurs more often on the genes’
coding strand
• Find TF modules with biases in their
order / orientation / distance
MF workshop 08 © Ron Shamir
14
Motif finding algorithms
• >100 motif finding algs
• Main differences between them:
– Type of analysis & input:
• Target-set vs. genome-wide
• Single vs. multi-species (conservation)
• Single motifs vs. modules
– Motif model
– Score for evaluating motif
– Motif search technique:
• Combinatorial (enumeration) vs.
Statistical optimization
MF workshop 08 © Ron Shamir
15
Example - Amadeus
Over-represented motifs in the promoters of genes expressed in
the G2 and G2/M phases of the human cell cycle:
CHR
NF-Y
MF workshop 08 © Ron Shamir
16
2. The project
MF workshop 08 © Ron Shamir
17
General goals
• Develop software from A-Z:
–
–
–
–
Design
Implementation
(Optimization)
Execution & analysis of real data
• A taste of bioinformatics
• Have fun
• Get credit…
MF workshop 08 © Ron Shamir
18
The computational task
• Given a set of DNA sequences
• Find “interesting” pairs of motifs:
– Order bias
– Other scores…
• Main challenges:
– Performance (time, memory)
– Output redundancy
MF workshop 08 © Ron Shamir
19
Input
File with DNA sequences in “fasta” format:
>sequence-name1 <space> [header1]
ACCCGNNNNTCGGAAATGANN
CGGAGTAAAATATGCGAGCGT
>sequence-name2 <space> [header2]
cggattnnnaccgcannnnnnnnaccgtga
>sequence-name3 <space> [header3]
agtttagactgctagctcgatcgcta
gcggatnggctannnnnatctag
MF workshop 08 © Ron Shamir
20
Input (II)
• Ignore the header lines
• Sequence may span multiple lines or
one long line
• Sequence contains the characters
A,C,G,T,N in upper or lower case
• “N” means unknown or masked base
• Sample input files will be supplied
MF workshop 08 © Ron Shamir
21
Input (III)
• Search parameters:
– Length of motifs (between 5-10)
– Min. + Max. distance between the motifs:
ACGGATTGATNNNTGGATGCCAT
distance=9
– Single vs. two strands search
– Min. number of occurrences (hits) of pair:
GCGGATTCAGTGATGCCANGNATGCCTCAGGATTGNAATGCCA
hit
hit
hit
– Max. p-value
– Additional parameters…
MF workshop 08 © Ron Shamir
(don’t count overlaps,
e.g. AAAAAA)
22
Output
A. A list of the string pairs with the
best order-bias score (smallest pvalues):
Motif A
ACGTT
ACGTT
TTAAC
Motif B
GGATT
GATTC
CAGCC
A→B
97
87
31
B→A
17
16
114
p-value
4.3E-15
2.7E-13
1.2E-12
B. A non-redundant list of motif pairs
(motif = consensus string):
logos, # of hits, additional scores
MF workshop 08 © Ron Shamir
23
Part A: String pairs with
order bias
•
•
•
•
•
nA = # of A→B ; nB = # of B→A
WLOG, nA > nB
n = nA + nB
H0 = random order: nA ~ B(n, 0.5)
p-value = prob for at least nA occurrences
of A→B = tail of B(n, 0.5)
• Normal approximation (central limit thm.)
• Fix for multiple testing: x2
n j
Binomial tail ( n, p, k )     p (1  p) n j
j k  j 
n
MF workshop 08 © Ron Shamir
24
Part B: Non-redundant list
of motif pairs
• Collect similar strings to motif with better
score: (motif = consensus)
String pair (p-value)
ACGTT , GGATT (4.3E-15)
ACGAT , GGATT (2.4E-11)
AGGAT , GGTTT (1.7E-5)
AGGTT , GGTTT (5.9E-5)
Motif pair
,
(8.1E-31)
• Don’t report similar motif pairs:
– Motifs that consist of similar strings
– Motif pairs that are small shifts of one another
– Palindromes
MF workshop 08 © Ron Shamir
25
Part B (cont.): Additional score
Option I: Co-occurrence rate
N = total # of sequences
sA = # of sequences that contain motif A
sAB = # of sequences that contain motifs A and B
H0 = motifs occur independently and randomly
p-value = prob for at least joint occurrences, given the
number of hits of each single motif
= tail of hypergeometric distribution
 sB   N  sB 
 s  i 
min( sA , sB ) 
i

HG tail ( N , s A , sB , s AB )     A
N
i  sAB
 
 sA 
MF workshop 08 © Ron Shamir
26
Part B (cont.): Additional score
Option II: Distance bias
Is the distance between the two motifs uniform (H0),
or are there specific distances that are very common?
Option III: Gap variability
Are the sequences between the motifs conserved (H0),
or are they highly variable?
Other options??
MF workshop 08 © Ron Shamir
27
Implementation
• Java (Eclipse) ; Linux
• GUI: Simple graphical user interface for
supplying the input parameters and
reporting the results
• Packages for motif logo and statistical
scores will be supplied
• Time performance will be measured only for
part A
• Reasonable documentation
• Separate packages for data-structures,
scores, GUI, I/O, etc.
MF workshop 08 © Ron Shamir
28
Design document
• Due in 3 weeks (Feb 24)
• 3-5 pages (Word), Hebrew/English
• Briefly describe main goal, input and
output of program
• Describe main data structures,
algorithms, and scores for parts A+B
• Meet with me before submission
MF workshop 08 © Ron Shamir
29
Fin
MF workshop 08 © Ron Shamir
30
Related documents