* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slides: background and project plan
Extrachromosomal DNA wikipedia , lookup
Human genome wikipedia , lookup
Epigenomics wikipedia , lookup
Transposable element wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene desert wikipedia , lookup
Genome (book) wikipedia , lookup
Metagenomics wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene nomenclature wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genome evolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Epigenetics of human development wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene expression profiling wikipedia , lookup
Primary transcript wikipedia , lookup
Point mutation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome editing wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Helitron (biology) wikipedia , lookup
Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010 MF workshop 10 © Yaron Orenstein 1 Outline 1. Some background again… 2. The project MF workshop 10 © Yaron Orenstein 2 1. Background Slides with Ron Shamir and Chaim Linhart MF workshop 10 © Yaron Orenstein 3 Gene: from DNA to protein DNA PremRNA transcription MF workshop 10 © Yaron Orenstein Mature mRNA splicing protein translation 4 DNA • DNA: a “string” over the alphabet of 4 bases (nucleotides): { A, C, G, T } • Resides in chromosomes • Complementary strands: A-T ; C-G Forward/sense strand: AACTTGCG Reverse-complement/anti-sense strand: TTGAACGC • Directional: from 5’ to 3’: (upstream) AACTTGCGATACTCCTA 5’ end MF workshop 10 © Yaron Orenstein (downstream) 3’ end 5 Gene structure (eukaryotes) Promoter DNA Coding strand Transcription start site (TSS) Transcription Pre-mRNA Exon Intron Splicing Mature mRNA 5’ UTR Start codon (RNA polymerase) Exon (spliceosome) 3’ UTR Coding region Translation MF workshop 10 © Yaron Orenstein Protein Stop codon (ribosome) 6 Translation • Codon - a triplet of bases, codes a specific amino acid (except the stop codons); many-to-1 relation • Stop codons - signal termination of the protein synthesis process MF workshop 10 © Yaron Orenstein http://ntri.tamuk.edu/cell/ribosomes.html 7 Genome sequences • Many genomes have been sequences, including those of viruses, microbes, plants and animals. • Human: – 23 pairs of chromosomes – 3+ Gbps (bps = base pairs) , only ~3% are genes – ~25,000 genes • Yeast: – 16 chromosomes – 20 Mbps – 6,500 genes MF workshop 10 © Yaron Orenstein 8 Regulation of Expression • Each cell contains an identical copy of the whole genome - but utilizes only a subset of the genes to perform diverse, unique tasks • Most genes are highly regulated – their expression is limited to specific tissues, developmental stages, physiological condition • Main regulatory mechanism – transcriptional regulation 9 MF workshop 10 © Yaron Orenstein Transcriptional regulation • Transcription is regulated primarily by transcription factors (TFs) – proteins that bind to DNA subsequences, called binding sites (BSs) • TFBSs are located mainly (not always!) in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS) • BSs of a particular TF share a common pattern, or motif • Some TFs operate together – TF modules TF 5’ BS MF workshop 10 © Yaron Orenstein TF Gene BS TSS 3’ 10 TFBS motif models • Consensus (“degenerate”) string: A ACT CT C G AACTGT CACTGT CACTCT CACTGT AACTGT gene 1 gene 2 gene 3 gene 4 gene 5 gene 6 gene 7 gene 8 gene 9 gene 10 • Statistical models… • Motif logo representation MF workshop 10 © Yaron Orenstein 11 Human G2+M cell-cycle genes: The CHR – NF-Y module CDCA3 (trigger of mitotic entry 1) CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAACT -18 CDCA8 (cell division cycle associated 8) TTGTGATTGGATGTTGTGGGA…[25bp]…TGACTGTGGAGTTTGAATTGG +23 CDC2 (cell division control protein 2 homolog) CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTGGTGAATCCGGG GCCCTTTAGCGCGGTGAGTTTGAAACTGCT 0 CDC42EP4 (cdc42 effector protein 4) GCTTTCAGTTTGAACCGAGGA…[25bp]…CGACGGCCATTGGCTGCTGC -110 CCNB1 (G2/mitotic-specific cyclin B1) AGCCGCCAATGGGAAGGGAG…[30bp]…AGCAGTGCGGGGTTTAAATCT +45 CCNB2 (G2/mitotic-specific cyclin B2) TTCAGCCAATGAGAGT…[15bp]…GTGTTGGCCAATGAGAAC…[15bp]…GGGC CGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA +10 BS’s are short, non-specific, hiding in both strands and at various TFs: NF-Y , CHR 12 locations along the promoters MF workshop 10 © Yaron Orenstein Protein Binding Microarrays Berger et al, Nat. Biotech 2006 • Generate an array of double-stranded DNA with all possible k-mers • Detect TF binding to specific k-mers MF workshop 10 © Yaron Orenstein 13 13 PBM (2) MF workshop 10 © Yaron Orenstein 14 14 PBM - implementation • Use 60-mers (Agilent): 25nt constant primer + 35nt variable region • De Bruijn seq of all 10-mers (410 long) split into 35nt long fragments with 9nt overlap • ~40K probes • For each 8-mer, combine signals from all probes that contain it (or differ in 1nt) to obtain its binding score MF workshop 10 © Yaron Orenstein 15 15 The computational challenge • Input: PBM data (sequences and binding scores) of one TF. • Goal: Find a motif (PWM) that is the binding site of that TF. • Intuition: sequences that match the motif (on one of the two possible strands!) are expected to have high binding scores. MF workshop 10 © Yaron Orenstein 16 2. The project MF workshop 10 © Yaron Orenstein 17 General goals • Research - Learn about known solutions - Trial and error with training data • Develop software from A-Z: – Design – Implementation (Optimization) – Execution & analysis of test data • A taste of bioinformatics • Have fun • Get credit… MF workshop 10 © Yaron Orenstein 18 The computational task • Given a set of PBM data of different TFs. • Find the binding site motif in PWM format of each TF. • Main challenges: – Performance (time, memory) – Accuracy MF workshop 10 © Yaron Orenstein 19 Input File with 41,923 lines, each containing a probe sequence of length 35 and binding intensity. <sequence 35bp> \t <intensity> \n MF workshop 10 © Yaron Orenstein 20 Input (II) • For the training data, an additional PWM file will be supplied for each PBM data set. A: C: G: T: <freq1> <freq2> … <freq10> <freq1> … <freq10> … … • Separated by \t and \n. • All lines must contain same number of frequencies (10 is just an example). MF workshop 10 © Yaron Orenstein 21 Input (III) You will be given: 1. 10 training sets (PBM data + PWM) 2. 4 test sets (PBM data). You have to provide the PWM. 3. In the final project presentation, you will be given an online test set (PBM data) and your software will be applied to it. MF workshop 10 © Yaron Orenstein 22 Output 1. A PWM file describing the binding site found in the given PBM file. 2. The PWM in motif logo format (i.e. displayed on the screen). bits = 2 - entropy The file logo.zip contains a java package with the code that will easily display your motif. MF workshop 10 © Yaron Orenstein 23 Output (II) 3. Show graphically how well your motif predicts the binding intensity. • One example (note it’s not PWM): MF workshop 10 © Yaron Orenstein 24 Ranking 8-mers • One possible way to start: rank the 8mers in some way. Scores for example: 1. Signal average. 2. Signal median. • You can think of other scores that incorporate more information, e.g. position in probe sequence. • This is just an example. You can think of other ways to start. MF workshop 10 © Yaron Orenstein 25 Alignment procedure • Then, you can align the significant 8-mers. • You may take into account the relative score. • Don’t forget about the reverse complement! • Example: Cebpb TF MF workshop 10 © Yaron Orenstein 26 Enrichment scores • To test how good your motif is, you can use an enrichment score. • An enrichment score tests how good the motif distinguishes between highranking probes and the rest of the probes. MF workshop 10 © Yaron Orenstein 27 Hypergeometric probability drawn not drawn total white k m−k m black n−k N+k−n−m N−m total n N−n N MF workshop 10 © Yaron Orenstein 28 Hypergeometric enrichment score • Let B and T (T B) denote the BG and target sets, respectively, and let b and t denote the subset of probes from the BG and target set, respectively, that contain at least one occurrence of the motif. | b | | B | | b | min(|T |,|b|) i | T | i HG tail ( | B |,| T |,| b |,| t |) | B | i |t| | T | MF workshop 10 © Yaron Orenstein 29 Hypergeometric score (2) • The HG enrichment score computes the probability of observing at least |t| target sequences with a motif occurrence, under the null hypothesis that the probes in the target set were drawn randomly, independently, and without replacement from the BG set. • Code is provided in math.zip MF workshop 10 © Yaron Orenstein 30 Wilcoxon-Mann-Whitney (WMW) enrichment score • Foreground probes are all those containing a match, background are all the others. • B and F are the sizes of background and foreground, respectively. • ρB and ρF are the sums of the background and foreground ranks. • Read more in supplementary info (Berger06). 1 B F area BF B F MF workshop 10 © Yaron Orenstein 31 Deciding the length of the motif • Another challenge is to decide the length of the motif. • Most binding site are 6-12 bp long. • You should consider the information each position contains and decide on the length accordingly. MF workshop 10 © Yaron Orenstein 32 Scoring your PWM • One way to score your motif is by ranking the probe sequences according to a match score. • You may use the given code for match score. • Compare the ranking of the probes you got to the ranking according to binding intensities. There are different correlation score for that. MF workshop 10 © Yaron Orenstein 33 Match Score between PWMs • Already implemented for you: 1. Euclidian Distance: 2. Pearson Correlation Coefficient 3. KL Divergence MF workshop 10 © Yaron Orenstein 34 Implementation • Java (Eclipse) ; Linux (Other languages are possible, but will not participate in bonus). • Input: one single argument PBM filename • Output: PWM file, PWM presented in logo and graphical presentation of PWM matching distribution among probes. • Packages for motif logo and statistical scores will be supplied • Time performance will be measured • Reasonable documentation • Separate packages for data-structures, scores, GUI, I/O, etc. MF workshop 10 © Yaron Orenstein 35 Submission • Printed design document. • Printed code – for comments and remarks. • Printed results document – for each test set PWM logo + how good your result in terms of correlation to the probes ranks. • 4 PWM files, e.g. Test_1.pwm (submitted by email). • Executable for the online test. MF workshop 10 © Yaron Orenstein 36 Grade • 20% for the design • 30% for the implementation (20% for modularity, clarity, documentation, 10% for efficiency) • 30% for the performance and experimental results (20% for the accuracy on the 4 test queries and 10% for the accuracy on the online test query) • 20% for the final report and presentation • 10% bonus to the group with the most accurate results • 10% bonus for the group with the fastest implementation MF workshop 10 © Yaron Orenstein 37 Bonus grading • Accuracy will be determined using the provided code that compares two PWMs. • We will take the average of runs on several different PBM data sets. • Running time will be measured in java implementation, and the average will be taken. MF workshop 10 © Yaron Orenstein 38 Schedule 1.First progress report 23/11 2.Design document 21/12 3.Final presentation 16/2 • • • We shall meet with each group on each of these dates – mark your calendars! Schedule can be made earlier if you are ready. You are always welcome to meet us. Contact us by email. MF workshop 10 © Yaron Orenstein 39 Design document • Due in week 12 (21/12). • 3-5 pages (Word), Hebrew/English • Briefly describe main goal, input and output of program • Describe main data structures, algorithms, and scores. • Meet with me before submission. MF workshop 10 © Yaron Orenstein 40 Reference • Berger MF, Philippakis AA, Quershi AM, He FS, EstepIII PW, Bulyk ML. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nature biotechnology. 2006;338:1429-1435. Very important! Read: the_brain.bwh.harvard.edu/UPBMseqn/suppl_methods.doc • Chen X, Hughes TR, Morris Q. RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors. Bioinformatics. 2007 Jul 1;23(13):i72-79. MF workshop 10 © Yaron Orenstein 41 Fin MF workshop 10 © Yaron Orenstein 42