* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download HW4_final
Genetic code wikipedia , lookup
Human genome wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Transcription factor wikipedia , lookup
Metagenomics wikipedia , lookup
Messenger RNA wikipedia , lookup
RNA interference wikipedia , lookup
Point mutation wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Non-coding DNA wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Polyadenylation wikipedia , lookup
Helitron (biology) wikipedia , lookup
RNA silencing wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Epitranscriptome wikipedia , lookup
History of RNA biology wikipedia , lookup
Primary transcript wikipedia , lookup
Non-coding RNA wikipedia , lookup
Introduction to Bioinformatics (236523) HW 4 – Winter 2016 General Instructions: Dead Line: 7/1/15 23:55. Submission according to published pairs only. The submission is electronic only in the course website. Question 1: Motif representation, sensitivity and specificity A researcher studies the binding sites of a transcription factor X. He conducts a ChIP-Seq experiment by binding X to the genome and sequencing the sequences to which the protein binds. In order to find the binding site motif, the researcher then ran MEME. 1. The following multiple sequence alignment is the extracted motifs found in a subset of the ChIP-Seq sequences. Create a representation of the motif in 3 different options: a. A count matrix b. A probability matrix c. A consensus sequence AGGGCAGCTT ACGACTGCTG CTGGCTGCTA ATGACTGCTG AGGACTGCTC CCGGCAGCTG ATGGCTGCTC 2. The researcher than ran MAST on the 50 sequences. 25 of them are known to be bound by protein X and 25 of them are known to be unbound by protein X. The following table contains MAST results from running the motif found in step 1 and 50 sequences. a. What is the rate of false positive, false negative, true positive and true negatives of MAST? b. Describe the quality of this MAST run in terms of sensitivity and specificity. Sequences that bind protein X Sequence MAST result ID 1 Motif found 2 Motif found 3 Motif found 4 Motif found 5 Motif found 6 Motif found 7 Motif found 8 Motif found 9 Motif not found 10 Motif found 11 Motif found 12 Motif found 13 Motif found 14 Motif found 15 Motif found 16 Motif found 17 Motif found 18 Motif not found 19 Motif found 20 Motif found 21 Motif found 22 Motif found 23 Motif found 24 Motif found 25 Motif found Sequences that don't bind protein X Sequence MAST result ID 26 Motif not found 27 Motif found 28 Motif not found 29 Motif not found 30 Motif not found 31 Motif found 32 Motif not found 33 Motif found 34 Motif not found 35 Motif not found 36 Motif found 37 Motif found 38 Motif found 39 Motif not found 40 Motif not found 41 Motif found 42 Motif not found 43 Motif found 44 Motif not found 45 Motif not found 46 Motif found 47 Motif found 48 Motif found 49 Motif not found 50 Motif not found Question 2: RNA structure and function The PUM2 protein is a human RNA binding protein. In order to identify the preferred binding motif of this protein on RNA, a researcher conducted a high throughput RNA binding experiments (CLIP). The results of the experiments are given in the attached file pum2.fasta.txt. Sequences are in FASTA format and are ranked according to the binding score, from high to low. 1. Identify the preferred binding motif of PUM2 by running the file on DRIMUST http://drimust.technion.ac.il/ . Make sure that you are using the “Single strand” mode for RNA sequences. Provide the motif logo. 2. What is the significance of the identified motif? Look at the occurrences distribution by clicking on “view occurrences distribution” or by downloading the motif occurrences file and explain how come the motif is so highly significant. 3. In a different study it has been shown that PUM2 binds RNA only in loop regions (single stranded) of the folded RNA which contain the consensus motif. The researcher is studying 5 RNA sequences in the 3’ untranslated regions of different genes (sequences are found below) and is interested to learn which of the sequences bind the PUM2 protein. a. Use the program mfold http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi to predict which one of the 5 sequences below is the most likely candidate to bind RNA. Explain your results. b. What is the free energy of the RNA predicted to bind PUM2? Are the results you got expected for a binding site of PUM2? Explain. NOTE! Stable RNA structures of similar length are usually lower than -10 kcal/mol. c. What part of the identified PUM2 motif has the highest probability to be in a loop? Explain. d. It has been shown that in the presence of an RNA helicase (an enzyme that can rewind stem loop RNA structures) PUM2 can also bind its preferred binding motif in partial stem regions. Based on this information, which other sequence (from the list below) would bind PUM2 in the presence of the RNA helicase? List of 5 sequences: Seq 1: 5’ CCGGCCAAAUAAAUGUCCCCAAAAGGCC 3’ Seq 2: 5’ CCGGCGCACAUAAAUGUACAGUGCGGCC 3’ Seq 3: 5’ CCGGUUAACGUUUUAUUAUACCCAGGCC 3’ Seq 4: 5’ CCGGCCAAUGUAAAUACCCCAAAAGGCC 3’ Seq 5: 5’ CCGGCGCACUGUAAAUAACAGUGCGGCC 3’ Provide snapshots or PDF files of the predicted RNA fold for each sequence! Question 3: Research Question This is an open research question; you are requested only to write your research plan (as described below) and not conduct the research!!! As we discussed in class it has been proposed that long non coding RNAs (lncRNAs) can bind Transcription Factors (which usually bind double-stranded DNA) and compete with the natural promoter of a gene which is regulated by that Transcription Factor. This is another elegant way by which the cell can regulate the gene expression. To-date there are only very few known examples of lncRNA that has been confirmed in the laboratory bind a Transcription Factor. In these cases it was found that the binding motif on the lncRNA is exactly the same as the binding motif on the promoter (DNA). You are given chip-seq data for 10 Transcription Factors and sequences of 400 lncRNAs, each of length 1000 nucleotides. Your goal is to design a bioinformatics experiments to find the lncRNAs (among the 400) that can bind one or more Transcription factor for which you have chip-seq data. In your answer you will have to define clearly the research steps you plan. Each step should include the question that is answered in that step, and the bioinformatics tools that you will use to answer it. Please elaborate and explain each step – what is the aim and how does it promote you to the next step. Also detail what is the expected output of your analysis. Note that this is an open question, and there may be many ways you can approach it. Several important clues to remember when answering the question: 1. In terms of the motif, “T” in DNA is completely equivalent to “U” in RNA 2. Motifs in single-stand are completely different than motifs in double-strand 3. RNA folding algorithms are most accurate for sequences of length 150-200 nucleotides.