Download HW4_final

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic code wikipedia , lookup

Human genome wikipedia , lookup

Genomics wikipedia , lookup

Gene wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Transcription factor wikipedia , lookup

Metagenomics wikipedia , lookup

Messenger RNA wikipedia , lookup

NEDD9 wikipedia , lookup

RNA interference wikipedia , lookup

Point mutation wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Non-coding DNA wikipedia , lookup

Epigenetics of human development wikipedia , lookup

RNA world wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Polyadenylation wikipedia , lookup

Helitron (biology) wikipedia , lookup

RNA wikipedia , lookup

RNA silencing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Epitranscriptome wikipedia , lookup

History of RNA biology wikipedia , lookup

RNA-Seq wikipedia , lookup

Primary transcript wikipedia , lookup

Non-coding RNA wikipedia , lookup

RNA-binding protein wikipedia , lookup

Nucleic acid tertiary structure wikipedia , lookup

Transcript
Introduction to Bioinformatics (236523)
HW 4 – Winter 2016
General Instructions:

Dead Line: 7/1/15 23:55.

Submission according to published pairs only.

The submission is electronic only in the course website.
Question 1: Motif representation, sensitivity and specificity
A researcher studies the binding sites of a transcription factor X. He conducts a ChIP-Seq
experiment by binding X to the genome and sequencing the sequences to which the protein
binds. In order to find the binding site motif, the researcher then ran MEME.
1. The following multiple sequence alignment is the extracted motifs found in a subset
of the ChIP-Seq sequences. Create a representation of the motif in 3 different
options:
a. A count matrix
b. A probability matrix
c. A consensus sequence
AGGGCAGCTT
ACGACTGCTG
CTGGCTGCTA
ATGACTGCTG
AGGACTGCTC
CCGGCAGCTG
ATGGCTGCTC
2. The researcher than ran MAST on the 50 sequences. 25 of them are known to be
bound by protein X and 25 of them are known to be unbound by protein X. The
following table contains MAST results from running the motif found in step 1 and 50
sequences.
a. What is the rate of false positive, false negative, true positive and true
negatives of MAST?
b. Describe the quality of this MAST run in terms of sensitivity and specificity.
Sequences that bind protein X
Sequence
MAST result
ID
1
Motif found
2
Motif found
3
Motif found
4
Motif found
5
Motif found
6
Motif found
7
Motif found
8
Motif found
9
Motif not found
10
Motif found
11
Motif found
12
Motif found
13
Motif found
14
Motif found
15
Motif found
16
Motif found
17
Motif found
18
Motif not found
19
Motif found
20
Motif found
21
Motif found
22
Motif found
23
Motif found
24
Motif found
25
Motif found
Sequences that don't bind protein X
Sequence
MAST result
ID
26
Motif not found
27
Motif found
28
Motif not found
29
Motif not found
30
Motif not found
31
Motif found
32
Motif not found
33
Motif found
34
Motif not found
35
Motif not found
36
Motif found
37
Motif found
38
Motif found
39
Motif not found
40
Motif not found
41
Motif found
42
Motif not found
43
Motif found
44
Motif not found
45
Motif not found
46
Motif found
47
Motif found
48
Motif found
49
Motif not found
50
Motif not found
Question 2: RNA structure and function
The PUM2 protein is a human RNA binding protein. In order to identify the preferred binding
motif of this protein on RNA, a researcher conducted a high throughput RNA binding
experiments (CLIP). The results of the experiments are given in the attached file
pum2.fasta.txt. Sequences are in FASTA format and are ranked according to the binding
score, from high to low.
1. Identify the preferred binding motif of PUM2 by running the file on DRIMUST
http://drimust.technion.ac.il/ .
Make sure that you are using the “Single strand” mode for RNA sequences.
Provide the motif logo.
2. What is the significance of the identified motif? Look at the occurrences distribution
by clicking on “view occurrences distribution” or by downloading the motif
occurrences file and explain how come the motif is so highly significant.
3. In a different study it has been shown that PUM2 binds RNA only in loop regions
(single stranded) of the folded RNA which contain the consensus motif. The
researcher is studying 5 RNA sequences in the 3’ untranslated regions of different
genes (sequences are found below) and is interested to learn which of the
sequences bind the PUM2 protein.
a. Use the program mfold http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi to
predict which one of the 5 sequences below is the most likely candidate to
bind RNA. Explain your results.
b. What is the free energy of the RNA predicted to bind PUM2? Are the results
you got expected for a binding site of PUM2? Explain. NOTE! Stable RNA
structures of similar length are usually lower than -10 kcal/mol.
c. What part of the identified PUM2 motif has the highest probability to be in a
loop? Explain.
d. It has been shown that in the presence of an RNA helicase (an enzyme that
can rewind stem loop RNA structures) PUM2 can also bind its preferred
binding motif in partial stem regions. Based on this information, which other
sequence (from the list below) would bind PUM2 in the presence of the RNA
helicase?
List of 5 sequences:
Seq 1: 5’ CCGGCCAAAUAAAUGUCCCCAAAAGGCC 3’
Seq 2: 5’ CCGGCGCACAUAAAUGUACAGUGCGGCC 3’
Seq 3: 5’ CCGGUUAACGUUUUAUUAUACCCAGGCC 3’
Seq 4: 5’ CCGGCCAAUGUAAAUACCCCAAAAGGCC 3’
Seq 5: 5’ CCGGCGCACUGUAAAUAACAGUGCGGCC 3’
Provide snapshots or PDF files of the predicted RNA fold for each sequence!
Question 3: Research Question
This is an open research question; you are requested only to write your research plan (as
described below) and not conduct the research!!!
As we discussed in class it has been proposed that long non coding RNAs (lncRNAs) can bind
Transcription Factors (which usually bind double-stranded DNA) and compete with the natural
promoter of a gene which is regulated by that Transcription Factor. This is another elegant
way by which the cell can regulate the gene expression. To-date there are only very few known
examples of lncRNA that has been confirmed in the laboratory bind a Transcription Factor. In
these cases it was found that the binding motif on the lncRNA is exactly the same as the
binding motif on the promoter (DNA).
You are given chip-seq data for 10 Transcription Factors and sequences of 400 lncRNAs, each
of length 1000 nucleotides. Your goal is to design a bioinformatics experiments to find the
lncRNAs (among the 400) that can bind one or more Transcription factor for which you have
chip-seq data.
In your answer you will have to define clearly the research steps you plan. Each step should
include the question that is answered in that step, and the bioinformatics tools that you will
use to answer it. Please elaborate and explain each step – what is the aim and how does it
promote you to the next step. Also detail what is the expected output of your analysis.
Note that this is an open question, and there may be many ways you can approach it.
Several important clues to remember when answering the question:
1. In terms of the motif, “T” in DNA is completely equivalent to “U” in RNA
2. Motifs in single-stand are completely different than motifs in double-strand
3. RNA folding algorithms are most accurate for sequences of length 150-200 nucleotides.