Download Tasks Monday January 21st 2006

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Whole genome sequencing wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene wikipedia , lookup

Community fingerprinting wikipedia , lookup

Western blot wikipedia , lookup

Gene expression wikipedia , lookup

Magnesium transporter wikipedia , lookup

Interactome wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Non-coding DNA wikipedia , lookup

Protein purification wikipedia , lookup

Proteolysis wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein structure prediction wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Genomic library wikipedia , lookup

Structural alignment wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Expression vector wikipedia , lookup

Point mutation wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Molecular evolution wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Transcript
Tasks Monday January 21st 2006
Goals:
- to work with public databases on the internet to find gene and protein
information.
- To use tools to analyse and compare DNA sequences
- To find homologous sequences in other organisms and to learn the concept of
orthologs and paralogs.
- To make a phylogenetic analysis using ClustalW
- To analyse genome sequences from multiple organisms using VISTA
We will make use of public DNA (NCBI and UCSC) and protein databases (EBI).
The underlying information in the various databases is mostly identical but
visualisation and search options as well as annotation may vary.
This task contains question. Please answer these questions in groups of two
persons and make a small report.
Task 1: Homologs of the E. coli photolyase gene
The bacterium E. coli can repair UV-induced DNA damage. UV-light can result
in the formation of cyclobutane-thymidine dimers. The enzyme photolyase can
repair the damage but it needs visible light to be activated. The energy of a
photon is absorbed by the enzyme and
used by FADH to free an electron needed
for repair of the DNA damage.
In this task you will search for the E. coli K12 photolyase gene and protein and
you will try to find and compare homologs in other model organisms from
Page 1 of 6
other 'kingdoms'. You will collect information for these homologs (e.g. protein
size, protein domains present). Using this information, you will try to find out
the possible evolution for this gene and how it did arise in various organisms.
Find the amino acid sequence of the E. coli photolyase protein at NCBI.
Go to http://www.ncbi.nih.gov/ and search all databases for photolyase. Find
the protein sequence starting with NP_, which means that it is reference
sequence for a given organism.
How many amino acids does the protein consist of?
In the pull-down menu 'display' you can select another view. Select 'FASTA' for
a short version of the sequence without extensive annotation. Copy the
sequence, including its description which is preceded by ">" and paste it on
your notepad. Save this file for later use.
As depicted in the picture above, the protein contains two important activities.
We will now analyse the protein for known 'domains' residing in the protein
using the program "Interproscan" (http://www.ebi.ac.uk/InterProScan/). Copy
the protein sequence into this screen and start the analysis.
Which two large protein domains are found in the E. coli photolyase
protein?
What are the functions of these two domains?
To find homologs in other organisms you can choose to use the complete
protein or one of the two domains. We will first search for homologs in the
one-cellular organism bakers yeast Saccharomyces cerevisiae. Copy the E. coli
photolyase amino acid sequence and find homologs in yeast using Blast
(http://www.ncbi.nlm.nih.gov/BLAST/).
Which Blast program would be best suited for this task?
Paste the protein sequence in the 'search' window and limit your search to
"Saccharomyces cerevisiae" in the options panel. After you started the Blast
comparison, a new screen will pop up, again showing the two domains present
in the photolyase protein. Hit the 'format' button to go the results. Click on the
best hit, preferably again a 'NP_xxxxx' sequence and retrieve and copy this
sequence in FASTA format to your notepad with the E. coli sequence.
Blast also returns an alignment of the E. coli and yeast protein sequence, but
this is a local alignment that only shows those parts that match best. In this
case, homology information for the start and end of the protein is missing. To
make a global alignment, we will use the program Align
(http://www.ebi.ac.uk/emboss/align/index.html). The alignment method is
Page 2 of 6
standard put on global alignment. Paste the E. coli and yeast protein
sequences in the two different input fields. Hit 'run' to see the aligned output.
What is the most striking difference between the two sequences?
This part of the protein may play a role in the subcellular targeting of the
protein. Go to the Saccharomyces Genome Database (SGD)
(http://www.yeastgenome.org/) and find out what the subcellular localisation
of the yeast photolyase protein is to search for "Phr1", which is the yeast name
for this protein.
What is the subcellular localisation of the protein? Could this be
expected? Assuming that the domain that is present in yeast and not
in E. coli is responsible for this localisation, why can E. coli do without
this segment?
Now go back and try to find other homologs of the E. coli photolyase in
Eukaryotes.
Include homologs in human, mouse and a plant and some organisms of your
own choice. Copy all sequences in FASTA format (preferentially sequences with
NP_xxxxx names) to a new notepad file. Note that organisms may contain
multiple different homologs! Collect all of them.
Once you have a nice collection of sequences we will compare them with each
other using the multiple sequence alignment program ClustalW
(http://www.ebi.ac.uk/clustalw/). Read the Frequently Asked Questions for
more background on this tool. On the bottom of the page you will find the
'Upload a file' field. Select your saved notepad file and run the program.
Discuss your findings in your report.
You can improve your alignment by removing distantly related sequences.
Delete these sequences (e.g. E. coli) from your notepad file and reanalyse your
sequences.
The human and mouse genome both contain two clear photolyase homologs:
cryptochrome 1 and 2.
Describe which genes are likely to be orthologs and which are
paralogs.
Page 3 of 6
Task 2: Comparative genome analysis of the human cry2
locus.
From Task 1 you have learnt that you can find protein sequences and identify
homologs in other organisms. However, sometimes the protein sequence is not
available for a given organism or it may be questionable if the gene structure
is properly predicted from the genome sequence. In this task, you will search
for homologous regions in mouse, rat, chimpanzee, fugu, etcetera using the
comparative genome browser VISTA
(http://genome.lbl.gov/vista/index.shtml). You will find various programs on
the VISTA home page for specific types of searches. Go to the VISTA Browser
(http://pipeline.lbl.gov/cgi-bin/gateway2) and search in the 'Human July 2003'
genome for the human photolyase gene 'cry2' by filling out this term in the
position field. You will now graphically see the degree of conservation between
the homologous human and mouse genome sequences. Try to understand the
figure and colouring using the legend. Extend the comparison by adding more
organisms using the pull down menu on the left.
Which parts of the gene are clearly conserved in all organisms?
Which organisms are best suited for the identification of this kind of
conserved regions? Which are less suited? Explain why.
Which organism would be best suited for finding conserved and
potentially functional promoter elements that regulate the expression
of this gene?
Zoom out by clicking on the magnification icon with the '-' sign. You will now
see a larger genomic region. In the chimpanzee trace you will now see a large
gap.
What does this mean and what process is underlying this.
Page 4 of 6
Task 3: Identification of functional genomic elements using
phylogenetic shadowing
This morning you have read the paper by Boffelli et al. on phylogenetic
shadowing. This method is specifically suited for the identification of small
conserved elements in a genome or lineage-specific features.
For this task you will be using the sequences from 10 different primates
(FASTA format) from the course webpage.
Align these sequences using ClustalW. What can you conclude from
this alignment?
There is clearly another approach needed to extract information from these
sequences. Use the eShadow (http://eshadow.dcode.org/) tool to analyse
these sequences. Play around with the window size settings to get a clear
view.
What is shown in the graph? How many potentially functional
segments are present in this region and what principle is underlying
this hypothesis? What is the estimated size of each conserved region?
Page 5 of 6
Now let's go back to the Vista homepage and see if we can retrieve the same
information using other genomes. Use the GenomeVISTA tool and use the
human sequence in your sequence list to search the genomic coordinates in
the human June 2004 genome assembly. Wait until your search is finished and
click the 'Vista browser' option. Add all available organisms for comparison.
What can you conclude? Which organisms could also be used to
identify these individual elements and which are not very informative?
What is the estimated size of each conserved region?
Close the VISTA browser window and select the 'VISTA track' option in the
search results window. You are now redirected to the UCSC genome browser,
which displays your results along with existing genome annotation. There is
another track with conservation information, showing the cumulative
conservation using information from 10 different organisms (this is not a
pairwise alignment, as you have seen thus far in VISTA, but a graphical
representation of a sort of ClustalW multiple alignment).
What would you conclude from the 10-way alignment? What is the
estimated size of each conserved region?
Under the graph you will find many options that can be displayed as well.
Select the 'full' option for the sno/miRNA option.
Which element(s) reside in the conserved regions? What are their
sizes?
Page 6 of 6