Download CSE182 project

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Homology modeling wikipedia , lookup

List of types of proteins wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Transcript
CSE182 project
1
1.1
Validation of proteomics data by homology search
Background
Peptide mass spectrometry has become widely used for the identification of regions of the genome which
are translated into protein1 . The evidence from mass spectrometry is a collection of peptides, as shown
in Figure 1.
In many cases, the identified region falls within a known gene and is consistent with the protein
produced by the gene. In other cases, the region represents a novel coding region. These novel regions
can help to improve the annotation of the genome by identifying refinements to known genes as shown
in Figures 2, or completely novel genes. Notice that several overlapping or neighboring regions may
indicate the same type of refinement. In the analysis below, these will be treated as a single translated
region. Each region is also associated with a peptide sequence.
Prokaryotic Genes
A
Operon
Promoter
Gene 1
Operator
Gene 2
Gene 3
Transcription
Initiation
Transcription
Stop
Eukaryotic Gene
B
Transcription
Initiation
Enhancer
Promoter
Exon 1
Splice
Acceptor
Splice
Donor
Intron
Translation
Initiation
Exon 2
Alternate Splice
Junction
Transcription
Stop
Intron
Exon 3
Translation
Stop
Figure 1: A: Typical gene structure of a prokaryotic organism. B: Typical gene structure of a eukaryotic organism.
The exons are transcribed into mRNA, and then translated into proteins. If the gene annotation were complete,
we would only identify the coding exon regions as translated by mass spectrometry.
However, for the purposes of gene annotation, we cannot tolerate many errors. The translated
1
regions indicated by mass spectrometry must be congruent with other sources of translational evidence
in order to be believed. The goal of this project is to validate the regions identified by mass spectrometry
using the homology search tool, BLAST. The organism we are studying is the eukaryotic plant, Zea
mays, aka corn/maize.
Novel Exon
Exon Boundary
Novel Splice
Junction
Translated UTR
+2
Out of Frame
+1
Figure 2: Translated regions of the genome identified by mass spectrometry are shown in white below the genome.
Known genes are shown on the genome as grey boxes. All of the regions are inconsistent with a known gene and
suggest different types of gene model refinement are needed. The dotted lines indicate regions which were not
included in the known gene but appear to be translated according to the mass spectrometry information.
1.2
Problem Description
Input: The input will be a collection of ‘translated regions’ from the maize genome.
Output: The output is a table of comparative genomics evidence for each translated region being
correct.
1.3
Methods
1. Download the genome of Zea mays from maizesequence.org. Warning: this is a big genome.
2. Download the non-redundant protein database (nr) from genbank. This is also a large database.
3. Download and install blast on a local computer.
4. Write a script to excise genomic regions around each translated region (100 nucleotides added to
both sides). This forms the set of genomic queries.
2
5. Run BLASTX of the genomic queries against the non-redundant (nr) protein database.
6. For each translated region, determine if there exists a close protein homolog. If a homolog exists,
then determine if peptides from the predicted region are supported by the alignment. Specifically,
(a) Use the blast homolog alignments to obtain the coding exons (with inexact boundaries), and
the frame in which they are translated.
(b) Scan the alignment against the homolog to look for any frame-shifts or stop codons, as these
indicate a pseudogene.
(c) Of the n peptides given to you as support for the translated region, identify the number that
is consistent with the target. Be careful as some peptides cross splice-junctions.
7. Next, take the translated protein sequence of the close homolog, and search the Z. mays genome
for paralogs. Report all of the coordinates of the paralogs on the Zea mays genome.
The output of your method will be a table of translated regions. For each entry, you should have a
row with the following entries
1. The translated region and its coordinates.
2. The set of peptides
3. The Genbank ID of the best match, with E-value, P-value, and length of alignment
4. The set of peptides consisten with the alignment
5. Coordinates of paralogs, if any.
Contact Natalie Castellana ([email protected]) for the translated regions identified by mass spectrometry.
References
[1] N. E. Castellana, S. H. Payne, Z. Shen, M. Stanke, V. Bafna, and S. P. Briggs, Discovery and revision
of Arabidopsis genes by proteogenomics, Proc. Natl. Acad. Sci. U.S.A. 105 (2008), 21034–21038.
3