Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSE182 project 1 1.1 Validation of proteomics data by homology search Background Peptide mass spectrometry has become widely used for the identification of regions of the genome which are translated into protein1 . The evidence from mass spectrometry is a collection of peptides, as shown in Figure 1. In many cases, the identified region falls within a known gene and is consistent with the protein produced by the gene. In other cases, the region represents a novel coding region. These novel regions can help to improve the annotation of the genome by identifying refinements to known genes as shown in Figures 2, or completely novel genes. Notice that several overlapping or neighboring regions may indicate the same type of refinement. In the analysis below, these will be treated as a single translated region. Each region is also associated with a peptide sequence. Prokaryotic Genes A Operon Promoter Gene 1 Operator Gene 2 Gene 3 Transcription Initiation Transcription Stop Eukaryotic Gene B Transcription Initiation Enhancer Promoter Exon 1 Splice Acceptor Splice Donor Intron Translation Initiation Exon 2 Alternate Splice Junction Transcription Stop Intron Exon 3 Translation Stop Figure 1: A: Typical gene structure of a prokaryotic organism. B: Typical gene structure of a eukaryotic organism. The exons are transcribed into mRNA, and then translated into proteins. If the gene annotation were complete, we would only identify the coding exon regions as translated by mass spectrometry. However, for the purposes of gene annotation, we cannot tolerate many errors. The translated 1 regions indicated by mass spectrometry must be congruent with other sources of translational evidence in order to be believed. The goal of this project is to validate the regions identified by mass spectrometry using the homology search tool, BLAST. The organism we are studying is the eukaryotic plant, Zea mays, aka corn/maize. Novel Exon Exon Boundary Novel Splice Junction Translated UTR +2 Out of Frame +1 Figure 2: Translated regions of the genome identified by mass spectrometry are shown in white below the genome. Known genes are shown on the genome as grey boxes. All of the regions are inconsistent with a known gene and suggest different types of gene model refinement are needed. The dotted lines indicate regions which were not included in the known gene but appear to be translated according to the mass spectrometry information. 1.2 Problem Description Input: The input will be a collection of ‘translated regions’ from the maize genome. Output: The output is a table of comparative genomics evidence for each translated region being correct. 1.3 Methods 1. Download the genome of Zea mays from maizesequence.org. Warning: this is a big genome. 2. Download the non-redundant protein database (nr) from genbank. This is also a large database. 3. Download and install blast on a local computer. 4. Write a script to excise genomic regions around each translated region (100 nucleotides added to both sides). This forms the set of genomic queries. 2 5. Run BLASTX of the genomic queries against the non-redundant (nr) protein database. 6. For each translated region, determine if there exists a close protein homolog. If a homolog exists, then determine if peptides from the predicted region are supported by the alignment. Specifically, (a) Use the blast homolog alignments to obtain the coding exons (with inexact boundaries), and the frame in which they are translated. (b) Scan the alignment against the homolog to look for any frame-shifts or stop codons, as these indicate a pseudogene. (c) Of the n peptides given to you as support for the translated region, identify the number that is consistent with the target. Be careful as some peptides cross splice-junctions. 7. Next, take the translated protein sequence of the close homolog, and search the Z. mays genome for paralogs. Report all of the coordinates of the paralogs on the Zea mays genome. The output of your method will be a table of translated regions. For each entry, you should have a row with the following entries 1. The translated region and its coordinates. 2. The set of peptides 3. The Genbank ID of the best match, with E-value, P-value, and length of alignment 4. The set of peptides consisten with the alignment 5. Coordinates of paralogs, if any. Contact Natalie Castellana ([email protected]) for the translated regions identified by mass spectrometry. References [1] N. E. Castellana, S. H. Payne, Z. Shen, M. Stanke, V. Bafna, and S. P. Briggs, Discovery and revision of Arabidopsis genes by proteogenomics, Proc. Natl. Acad. Sci. U.S.A. 105 (2008), 21034–21038. 3