Download Guidelines for Genome Annotation - Muktak

Aklujkar, Muktak 1-5 Guidelines for Genome Annotation There are a lot of genomes for us to annotate, and everyone is encouraged to participate. This is a good way to get one more paper to your credit. However, the criteria for authorship should be the same as for any other paper: you should make an intellectual contribution and discover something. If you are an authority on any particular group of proteins, please share your insights by e-mailing them as an Excel file. Seven columns: 1. The date you submit the file 2. Your name 3. Genome name and draft date 4. Gene number 5. Suggested gene name (optional) 6. Description 7. Evidence. You don’t have to limit yourself to your area of expertise. Just pick a topic that interests you, and follow the clues. Anyone can do this sort of research, and the more you find out, the higher up you will move in the list of authors. The purpose of these genome annotation projects is not just to corroborate or qualify what we have learned from Geobacter sulfurreducens, but to record new discoveries that may be important later on. It is a huge task to attach meaningful labels to the proteins of unknown function: 1. Are they conserved? 2. Are they specific to the Geobacteraceae? 3. With what other proteins are they expressed? Make sure that the orf-finding software doesn’t miss any orfs that should be on the microarrays, especially those that look like other orfs in the same genome! Check everything that looks suspicious: 1. Long intergenic regions 2. Truncated/extended N-termini 3. Overlapping genes 4. Orfs split into pieces by frameshifts 5. Orfs within repetitive DNA 6. Etc. Double check all work Aklujkar, Muktak 2-5 Questions to ask when you look at a gene product annotation: 1. Is it similar to any other protein? Is the alignment full-length or partial? Are there big gaps? Do the homologs have longer or shorter N-termini or Ctermini? How many homologs are in the same genome? 2. If a function has been predicted, is a domain associated with that function present in the orf? If so, is it the complete domain; if not, was the function predicted reasonably? For which homolog is there experimental evidence of function? 3. Are there repeats within the orf? Are they plausible protein repeats, or DNA repeats that extend into the noncoding regions on either side? 4. What is on either side of the orf? If it’s a toxin, is the antidote encoded next to it? If there’s a big intergenic region, what is in it - any hairpins or repeats? If the adjacent orf has the same annotation, is it a duplication or is it another piece of the same protein, split up by a frameshift? How well is the gene arrangement conserved? 5. Is the protein typical of the Geobacteraceae or a very different group of organisms - e.g., Archaea or Eukarya? Are there multiple proteins with the same function but different lineages? Does the protein belong to a mobile genetic element such as a plasmid or prophage? 6. Is there anything suspicious about the protein sequence? Lots of prolines and arginines? Is there a more plausible orf in a different reading frame? On the opposite strand? Is the orf on the same strand as the adjacent orfs? If not, does it seem to interrupt an operon of gene products that work together? Aklujkar, Muktak 3-5 Genome Annotation Tools The place for you to start looking at our Geobacteraceae genomes is http://www.geobacter.org/refs/genomes, where you can browse from gene to gene, getting a feel for how little we know. Or, you can click on the "ORFS" button and search for a word in the gene description, such as "kinase," if you are interested in a particular sort of protein. The Geobacteraceae genomes encode a lot of cytochromes, transcriptional regulators, sensor kinases, chemotaxis proteins, and transposases, and a whole lot of proteins of unknown function (a.k.a. “proteins of imminent importance).” Maybe it would be better if you went to http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=protein and searched for something like “laccase” or “sortase” - whatever nifty protein you heard about at a seminar, and used the sequence as a query in the UMass BLAST tool. For each genome, there is a list of contigs (fragments of the genome for which we are fairly certain of the sequence), and for each contig there is a map showing where genes have been predicted on one DNA strand (red) or the other (blue). Bacterial genomes don't waste a lot of space, so if the genes are few and far between, we should take a closer look: have we missed any genes or is there something else in the DNA that is interesting? The maps all look circular, but unless the genome is finished, the contigs are in fact linear. (The G. metallireducens plasmid is the only small contig that is truly circular.) Sometimes, you will see two independent predictions of genes (by the automated annotation at Oak Ridge National Laboratories (ORNL) of the Joint Genome Institute (JGI), and by our collaborator Julia Krushkal) that may not agree. I want to get rid of as many incorrect gene predictions as possible. Each predicted gene has its own page, where you can see its DNA and protein sequences and functional description (if any exists). These pages are NOT handmade; if something on the page doesn't make sense, it probably doesn't belong there! Click on "BLAST this sequence" to use the UMass BLAST tool to find similar genes in our other genomes. The ">" symbol is required in the first line (called the FASTA header) to signal that this line is the gene name, not part of the query sequence. The BLAST results (or "hits") will include a number called the "e-value" of each hit, which tells you "how many times you would expect to find this much similarity in a database of this size, just by chance." An e-value less than 10^-5 is a decent indication that two sequences had a common origin somewhere, and the closer to zero the e-value is, the more certain you can be that the two sequences are variations on the same theme. Keep in mind, however, that BLAST is a "local alignment" tool that matches pieces of sequences; the two sequences may not be similar over their entire lengths. You may want to pull out the protein sequences and generate an alignment, or a phylogenetic tree showing their relationships, which you can do at http://clustalw.genome.jp (be sure to include a FASTA header before each protein sequence). Aklujkar, Muktak 4-5 BLAST can also be done at the National Center for Biotechnology Information (NCBI) website - http://www.ncbi.nlm.nih.gov/BLAST where you have the advantage of seeing what domains your protein contains, and what they might do. This site does not have several of the genomes-in-progress that you can find at http://genome.jgi-psf.org/mic_home.html - of particular interest to us are the genomes with names that begin with “Desulf...” because these are relatives of the Geobacteraceae. Other places to find out more about a protein sequence are: http://www.psort.org/psortb/index.html to predict where in the cell it goes http://www.sbg.bio.ic.ac.uk/~phyre to predict how it folds (alpha-helices, betastrands, coils) http://www.cbs.dtu.dk/services/TMHMM to predict membrane protein topology (how many transmembrane segments and which way they go into the membrane). The socalled "positive-inside rule" is that proteins thread in and out of the membrane so that most of the lysines and arginines in between the hydrophobic segments end up in the cytoplasm. Although lysine and arginine carry a positive charge, the rule may have nothing to do with that, because the other charged amino acids (histidine, aspartate, glutamate) don’t affect topology. Rather, it may be significant that lysine and arginine (and methionine) have long, slender side chains. You might also find some useful tools at http://molbiol-tools.ca Back to our own website... You can use the Sequence Extractor tool from each gene's page to pull out the DNA sequence and adjust the numbers to include the sequences on either side of the gene. Not all genes start with an "ATG" codon; "GTG" is fairly common, and others such as "TTG" and "ATC" have also been observed in nature. However, there should be a "putative ribosome-binding sequence" similar to AGGAGGT on the 5' side of the start codon, separated by 4 to 11 bases. This sequence pairs with the 3' end of the 16S ribosomal RNA within the ribosome, positioning the messenger RNA so that the ribosome knows where the start codon is, and can start to translate protein. If you copy the sequence into a program such as DNA Strider (available from me for Macintosh computers only) you can identify other features of the sequence, such as hairpins (a.k.a. stem-loops - palindromic DNA sequences such as GTGAATcatgttATTCAC in which potential base-pairing is shown in capital letters). A strong hairpin can terminate transcription at the end of an operon (a co-transcribed set of adjacent genes on the same strand), whereas the weak hairpin in the example above blocks an ATG start codon so that the gene can only be translated if the previous gene (ending with TGA) is fully translated by disrupting the hairpin. Other things that you can compute with DNA Strider are the amino acid composition of a protein and its codon usage. In genomes that are rich in G and C bases, like the Geobacteraceae, the three stop codons (TAA, TAG, TGA) are infrequent, and so you often find overlapping open reading frames of considerable length. Which one is really a gene? You can make an educated guess by considering that the proline Aklujkar, Muktak 5-5 codons (CCA, CCC, CCG, and CCT) would be fairly common in open reading frames that are not real genes, but most real proteins don’t contain a lot of proline (the protein would be too flexible because proline can’t donate a hydrogen bond to anything). Likewise, the arginine codons (AGA, AGG, CGA, CGC, CGG, CGT) are common in noncoding open reading frames, but a protein bristling with positively charged arginines isn’t very likely to be useful. The codon usage of a predicted protein can also be used to make an educated guess about how likely it is to be real. If you see a lot of the codons GGA (glycine), AGA and CGA (arginine) instead of the alternatives, you can imagine how hard it would be for the gene to survive when a single base mutation would introduce a TGA stop codon at any of these locations. Maybe it’s not a real protein. The codon usage can be used to calculate the Codon Adaptation Index (CAI) that measures how well a predicted protein matches the codon usage of highly expressed proteins in the same species. A high CAI suggests that the protein is real and important; a low CAI could mean that the protein is expressed at a low level because too much of it would be bad for the cell.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Guidelines for Genome Annotation - Muktak