Download annotation_tutorial

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene wikipedia , lookup

Gene nomenclature wikipedia , lookup

Microevolution wikipedia , lookup

Point mutation wikipedia , lookup

Genome evolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Frameshift mutation wikipedia , lookup

Metagenomics wikipedia , lookup

Designer baby wikipedia , lookup

Genetic code wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genomics wikipedia , lookup

Genome editing wikipedia , lookup

Sequence alignment wikipedia , lookup

RNA-Seq wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Using the Artemis sequence viewer and annotation
tool.
About this document
This document is very much “work in progress”, so if you have any comments or
suggestions, do not hesitate to contact me.
Installing the software
Artemis can be downloaded from http://www.sanger.ac.uk/Software/Artemis/ . This site
also contains installation instructions. The Artemis software can also be downloaded
from http://mycor.nancy.inra.fr/IMGC/LaccariaGenome/Annotation/download.html. You
can download Artemis documentation from the same address.
Getting the correct scaffolds
The available scaffolds are based on the 15 march 2005 assembly. Sequences are split in
1MB, or smaller segments, with an overlap of 10KB.
Sequences are named scaffold_XXX_YYY-ZZZ with XXX the number of the scaffold in
the 20050315 assembly and YYY-ZZZ the range of the sequence in the original scaffold.
To determine the scaffolds that contain your gene of interest you can use the BLAST
server at http://mycor.nancy.inra.fr/IMGC/LaccariaGenome/Annotation/blastlaccaria.php
You can download the scaffolds you want to work with from
http://mycor.nancy.inra.fr/IMGC/LaccariaGenome/Annotation/scaffold.php?start=0&sear
ch=
Getting additional data
I have done a tBLASTx of the Laccaria scaffolds against both Coprinus and
Cryptococcus. This result is formatted so you can load it into artemis to help you
visualize. Off course you can expect this file to include a lot of false positive data but I’ve
found it very helpful nonetheless. For each genome I have made 2 sets available: 1
filtered to a BLAST e-value of 10-10 or less and one filtered to an e-value of 10-50 or less.
You can download these files from http:// .
The mapping of the EST data is also formatted in the same way. These files can also be
downloaded from http:// .
Example
Let’s try to annotate the NADP-dependent glutamate dehydrogenase 2 in Laccaria using
the yeast sequence from SwissProt.
The yeast sequence look like this:
>DHE5_YEAST (P39708) NADP-specific glutamate dehydrogenase 2 (EC
1.4.1.4) (NADP-GDH 2) (NADP-dependent glutamate dehydrogenase 2)
MTSEPEFQQAYDEIVSSVEDSKIFEKFPQYKKVLPIVSVPERIIQFRVTWENDNGEQEVA
QGYRVQFNSAKGPYKGGLRFHPSVNLSILKFLGFEQIFKNALTGLDMGGGKGGLCVDLKG
KSDNEIRRICYAFMRELSRHIGKDTDVPAGDIGVGGREIGYLFGAYRSYKNSWEGVLTGK
GLNWGGSLIRPEATGFGLVYYTQAMIDYATNGKESFEGKRVTISGSGNVAQYAALKVIEL
GGIVVSLSDSKGCIISETGITSEQIHDIASAKIRFKSLEEIVDEYSTFSESKMKYVAGAR
PWTHVSNVDIALPCATQNEVSGDEAKALVASGVKFVAEGANMGSTPEAISVFETARSTAT
NAKDAVWFGPPKAANLGGVAVSGLEMAQNSQKVTWTAERVDQELKKIMINCFNDCIQAAQ
EYSTEKNTNTLPSLVKGANIASFVMVADAMLDQGDVF
Using ungapped tBLASTn against the Laccaria assembly with the BLOSUM62 matrix
and an expect cutoff of 0.0001 gives us this BLAST report:
TBLASTN 2.2.8 [Jan-05-2004]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query= DHE5_YEAST (P39708) NADP-specific glutamate dehydrogenase 2 (EC
1.4.1.4) (NADP-GDH 2) (NADP-dependent glutamate dehydrogenase 2)
(457 letters)
Database: laccaria_genome
686 sequences; 65,096,429 total letters
Searching..done
Sequences producing significant alignments:
scaffold_4_1-1000000
8
Score
E
(bits) Value
104
N
e-126
>scaffold_4_1-1000000
Length = 1000000
Score = 104 bits (224), Expect(8) = e-126
Identities = 39/57 (68%), Positives = 47/57 (82%)
Frame = +1
Query: 147
VPAGDIGVGGREIGYLFGAYRSYKNSWEGVLTGKGLNWGGSLIRPEATGFGLVYYTQ 203
+ AGDIG G REIGYLFGAY+ +N + G+LTGKGL WGGS IRPEATG+GL+YY +
Sbjct: 194707 IVAGDIGTGAREIGYLFGAYKKLQNEFVGMLTGKGLAWGGSFIRPEATGYGLIYYVE 194877
Score = 95.7 bits (204), Expect(8) = e-126
Identities = 36/61 (59%), Positives = 48/61 (78%)
Frame = +2
Query: 366
VWFGPPKAANLGGVAVSGLEMAQNSQKVTWTAERVDQELKKIMINCFNDCIQAAQEYSTE 425
VW+ P KA+N GGVAVSGLEMAQNSQ++ WT ++VDQ+LKKIM C+ C+ A ++S E
Sbjct: 195500 VWYAPGKASNCGGVAVSGLEMAQNSQRLAWTTDQVDQKLKKIMAECYEICLSAGTKWSGE
195679
Query: 426
K 426
+
Sbjct: 195680 E 195682
Score = 83.4 bits (177), Expect(8) = e-126
Identities = 34/38 (89%), Positives = 35/38 (92%)
Frame = +3
Query: 66
QFNSAKGPYKGGLRFHPSVNLSILKFLGFEQIFKNALT 103
Q+NSA GPYKGGLR HPSVNLSILKFLGFEQ FKNALT
Sbjct: 194409 QYNSALGPYKGGLRLHPSVNLSILKFLGFEQTFKNALT 194522
Score = 79.8 bits (169), Expect(8) = e-126
Identities = 36/61 (59%), Positives = 45/61 (73%)
Frame = +3
Query: 221
VTISGSGNVAQYAALKVIELGGIVVSLSDSKGCIISETGITSEQIHDIASAKIRFKSLEE 280
V ISGSGNVAQ+ ALKVIELG V+SLSDSKG +I+E G T E I +I
K++ +LE
Sbjct: 194988 VAISGSGNVAQFTALKVIELGATVLSLSDSKGSLIAEKGYTKEFIKEIGQLKLKGGALES
195167
Query: 281
I 281
+
Sbjct: 195168 L 195170
Score = 63.4 bits (133), Expect(8) = e-126
Identities = 27/46 (58%), Positives = 35/46 (76%)
Frame = +3
Query: 297
AGARPWTHVSNVDIALPCATQNEVSGDEAKALVASGVKFVAEGANM 342
AG RPW+ + V +ALP ATQNEVS EA+ L+ +GV+ VAEG+NM
Sbjct: 195249 AGKRPWSLLPVVHVALPGATQNEVSKTEAEDLIKAGVRIVAEGSNM 195386
Score = 63.4 bits (133), Expect(8) = e-126
Identities = 22/39 (56%), Positives = 31/39 (79%)
Frame = +3
Query: 28
PQYKKVLPIVSVPERIIQFRVTWENDNGEQEVAQGYRVQ 66
P Y+K L IV +PER++QFRV WE+D G+ +V +G+RVQ
Sbjct: 194232 PDYEKALEIVQIPERVLQFRVVWEDDQGKAQVNRGFRVQ 194348
Score = 51.1 bits (106), Expect(8) = e-126
Identities = 20/27 (74%), Positives = 22/27 (81%)
Frame = +3
Query: 122
SDNEIRRICYAFMRELSRHIGKDTDVP 148
SD EIRR C +FM EL RHIG+DTDVP
Sbjct: 194577 SDGEIRRFCTSFMSELFRHIGQDTDVP 194657
Score = 42.5 bits (87), Expect(8) = e-126
Identities = 17/31 (54%), Positives = 21/31 (67%)
Frame = +2
Query: 425
EKNTNTLPSLVKGANIASFVMVADAMLDQGD 455
E
LPSL+ GAN+A F+ VADAM + GD
Sbjct: 195680 EIKDGVLPSLLSGANVAGFIKVADAMREHGD 195772
Database: laccaria_genome
Posted date: May 4, 2005 3:15 PM
Number of letters in database: 65,096,429
Number of sequences in database: 686
Lambda
0.315
K
H
0.133
0.380
Matrix: BLOSUM62
Number of Hits to DB: 28,044,928
Number of Sequences: 686
Number of extensions: 351513
Number of successful extensions: 43955
Number of sequences better than 1.0e-04: 2
length of query: 457
length of database: 21,698,809
effective HSP length: 55
effective length of query: 402
effective length of database: 21,661,079
effective search space: 8707753758
effective search space used: 8707753758
frameshift window, decay const: 50, 0.5
T: 13
A: 40
X1: 16 ( 7.3 bits)
X2: 32 (14.6 bits)
S1: 41 (21.6 bits)
S2: 96 (46.6 bits)
If we look at the first section of the BLAST report we see our query sequence is 457
nucleotides long and we have 1 hit with a very good score in scaffold_4_1-1000000
We download this sequence and save it in the file scaffold_4_1-1000000.embl.
Doing the manual annotation
Before we can annotate the correct gene structure in artemis it helps to know what introns
we can expect in this genome. Information on this is available in a file Laccaria
bicolor introns.doc. This file should be present in the same location as the document
you are reading.
Start artemis and load in the file you just downloaded. This will give you the Artemis
window looking like this:
The artemis window is composed of 3 main parts: The first (top) part gives you an
overview of the entire sequence. Using the slider at the bottom you can scroll through the
sequence. The slider in the right will let you zoom in and out on the sequence. In the
middle of this section you see the nucleotide numbering with one dark grey bar above
and below. These bars will later contain a graphical representation of our annotation.
Above and below there are also 3 light-grey bars, these represent the translation of this
sequence in the 6 reading frames. The black lines in these bars indicate stop codons at the
corresponding positions.
The second (middle) part of the artemis window shows a maximally zoomed in view of
the sequence. The sequence and its reverse complement are shown in the middle and the
six-frame translation is given on the 6 bars above and below the sequence. Stop codons
are represented by the symbols +, * and #.
The third part of the screen shows the currently annotated features and is currently empty.
To start annotating our gene we have a look at the second part of the BLAST report. The
text Expect(8) = e-126 tells us that BLAST found 8 correctly distributed hits in the
genome sequence. Taken together these hits get an e-value of 10-126. If this gene is
completely covered by the BLAST hit, it can thus have a maximum of 8 exons. It might
be less since we used ungapped BLAST so if there is a real gap in the alignment it would
have been split up in 2 HSPs. We browse through all HSPs in this group (all with the
same text “Expect(8) = e-126” (there is only 1 group in this example)) and we find the 2
extreme HSPs.
Score = 63.4 bits (133), Expect(8) = e-126
Identities = 22/39 (56%), Positives = 31/39 (79%)
Frame = +3
Query: 28
PQYKKVLPIVSVPERIIQFRVTWENDNGEQEVAQGYRVQ 66
P Y+K L IV +PER++QFRV WE+D G+ +V +G+RVQ
Sbjct: 194232 PDYEKALEIVQIPERVLQFRVVWEDDQGKAQVNRGFRVQ 194348
and
Score = 42.5 bits (87), Expect(8) = e-126
Identities = 17/31 (54%), Positives = 21/31 (67%)
Frame = +2
Query: 425
EKNTNTLPSLVKGANIASFVMVADAMLDQGD 455
E
LPSL+ GAN+A F+ VADAM + GD
Sbjct: 195680 EIKDGVLPSLLSGANVAGFIKVADAMREHGD 195772
We noted earlier that the length of our query(yeast) sequence was 457 so it seems this last
HSP corresponds to the end of the last exon. The first HSP only starts at position 28 of
the yeast sequence so we might be missing the first exon.
We’ll first add all the BLAST HSPs to the sequence.
From the BLAST report we note the location of each HSP on the assembly. These are:
194707..194877
195500..195682
194409..194522
194988..195170
195249..195386
194232..194348
194577..194657
195680..195772
These are each entered into artemis with the menu “Create -> New feature”. In the new
window, select BLASTCDS as “key” and fill in the correct coordinates as shown in this
figure.
Do the same for the remaining 7 features. Once you are a bit more familiar with this
process you will want to skip this first step, but for this tutorial I think it’s a good idea.
Now double click on one of the created features in the bottom part of the screen and
artemis will center around this feature. Your screen should look something like this:
To facilitate the discussion I have numbered the 8 HSP segments. There are no stop
codons between 1 and 3 so HSP 1, 2 and 3 could form 1 exon. From HSP 3 to 4 we go to
another reading frame so there will probably be in intron between these HSPs. The same
goes for HSPs 4&5 and 6 & 7. We can see there are stop codons between HSPs 5 and 6
so there will be an intron between these. HSPs 7 and 8 overlap so they will be merged to
1 exon.
First we need to check if we have one or more ESTs available for this gene. If we have
this will make our job a lot easier. To check this we should download the gff file with
Laccaria EST data and load it into artemis. For this sequence this file is called
scaffold_4_1-1000000.lbEST.gff. Load it into artemis using the “File -> Read an
entry...” menu. We see there are no ESTs matching this gene so we’ll have to do
everything by hand. To see an example of a gene matched by an EST you can scroll to
position 326000 (or look at the next screen shot), This way you can clearly see the
intron/exon boundaries.
We can’t use the EST data for this gene so we can unload the EST gff file from artemis
by clicking “Entries -> Remove An Entry -> scaffold_4_1-1000000.lbEST.gff”
Go back to our glutamate dehydrogenase by double-clicking one of the BLASTCDS
entries in the lower part of the artemis window.
For the purpose of this tutorial we will annotate this gene in the 3’ -> 5’ direction,
contrary to the natural 5’ -> 3’ direction. We do suggest however that you start with most
genes from the 5’ end. This is however a nice gene for a tutorial because it contains both
“easy” and “hard” parts. Unfortunately the easy parts, with which we’ll start, are at the 3’
end.
Let’s start with the last intron (between HSPs 6 and 7). In the BLAST report we see that
HSP 6 stops at position 342 while HSP 7 only starts at position 366 of the Yeast
sequence. This means it’s likely that there is still some coding sequence between these 2
HSPs that was not detected by BLAST. Zoom in on sequence and try to find the intron.
We need to take into account that HSP 6 was in frame 3 and HSP 7 is in frame 2 so we
need to select a GT....AG pair that respects this. (GT...AG seem to be the most common
introns in Laccaria, but we also have a (much lower) number of GC...AG introns)
Position 195387 looks like a very good splice donor site. If we take into account the fact
that most Laccaria introns have a length in the 40-70 nt. range we have only 2 possible
splice acceptors: 195439 and 195452. The potential splice acceptors on 195473, 195485
and 195521 would give us unusually long introns. If we then take into account the
constraint that our intron needs to respect the reading frame of the 2 flanking exons, we
can disqualify all potential splice acceptors except 195439. If you find 2 AG’s within a
few nt you should select the AG most 5’ because of the way the splicing mechanism
works: the spliceosomal machinery binds to the branch point and from there on starts
scanning the sequence in the 5’->3’ direction for the AG splice acceptor. Because of this
scanning mechanism the first AG will usually be selected.
We select these nucleotides. Click on nucleotide 195387 and drag the mouse to
nucleotide 195439. Now we turn our selection into in intron by clicking “Create ->
Create Feature From Base Range”. Change the key to “intron” and click OK.
Your artemis window should now look something like this:
Let’s immediately include the last exon. We remember from the BLAST report that the
last HSP should be close to the end of the gene and indeed we see a stop codon 2 codons
beyond the end of this HSP. We select the nucleotides from 195440 to 195781 (do this by
first clicking on 195440, then scrolling to the right and finally shift-clicking on 195781).
Make this an exon by clicking “Create -> Create Feature From Base Range”. Change the
key to “exon” and click OK.
We’ll have a look at the intron between HSPs 5 and 6 now. From the BLAST report we
learn the HSP 5 stops at 281 in the yeast sequence and HSP 6 starts at 297, so again we
expect to include a few amino acids that were not detected by BLAST. Since there is a
stop codon before HSP 6 we expect to find most of these missing amino acids by
extending HSP5. This stop codon also means we have only 1 valid splice-acceptor near
HSP6: 195252. For splice donors we can choose from 195167, 195198 (a GC splice
donor), 195202 and 195214. Since we think we should extend the HSP we don’t choose
195167 (this one is inside HSP5). 195198 would not put us in the correct reading frame
when combined with our splice acceptor and 195214 would give us an intron of only 38
nucleotides so we choose 195202. Select the intron and annotate it as previously. You can
now also select and annotate the second to last exon.
Next is the intron between HSPs 4 and 5. Zoom in on this area. Your artemis window
now looks like this:
The BLAST report tells us that HSP 4 stops at 203 while HSP 5 only starts at 221 in the
yeast sequence, so again we expect to have to extend the HSPs. Position 194881 has the
only clean splice donor. Now we find 4 splice acceptors that put us in the right reading
frame: 194936, 194960, 194975 and 194981. With no other information, we will just
select the first one and check our protein later by aligning it to known homologs.
Let’s move on to the intron between HSPs 3 and 4. From the BLAST report we see that
HSP3 stops at 148 while HSP4 starts at 147 in the yeast sequence. This means we’ll
probably have to remove a few amino acids from the HSPs. The only valid combination
here seems to be 194659..194713 so we select this as our intron.
HSPs 1, 2 and 3 could form 1 large exons because there are no stop codons between
them. To make sure this is correct we have again a look at our BLAST report. If we look
at the location of these HSPs on the yeast sequence we see this
28..(HSP1)..66 66..(HSP2)..103 122..(HSP3)..148
This leads us think that HSPs 2 and 3 will indeed form one exon, but it is likely that there
will be an intron between HSPs 1 and 2. If we stay close to the HSP boundaries we see
only one valid donor/acceptor combination: 194349..194411. After annotating this intron
and the exons your artemis window should look like this:
One HSP to go. The BLAST report showed us that it is likely that this is not the real start
of the gene. If we look at the sequence before this HSP we can see no methionine codons
in the same reading frame. This means we are probably missing one intron and an exon.
If we look at the upstream sequence we can find a methionine at position 194130 and
another one at 194095. We can find multiple splice acceptors and for both methionine
codons we can find a splice donor, but never a very clean one. Clearly this is going to be
our most difficult intron so we’ll need a little help. We’ll load in the Coprinus tBLASTx
data to help find the first exon. We will use the file filtered to an e-value of 10-50 since
this will contain less false positives. To load in this file select “File -> Read An Entry...”
and select the file scaffold_4_1-1000000.coprE-50.gff. Now your artemis window
will look like this.
You will notice on the top of the screen that a new entry has been loaded. You can turn
the annotation of each individual entry on or off by clicking the check box in front of its
name. With this information it’s clear which methionine we should select as start site. It
even seems we can just use the borders of these 2 tBLASTx hits as our intron borders.
The splice donor site does not look perfect but we can’t find a better one in this area. We
annotate 194095..194136 as our first exon and 194137..194183 as our first intron.
Now have another look at the tBLASTx data. First check off the first entry and then
check it back on (this is just a trick to make sure the first entry is on top in the artemis
window). Your window now looks like this:
We see that our annotation agrees very well with the tBLASTx data. It looks we were
right that HSP 2 and 3 belong to the same exon. It seems we were also right in extending
HSP5 in both directions and HSP7 to the 5’ end. Off course we’ll need to check this more
carefully later.
This is it for the tBLASTx data, so you can click it off or remove the entry with “Entries > Remove An Entry -> scaffold_4_1-1000000.coprE-50.gff”.
Now we have all our exons annotated we can combine then to a CDS. Select all the exons
(click on the first and then, while holding shift, click on the other ones) and combine
them by clicking “Edit -> Merge Selected Features”. Confirm that you want to do this
and don’t let artemis remove the old features (for now). You will notice in the lower part
of the artemis window that a new exon feature was formed spanning from 194095 to
195781. We will need to modify this to CDS. Select this feature (if it isn’t already) and
click “Edit -> Edit Selected Features”. Change the key to CDS and click OK. Now your
artemis window looks like this:
Quality Control
Now it’s time to check if we really annotated this gene correctly. Select the CDS we just
annotated and show the sequence by selecting “View -> View Amino Acids Of Selection
As FASTA”. Select the sequence and copy it to the clipboard.
Go to http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_server.html
Then chose Blast search, and paste your sequence in the window.
Choose BlastP, and non-redundant protein sequence (you may chose Uniprot-SwissProt
instead, especially if your protein is well known). You may give a name to your sequence
(optional), otherwise it will appear as “unknown_XXXX”. Click on submit..
Wait a bit… Then a window with a graphic display and a list of orthologues is given.
Most of the time you have too many of them. Keep only the top of the list, by typing an
appropriate E threshold in the window below the graph, then the “select” button. Scroll to
the bottom of the page to click on “Add query sequence to created database”, then click
on the Extract button. This may take a while … before the next page appear (“work with
protein sequence databank”) . Click on Align .. In the next page (“ClustalW”) click on the
submit button (after choosing a larger window, maybe, like 100)
From the multiple alignment it’s clear that our annotation agrees very well with the
alignment except for 1 area:
The “CRA” in our annotation seems to be an insertion compared to the other
basidiomycetes. Let’s go back to our artemis window and find this sequence. This is on
the border of the first and second exon.
We can’t remove the CRA sequence from the protein since we don’t have the correct
splice donor and acceptor for this. We can however remove the ACR from the last exon,
this will give us the exact same protein sequence and we can keep our splice acceptor
site. We also find an OK splice donor at position 194128. This is one of the rather rate
GC splice donors! Modify the annotation of the first exon by clicking it and selecting
“Edit -> Edit Selected Features” change the endpoint to 194127. Now modify the intron
to have it start at location 194128. Now we have the correct intron/exon structure of this
gene defined. We still need to remove the old CDS: select in and click “Edit -> Remove
Selected Features”. Now make a new CDS by combining the exons as explained earlier.
This time we won’t be modifying the exons anymore so you can remove the old features
when artemis asks for it. We also won’t need our original BLAST HSPs anymore so you
can remove these as well. Another quite common problem is to find the correct start
codon. Using this alignment method it will mostly be very obvious if you selected the
wrong ATG.
Now we can finish the annotation of this gene by adding some information to the CDS
feature. Select the CDS feature and click “Edit -> Edit Selected Features”.
The first thing we’ll add is the name of the gene. Next to the “Add Qualifier” button,
select the qualifier “gene” and then push the “Add Qualifier” button. In the text area
below the text /gene=”” will appear. Enter the gene name between the quotes. (gdhA).
Please use the accepted name(s) for this and select the one used for Yeast/Fungi if there is
more then one. This name, usually 3letters + a Capital letter or a number (eg, GlnA,
Nia1) can usually be found in the entries you had in your Blast search. Likewise, we can
add the “product” qualifier (NADP-dependent glutamate dehydrogenase). We will also
include the best results from our BLASTp against SwissProt by adding the
“blastp_match” qualifier. (gi|1706405|sp|P54388|DHE4_LACBI NADP-specific
glutamate deh... 834 0.0. gi|41017051|sp|Q96UJ9|DHE4_HEBCY NADP-specific
glutamate de... 735 0.0. gi|1706404|sp|P54387|DHE4_AGABI NADP-specific
glutamate deh... 714 0.0). Finally, don’t forget to add your name to annotation so we
can keep track of who did what. Use the “curation” qualifier for this. Have a look at the
other qualifiers that are available. You can find a partial description at
http://www.ncbi.nlm.nih.gov/collab/FT/#7.4 .
1. keep as close as possible to conventions:
- type biochemical function when there is an indication for it, using Swissprot kind
of nomenclature (The EC nomenclature for enzyme)
- If only a cellular function is known, check if it could apply to Laccaria !
Otherwise use a terminology indicating where the function applied, or append “like” to the description valid for another organism.
- Don’t overkill ! Often the blast will point to a very specific function… Like
Cadmium transporter, uridine diphosphate-N-acetylglucosamine transporter, 6phosphogluconolactonase. Check if it really applies, or if it shouldn’t be changed
towards a wider acceptation : metal transporter, nucleotide-sugar transporter, etc..
In the example: In this example we are rather sure it is indeed an NADPdependant glutamate dehydrogenase because we found a lot of hits to other
NADP-dependant glutamate dehydrogenases with our BLAST. Otherwise we
could have used “glutamate dehydrogenase”
2. Only leave this name as such if there is this is proven to exist (cognate cDNA &
ESTs). Otherwise add the mention putative or probable, depending on the likeliness for
the gene to be the one you found : “glutamate dehydrogenase, probable”.
The artemis feature edit window should now look something like this:
Finally, I suggest you keep as much information as possible in a Word file on your own
computer. This information should include: the location and annotation of the gene. You
can get this from the Artemis Feature Edit window (see previous screenshot), the DNA
and protein sequence of your gene (you can get these by selecting the CDS and select
“View -> View Bases Of Selection As Fasta” or “View Amino Acids Of Selection As
Fasta”), relevant information from BLAST reports, the alignment you made in the quality
control step, ... basically anything you think might be important.
All this information will be important if we update the genome and there are problems
with remapping previously annotated genes.
Click OK and don’t forget to save you file (File -> Save An Entry).
You can upload your annotated file to http://