* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Part_of - coccidia.icb.usp.br
Zinc finger nuclease wikipedia , lookup
Y chromosome wikipedia , lookup
Transposable element wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Frameshift mutation wikipedia , lookup
Copy-number variation wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Gene therapy wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genetic engineering wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Gene expression profiling wikipedia , lookup
Gene desert wikipedia , lookup
Genomic library wikipedia , lookup
Human genome wikipedia , lookup
Protein moonlighting wikipedia , lookup
Gene expression programming wikipedia , lookup
Metagenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Genome evolution wikipedia , lookup
X-inactivation wikipedia , lookup
Genome (book) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Neocentromere wikipedia , lookup
Gene nomenclature wikipedia , lookup
Microevolution wikipedia , lookup
Point mutation wikipedia , lookup
Designer baby wikipedia , lookup
Genome editing wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP Sequence annotation • Annotation is the process information to a DNA sequence. of adding • The information usually has DNA coordinate. • Features could be repeats, genes, promoters, protein domains…….. • Features can be linked to other databases e.g. Pfam/Pubmed AG-ICB-USP Public databases • • GenBank, EMBL and DDBJ. All databases update each other automatically AG-ICB-USP Feature table • http://www.ncbi.nlm.nih.gov/projects/collab/FT/ • Format definition • Covers DDBJ/EMBL/GenBank • Defines all accepted annotation terms and hierarchy AG-ICB-USP Annotation file Contains: • A header with: • • • • • Information about the sequence Organism Authors References Comments • A feature table containing • Sequence features and co-ordinates AG-ICB-USP Header (EMBL) ID PFMAL1P4 standard; DNA; INV; 66441 BP. XX AC AL031747; XX SV AL031747.8 XX DT 24-SEP-1998 (Rel. 57, Created) DT 27-APR-2000 (Rel. 63, Last updated, Version 13) XX DE Plasmodium falciparum DNA from MAL1P4 XX KW HTG; rifin; telomere; var; var-like hypothetical protein. XX OS Plasmodium falciparum (malaria parasite P. falciparum) OC Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium. XX RN [1] RA Oliver K., Bowman S., Churcher C., Harris B., Harris D., Lawson D., RA Quail M., Rajandream M., Barrell B.; RT ; RL Submitted (24-SEP-1998) to the EMBL/GenBank/DDBJ databases. RL P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome RL Trust Genome Campus, Hinxton, Cambridge CB10 1S. AG-ICB-USP NCBI Header LOCUS PFMAL1P4 66442 bp DNA linear INV 02-DEC-2004 DEFINITION Plasmodium falciparum DNA from MAL1P4, complete sequence. ACCESSION AL031747 AL844501 VERSION AL031747.9 GI:23477012 KEYWORDS HTG; rifin; telomere; var; var-like hypothetical protein. SOURCE Plasmodium falciparum 3D7 ORGANISM Plasmodium falciparum 3D7 Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium. REFERENCE 1 AUTHORS Hall,N., Pain,A., Berriman,M., Churcher,C., Harris,B., Harris,D., TITLE Sequence of Plasmodium falciparum chromosomes 1, 3-9 and 13 JOURNAL Nature 419 (6906), 527-531 (2002) PUBMED 12368867 REFERENCE 2 AUTHORS Oliver,K., Pain,A., Berriman,M., Bowman,S., Churcher,C., Harris,B., Harris,D., Lawson,D., Quail,M., Rajandream,M., Hall,N. and Barrell,B. TITLE Direct Submission JOURNAL Submitted (24-SEP-1998) P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK COMMENT On Oct 2, 2002 this sequence version replaced gi:7670004. For more information about this sequence or the Malaria Project, see http://www.sanger.ac.uk/Projects/P_falciparum. AG-ICB-USP Feature • Region of DNA that was annotated with a key/qualifier • Keys: CDS, intron, miscellaneous, etc. • Qualifier: notes or extra-information about a feature i.e. exon (key) /gene=“adh” (qualifier) AG-ICB-USP Feature keys attenuator C_region CAAT_signal CDS conflict D-loop D_segment enhancer exon GC_signal gene iDNA intron J_segment LTR mat_peptide misc_binding misc_difference misc_feature misc_recomb misc_RNA misc_signal misc_structure modified_base mRNA N_region old_sequence polyA_signal polyA_site precursor_RNA prim_transcript primer_bind promoter protein_bind RBS repeat_region repeat_unit rep_origin rRNA S_region satellite scRNA sig_peptide snRNA snoRNA source stem_loop STS TATA_signal terminator transit_peptide tRNA unsure V_region V_segment variation 3'clip 3'UTR 5'clip 5'UTR -10_signal -35_signal AG-ICB-USP Feature qualifier Additional information about a feature /note="text" /allele="text" /number=unquoted /citation=[number] /codon=(seq:"text",aa:<amino_acid>)/product="text" /protein_id="<identifier>" /codon_start=<1 /db_xref="<database>:<identifier>" /pseudo /standard_name="text" /EC_number="text" /translation="text" /evidence=<evidence_value> /transl_except=(pos:<base_range>,aa:<amino_acid>) /exception="text" /transl_table /function="text" /usedin=accnum:feature_label /gene="text" /label=feature_label /map="text" AG-ICB-USP Features (EMBL) AG-ICB-USP Features (NCBI) FEATURES source Location/Qualifiers 1..66442 /organism="Plasmodium falciparum 3D7" /mol_type="genomic DNA" /isolate="3D7" /db_xref="taxon:36329" /chromosome="1" repeat_region 1..583 /note="telomeric repeat" repeat_region 584..1641 /note="14bp repeat" gene join(29733..34985,36111..37349) /gene="MAL1P4.01" /note="synonyms: PFA0005w, VAR" CDS join(29733..34985,36111..37349) /gene="MAL1P4.01" /note="Subtelomeric var gene Pfam hit to PF03011 Similar to Plasmodium falciparum VaR, mal1p4.01 vaR SWALL:Q9NFB6 (EMBL:AL031747) (2163 aa) fasta scores: E(): 0, 100% id in 2163 aa" /codon_start=1 /product="erythrocyte membrane protein 1 (PfEMP1)" /protein_id="CAB89209.1" /db_xref="GI:7670005" /db_xref="GOA:Q9NFB6" /db_xref="UniProtKB/TrEMBL:Q9NFB6" /translation="MVTQSSGGGAAGSSGEEDAKHVLDEFGQQVYNEKVEKYANSKIY KEALKGDLSQASILSELAGTYKPCALEYEYYKHTNGGGKGKRYPCTELGEKVEPRFSDTLGGQCTNK KIEGNKYIKGKDVGACAPYRRLHLCSHNLESIQ AG-ICB-USP CDS features • • CDS stands for coding sequence and is used to denote genes and pseudogenes. These features are automatically translated on submission and the protein added to the protein databases. AG-ICB-USP /note • Note field contains all the evidence for a gene call……..plus anything else. • • • Similarity (fasta or blast) Domain/motif information (Pfam, TMHMM, etc.) Unusual features (repeats, aa richness) AG-ICB-USP /product • The name of the gene product eg. Alcohol dehydrogenase • Unless there is proof we must qualify... • • • Putative Possible Always be conservative!… eg. Putative dehydrogenase dehyrogenase like protein • Only piece of annotation added to the protein databases. AG-ICB-USP Naming protocols • Hypothetical protein unknown function and no homology • Conserved hypothetical protein unknown function WITH homology • Alcohol dehydrogenase like looks a bit like it, but may not be. • Putative alcohol dehydrogenase probably a alcohol dehydrogenase • Alcohol dehydrogenase this has previously been characterised and shown to be alcohol dehydrogenase in this organism. AG-ICB-USP /gene • The gene name • • • • eg ADH1 Only transfer a gene name if it is meaningful Never transfer a gene name like PfB0024. Is it a gene family? make sure two genes have the same name. AG-ICB-USP Transitive Annotation • • • AKA annotation catastrophe Junk in = Junk out Mis-annotations spread through incorrect database submissions. AG-ICB-USP How can we standardize the annotation terms? AG-ICB-USP Through a dynamic controlled vocabulary AG-ICB-USP AG-ICB-USP So what does that mean? From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those things. Ontology Structure cell membrane mitochondrial membrane Directed Acyclic Graph (DAG) - multiple parentage allowed chloroplast chloroplast membrane GO topology • The ontologies are structured as directed acyclic graphs • Similar to hierarchies but differ in that a more specialized term (child) can be related to more than one less specialized term (parent). • For example, hexose biosynthetic process has two parents, hexose metabolic process and monosaccharide biosynthetic process. AG-ICB-USP True Path Violations Create Incorrect Definitions ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". nucleus Part_of relationship chromosome True Path Violations ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". chromosome Is_a relationship Mitochondrial chromosome True Path Violations ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". nucleus A mitochondrial chromosome is not part of a nucleus! Part_of relationship chromosome Is_a relationship Mitochondrial chromosome True Path Violations ..”the pathway from a child term all the way up to its top-level parent(s) must always be true". nucleus Part_of relationship Nuclear chromosome chromosome Is_a relationship mitochondrion Part_of relationship Mitochondrial chromosome GO Definitions: Each GO term has 2 Definitions A definition written by a biologist: necessary & sufficient conditions written definition (not computable) Graph structure: necessary conditions formal (computable) Term-term relationship • is_a • The is_a relationship is a simple classsubclass relationship, where A is_a B means that A is a subclass of B • For example, nuclear chromosome is_a chromosome. GO:0043232 : intracellular non-membrane-bound organelle GO:0005694 : chromosome GO:0000228 : nuclear chromosome AG-ICB-USP Term-term relationship • part_of • C part_of D means that whenever C is present, it is always a part of D, but C does not always have to be present • For example, periplasmic flagellum part_of periplasmic space GO:0044464 : cell part GO:0042995 : cell projection GO:0019861 : flagellum GO:0009288 : flagellin-based flagellum GO:0055040 : periplasmic flagellum GO:0042597 : periplasmic space GO:0055040 : periplasmic flagellum AG-ICB-USP Current Ontologies • Molecular function: tasks performed by gene product • Biological process: broad biological goals accomplished by ordered assemblies of molecular functions • Cellular component: subcellular structures, locations and macromolecular complexes AG-ICB-USP AG-ICB-USP Search result for toxin AG-ICB-USP Relationships in GO •“is-a” •“part of” AG-ICB-USP GO paths to terms AG-ICB-USP GO definitions AG-ICB-USP Pyruvate dehydrogenase AG-ICB-USP Why the interest in GO? ● ● ● ● Universal ontology Functional classification scheme with many different levels in a DAG Widespread interest from scientific community Already mappings to SP keywords and gene products-annotation on some organisms AG-ICB-USP GO Evidence codes • Experimental Evidence Codes •EXP: Inferred from Experiment •IDA: Inferred from Direct Assay •IPI: Inferred from Physical Interaction •IMP: Inferred from Mutant Phenotype •IGI: Inferred from Genetic Interaction •IEP: Inferred from Expression Pattern • Computational Analysis Evidence Codes •ISS: Inferred from Sequence or Structural Similarity •ISO: Inferred from Sequence Orthology •ISA: Inferred from Sequence Alignment •ISM: Inferred from Sequence Model •IGC: Inferred from Genomic Context •RCA: inferred from Reviewed Computational Analysis • Author Statement Evidence Codes •TAS: Traceable Author Statement •NAS: Non-traceable Author Statement •Curator Statement Evidence Codes •IC: Inferred by Curator • ND: No biological Data available • Automatically-assigned Evidence Codes •IEA: Inferred from Electronic Annotation • Obsolete Evidence Codes • NR: Not Recorded AG-ICB-USP Current Mappings to GO • Consortium mappings -MGD, SGD, FlyBase • Swiss-Prot keywords • EC numbers • InterPro entries • Medline ID • Commercial companies -CompuGen, Proteome AG-ICB-USP AG-ICB-USP AG-ICB-USP AG-ICB-USP InterPro-to-GO EC number-to-GO AG-ICB-USP SP keyword-to-GO AG-ICB-USP GO doesn’t cover… • Gene products: e.g. cytochrome c is not in the ontologies, but attributes of cytochrome c, such as oxidoreductase activity, are. • Processes, functions or components that are unique to mutants or diseases: e.g. oncogenesis is not a valid GO term because causing cancer is not the normal function of any gene. • Attributes of sequence such as intron/exon parameters: these are not attributes of gene products and will be described in a separate sequence ontology (see Sequence Ontology). • Protein domains or structural features. • Protein-protein interactions. • Environment, evolution and expression. • Anatomical or histological features above the level of cellular components, including cell types. AG-ICB-USP Sequence Ontology • The four major aspects of the complete Sequence Ontology are: • • • • located sequence features for objects that can be located on sequence in coordinates, sequence attributes for describing the properties of features, consequences of mutation for the annotation of the effects of a mutation chromosome variation to describe large scale variations AG-ICB-USP Sequence Ontology • How to edit an ontology file? • OBO-Edit – an ontology editor for biologists • OBO-Edit compliant format AG-ICB-USP Generic feature format 3 • Generic format for sequence annotation interchange • • • Tab-delimited text file Represents features in hierarchical view Uses a controlled vocabulary – is compliant to Sequence Ontology AG-ICB-USP Generic feature format 3 • The tab-delimited file presents 9 columns: • • • • • • • • • Column 1: "seqid" Column 2: "source" Column 3: "type" Columns 4 & 5: "start" and "end" Column 6: "score" Column 7: "strand" The strand of the feature. + for positive strand (relative to the landmark), - for minus strand Column 8: "phase" Column 9: "attributes" AG-ICB-USP Generic feature format 3 • • • • • • • • Column 1: "seqid" Column 2: "source" Column 3: "type" Columns 4 & 5: "start" and "end" Column 6: "score" Column 7: "strand" Column 8: "phase" Column 9: "attributes" How to annotate these splicing variants using Sequence Ontology terms and the GFF3? • The annotated genome region is named “ctg123” • A gene named EDEN extends from coordinates 1 to 9000 • The gene encodes three alternatively-spliced variants: EDEN.1, EDEN.2 and EDEN.3 • Transcript EDEN.3 presents two alternative translation start points • There is a transcriptional factor binding site (a promoter) located 50 bp upstream of the translational start site of EDEN.1 ##gff-version 3 ##sequence-region ctg123 1 1497228 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001;Name=EDEN.2 ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001;Name=EDEN.3 ctg123 . exon ctg123 . exon ctg123 . exon ctg123 . exon ctg123 . exon 1300 1050 3000 5000 7000 1500 1500 3902 5500 9000 . . . . . + + + + + . . . . . ctg123 . CDS ctg123 . CDS ctg123 . CDS ctg123 . CDS 1201 3000 5000 7000 1500 3902 5500 7600 . . . . + + + + 0 0 0 0 ID=exon00001;Parent=mRNA00003 ID=exon00002;Parent=mRNA00001,mRNA00002 ID=exon00003;Parent=mRNA00001,mRNA00003 ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003 ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 ctg123 . CDS ctg123 . CDS ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2 ctg123 . CDS ctg123 . CDS ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3 5000 5500 . + 2 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3 7000 7600 . + 2 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3 ctg123 . CDS ctg123 . CDS ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4 5000 5500 . + 2 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4 7000 7600 . + 2 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4 Generic feature format 3 • If you writes a GFF file, you can test it! There is an online validator: http://dev.wormbase.org/db/validate_gff3/validate_gff3_online AG-ICB-USP Testing the GFF3 Validator AG-ICB-USP Testing the GFF3 Validator Let’s change the feature names Annotation viewing and editing Artemis • Artemis is a free genome viewer and annotation tool developed by Kim Rutherford (Sanger Institute, UK). • It allows for visualization of sequence features and results of analyses, in the context of the sequence and its six-frame translation. AG-ICB-USP Annotation viewing and editing Artemis • Artemis is written in Java, and is available for UNIX, GNU/Linux, BSD, Macintosh and MSWindows systems. • It can read complete EMBL and GENBANK database entries or sequence in FASTA or raw format. Extra sequence features can be in EMBL, GENBANK or GFF format. AG-ICB-USP AG-FMVZ-USP AG-FMVZ-USP AG-FMVZ-USP AG-FMVZ-USP AG-FMVZ-USP