Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain Jul-01-0806/16/08 Bioinformatics Workshop - Malaga Node 1 of the INB GN1 Bioinformática y Genómica Genome Bioinformatic Lab, CRG Roderic Guigó (PI) Jul-01-08 Bioinformatics Workshop - Malaga Themes Gene prediction Genome feature visualization gff2ps Alternative splicing ab initio => GeneID dual-genome => SGP2 u12 introns => GeneID v1.3 and U12DB combiner => GenePC ASTALAVISTA Gene expression regulatory elements meta and mmeta alignment Jul-01-08 Bioinformatics Workshop - Malaga Eukaryotic gene structure Jul-01-08 Bioinformatics Workshop - Malaga Eukaryotic gene structure INTRONS PROMOTOR donor UPSTREAM REGULATOR acceptor EXONS Jul-01-08 DOWNSTREAM REGULATOR Bioinformatics Workshop - Malaga The Splicing Code Jul-01-08 Bioinformatics Workshop - Malaga Gene Prediction Strategies Expressed Sequence (cDNA) or protein sequence available? Yes Spliced alignment BLAT, Exonerate, est_genome, spidey, GMAP, Genewise No Integrated gene prediction Informant genome(s) available? Yes Dual or n-genome de novo predictors: SGP2, Twinscan, NSCAN, (Genomescan – same or cross genome protein blastx) No ab initio predictors geneid, genscan, augustus, fgenesh, genemark, etc. Many newer gene predictors can run in multiple modes depending on the evidence available. Jul-01-08 Bioinformatics Workshop - Malaga Gene Prediction Strategies Jul-01-08 Bioinformatics Workshop - Malaga Frameworks for gene prediction Hierarchical exon-buliding and chaining Hidden Markov Models (many flavors) HMM, GHMM, GPHMM, Phylo-HMM Conditional Random Fields (new!) Conrad, Contrast... and, no doubt, more to come All of them involve parsing the optimal path of exons using dynamic programming (e.g. GenAmic, Viterbi algorithms) Jul-01-08 Bioinformatics Workshop - Malaga How does GeneID approach gene prediction? Jul-01-0806/16/08 Bioinformatics Workshop - Malaga The gene prediction problem sites a4 a2 a1 d1 e1 a3 d2 e2 d3 e3 e4 exons d4 d5 e5 e6 e7 e8 e1 e4 genes e8 Jul-01-08 Bioinformatics Workshop - Malaga GeneID Geneid follows a hierarchical structure: Exon score: Score of exon-defining signals + protein-coding potential (log-likelihood ratios) Dynamic programming algorithm: Jul-01-08 signal exon gene maximize score of assembled exons assembled gene Bioinformatics Workshop - Malaga Training GeneID 1 GAGGTAAAC TCCGTAAGT CAGGTTGGA ACAGTCAGT TAGGTCATT TAGGTACTG ATGGTAACT CAGGTATAC 2 3 4 5 6 7 8 9 A 0.3 0.6 0.1 0.0 0.0 0.6 0.7 0.2 0.1 C 0.2 0.2 0.1 0.0 0.0 0.2 0.1 0.1 0.2 G 0.1 0.1 0.7 1.0 0.0 0.1 0.1 0.5 0.1 T 0.4 0.1 0.1 0.0 1.0 0.1 0.1 0.2 0.6 TGTGTGAGT AAGGTAAGT ATGGCAGGGACCGTGACGGAAGCCTGGGATGTGGCAGTATTTGCTGCCCGACGGCGCAAT GATGAAGACGACACCACAAGGGATAGCTTGTTCACTTATACCAACAGCAACAATACCCGG GGCCCCTTTGAAGGTCCAAACTATCACATTGCGCCACGCTGGGTCTACAATATCACTTCT GTCTGGATGATTTTTGTGGTCATCGCTTCAATCTTCACCAATGGTTTGGTATTGGTGGCC ACTGCCAAATTCAAGAAGCTACGGCATCCTCTGAACTGGATTCTGGTAAACTTGGCGATA GCTGATCTGGGTGAGACGGTTATTGCCAGTACCATCAGTGTCATCAACCAGATCTCTGGC Jul-01-08 Bioinformatics Workshop - Malaga Running GeneID command line or on geneid server NAME geneid - a program to annotate genomic sequences SYNOPSIS geneid [-bdaefitnxszr] [-DA] [-Z] [-p gene_prefix] [-G] [-3] [-X] [-M] [-m] [-WCF] [-o] [-j lower_bound_coord] [-k upper_bound_coord] [-O <gff_exons_file>] [-R <gff_annotation-file>] [-S <gff_homology_file>] [-P <parameter_file>] [-E exonweight] [-V evidence_exonweight] [-Bv] [-h] <locus_seq_in_fasta_format> RELEASE geneid v 1.3 OPTIONS -b: Output Start codons -d: Output Donor splice sites -a: Output Acceptor splice sites -e: Output Stop codons -f: Output Initial exons -i: Output Internal exons -t: Output Terminal exons -n: Output introns -s: Output Single genes Jul-01-08 -x: Output all predicted exons Bioinformatics Workshop - Malaga GeneID output ## gff-version 2 ## date Mon Nov 26 14:37:15 2007 ## source-version: geneid v 1.2 -- [email protected] # Sequence HS307871 - Length = 4514 bps # Optimal Gene Structure. 1 genes. Score = 16.20 # Gene 1 (Forward). 9 exons. 391 aa. Score = 16.20 HS307871 geneid_v1.2 Internal 1710 1860 -0.11 + 0 HS307871_1 HS307871 geneid_v1.2 Internal 1976 2055 0.24 + 2 HS307871_1 HS307871 geneid_v1.2 Internal 2132 2194 0.44 + 0 HS307871_1 HS307871 geneid_v1.2 Internal 2434 2682 4.66 + 0 HS307871_1 HS307871 geneid_v1.2 Internal 2749 2910 3.19 + 0 HS307871_1 HS307871 geneid_v1.2 Internal 3279 3416 0.97 + 0 HS307871_1 HS307871 geneid_v1.2 Internal 3576 3676 3.23 + 0 HS307871_1 HS307871 geneid_v1.2 Internal 3780 3846 -0.96 + 1 HS307871_1 HS307871 geneid_v1.2 Terminal 4179 4340 4.55 + 0 HS307871_1 Jul-01-08 Bioinformatics Workshop - Malaga GFF: a standard annotation format Stands for: Designed as a single line record for describing features on DNA sequence -- originally used for gene prediction output 9 tab-delimited fields common to all versions Gene Finding Format -or- General Feature Format seq source feature begin end score strand frame group The group field differs between versions, but in every case no tabs are allowed GFF2: group is a unique description, usually the gene name. GFF2.5 / GTF (Gene Transfer Format): tag-value pairs introduced, start_codon and stop_codon are required features for CDS NCOA1 transcript_id “NM_056789” ; gene_id “NCOA1” GFF3: Capitalized tags follow Sequence Ontology (SO) relationships, FASTA seqs can be embedded ID=NM_056789_exon1; Parent=NM_056789; note=“5’ UTR exon” Jul-01-08 Bioinformatics Workshop - Malaga GeneID output ## gff-version 2 ## date Mon Nov 26 14:37:15 2007 ## source-version: geneid v 1.2 -- [email protected] # Sequence HS307871 - Length = 4514 bps # Optimal Gene Structure. 1 genes. Score = 16.20 # Gene 1 (Forward). 9 exons. 391 aa. Score = 16.20 HS307871 geneid_v1.2 Internal 1710 1860 -0.11 + 0 HS307871_1 HS307871 geneid_v1.2 Internal 1976 2055 0.24 + 2 HS307871_1 HS307871 geneid_v1.2 Internal 2132 2194 0.44 + 0 HS307871_1 HS307871 geneid_v1.2 Internal 2434 2682 4.66 + 0 HS307871_1 HS307871 geneid_v1.2 Internal 2749 2910 3.19 + 0 HS307871_1 HS307871 geneid_v1.2 Internal 3279 3416 0.97 + 0 HS307871_1 HS307871 geneid_v1.2 Internal 3576 3676 3.23 + 0 HS307871_1 HS307871 geneid_v1.2 Internal 3780 3846 -0.96 + 1 HS307871_1 HS307871 geneid_v1.2 Terminal 4179 4340 4.55 + 0 HS307871_1 Jul-01-08 Bioinformatics Workshop - Malaga Visualizing features with gff2ps generated by Josep Abril Jul-01-08 Bioinformatics Workshop - Malaga Visualizing features on UCSC genome browser (custom tracks) If “your” genome is served by UCSC, this is a good option because: browsing is dynamic access to other annotations can view DNA sequence can do complex intersections and filtering gff2ps is good when: your genome is not on UCSC you want more flexible layout options you want to run it ‘offline’ Jul-01-08 Bioinformatics Workshop - Malaga Extensions to GeneID Syntenic Gene Prediction (dual-genome) Evidence-based (constrained) gene prediction U12 intron detection Combining gene predictions Selenoprotein gene prediction Jul-01-08 Bioinformatics Workshop - Malaga Syntenic Gene Prediction: SGP2 Jul-01-08 Bioinformatics Workshop - Malaga Minor splicing and U12 introns U12 introns make up a minor proportion of all introns (~0.33% in human, less in insects) But they can be found in 2-3% of genes Normally ignored, but this causes annotation problems Easy to predict due to highly conserved donor and branch sites Jul-01-08 Bioinformatics Workshop - Malaga Splice Signal Profiles: major and minor Jul-01-08 Bioinformatics Workshop - Malaga Gathering U12 Introns Human predict genome Fruit Fly 2084 aln to EST/ mRNA aln to EST/ mRNA 563 568 385 score merge all annotated introns predict score merge genome all annotated introns 658 ENSEMBL? 597 ortholog search (17 species) + spliced alignment published U12 DB Jul-01-08 Bioinformatics Workshop - Malaga Jul-01-08 Bioinformatics Workshop - Malaga Coming Soon: GenePC a Gene Prediction Combiner Jul-01-08 Bioinformatics Workshop - Malaga Tutorial Homepage http://genome.imim.es/courses/Malaga08/ GBL Homepage http://genome.imim.es/ Jul-01-08 Bioinformatics Workshop - Malaga