Download Genome Analysis and Genome Comparison

Part 12 Genome Analysis Outline • • • • • • • • • Overview Why do comparative genomic analysis? Assumptions/Limitations Genome Analysis and Annotation Standard Procedure General Purposes Databases for Comparative Genomics Organism Specific Databases Genome Analysis Environments Genome Sequence Alignment Programs Genomic Comparison Visualization Tools Some of the prokaryotic genomes Bacteroides fragilis Bordetella bronchiseptica Bordetella parapertussis Bordetella pertussis Burkholderia cepacia Burkholderia pseudomallei Chlamidophila abortus Clostridium botulinum Clostridium difficile Corynebacterium diphtheriae Erwinia carotovora Escherichia/Shigella spp. (5) Mycobacterium bovis Mycobacterium marinum Neisseria meningitidis (serogroup C) Salmonella typhi Salmonella spp. (5) Staphylococcus aureus (MRSA) Staphylococcus aureus (MSSA) Streptococcus pneumoniae Streptococcus pyogenes Streptococcus suis Streptococcus uberis Streptomyces coelicolor Tropheryma whipelli Wolbachia (Culex quinquefasciatus) Wolbachia (Onchocerca volvulus) Yersinia enterocolitica Yersinia pestis Opportunistic Veterinary Whooping cough Whooping cough Lung infections in CF Melliodosis Veterinary Botulism Colitis Diphtheria Plant pathogen Various Tuberculosis Various Bacterial meningitis Typhoid fever Various Various (Nosocomial) Various (Community acquired) Bacterial meningitis Various (ARF-associated) Veterinary Veterinary Non-pathogenic Whipple’s disease Vector (Bancroftian filariasis) River Blindness Food poisoning Plague In progress In progress Complete Complete In progress In progress Funded Funded In progress Complete Funded In progress In progress In progress In progress Complete In progress Complete In progress In progress In progress In progress In progress Complete In progress In progress Funded In progress Complete Some of the eukaryotic genomes Aspergillus fumigatus Dictyostelium discoideum Entamoeba histolitica Leishmania major Plasmodium falciparum Schistosoma mansoni Schizosaccharomyces pombe Theileria annulata Toxoplasma gondii Trypanosoma brucei Farmer’s lung Soil amoeba Amoebic dysentry Leishmaniasis Malaria Bilharzia Fission yeast Veterinary Toxoplasmosis Sleeping sickness In progress In progress In progress In progress In progress In progress Complete In progress In progress In progress Bioinformatics Flow Chart 1a. Sequencing 1b. Analysis of nucleic acid seq. 2. Analysis of protein seq. 3. Molecular structure prediction 6. Gene & Protein expression data 7. Drug screening Ab initio drug design OR Drug compound screening in database of molecules 4. molecular interaction 8. Genetic variability 5. Metabolic and regulatory networks Genomic DNA Shearing/Sonication Subclone and Sequence Shotgun reads Assembly Contigs Finishing read Finishing Complete sequence Genome Sequencing - Review Strategy Clone by clone vs whole genome shotgun Libraries Subcloning; generate small insert libraries Sequencing Assembly Closure Annotation Release •Most genome will be sequenced and can be sequenced; few problem are unsolvable. Assembly: Process of taking raw single-pass reads into contiguous •Problem consensus sequence (Phred/Phrap) lies in understanding what you have: Closure: Process of ordering and merging consensus •Gene finding sequences into a singleprediction/gene contiguous sequence •Annotation -DNA features (repeats/similarities) -Gene finding Release to the public e.g. EMBL or GenBank -Peptidedata features -Initial role assignment -Others- regulatory regions Annotation of eukaryotic genomes Genomic DNA ab initio gene prediction transcription Unprocessed RNA RNA processing Mature mRNA Gm3 AAAAAAA translation Nascent polypeptide Comparative gene prediction folding Active enzyme Functional identification Function Reactant A Product B Why do comparative genomics? • Many of the genes encoded in each genome from the genome projects had no known or predictable function • Analysis of protein set from completely sequenced genomes • Uniform evolutionary conservation of proteins in microbial genomes, 70% of gene products from sequenced genomes have homologs in distant genomes (Koonin et al., 1997) • Function of many of these genes can be predicted by comparing different genomes of known functional annotation and transferring functional annotation of proteins from better studied organisms to their orthologs in lesser studied organisms. • Cross species comparison to help reveal conserved coding regions • No prior knowledge of the sequence motif is necessary • Complement to algorithmic analysis Assumptions/Limitation • Homologous genes are relatively well preserved while noncoding regions tend to show varying degrees of conservation. Conserved noncoding regions are believed to be important in regulating gene expression, maintaiing structural organization of the genome and most likely other possible functions. • Cross species comparative genomics is influenced by the evolutionary distance of the compared species. Genome Analysis and Annotation: General Procedure • • • • • • Basic procedure to determine the functional and structural annotation of uncharacterized proteins: Use a sequence similarity search programs such as BLAST or FASTA to identify all the functional regions in the sequence. If greater sensitivity is required then the Smith-Waterman algorithm based programs are preferred with the trade-off greater analysis time. Identify functional motifs and structural domains by comparing the protein sequence against PROSITE, BLOCKS, SMART, CDD, or Pfam. Predict structural features of the protein such as signal peptides, transmembrane segments, coiled-coil regions, and other regions of low sequence complexity Generate a secondary and tertiary (if possible) structure prediction Annotation: – Transfer of function information from a well-characterized organism to a lesser studied organism and/or – Use phylogenetic patterns (or profiles) and/or – Use the phylogenetic pattern search tools (e.g. through COGs) to perform a systematic formal logical operations (AND, OR, NOT) on gene sets -- differential genome display (Huynen et al., 1997). Genome Analysis and Annotation: One Possible Procedure • • • • • • Basic procedure to determine the functional and structural annotation of uncharacterized proteins: Use a sequence similarity search programs such as BLAST or FASTA to identify all the functional regions in the sequence. If greater sensitivity is required then the Smith-Waterman algorithm based programs are preferred with the trade-off greater analysis time. Identify functional motifs and structural domains by comparing the protein sequence against PROSITE, BLOCKS, SMART, CDD, or Pfam. Predict structural features of the protein such as signal peptides, transmembrane segments, coiled-coil regions, and other regions of low sequence complexity Generate a secondary and tertiary (if possible) structure prediction Transfer of function information from a well-characterized organism to a lesser studied organism and/or use phylogenetic patterns (or profiles) and/or use the phylogenetic pattern search tools (e.g. through COGs) to perform a systematic formal logical operations (AND, OR, NOT) on gene sets -- differential genome display (Huynen et al., 1997).. Automated Genome Annotation • GeneQuiz – limited number of searches/day • MAGPIE – outside users cannot submit own seq • PEDANT – commercial version allow for full capacity • SEALS – semi automated General Databases Useful for Comparative Genomics • Locus Link/RefSeq: http://www.ncbi.nih.gov/LocusLink/ • PEDANT -Protein Extraction Description ANalysis Tool http://pedant.gsf.de/ • MIPS – http://mips.gsf.de/ • COGs - Cluster of Orthologous Groups (of proteins) http://www.ncbi.nih.gov/COG/ • KEGG - Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp/kegg/ • MBGD - Microbial Genome Database http://mbgd.genome.ad.jp/ • GOLD - Genome OnLine Database http://wit.integratedgenomics.com/GOLD/ • TOGA – http://www.tigr.org/xxxxx Problems with existing sequence alignments algorithms for genomic analysis • Most algorithms were developed for comparing single protein sequences or DNA sequences containing a single gene • Most algorithms were based on assigning a score to all the possible alignments (usually by the sum of the similarity/identity values for each aligned residue minus a penalty for the introduction of gaps) and then finding the optimal or near-optimal alignment based on the chosen scoring scheme. • Unfortunately, most of these programs cannot accurately handle long alignments. • Linear-space type of Smith-Waterman variants are too computationally intensive requiring specialized hardware (memorylimited) or very time-consuming. Higher speed vs increased sensitivity. Genome-size comparative alignment tools • • • • • • • • • • • ASSIRC - Accelerated Search for SImilarity Regions in Chromosomes – ftp://ftp.biologie.ens.fr/pub/molbio/ (Vincens et al. 1998) BLAT – – http://genome.ucsc.edu/cgi-bin/hgBlat?command=start (Kent xxx) DIALIGN - DIagonal ALIGNment – http://www.gsf.de/biodv/dialign.html (Morgenstern et al. 1998; Morgenstern 1999( DBA - DNA Block Aligner – http://www.sanger.ac.uk/Software/Wise2/dba.shtml (Jareborg et al. 1999( GLASS - GLobal Alignment SyStem – http://plover.lcs.mit.edu/ (Batzoglou et al. 2000) LSH-ALL-PAIRS - Locality -Sensitve Hashing in ALL PAIRS – Email: [email protected] (Buhler 2001) MegaBlast – http://www.ncbi.nih.gov/blast/ (Zhang 2000) MUMmer - Maximal Unique Match (mer) – http://www.tigr.org/softlab/ (Delcher et al. 1999) PIPMaker - Percent Identity Plot MAKER – http://biocse.psu.edu/pipmaker/ (Schwartz et al. 2000) SSAHA – Sequence Search and Alignment by Hashing Algorithm – http://www.sanger.ac.uk/Software/analysis/SSAHA/ WABA - Wobble Aware Bulk Aligner – http://www.cse.ucsc.edu/~kent/xenoAli/ (Kent & Zahler 2000) SSAHA • Sequence Search and Alignment by Hashing Algorithm • Software tool for very fast matching and alignment of DNA sequences. • Achieves fast search speed by converting sequence information into a hash table data structure which can then be searched very rapidly for matches • http://www.sanger.ac.uk/Software/analysis/SSAHA/ • Run from the Unix command line • Need > 1GB RAM (needs a lot of memory) • SSAHA algorithm best for application requiring exact or “almost exact” matches between two sequences – e.g. SNP detection, fast sequence assembly, ordering and orientation of contigs Genome Analysis Environment • MAGPIE - Automated Genome Project Investigation Environment • PEDANT • SEALS Problems with Visualizing Genomes • • • Alignment programs output often were visualized by text file, which can be intuitively difficult to interpret when comparing genomes. Visualization tools needed to handle the complexity and volume of data and present the information in a comprehensive and comprehensible manner to a biologist for interpretation. Genome Alignment Visualization tools need to provide: – interpretable alignments, – gene prediction and database homologies from different sources – Interactive features: real time capabilities, zooming, searching specific regions of homologies – Represent breaks in synteny – Multiple alignments display – Displaying contigs of unfinished genomes with finished genomes – Handle various data formats – Software availabilty (no black box) Genome Comparison Visualization Tool • ACT - Artemis Comparison Tool (displays parsed BLAST alignments; based on Artemis – an annotation tool) – http://www.sanger.ac.uk/Software/ACT/ • Alfresco (displays DBA alignments and ...) – http://www.sanger.ac.uk/Software/Alfresco/ (Jareborg & Durbin 2000) • PipMaker (displays BlastZ alignments) – http://bio.cse.psu.edu/pipmaker/ (Schwartz et al. 2000) • Enteric/Menteric/Maj (displays Blastz alignments) – http://glovin.cse.psu.edu/enterix/ (Florea et al. 2000; McClelland et al. 2000) • Intronerator (displays WABA alignments and ...) – http://www.cse.ucsc.edu/~kent/intronerator/ (Kent & Zahler 2000b) • VISTA (Visualization Tool for Alignment) (displays GLASS alignments) – http://www-gsd.lbl.gov/vista/ • SynPlot (displays DIALIGN and GLASS alignments) – http://www.sanger.ac.uk/Users/igrg/SynPlot/ Artemis Comparison Tool (ACT) - ACT is a DNA sequence comparison viewer based on Artemis - Can read complete EMBL and GenBank entries or sequence in FASTA or raw format - Additional sequence feature can be in EMBL, GenBank, GFF format - ACT is free software and is distributed under the GNU Public License - Java based software - Latest release 2.0 better support Eukaryotic Genome Comparison http://www.sanger.ac.uk/Software/ACT/ Salmonella typhi vs. E. coli – SPI-2 G+C S.typhi tRNA phage/IS genes Pseudogenes Blast hits E.coli Salmonella typhi and Yersinia pestis type III secretion systems Salmonella typhi vs. E. coli - ACT SPI-2 SPI-9 SPI-1 SPI-7 Vi SPI-10 S. typhi DNA matches E. coli Neisseria meningitidis - A vs. B comparison - ACT Extra Slides 1 ASSIRC • • • Accelerated Search for SImilarity Regions in Chromosome ASSIRC finds regions of similarity in pair-wise genomic sequence alignments. The method involves three steps: – (i) identification of short exact chains of fixed size, called 'seeds', common to both sequences, using hashing functions; – (ii) extension of these seeds into putative regions of similarity by a 'random walk' procedure (i.e. the four bases are associated; – (iii) final selection of regions of similarity by assessing alignments of the putative sequences. • • • • • We used simulations to estimate the proportion of regions of similarity not detected for particular region sizes, base identity proportions and seed sizes. This approach can be tailored to the user's specifications. They looked for regions of similarity between two yeast chromosomes (V and IX). The efficiency of the approach was compared to those of conventional programs BLAST and FASTA, by assessing CPU time required and the regions of similarity found for the same data set. http://www.biologie.ens.fr/perso/vincens/assirc.html ftp://ftp.biologie.ens.fr/pub/molbio/assirc.tar.gz BLAT • Only DNA sequences of 25,000 or less bases and protein or translated sequence of 5000 or less letters will be processed. If multiple sequences are submitted at the same time, the total limit is 50,000 bases or 12,500 letters. • BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 22 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates • BLAT is not BLAST. DNA BLAT works by keeping an index of the entire genome in memory. The index consists of all non- overlapping 11-mers except for those heavily involved in repeats. The index takes up a bit less than a gigabyte of RAM. The genome itself is not kept in memory, allowing BLAT to deliver high performance on a reasonably priced Linux box. The index is used to find areas of probable homology, which are then loaded into memory for a detailed alignment. Protein BLAT works in a similar manner, except with 4-mers rather than 11-mers. The protein index takes a little more than 2 gigabytes • BLAT was written by Jim Kent. Like most of Jim's software interactive use on this web server is free to all. Sources and executables to run batch jobs on your own server are available free for academic, personal, and non-profit purposes. Non- exclusive commercial licenses are also available. Contact Jim for details.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Genome Analysis and Genome Comparison