Download Genome Analysis and Genome Comparison

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene desert wikipedia , lookup

Gene expression profiling wikipedia , lookup

Exome sequencing wikipedia , lookup

List of types of proteins wikipedia , lookup

Gene expression wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Promoter (genetics) wikipedia , lookup

RNA-Seq wikipedia , lookup

Community fingerprinting wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Homology modeling wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genomic library wikipedia , lookup

Molecular evolution wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Part 12 Genome Analysis
Outline
•
•
•
•
•
•
•
•
•
Overview
Why do comparative genomic analysis?
Assumptions/Limitations
Genome Analysis and Annotation Standard Procedure
General Purposes Databases for Comparative
Genomics
Organism Specific Databases
Genome Analysis Environments
Genome Sequence Alignment Programs
Genomic Comparison Visualization Tools
Some of the prokaryotic genomes
Bacteroides fragilis
Bordetella bronchiseptica
Bordetella parapertussis
Bordetella pertussis
Burkholderia cepacia
Burkholderia pseudomallei
Chlamidophila abortus
Clostridium botulinum
Clostridium difficile
Corynebacterium diphtheriae
Erwinia carotovora
Escherichia/Shigella spp. (5)
Mycobacterium bovis
Mycobacterium marinum
Neisseria meningitidis (serogroup C)
Salmonella typhi
Salmonella spp. (5)
Staphylococcus aureus (MRSA)
Staphylococcus aureus (MSSA)
Streptococcus pneumoniae
Streptococcus pyogenes
Streptococcus suis
Streptococcus uberis
Streptomyces coelicolor
Tropheryma whipelli
Wolbachia (Culex quinquefasciatus)
Wolbachia (Onchocerca volvulus)
Yersinia enterocolitica
Yersinia pestis
Opportunistic
Veterinary
Whooping cough
Whooping cough
Lung infections in CF
Melliodosis
Veterinary
Botulism
Colitis
Diphtheria
Plant pathogen
Various
Tuberculosis
Various
Bacterial meningitis
Typhoid fever
Various
Various (Nosocomial)
Various (Community acquired)
Bacterial meningitis
Various (ARF-associated)
Veterinary
Veterinary
Non-pathogenic
Whipple’s disease
Vector (Bancroftian filariasis)
River Blindness
Food poisoning
Plague
In progress
In progress
Complete
Complete
In progress
In progress
Funded
Funded
In progress
Complete
Funded
In progress
In progress
In progress
In progress
Complete
In progress
Complete
In progress
In progress
In progress
In progress
In progress
Complete
In progress
In progress
Funded
In progress
Complete
Some of the eukaryotic genomes
Aspergillus fumigatus
Dictyostelium discoideum
Entamoeba histolitica
Leishmania major
Plasmodium falciparum
Schistosoma mansoni
Schizosaccharomyces pombe
Theileria annulata
Toxoplasma gondii
Trypanosoma brucei
Farmer’s lung
Soil amoeba
Amoebic dysentry
Leishmaniasis
Malaria
Bilharzia
Fission yeast
Veterinary
Toxoplasmosis
Sleeping sickness
In progress
In progress
In progress
In progress
In progress
In progress
Complete
In progress
In progress
In progress
Bioinformatics Flow Chart
1a. Sequencing
1b. Analysis of nucleic acid seq.
2. Analysis of protein seq.
3. Molecular structure prediction
6. Gene & Protein expression data
7. Drug screening
Ab initio drug design OR
Drug compound screening in
database of molecules
4. molecular interaction
8. Genetic variability
5. Metabolic and regulatory networks
Genomic DNA
Shearing/Sonication
Subclone and Sequence
Shotgun reads
Assembly
Contigs
Finishing read
Finishing
Complete sequence
Genome Sequencing - Review
Strategy
Clone by clone vs whole genome shotgun
Libraries
Subcloning; generate small insert libraries
Sequencing
Assembly
Closure
Annotation
Release
•Most genome will be sequenced and can be sequenced;
few problem are unsolvable.
Assembly: Process of taking raw single-pass reads into
contiguous •Problem
consensus
sequence
(Phred/Phrap)
lies
in understanding
what you have:
Closure: Process of ordering and merging consensus
•Gene
finding
sequences into a
singleprediction/gene
contiguous sequence
•Annotation
-DNA features (repeats/similarities)
-Gene finding
Release
to the public e.g. EMBL or GenBank
-Peptidedata
features
-Initial role assignment
-Others- regulatory regions
Annotation of eukaryotic genomes
Genomic DNA
ab initio gene
prediction
transcription
Unprocessed RNA
RNA processing
Mature mRNA
Gm3
AAAAAAA
translation
Nascent polypeptide
Comparative gene
prediction
folding
Active enzyme
Functional
identification
Function
Reactant A
Product B
Why do comparative genomics?
• Many of the genes encoded in each genome from the genome
projects had no known or predictable function
• Analysis of protein set from completely sequenced genomes
• Uniform evolutionary conservation of proteins in microbial genomes,
70% of gene products from sequenced genomes have homologs in
distant genomes (Koonin et al., 1997)
• Function of many of these genes can be predicted by comparing
different genomes of known functional annotation and transferring
functional annotation of proteins from better studied organisms to
their orthologs in lesser studied organisms.
• Cross species comparison to help reveal conserved coding regions
• No prior knowledge of the sequence motif is necessary
• Complement to algorithmic analysis
Assumptions/Limitation
• Homologous genes are relatively well preserved
while noncoding regions tend to show varying
degrees of conservation. Conserved noncoding
regions are believed to be important in
regulating gene expression, maintaiing structural
organization of the genome and most likely other
possible functions.
• Cross species comparative genomics is
influenced by the evolutionary distance of the
compared species.
Genome Analysis and Annotation: General Procedure
•
•
•
•
•
•
Basic procedure to determine the functional and structural annotation of
uncharacterized proteins:
Use a sequence similarity search programs such as BLAST or FASTA to
identify all the functional regions in the sequence. If greater sensitivity is
required then the Smith-Waterman algorithm based programs are preferred
with the trade-off greater analysis time.
Identify functional motifs and structural domains by comparing the protein
sequence against PROSITE, BLOCKS, SMART, CDD, or Pfam.
Predict structural features of the protein such as signal peptides,
transmembrane segments, coiled-coil regions, and other regions of low
sequence complexity
Generate a secondary and tertiary (if possible) structure prediction
Annotation:
– Transfer of function information from a well-characterized organism to a lesser
studied organism and/or
– Use phylogenetic patterns (or profiles) and/or
– Use the phylogenetic pattern search tools (e.g. through COGs) to perform a
systematic formal logical operations (AND, OR, NOT) on gene sets -- differential
genome display (Huynen et al., 1997).
Genome Analysis and Annotation:
One Possible Procedure
•
•
•
•
•
•
Basic procedure to determine the functional and structural annotation of
uncharacterized proteins:
Use a sequence similarity search programs such as BLAST or FASTA to
identify all the functional regions in the sequence. If greater sensitivity is
required then the Smith-Waterman algorithm based programs are preferred
with the trade-off greater analysis time.
Identify functional motifs and structural domains by comparing the protein
sequence against PROSITE, BLOCKS, SMART, CDD, or Pfam.
Predict structural features of the protein such as signal peptides,
transmembrane segments, coiled-coil regions, and other regions of low
sequence complexity
Generate a secondary and tertiary (if possible) structure prediction
Transfer of function information from a well-characterized organism to a
lesser studied organism and/or use phylogenetic patterns (or profiles)
and/or use the phylogenetic pattern search tools (e.g. through COGs) to
perform a systematic formal logical operations (AND, OR, NOT) on gene
sets -- differential genome display (Huynen et al., 1997)..
Automated Genome Annotation
• GeneQuiz – limited number of searches/day
• MAGPIE – outside users cannot submit own seq
• PEDANT – commercial version allow for full
capacity
• SEALS – semi automated
General Databases Useful for
Comparative Genomics
• Locus Link/RefSeq: http://www.ncbi.nih.gov/LocusLink/
• PEDANT -Protein Extraction Description ANalysis Tool
http://pedant.gsf.de/
• MIPS – http://mips.gsf.de/
• COGs - Cluster of Orthologous Groups (of proteins)
http://www.ncbi.nih.gov/COG/
• KEGG - Kyoto Encyclopedia of Genes and Genomes
http://www.genome.ad.jp/kegg/
• MBGD - Microbial Genome Database
http://mbgd.genome.ad.jp/
• GOLD - Genome OnLine Database
http://wit.integratedgenomics.com/GOLD/
• TOGA – http://www.tigr.org/xxxxx
Problems with existing sequence alignments
algorithms for genomic analysis
• Most algorithms were developed for comparing single protein
sequences or DNA sequences containing a single gene
• Most algorithms were based on assigning a score to all the possible
alignments (usually by the sum of the similarity/identity values for
each aligned residue minus a penalty for the introduction of gaps)
and then finding the optimal or near-optimal alignment based on the
chosen scoring scheme.
• Unfortunately, most of these programs cannot accurately handle
long alignments.
• Linear-space type of Smith-Waterman variants are too
computationally intensive requiring specialized hardware (memorylimited) or very time-consuming. Higher speed vs increased
sensitivity.
Genome-size comparative alignment tools
•
•
•
•
•
•
•
•
•
•
•
ASSIRC - Accelerated Search for SImilarity Regions in Chromosomes
– ftp://ftp.biologie.ens.fr/pub/molbio/ (Vincens et al. 1998)
BLAT –
– http://genome.ucsc.edu/cgi-bin/hgBlat?command=start (Kent xxx)
DIALIGN - DIagonal ALIGNment
– http://www.gsf.de/biodv/dialign.html (Morgenstern et al. 1998; Morgenstern 1999(
DBA - DNA Block Aligner
– http://www.sanger.ac.uk/Software/Wise2/dba.shtml (Jareborg et al. 1999(
GLASS - GLobal Alignment SyStem
– http://plover.lcs.mit.edu/ (Batzoglou et al. 2000)
LSH-ALL-PAIRS - Locality -Sensitve Hashing in ALL PAIRS
– Email: [email protected] (Buhler 2001)
MegaBlast
– http://www.ncbi.nih.gov/blast/ (Zhang 2000)
MUMmer - Maximal Unique Match (mer)
– http://www.tigr.org/softlab/ (Delcher et al. 1999)
PIPMaker - Percent Identity Plot MAKER
– http://biocse.psu.edu/pipmaker/ (Schwartz et al. 2000)
SSAHA – Sequence Search and Alignment by Hashing Algorithm
– http://www.sanger.ac.uk/Software/analysis/SSAHA/
WABA - Wobble Aware Bulk Aligner
– http://www.cse.ucsc.edu/~kent/xenoAli/ (Kent & Zahler 2000)
SSAHA
• Sequence Search and Alignment by Hashing Algorithm
• Software tool for very fast matching and alignment of
DNA sequences.
• Achieves fast search speed by converting sequence
information into a hash table data structure which can
then be searched very rapidly for matches
• http://www.sanger.ac.uk/Software/analysis/SSAHA/
• Run from the Unix command line
• Need > 1GB RAM (needs a lot of memory)
• SSAHA algorithm best for application requiring exact or
“almost exact” matches between two sequences – e.g.
SNP detection, fast sequence assembly, ordering and
orientation of contigs
Genome Analysis Environment
• MAGPIE - Automated Genome Project
Investigation Environment
• PEDANT
• SEALS
Problems with Visualizing Genomes
•
•
•
Alignment programs output often were visualized by text file, which can be
intuitively difficult to interpret when comparing genomes.
Visualization tools needed to handle the complexity and volume of data and
present the information in a comprehensive and comprehensible manner to
a biologist for interpretation.
Genome Alignment Visualization tools need to provide:
– interpretable alignments,
– gene prediction and database homologies from different sources
– Interactive features: real time capabilities, zooming, searching specific regions of
homologies
– Represent breaks in synteny
– Multiple alignments display
– Displaying contigs of unfinished genomes with finished genomes
– Handle various data formats
– Software availabilty (no black box)
Genome Comparison Visualization Tool
• ACT - Artemis Comparison Tool (displays parsed BLAST
alignments; based on Artemis – an annotation tool)
– http://www.sanger.ac.uk/Software/ACT/
• Alfresco (displays DBA alignments and ...)
– http://www.sanger.ac.uk/Software/Alfresco/ (Jareborg & Durbin 2000)
• PipMaker (displays BlastZ alignments)
– http://bio.cse.psu.edu/pipmaker/ (Schwartz et al. 2000)
• Enteric/Menteric/Maj (displays Blastz alignments)
– http://glovin.cse.psu.edu/enterix/ (Florea et al. 2000; McClelland et al.
2000)
• Intronerator (displays WABA alignments and ...)
– http://www.cse.ucsc.edu/~kent/intronerator/ (Kent & Zahler 2000b)
• VISTA (Visualization Tool for Alignment) (displays GLASS alignments)
– http://www-gsd.lbl.gov/vista/
• SynPlot (displays DIALIGN and GLASS alignments)
– http://www.sanger.ac.uk/Users/igrg/SynPlot/
Artemis Comparison Tool (ACT)
- ACT is a DNA sequence comparison viewer based on
Artemis
- Can read complete EMBL and GenBank entries or
sequence in FASTA or raw format
- Additional sequence feature can be in EMBL, GenBank,
GFF format
- ACT is free software and is distributed under the GNU
Public License
- Java based software
- Latest release 2.0 better support Eukaryotic Genome
Comparison
http://www.sanger.ac.uk/Software/ACT/
Salmonella typhi vs. E. coli – SPI-2
G+C
S.typhi
tRNA
phage/IS genes
Pseudogenes
Blast hits
E.coli
Salmonella typhi and Yersinia pestis type III secretion systems
Salmonella typhi vs. E. coli - ACT
SPI-2
SPI-9
SPI-1
SPI-7 Vi
SPI-10
S. typhi
DNA
matches
E. coli
Neisseria meningitidis - A vs. B comparison - ACT
Extra Slides 1
ASSIRC
•
•
•
Accelerated Search for SImilarity Regions in Chromosome
ASSIRC finds regions of similarity in pair-wise genomic sequence
alignments.
The method involves three steps:
– (i) identification of short exact chains of fixed size, called 'seeds', common to
both sequences, using hashing functions;
– (ii) extension of these seeds into putative regions of similarity by a 'random walk'
procedure (i.e. the four bases are associated;
– (iii) final selection of regions of similarity by assessing alignments of the putative
sequences.
•
•
•
•
•
We used simulations to estimate the proportion of regions of similarity not
detected for particular region sizes, base identity proportions and seed
sizes.
This approach can be tailored to the user's specifications.
They looked for regions of similarity between two yeast chromosomes (V
and IX). The efficiency of the approach was compared to those of
conventional programs BLAST and FASTA, by assessing CPU time required
and the regions of similarity found for the same data set.
http://www.biologie.ens.fr/perso/vincens/assirc.html
ftp://ftp.biologie.ens.fr/pub/molbio/assirc.tar.gz
BLAT
•
Only DNA sequences of 25,000 or less bases and protein or translated sequence of 5000 or less
letters will be processed. If multiple sequences are submitted at the same time, the total limit is
50,000 bases or 12,500 letters.
•
BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 40
bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect
sequence matches of 33 bases, and sometimes find them down to 22 bases. BLAT on proteins
finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA
BLAT works well on primates, and protein blat on land vertebrates
•
BLAT is not BLAST. DNA BLAT works by keeping an index of the entire genome in memory. The
index consists of all non- overlapping 11-mers except for those heavily involved in repeats. The
index takes up a bit less than a gigabyte of RAM. The genome itself is not kept in memory,
allowing BLAT to deliver high performance on a reasonably priced Linux box. The index is used to
find areas of probable homology, which are then loaded into memory for a detailed alignment.
Protein BLAT works in a similar manner, except with 4-mers rather than 11-mers. The protein
index takes a little more than 2 gigabytes
•
BLAT was written by Jim Kent. Like most of Jim's software interactive use on this web server is
free to all. Sources and executables to run batch jobs on your own server are available free for
academic, personal, and non-profit purposes. Non- exclusive commercial licenses are also
available. Contact Jim for details.