* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 09_01.jpg
Gene nomenclature wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Epigenomics wikipedia , lookup
DNA vaccination wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Genetic engineering wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Whole genome sequencing wikipedia , lookup
History of RNA biology wikipedia , lookup
Public health genomics wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Gene desert wikipedia , lookup
Genome (book) wikipedia , lookup
Minimal genome wikipedia , lookup
Gene expression profiling wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Transposable element wikipedia , lookup
Non-coding RNA wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Pathogenomics wikipedia , lookup
Point mutation wikipedia , lookup
History of genetic engineering wikipedia , lookup
Microevolution wikipedia , lookup
Primary transcript wikipedia , lookup
Human Genome Project wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Designer baby wikipedia , lookup
Genomic library wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Metagenomics wikipedia , lookup
Sequence alignment wikipedia , lookup
Human genome wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genome evolution wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genome editing wikipedia , lookup
Sequence Analysis (II) Yuh-Shan Jou (周玉山) [email protected] Institute of Biomedical Sciences, Academia Sinica Other Areas to Cover • • • • Genomic Data Annotation Common Domains prediction WWW Other Useful Genome Browsers But first, some vocabulary... YACs Yeast Artificial Chromosomes Yeast linear vector to propagate large DNA inserts (100 kb to Mb) Uses yeast centromere and telomeres to propagate insert as a chromosome BACs Bacterial Artificial Chromosomes E. coli circular plasmid designed to carry large inserts (100-300 kb), single copy (reduces occurrence of chimeric clones) Cosmid E. coli circular plasmid holds 5 to 50 kb inserts, multicopy Plasmids E. coli circular vectors designed to propagate DNA inserts (~1 to 10,000 bp) Usually have origin of replication and antibiotic resistance marker (pUC8) M13 E. coli phage adapted for DNA sequencing. Can clone small DNA inserts in double stranded plasmid version, and convert to single strand version for sequencing cDNA complementary DNA DNA synthesized from RNA using reverse transcriptase (an RNAdependent DNA polymerase) EST Expressed Sequence Tag Single pass DNA sequence run of a cDNA insert in a plasmid from one end ORF Open Reading Frame A region of at least 100 codons that is uninterrupted by stop codons and thus potentially encodes a protein SNP Single Nucleotide Polymorphism A single base that differs among members of a population. Can be detected by “genotyping”by PCR. Responsible for much trait diversity in populations (physical appearance, diseases, drug response). Satellite Marker Short tandem repeat (CACACACACACACACACAC, eg.) with length polymorphisms in a population (10 CA’s vs 25, eg.). Can be detected by genotyping. Often used for screening affected populations for disease genes(LOD scores). STS Sequenced Tag Site Short (~500 bp) segment of DNA of known sequence mapped to location Ensembl Database and Web Browser Erin Pleasance Canada’s Michael Smith Genome Sciences Centre, Vancouver www.ensembl.org Lecture 7.1 7 What is Ensembl? • • • • • Joint project of EBI and Sanger Automated annotation of eukaryotic genomes Open source software Relational database system Web interface “The main aim of this campaign is to encourage scientists across the world - in academia, pharmaceutical companies, and the biotechnology and computer industries - to use this free information.” - Dr. Mike Dexter, Director of the Wellcome Trust TPMD: http://tpmd.nhri.org.tw Nucleic Acids Research, 2005, Vol. 33, Database issue D174-D177 Ensembl components Search tools: Data: Chromosomes (ChromoView, KaryoView, CytoView, MapView) Diseases SNPs and Haplotypes (SNPView, GeneSNPView, HaploView, LDView) (DiseaseView) Functions (GOView) Sequence Similarity (BLAST, SSAHA) Genes (GeneView, TransView, ExonView, ProtView) Families (DomainView, FamilyView Genome Sequence Markers (MarkerView) (ContigView) Comparative Genomics (ContigView, MultiContigView, SyntenyView, GeneView) Text (TextView) Other Annotations Anything (EnsMart) Example 1: Exploring Caspase-3 • Aim to demonstrate basic browsing and views • Caspase-3 is a gene involved in apoptosis (cell suicide) • We will look at: – – – – Gene annotation SNPs Orthologs and genome alignments Alternative transcripts and EST genes Example 1: Exploring Caspase-3 http://www.ensembl.org Go to human homepage Species-specific homepage Site map Statistics of current release Finding the tool/view: Site Map Click Back to Gene Text Search caspase-3 Species-specific homepage GeneView ContigView ExportView SNPView TransView of transcript ExonView ProteinView GeneView Orthologs predicted by sequence similarity and synteny GeneDAS: Get data from external sources GeneView On the same page, information provided for each transcript individually Links to external databases GeneView GeneSNPView Other SNP/Haplotype tools • SNPView • ProteinView (protein sequence with SNP markup) • LDView: View linkage disequilibrium (only limited regions) • HaploView: View haplotypes (only limited regions) Click Back to GeneView ContigView Chromosome and bands Sequence contigs ContigView: Detailed View See other tracks, options in menus Gene annotations Genscan predictions Targetted gene predictions (2 alternative transcripts) EST genes Other tracks: Aligned sequences etc. ContigView MultiContigView DNA sequence homology Rat ortholog Other Comparative Genomics Tools • Saw gene orthology, DNA homology • Other view is SyntenyView • Also access comparative genomics through EnsMart Data Mining with EnsMart • Allows very fast, cross-data source querying • Search for genes (features, sequences, etc.) or SNPs based on – Position; function; domains; similarity; expression; etc. • Accessible from Ensembl website (MartView) as well as stand-alone • Extremely powerful for data mining Example 2: EnsMart • A new disease locus has been mapped between markers D21S1991 and D21S171. It may be that the gene involved has already been identified as having a role in another disease. What candidates are in this region? Example 2: EnsMart • EnsMart is based on BioMart • http://www.ensembl.org/Multi/martview OR • http://www.ebi.ac.uk/BioMart/martview EnsMart: Choosing your dataset EnsMart: Filtering 21 D21S1991 D21S171 EnsMart: Output Note you can output different types of information eg. sequences EnsMart: Output Sequence Similarity Searching • Use SSAHA for exact matches (fast) • Use BLAST for more distant similarity (slow) EnsEMBL BLAST The ideal annotation of “Gene” All clones All SNPs Promoter(s) Ideal Gene All mRNAs All proteins All structures Lecture 5.1 • All protein modifications • Ontologies • Interactions (complexes, pathways, networks) •Expression (where and when, and how much) 38 •Evolutionary relationships gene number in the human genome • Consortium 30.000 ~ 40.000 2001 • Celera 27.000 ~ 38.000 2001 • Consortium+Celera 50.000 Hogenesch et al. 2001 • DBsearches 65.000 ~ 75.000 Wrigth et al., 2001 • HumanGenomeSciences 90.000 ~ 120.000 Haseltine, 2001 • Consortium Build 34 35,000 ~ 40,000 April, 2003 • Consortium Build 35 20,000 ~ 25,000 Nature 431:931, 2004 Human Genome Project -- “Why sequence junk?!” • 90% of human genome (3.3x109) in finished status, ie 99% of euchromatin. • 45% of the genome are repeat sequences. • 5% of the genome encodes genes (1.5% is coding). • 35,000 ~ 40,000 genes with multiple splicing products per gene (build 34). • Finish at April, 2003 & single chromosome papers published one by one. • The entire human genome was finished again Oct. 2004. Build 35 assembly with 2.85 billion nucleotides interrupted by only 341 gaps. It covers 99% of the euchromatic genome with an error rate of 1 / 100,000 bases. The human genome seems to encode only 20,000–25,000 protein-coding genes. (Nature 431:931-945, 2004). Cost of Genome sequencing: average US $1 per base. = 3.3 billion US dollars to sequence the human genome. Ab initio gene identification • Goals – Identify coding exons – Seek gene structure information – Get a protein sequence for further analysis • Relevance – Characterization of anonymous DNA genomic sequences – Works on all DNA sequences Gene Finding on the Web GRAIL: Oak Ridge Natl. Lab, Oak Ridge, TN – http://compbio.ornl.gov/grailexp ORFfinder: NCBI – http://www.ncbi.nlm.nih.gov/gorf/gorf.html DNA translation: Univ. of Minnesota Med. School – http://alces.med.umn.edu/webtrans.html GenLang – http://cbil.humgen.upenn.edu/~sdong/genlang.html BCM GeneFinder: Baylor College of Medicine, Houston, TX – http://dot.imgen.bcm.tmc.edu:9331/seq-search/gene-search.html – http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html “Exon 1” Promoter | “Intron 1” | “Exon 2” | “Intron 2” | “Exon 3” | “Intron 3” | “Exon 4” DNA Transcription Primary transcript Poly-A Signal GU AG GU AG GU AG Splicing polyA cap Mature mRNA Nucleus Cytoplasm Start Stop cap polyA Translation Relative Performance Claverie 1997 Sn (%) Sp (%) Overall Individual Exons MZEF HEXON SorFind GRAIL II Gene Structure GENSCAN FGENES GRAIL II/Gap GeneParser HMMgene 78 71 42 51 86 65 47 57 0.79 0.64 0.62 0.47 78 73 51 35 81 78 52 40 0.86 0.74 0.66 0.54 Rogic 2000 Overall 0.91 0.83 0.91 What works best when? • Genome survey (draft) data: expect only a single exon in any given stretch of contiguous sequence – BLASTN vs. dbEST (3’ UTR) – BLASTX vs. nr (protein CDS) • Finished data: large contigs are available, providing context – GENSCAN – HMMgene Things we are looking to annotate? • • • • • • CDS mRNA Alternative RNA Promoter and Poly-A Signal Pseudogenes ncRNA Lecture 5.1 46 Pseudogenes • Could be as high as 20-30% of all genomic sequence predictions could be pseudogene • Non-functional copy of a gene – Processed pseudogene • • • • Retro-transposon derived No 5’ promoters No introns Often includes polyA tail – Non-processed pseudogene • Gene duplication derived – Both include events that make the gene non-funtional • Frameshift • Stop codons • We assume pseudogenes have no function, but we really don’t know! Noncoding RNA (ncRNA) • ncRNA represent 98% of all transcripts in a mammalian cell • ncRNA have not been taken into account in gene counts • cDNA • ORF computational prediction • Comparative genomics looking at ORF • ncRNA can be: – Structural – Catalytic – Regulatory Non-encoding genes: Noncoding RNA database: http://biobases.ibch.poznan.pl/ncRNA 09_04.jpg The total number of ncRNAs are still unknown due to difficulty of predicting ncRNA from genome sequences. Noncoding RNA (ncRNA) • tRNA – transfer RNA: involved in translation • rRNA – ribosomal RNA: structural component of ribosome, where translation takes place • snoRNA – small nucleolar RNA: functional/catalytic in RNA maturation • Antisense RNA: gene silencing http://rfam.wustl.edu http://rna.tbi.univie.ac.at/ Input (sequence only) RNA or DNA parameters Fold Algorithm Target temperature Advanced fold options Output formats Email (necessary for large sequences) Link to your previous run Output in bracket notation Output - PostScript Free energy (∆G) Enthalpy (∆S) Melting (de-hybridization) temperature RNAalifold: Predicts consensus secondary structures for sets of aligned RNA (ClustalW files). Information from the alignment: 1. Conserved nucleotide pairs are shown normally. 2. Pairs with consistent mutations, which support the structure, are marked by circles. 3. Pairs with inconsistent mutations are shown in two shades of gray. Graphical Representation – Sequence Logo • Horizontal axis: position of the base in the sequence. • Vertical axis: amount of information. • Letter stack: order indicates importance. • Letter height: indicates frequency. • Consensus can be read across the top of the letter columns. http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html Tools on the Web for motifs • MEME – Multiple EM for Motif Elicitation. http://meme.sdsc.edu/meme/website/ • metaMEME- Uses HMM method http://meme.sdsc.edu/meme • MAST-Motif Alignment and Search Tool http://meme.sdsc.edu/meme • TRANSFAC - database of eukaryotic cis-acting regulatory DNA elements and trans-acting factors. http://transfac.gbf.de/TRANSFAC/ • eMotif - allows to scan, make and search for motifs in the protein level. http://motif.stanford.edu/emotif/ Websites for Promoter finding Promoter Scan: NIH Bioinformatics (BIMAS) http://bimas.dcrt.nih.gov/molbio/proscan/ Promoter Scan II: Univ. of Minnesota & Axyx Pharmaceuticals http://biosci.cbs.umn.edu/software/proscan/promoterscan.htm Signal Scan: NIH Bioinformatics (BIMAS) http://bimas.dcrt.nih.gov:80/molbio/signal/index.html Transcription Element Search (TESS): Center for Bioinformatics, Univ. of Pennsylvania http://www.cbil.upenn.edu/tess/ Search TransFac at GBF with MatInspector, PatSearch, and FunSiteP http://transfac.gbf-braunschweig.de/TRANSFAC/programs.html TargetFinder: Telethon Inst.of Genetics and Medicine, Milan, Italy http://hercules.tigem.it/TargetFinder.html Transcriptional regulatory region TSS activator RNAP II 70K GTFs SR 1 st exon 5’ U1 GU 3’ repressor regulatory promoter region core promoter region (~100bp) TFs play a significant role in differentiation in a number of cell types The fact that ~ 5% of the genes are predicted to encode transcription factors underscores the importance of transcriptional regulation in gene expression (Tupler et al. 2001 Nature. 409:832-833) The combinatorial nature of transcriptional regulation and practically unlimited number of cellular conditions significantly complicate the experimental identification of TF binding sites on a genome scale Understanding the transcriptional regulation is a major challenge Computational approaches to identify potential regulatory elements and modules, and derive new, biologically relevant and testable hypothesis Transcriptional regulatory module • cis-regulatory elements are sequence-specific regions transcription factors bind CGGTTAAG GCTAACGC AGGCTA • TFs combinatorially associate with each other to form modules and regulate their target genes Mammalian Promoter Dadabase (MPromDb) (http://bioinformatics.med.ohio-state.edu) MPromDb 1.0 (Mammalian Promoter Database) (http://bioinformatics.med.ohio-state.edu) Human, mouse & rat Search by gene symbol; Genbank Acc.Num; Unigene/LocusLink ID; TF binding site name Click here to search the database MPromDb 1.0 (Mammalian Promoter Database) (http://bioinformatics.med.ohio-state.edu) MPromDb 1.0 (Mammalian Promoter Database) (http://bioinformatics.med.ohio-state.edu) BAX gene promoter with TF binding site annotations, with supporting evidence from 3 PubMed records MPromDb (Mammalian Promoter Database) Promoter sequences with annotations of experimentally supported TF binding sites Promoter sequences with annotations of computationally predicted TF binding sites A platform for statistical analysis & pattern recognition, to predict TF binding sites in uncharacterized promoters, and model combinatorial association of TF binding sites A platform for comparative genomics, to reveal conserved regions across genomes of different species Identification of core-promoters Identification of all the human-mouse-rat homologues pairs Modeling and identification of TF binding sites & modules Specific databases of protein sequences and structures Swissprot PIR TREMBL (translated from DNA) PDB (Three Dimensional Structures) Protein Structure Primary Amino acid sequence Secondary Alpha helices & Beta sheets, loops. Tertiary Packing of secondary elements. Quaternary Packing of several polypeptide chains Structure Prediction: Motivation • Hundreds of thousands of gene sequences translated to proteins (genbanbk, SW, PIR) • Only about 28000 solved structures (PDB) • Goal: Predict protein structure based on sequence information Structure Prediction: Motivation • Understand protein function – Locate binding sites • Broaden homology – Detect similar function where sequence differs • Explain disease – See effect of amino acid changes – Design suitable compensatory drugs Prediction Approaches • Primary (sequence) to secondary structure – Sequence characteristics • Secondary to tertiary structure – Fold recognition – Threading against known structures • Primary to tertiary structure – Ab initio modelling Secondary structure prediction • • • • • • • • • • • • • • AGADIR - An algorithm to predict the helical content of peptides APSSP - Advanced Protein Secondary Structure Prediction Server GOR - Garnier et al, 1996 HNN - Hierarchical Neural Network method (Guermeur, 1997) Jpred - A consensus method for protein secondary structure prediction at University of Dundee JUFO - Protein secondary structure prediction from sequence (neural network) nnPredict - University of California at San Francisco (UCSF) PredictProtein - PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom, EvalSec from Columbia University Prof - Cascaded Multiple Classifiers for Secondary Structure Prediction PSA - BioMolecular Engineering Research Center (BMERC) / Boston PSIpred - Various protein structure prediction methods at Brunel University SOPMA - Geourjon and Delיage, 1995 SSpro - Secondary structure prediction using bidirectional recurrent neural networks at University of California DLP - Domain linker prediction at RIKEN http://searchlauncher.bcm.tmc.edu/ Multiple Sequence Alignment ClustalW Algorithm Progressive Sequences Alignment (Higgins and Sharp 1988) • Compute pairwise alignment for all the pairs of sequences. • Use the alignment scores to build a phylogenetic tree such that • similar sequences are neighbors in the tree • distant sequences are distant from each other in the tree. • The sequences are progressively aligned according to the branching order in the guide tree. • http://www.ebi.ac.uk/clustalw/ ClustalW Input Fast alignment? Scoring matrix Input sequences Alignment format Fast alignment options Gap scoring Phylogenetic trees ClustalW Output (1) Input sequences Pairwise alignment scores Building alignment Final score ClustalW Output (2) Sequence names Sequence positions Match strength in decreasing order: * : . Phylogenetic Trees • Represent closeness between many entities – In our case, genomic or protein sequences Distance representation chimp Observed entity human monkey Unobserved commonality Progressive Sequence Alignment (Protein sequences example) NYLS N KYLS NFS N K/- Y L S N K/- Y/F L/- S NFLS N F L/- S MSA Approaches • Progressive approach CLUSTALW (CLUSTALX) PILEUP T-COFFEE • Iterative approach: Repeatedly realign subsets of sequences. MultAlin, DiAlign. • Statistical Methods: Hidden Markov Models SAM2K • Genetic algorithm SAGA Multiple Alignment tools on the Web (Some URLs) EMBL-EBI http://www.ebi.ac.uk/clustalw/ BCM Search Launcher: Multiple Alignment http://dot.imgen.bcm.tmc.edu:9331/multi-align/multialign.html Multiple Sequence Alignment for Proteins (Wash. U. St. Louis) http://www.ibc.wustl.edu/service/msa/ Editing Multiple Alignments • There are a variety of tools that can be used to modify a multiple alignment. • These programs can be very useful in formatting and annotating an alignment for publication. • An editor can also be used to make modifications by hand to improve biologically significant regions in a multiple alignment created by one of the automated alignment programs. GCG alignment editors • Alignments produced with PILEUP (or CLUSTAL) can be adjusted with LINEUP. • Nicely shaded printouts can be produced with PRETTYBOX • GCG's SeqLab X-Windows interface has a superb multiple sequence editor - the best editor of any kind. SeqVu Editors on the Web • Check out CINEMA (Colour INteractive Editor for Multiple Alignments) – It is an editor created completely in JAVA (old browsers beware) – It includes a fully functional version of CLUSTAL, BLAST, and a DotPlot module http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/ Questions: 1. Download protein seq and predict domain, secondary structure and post-translational modifications. 2. Download all SARS virus genome and perform MSA.