* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download EST
Epigenomics wikipedia , lookup
Copy-number variation wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Public health genomics wikipedia , lookup
Minimal genome wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Protein moonlighting wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Genome (book) wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene therapy wikipedia , lookup
Transposable element wikipedia , lookup
Gene desert wikipedia , lookup
Non-coding DNA wikipedia , lookup
Human genome wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene nomenclature wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Genomic library wikipedia , lookup
Point mutation wikipedia , lookup
Genome evolution wikipedia , lookup
Gene expression profiling wikipedia , lookup
Microevolution wikipedia , lookup
Metagenomics wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Designer baby wikipedia , lookup
Genome editing wikipedia , lookup
Helitron (biology) wikipedia , lookup
LECTURE 91-15 Analysis of Stage-Specific Gene Expression : Expression Sequence Tags Petrus Tang, Ph.D. Graduate Institute of Basic Medical Sciences and Bioinformatics Center, Chang Gung University. [email protected] http://petang.cgu.edu.tw 27th December 2002 THE WORLD OF GENOMICS Published Complete Genome Projects: 95 (including 3 chromosomes) Prokaryotic Ongoing Genome Projects: 310 Eukaryotic Ongoing Genome Projects: 211 (including 11 chromosomes) Last update: 18July2002 GenBank Sequences GenBank® is the National Institute of Health genetic sequence database, an annotated collection of all publicly available DNA sequences. There are approximately 20,648,748,345 bases in 17,471,130 sequence records as of June 2002 R130 (12,055,326 sequences in dBEST, 4.500,000 from Homo sapiens). High Throughput Technologies: The future of Molecular Medicine High Throughput Technologies (HTTs) are developed to produce huge amount of information from genome projects, but they have clear potential in mass screening and diagnostics of Infectious Diseases. The application of HTTs may revolutionize diagnostic techniques and replacing multiple individual assays. Genome Transcriptome Proteome mRNA Gene Protein Gene Products Gene Expression & Post-Translational Modification of Proteins Muscle cell Skin cell Gene A Gene B Gene C Cell Growth, External Stress Gene A Gene B Gene C Nerve cell Normal cell Cancer cell Analysis of Stage-Specific Gene Expression Northern Hybridization RT-PCR Differential Display, Subtraction Library, Serial Analysis of Gene Expression (SAGE) Expressed Sequence Tags (EST) Real-Time PCR Microarry Analysis of 10,000-50,000 messages in a transcriptome will generate a relevant profile of gene expression within a cell, providing a quantitative measurement of transcripts for gene discovery. Microarray 10,000 Clones per slide Serial Analysis of Gene Expression (SAGE) 1. Mix 5 µg total RNA with oligo dT magnetic beads 2. Synthesize double-strand cDNA 3. Digest with NlaIII to form one end of the tag 4. Divide in half and ligate 40 bp adapters (A and B) containing the recognition sequence for the typeII restriction enzyme BsmF 1 5. Cleave with BsmF 1 to form ~ 50 bp tag (40 bp adaptor/13 bp tag) 6. Fill in 5' overhangs and ligate to form a ~ 100 bp ditag 7. PCR amplify using ditag primers 1 and 2 8. Cut 40 bp adapters with Nla III to release the 26 bp ditag 9. Ligate ditags to form concatemers 10. Clone and sequence What are ESTs? Expressed Sequence Tags are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene. The idea is to sequence bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms and use these "tags" to fish a gene out of a portion of chromosomal DNA by matching base pairs. The challenge associated with identifying genes from genomic sequences varies among organisms and is dependent upon genome size as well as the presence or absence of introns--the intervening DNA sequences interrupting the protein coding sequence of a gene. Expressed Sequence Tags (EST) 5’-EST 5` Coding Sequence (CDS) * START 5`-Untranlasted region (UTR) 3`-UTR * STOP 3’-EST AAAAAAAAAAA 3` Basic Features and Tools of an Automated EST Analysis Pipeline ▲ Relational database (Oracle 8i) ▲ Automatic data validation ▲ Quality score generation ▲ Automatic trimming of low-quality, vector, adaptor, poly-A tails, low-complexity and contaminant sequences ▲ Automatic running of selected blast algorithms, with user-defined parameters, user selected reference databases, and storage of top results (by userdefined cutoffs) in the database ▲ Includes a web interface for viewing the data in the database, according to the permissions allowed to the viewer (by individual, project, lab or institution) ▲ Includes a Java tool for dbEST submission of newly generated ESTs at intervals define by the users ▲ System can be readily and simply deployed at any of the partner's institutions ▲ Includes methods for defining a Unigene set for a library. Additional functionalities are needed by the members of the current co-development group, including: ▲ Tissue or organism, integration of gene expression data. ▲ Annotations: Gene ontology annotations, functional motif annotation, metabolic pathways annotations, signal transduction pathways. Data Processing – Raw Nucleotide Sequence EST or SAGE clones sequenced MegaBRACE 1000 PC Chromas Chromas sequence High quality Poor quality Abi format UNIX sequence High quality Poor quality Fasta format Remove uncalled/miscalled bases & vector sequence PHRED algorithm Ewing B et al. (1988) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8(3):175-85 Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8(3):186-94 FREEWARES Trace Viewers: In order to take a look at the SCF file you first have to choose a program. Very commonly used programs for viewing the sequencing data are CHROMAS (for PC/Windows), TraceViewer (for MAC) and Trev (contained in the Gap4 Database Viewer, for UNIX). DL SeqVerter™ is a free sequence file format conversion utility by GeneStudio, Inc. SeqVerter encapsulates a small subset of the features offered by the GeneStudio Pro suite of programs. While the standalone SeqVerter is a simple dialog-based utility, the free SeqVerter component of the GeneStudio suite adds sophisticated viewers and sequence formatting functions, including a viewer for automatic DNA sequencer chromatogram files (traces). http://www.genestudio.com/seqverter.htm DL Octopus is an interactive program designed for the rapid interpretation of BLAST, BLAST-2 and FASTA output text files. It provides an easy-to-use graphical user interface for both experienced and inexperienced users with sequence comparison analysis based on the widely-used BLAST serie of softwares and FASTA. Octopus is able to read results files coming from various BLAST and BLAST2 servers, the GCG's BLAST and the original FASTA3 program. DL EST Analysis : Clustering ALGORITHM PHRED PHRAP CONSED Wu-Blastn Blastx FUNCTION Remove uncalled/miscalled bases & vector sequence Assemble clones to from contigs Contig viewer & screen for misassemblies Group contigs to form clusters of related contigs Homology search against self-generated dbases CONTIGS Clusters Singletons 1 500 1000 1500 Similarity Search: Blastx BLAST uses a heuristic algorithm which seeks local as opposed to global alignments and is therefore able to detect relationships among sequences which share only isolated regions of similarity (Altschul et al., 1990) Nucleotide query translated to six reading frames vs protein database TV007D02 WWW Blastx Blastx-nr Blastx-pfam,smart GCG Blastx Blastx-GCG format Blastx-Octopus viewer InterPro provides an integrated view of the commonly used signature databases, and has an intuitive interface for text- and sequence-based searches. Bioinformatics infrastructural activities are crucial to modern biological research. Complete and up-to-date databases of biological knowledge are vital for the increasingly information-dependent biological and biotechnological research. Secondary protein databases on functional sites and domains like PROSITE, PRINTS, SMART, Pfam, ProDom, etc. are vital resources for identifying distant relationships in novel sequences, and hence for predicting protein function and structure. Unfortunately, these signature databases do not share the same formats and nomenclature, and each database has is own strengths and weaknesses. To capitalise on these, the following partners: EBI, SIB, University of Manchester, Sanger Institute, GENE-IT, CNRS/INRA, LION bioscience AG and University of Bergen unified PROSITE, PRINTS, ProDom and Pfam into InterPro (Integrated resource of Protein Families, Domains and Sites). The latest databases to join the project were SMART, and more recently, TIGRFAMs. Annotation - GO GENE ONTOLOGYTM CONSORTIUM http://www.geneontology.org The goal of the Gene OntologyTM Consortium is to produce a dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing. Molecular Function the tasks performed by individual gene products; examples are transcription factor and DNA helicase. Biological Process broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions. Cellular Component subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex . p53 Classification According to Metabolic & Signalling Pathways Biocarta ( http://biocarta.com) Kyto Encyclopedia of Genes &Genomes http://www.genome.ad.jp/kegg/ The Cancer Genome Anatomy Project (CGAP) http://cgap.nci.nih.gov/ Annotation ESTs are categorized into the following classes: ESTs shows homology to known protein motifs/domains Unique ESTs with no matces ESTs matches exactly to known protein sequences Cell Component Cell Component comp_cell comp_extracellular comp_external protective structure comp_obsolete comp_unlocalized Molecular Function Molecular Function func_enzyme func_ligand binding or carrier func_structural molecule func_signal transducer func_transcription regulator func_transporter func_obsolete func_enzyme regulator func_chaperone func_cell adhesion molecule func_lysin func_protein tagging func_anticoagulant Biological Process Biological Process proc_cell growth and/or maintenance proc_cell communication proc_viral life cycle proc_developmental processes proc_physiological processes proc_obsolete proc_death proc_biological_process unknown Automated EST Analysis Pipeline GenBank® is the National Institute of Health genetic sequence database, an annotated collection of all publicly available DNA sequences. There are approximately 20,648,748,345 bases in 17,471,130 sequence records as of June 2002 R130 (12,055,326 sequences in dBEST, 4.500,000 from Homo sapiens). Project Management Sequence Management Clustering Sequence Analysis Annotation dBEST 12,261,869 (Aug,2002) EST Databases – dBEST & UNIGENE dbEST (http://www.ncbi.nlm.nih.gov/dbEST/index.html) is a division of GenBank that contains sequence data and other information on "single-pass" cDNA sequences, or Expressed Sequence Tags, from a number of organisms. UniGene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene) is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location. dBEST Record NCBI dBEST Accession numbers for Trichomonas vaginalis ESTs BQ621379~BQ621732; BQ625216~BQ625229; BQ640771~BQ640943 1: BQ640943. TVEST017.H09 Tv30...[gi:21765401] Taxonomy Entry Created: Jul 8 2002 Last Updated: Jul 15 2002 IDENTIFIERS dbEST Id: EST name: GenBank Acc: GenBank gi: 12791004 TVEST017.H09 BQ640943 21765401 CLONE INFO Clone Id: DNA type: (5') cDNA PRIMERS PCR forward: PCR backward: Sequencing: PolyA Tail: T7 T3 T3 Unknown SEQUENCE ATTACAGCAATTGCCGATGATTGGCTTGGCATCACTGGCTGGCGTATCGAAAACTTTAAG CTCGTTAAAGTTGCAGAGATGGGCGCCTTCCACACAGGAGATTCTTATTTGTATCTTCAC GCTTACCTTGNTTGGCACAAGCAAGCTCGTCCATCGTGATATTTACTTCTGGCAGGGCTC CACATCCACAACAGATGAGCGCGGTGCTGTTGCTATCAAGGCTGTTGAACTTGATGACAG ATTTGGAGGCTCTCCAAAGCAACACAGAGAAGTCCAGAACCACGAGTCAGACCAGTTCAT TGGACTCTTCGATCAGTTTGGCGGTGTTCGCTACCTCGATGGCGGTGTTGAATCAGGATT CCACAAAGTCACAACATCTGCAAAGGTTGAGATGTACAGAATCAAGGGAAGAAAGCGCCC AATTCTCCAGATCGTTCCAGCTCAGCGCTCCTCCCTCAACCATGGAGATGTTTTCATTAT CCATGC http://www.ncbi.nlm.nih.gov/dbEST/index.html trichomonas vaginalis AND gbdiv_est[PROP] PUTATIVE ID Assigned by submitter ACTIN-BINDING PROTEIN FRAGMIN P. LIBRARY Lib Name: Tv30236_PT cDNA Library Organism: Trichomonas vaginalis Cell line: ATCC30236 Develop. stage: Trophozoites at mid-log phase Lab host: XL1 Blue-MRF' Vector: Lambda ZAP-Express (Stratagene) R. Site 1: EcoRI R. Site 2: XhoI SUBMITTER Name: Tang, P. Lab: Molecular Regulation and Bioinformatics Laboratory, College of Medicine Institution: Chang Gung University Address: 259 Wenhwa 1st. Road, Kweishan, Taoyuan 333, Taiwan Tel: +886 3 3283016 EXT5136 Fax: +886 3 3283031 E-mail: [email protected] CITATIONS Title: Analysis of Gene Expression Profile in Trichomonas vaginalis by EST Sequencing Authors: Zhou,Y., Shu,W.M., Huang,S.C.C., Huang,K.Y., Tang,P. Year: 2003 Status: Unpublished EST & SAGE Based Microarray Not Pre-selected Can identify Gene Families Real Gene Expressed Products cDNA vs cDNA Abundance = Expression Level Normal, U1,U2,U3,U4, Prognosis, Drug Resistant Bladder Tissue, Normal Bladder Tissue, Cancer Genes mRNAs Genes cDNA ESTs Bladder Carcinoma-Specific Microarrays Bladder Carcinoma-Specific Microarrays