Download EST

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenomics wikipedia , lookup

Copy-number variation wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Public health genomics wikipedia , lookup

Minimal genome wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Protein moonlighting wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genome (book) wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene therapy wikipedia , lookup

Transposable element wikipedia , lookup

Gene desert wikipedia , lookup

Non-coding DNA wikipedia , lookup

Human genome wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene nomenclature wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genomic library wikipedia , lookup

Gene wikipedia , lookup

Point mutation wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Microevolution wikipedia , lookup

NEDD9 wikipedia , lookup

Metagenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Designer baby wikipedia , lookup

Genomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
LECTURE 91-15
Analysis of Stage-Specific
Gene Expression :
Expression Sequence Tags
Petrus Tang, Ph.D.
Graduate Institute of Basic Medical Sciences
and
Bioinformatics Center, Chang Gung University.
[email protected]
http://petang.cgu.edu.tw
27th December 2002
THE WORLD OF GENOMICS
Published Complete Genome Projects: 95
(including 3 chromosomes)
Prokaryotic Ongoing Genome Projects: 310
Eukaryotic Ongoing Genome Projects: 211
(including 11 chromosomes)
Last update: 18July2002
GenBank Sequences
GenBank® is the National Institute
of Health genetic sequence
database, an annotated collection
of all publicly available DNA
sequences.
There are approximately
20,648,748,345 bases in
17,471,130 sequence records as of
June 2002 R130
(12,055,326 sequences in dBEST,
4.500,000 from Homo sapiens).
High Throughput Technologies: The future of Molecular Medicine
High Throughput Technologies (HTTs) are developed to produce huge amount of
information from genome projects, but they have clear potential in mass screening
and diagnostics of Infectious Diseases. The application of HTTs may revolutionize
diagnostic techniques and replacing multiple individual assays.
Genome
Transcriptome
Proteome
mRNA
Gene
Protein
Gene Products
Gene Expression & Post-Translational Modification of Proteins
Muscle cell
Skin cell
Gene A
Gene B
Gene C
Cell Growth, External Stress
Gene A
Gene B
Gene C
Nerve cell
Normal cell
Cancer cell
Analysis of Stage-Specific
Gene Expression
Northern Hybridization
RT-PCR
Differential Display, Subtraction Library,
Serial Analysis of Gene Expression (SAGE)
Expressed Sequence Tags (EST)
Real-Time PCR
Microarry
Analysis of 10,000-50,000 messages in a transcriptome will
generate a relevant profile of gene expression within a cell,
providing a quantitative measurement of
transcripts for gene discovery.
Microarray
10,000
Clones
per
slide
Serial Analysis of Gene Expression (SAGE)
1. Mix 5 µg total RNA with oligo dT magnetic beads
2. Synthesize double-strand cDNA
3. Digest with NlaIII to form one end of the tag
4. Divide in half and ligate 40 bp adapters (A and B)
containing the recognition sequence for the typeII restriction enzyme BsmF 1
5. Cleave with BsmF 1 to form ~ 50 bp tag (40 bp
adaptor/13 bp tag)
6. Fill in 5' overhangs and ligate to form a ~ 100 bp
ditag
7. PCR amplify using ditag primers 1 and 2
8. Cut 40 bp adapters with Nla III to release the 26
bp ditag
9. Ligate ditags to form concatemers
10. Clone and sequence
What are ESTs?
Expressed Sequence Tags are small
pieces of DNA sequence (usually 200
to 500 nucleotides long) that are
generated by sequencing either one or
both ends of an expressed gene. The
idea is to sequence bits of DNA that
represent genes expressed in certain
cells, tissues, or organs from different
organisms and use these "tags" to fish
a gene out of a portion of
chromosomal DNA by matching base
pairs. The challenge associated with
identifying genes from genomic
sequences varies among organisms
and is dependent upon genome size
as well as the presence or absence of
introns--the
intervening
DNA
sequences interrupting the protein
coding sequence of a gene.
Expressed Sequence Tags (EST)
5’-EST
5`
Coding Sequence (CDS)
*
START
5`-Untranlasted region (UTR)
3`-UTR
*
STOP
3’-EST
AAAAAAAAAAA
3`
Basic Features and Tools of an Automated
EST Analysis Pipeline
▲ Relational database (Oracle 8i)
▲ Automatic data validation
▲ Quality score generation
▲ Automatic trimming of low-quality, vector, adaptor, poly-A tails,
low-complexity and contaminant sequences
▲ Automatic running of selected blast algorithms, with user-defined parameters,
user selected reference databases, and storage of top results (by userdefined cutoffs) in the database
▲ Includes a web interface for viewing the data in the database, according to the
permissions allowed to the viewer (by individual, project, lab or institution)
▲ Includes a Java tool for dbEST submission of newly generated ESTs at intervals
define by the users
▲ System can be readily and simply deployed at any of the partner's institutions
▲ Includes methods for defining a Unigene set for a library.
Additional functionalities are needed by the members of the current
co-development group, including:
▲ Tissue or organism, integration of gene expression data.
▲ Annotations: Gene ontology annotations, functional motif annotation, metabolic
pathways annotations, signal transduction pathways.
Data Processing – Raw Nucleotide Sequence
EST or SAGE clones sequenced
MegaBRACE 1000
PC
Chromas
Chromas
sequence
High quality
Poor quality
Abi format
UNIX
sequence
High quality
Poor quality
Fasta format
Remove uncalled/miscalled bases & vector sequence
PHRED algorithm
Ewing B et al. (1988)
Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8(3):175-85
Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8(3):186-94
FREEWARES
Trace Viewers: In order to take a look at the SCF file you first have to choose a
program. Very commonly used programs for viewing the sequencing data are
CHROMAS (for PC/Windows), TraceViewer (for MAC) and Trev (contained in the Gap4
Database Viewer, for UNIX).
DL
SeqVerter™ is a free sequence file format conversion utility by GeneStudio,
Inc. SeqVerter encapsulates a small subset of the features offered by the
GeneStudio Pro suite of programs. While the standalone SeqVerter is a
simple dialog-based utility, the free SeqVerter component of the
GeneStudio suite adds sophisticated viewers and sequence formatting
functions, including a viewer for automatic DNA sequencer chromatogram
files (traces). http://www.genestudio.com/seqverter.htm
DL
Octopus is an interactive program designed for the rapid interpretation of BLAST,
BLAST-2 and FASTA output text files. It provides an easy-to-use graphical user interface
for both experienced and inexperienced users with sequence comparison analysis based
on the widely-used BLAST serie of softwares and FASTA. Octopus is able to read results
files coming from various BLAST and BLAST2 servers, the GCG's BLAST and the
original FASTA3 program.
DL
EST Analysis : Clustering
ALGORITHM
PHRED
PHRAP
CONSED
Wu-Blastn
Blastx
FUNCTION
Remove uncalled/miscalled bases & vector sequence
Assemble clones to from contigs
Contig viewer & screen for misassemblies
Group contigs to form clusters of related contigs
Homology search against self-generated dbases
CONTIGS Clusters
Singletons
1
500
1000
1500
Similarity Search: Blastx
BLAST uses a heuristic algorithm which seeks local as opposed to global
alignments and is therefore able to detect relationships among
sequences which share only isolated regions of similarity (Altschul et al.,
1990)
Nucleotide query translated to six reading frames
vs protein database
TV007D02
WWW Blastx
Blastx-nr
Blastx-pfam,smart
GCG Blastx
Blastx-GCG format
Blastx-Octopus viewer
InterPro provides an integrated view of the commonly used signature
databases, and has an intuitive interface for text- and sequence-based
searches.
Bioinformatics infrastructural activities are crucial to modern biological research.
Complete and up-to-date databases of biological knowledge are vital for the
increasingly information-dependent biological and biotechnological research.
Secondary protein databases on functional sites and domains like PROSITE,
PRINTS, SMART, Pfam, ProDom, etc. are vital resources for identifying distant
relationships in novel sequences, and hence for predicting protein function and
structure. Unfortunately, these signature databases do not share the same
formats and nomenclature, and each database has is own strengths and
weaknesses.
To capitalise on these, the following partners: EBI, SIB, University of Manchester,
Sanger Institute, GENE-IT, CNRS/INRA, LION bioscience AG and University of
Bergen unified PROSITE, PRINTS, ProDom and Pfam into InterPro (Integrated
resource of Protein Families, Domains and Sites). The latest databases to join
the project were SMART, and more recently, TIGRFAMs.
Annotation - GO
GENE ONTOLOGYTM CONSORTIUM
http://www.geneontology.org
The goal of the Gene OntologyTM Consortium is to produce a dynamic controlled
vocabulary that can be applied to all organisms even as knowledge of gene and
protein roles in cells is accumulating and changing.
Molecular Function
the tasks performed by individual gene products;
examples are transcription factor and DNA helicase.
Biological Process
broad biological goals, such as mitosis or purine
metabolism, that are accomplished by ordered
assemblies of molecular functions.
Cellular Component
subcellular structures, locations, and macromolecular
complexes; examples include nucleus, telomere, and
origin recognition complex .
p53
Classification According to Metabolic & Signalling Pathways
Biocarta
( http://biocarta.com)
Kyto Encyclopedia of Genes &Genomes
http://www.genome.ad.jp/kegg/
The Cancer Genome Anatomy Project
(CGAP) http://cgap.nci.nih.gov/
Annotation
ESTs are categorized into the following classes:
ESTs
shows homology
to known
protein motifs/domains
Unique ESTs with
no matces
ESTs matches exactly to
known protein sequences
Cell Component
Cell Component
comp_cell
comp_extracellular
comp_external protective
structure
comp_obsolete
comp_unlocalized
Molecular Function
Molecular Function
func_enzyme
func_ligand binding or
carrier
func_structural molecule
func_signal transducer
func_transcription regulator
func_transporter
func_obsolete
func_enzyme regulator
func_chaperone
func_cell adhesion molecule
func_lysin
func_protein tagging
func_anticoagulant
Biological Process
Biological Process
proc_cell growth and/or
maintenance
proc_cell communication
proc_viral life cycle
proc_developmental
processes
proc_physiological
processes
proc_obsolete
proc_death
proc_biological_process
unknown
Automated EST Analysis Pipeline
GenBank® is the National Institute of Health genetic
sequence database, an annotated collection of all
publicly available DNA sequences.
There are
approximately 20,648,748,345 bases in 17,471,130
sequence records as of June 2002 R130 (12,055,326
sequences in dBEST, 4.500,000 from Homo sapiens).
Project Management
Sequence Management
Clustering
Sequence Analysis
Annotation
dBEST
12,261,869 (Aug,2002)
EST Databases – dBEST & UNIGENE
dbEST
(http://www.ncbi.nlm.nih.gov/dbEST/index.html) is a division of GenBank
that contains sequence data and other information on "single-pass" cDNA
sequences, or Expressed Sequence Tags, from a number of organisms.
UniGene
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene) is an
experimental system for automatically partitioning GenBank sequences
into a non-redundant set of gene-oriented clusters. Each UniGene cluster
contains sequences that represent a unique gene, as well as related
information such as the tissue types in which the gene has been expressed
and map location.
dBEST Record
NCBI dBEST Accession numbers for Trichomonas vaginalis ESTs
BQ621379~BQ621732; BQ625216~BQ625229; BQ640771~BQ640943
1: BQ640943. TVEST017.H09 Tv30...[gi:21765401] Taxonomy
Entry Created: Jul 8 2002
Last Updated: Jul 15 2002
IDENTIFIERS
dbEST Id:
EST name:
GenBank Acc:
GenBank gi:
12791004
TVEST017.H09
BQ640943
21765401
CLONE INFO
Clone Id:
DNA type:
(5')
cDNA
PRIMERS
PCR forward:
PCR backward:
Sequencing:
PolyA Tail:
T7
T3
T3
Unknown
SEQUENCE
ATTACAGCAATTGCCGATGATTGGCTTGGCATCACTGGCTGGCGTATCGAAAACTTTAAG
CTCGTTAAAGTTGCAGAGATGGGCGCCTTCCACACAGGAGATTCTTATTTGTATCTTCAC
GCTTACCTTGNTTGGCACAAGCAAGCTCGTCCATCGTGATATTTACTTCTGGCAGGGCTC
CACATCCACAACAGATGAGCGCGGTGCTGTTGCTATCAAGGCTGTTGAACTTGATGACAG
ATTTGGAGGCTCTCCAAAGCAACACAGAGAAGTCCAGAACCACGAGTCAGACCAGTTCAT
TGGACTCTTCGATCAGTTTGGCGGTGTTCGCTACCTCGATGGCGGTGTTGAATCAGGATT
CCACAAAGTCACAACATCTGCAAAGGTTGAGATGTACAGAATCAAGGGAAGAAAGCGCCC
AATTCTCCAGATCGTTCCAGCTCAGCGCTCCTCCCTCAACCATGGAGATGTTTTCATTAT
CCATGC
http://www.ncbi.nlm.nih.gov/dbEST/index.html
trichomonas vaginalis AND gbdiv_est[PROP]
PUTATIVE ID Assigned by submitter
ACTIN-BINDING PROTEIN FRAGMIN P.
LIBRARY
Lib Name:
Tv30236_PT cDNA Library
Organism:
Trichomonas vaginalis
Cell line: ATCC30236
Develop. stage: Trophozoites at mid-log phase
Lab host:
XL1 Blue-MRF'
Vector:
Lambda ZAP-Express (Stratagene)
R. Site 1: EcoRI
R. Site 2: XhoI
SUBMITTER
Name:
Tang, P.
Lab:
Molecular Regulation and Bioinformatics Laboratory, College
of Medicine
Institution: Chang Gung University
Address:
259 Wenhwa 1st. Road, Kweishan, Taoyuan 333, Taiwan
Tel:
+886 3 3283016 EXT5136
Fax:
+886 3 3283031
E-mail:
[email protected]
CITATIONS
Title:
Analysis of Gene Expression Profile in Trichomonas vaginalis
by EST Sequencing
Authors:
Zhou,Y., Shu,W.M., Huang,S.C.C., Huang,K.Y., Tang,P.
Year:
2003
Status:
Unpublished
EST & SAGE Based Microarray
Not Pre-selected
Can identify Gene Families
Real Gene Expressed Products
cDNA vs cDNA
Abundance = Expression Level
Normal, U1,U2,U3,U4, Prognosis, Drug Resistant
Bladder Tissue, Normal
Bladder Tissue, Cancer
Genes
mRNAs
Genes
cDNA
ESTs
Bladder Carcinoma-Specific
Microarrays
Bladder Carcinoma-Specific
Microarrays