* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Comparison between Human and Mouse genomes
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene nomenclature wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Gene therapy wikipedia , lookup
Human genetic variation wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Genetic engineering wikipedia , lookup
Copy-number variation wikipedia , lookup
Oncogenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Transposable element wikipedia , lookup
Gene desert wikipedia , lookup
Epitranscriptome wikipedia , lookup
Ridge (biology) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene expression programming wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Genomic imprinting wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Public health genomics wikipedia , lookup
Non-coding DNA wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Microevolution wikipedia , lookup
Minimal genome wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genome (book) wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genomic library wikipedia , lookup
Metagenomics wikipedia , lookup
Human Genome Project wikipedia , lookup
Human genome wikipedia , lookup
Designer baby wikipedia , lookup
Genome editing wikipedia , lookup
Pathogenomics wikipedia , lookup
Helitron (biology) wikipedia , lookup
Alternative splicing wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome evolution wikipedia , lookup
Alternative Splicing from ESTs Eduardo Eyras Bioinformatics UPF – February 2004 Intro ESTs Prediction of Alternative Splicing from ESTs 5’ 3’ 3’ 5’ exons Transcription introns pre-mRNA Splicing Mature mRNA 5’ CAP AAAAAAA Translation Peptide 5’ 3’ 3’ 5’ exons Transcription introns pre-mRNA Different Splicing Mature mRNA 5’ CAP AAAAAAA Translation Different Peptide Alt splicing as a mechanism of gene regulation Functional domains can be added/subtracted protein diversity Can introduce early stop codons, resulting in truncated proteins or unstable mRNAs It can modify the activity of the transcription factors, affecting the expression of genes It is observed nearly in all metazoans Estimated to occur in 30%-40% of human Forms of alternative splicing Exon skipping / inclusion Alternative 3’ splice site Alternative 5’ splice site Mutually exclusive exons Intron retention Constitutive exon Alternatively spliced exons How to study alternative splicing? ESTs (Expressed Sequence Tags) Single-pass sequencing of a small (end) piece of cDNA Typically 200-500 nucleotides long It may contain coding and/or non-coding region ESTs Cells from a specific organ, tissue or developmental stage 5’ mRNA extraction AAAAAA 3’ Add oligo-dT primer 5’ Reverse transcriptase AAAAAA 3’ 3’ TTTTTT 5’ RNA 5’ AAAAAA 3’ DNA 3’ TTTTTT 5’ Ribonuclease H 3’ Double stranded cDNA TTTTTT 5’ DNA polimerase Ribonuclease H 5’ AAAAAA 3’ 3’ TTTTTT 5’ ESTs 5’ AAAAAA 3’ 3’ TTTTTT 5’ 5’ EST Single-pass sequence reads 3’ EST Clone cDNA into a vector Multiple cDNA clones Alternative Splicing from ESTs Genomic Primary transcript Splicing Splice variants cDNA clones EST sequences 5’ 3’ 5’ 3’ Alternative Splicing from ESTs ESTs can also provide information about potential alternative splicing when aligned to the genome (and when aligned to mRNA data) EST sequencing Is fast and cheap Gives direct information about the gene sequence Partial information Resulting ESTs (DB searches) Known gene Similar to known gene Contaminant Novel gene ESTs provide expression data eVOC Ontologies Anatomical System Cell Type http://www.sanbi.ac.za/evoc/ The tissue, organ or anatomical system from which the sample was prepared. Examples are digestive, lung and retina. The precise cell type from which a sample was prepared. Examples are: Blymphocyte, fibroblast and oocyte. Pathology The pathological state of the sample from which the sample was prepared. Examples are: normal, lymphoma, and congenital. Developmental Stage The stage during the organism's development at which the sample was prepared. Examples are: embryo, fetus, and adult. Pooling Indicates whether the tissue used to prepare the library was derived from single or multiple samples. Examples are pooled, pooled donor and pooled tissue. Linking the expression vocabulary to gene annotations ESTs Genes Normalized vs. non-normalized libraries The down side of the ESTs Cannot detect lowly/rarely expressed genes or nonexpressed sequences (regulatory) Random sampling: the more ESTs we sequence the less new useful sequences we will get Gene Hunting Sequencing of the Human Genome (HGP) EST Sequencing Origin of the ESTs Science. 1991 Jun 21;252(5013):1651-6 Complementary DNA sequencing: expressed sequence tags and human genome project. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al. Section of Receptor Biochemistry and Molecular Biology, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD. Automated partial DNA sequencing was conducted on more than 600 randomly selected human brain complementary DNA (cDNA) clones to generate expressed sequence tags (ESTs). ESTs have applications in the discovery of new human genes, mapping of the human genome, and identification of coding regions in genomic sequences. Of the sequences generated, 337 represent new genes, including 48 with significant similarity to genes from other organisms, such as a yeast RNA polymerase II subunit; Drosophila kinesin, Notch, and Enhancer of split; and a murine tyrosine kinase receptor. Forty-six ESTs were mapped to chromosomes after amplification by the polymerase chain reaction. This fast approach to cDNA characterization will facilitate the tagging of most human genes in a few years at a fraction of the cost of complete genomic sequencing, provide new genetic markers, and serve as a resource in diverse biological research fields. EST-sequencing explosion non-exclusivity (1992) Merck and WashU (1994) public ESTs GenBank dbEST dbEST release 20 February 2004 Number of public entries: 20,039,613 Summary by organism Homo sapiens (human) Mus musculus + domesticus (mouse) Rattus sp. (rat) Triticum aestivum (wheat) Ciona intestinalis Gallus gallus (chicken) Danio rerio (zebrafish) Zea mays (maize) Xenopus laevis (African clawed frog) … 5,472,005 4,056,481 583,841 549,926 492,511 460,385 450,652 391,417 359,901 EST lengths ~ 450 bp Human EST length distribution (dbEST Sep. 2003 ) Recover the mRNA from the ESTs What is an EST cluster? A cluster is a set of fragmented EST data (plus mRNA data if known), consolidated according to sequence similarity Clusters are indexed by gene such that all expressed data concerning a single gene is in a single index class, and each index class contains the information for only one gene. (Burke, Davison, Hide, Genome Research 1999). EST pre-processing Vector Repeats Mitochondrial Xenocontaminants EST Clustering UniGene (NCBI) www.ncbi.nlm.nih.gov/UniGene TIGR Human Gene Index (The Institute for Genomic Research) www.tigr.org StackDB (South African Bioinformatics Institute) www.sanbi.ac.za UniGene Species UniGene Entries Homo sapiens Mus musculus Rattus norvegicus Sus scrofa Gallus gallus Xenopus laevis Xenopus tropicalis … 118,517 82,482 43,942 20,426 11,970 21,734 17,102 ESTs and the Genome ESTs aligned to the genome Some advantages: •It defines the location of exons and introns •We can verify the splice sites of introns (e.g. GT-AG) hence also check the correct strand of spliced ESTs •It helps preventing chimeras •It can avoid putting together ESTs from paralogous genes •We can prevent including pseudogenes in our analysis Aligning ESTs to the Genome Many ESTs Fast programs, Fast computers Nearly exact matches Splice sites: Coverage >= 97% Percent_id >= 97% GT—AG, AT—AC, GC—AG Aligning ESTs to the Genome Extra pre-processing of ESTs: Clip poly A tails/Clip 20bp from either end Best in genome Remove potential processed pseudogenes Give preference to ESTs that are spliced Human ESTGenes Genomic length distribution of aligned human ESTs ~ 400bp Tail up to ~ 800kb The Problem ESTs Genome What are the transcripts represented in this set of mapped ESTs? Predict Transcripts from ESTs ESTs Transcript predictions Merge ESTs according to splicing structure compatibility Representation Every 2 ESTs in a Genomic Cluster may represent the same splicing (redundant) or not The redundancy relation is a graph: Extension Inclusion x x y x z y x Sort by the smallest coordinate ascending and by the largest coordinate descending z Criteria of merging Allow edge-exon mismatches Allow internal mismatches Allow intron mismatches Transitivity x x y Extension z y w Inclusion x z w x z This reduces the number of comparisons needed w ClusterMerge graph Each node defines an inclusion sub-tree y x z y x Extensions form acyclic graphs x y z w z x y z w Recovering the Solution Mergeable sets of ESTs can be recovered as special paths in the graph 1 2 3 4 5 6 8 9 7 Recovering the Solution Root: does not extend any node Root 1 2 3 4 5 6 Leaves 8 7 9 Leaf: not-extended and root of an inclusion tree Recovering the Solution Any set of ESTs in a path from a root to a leaf is mergeable Root 1 2 3 4 5 6 Leaves 8 9 7 Recovering the Solution Add the inclusion tree attached to each node in the path Root 1 2 3 4 5 6 Leaves 8 9 7 Recovering the Solution Lists produced: (1,2,3,4,5,6,7,8) ( 1,2,3,4,5,6,7,9) 1 2 3 4 5 6 8 7 9 This representation minimizes the necessary comparisons between ESTs How to build the graph Mutual Recursion Inclusion => go up in the tree Search graph (leaves) Recursion search along extension branch Search sub-graph How to build the graph Example 1 2 3 4 5 6 How to build the graph Example 1 2 3 4 5 6 6 1 3 2 5 4 How to build the graph Example 1 2 3 4 5 6 7 6 1 3 2 5 4 Leaves How to build the graph Example 1 2 3 4 5 6 7 6 1 3 2 5 4 Inclusion How to build the graph Example 1 2 3 4 5 6 7 6 1 3 2 5 4 Inclusion How to build the graph Example 1 2 3 4 5 6 7 6 1 3 2 5 4 Extension How to build the graph Example 1 2 3 4 5 6 7 6 1 3 2 5 4 Inclusion How to build the graph Example 1 2 3 4 5 6 7 6 1 3 2 5 4 Place 7 How to build the graph Example 1 2 3 4 5 6 7 6 1 3 2 5 4 Inclusion 7 How to build the graph Example 1 2 3 4 5 6 7 6 1 3 2 5 4 tagged as visited - skip 7 How to build the graph Example 1 2 3 4 5 6 7 6 1 3 2 5 4 Possible sub-trees beyond 1 or 3 remain unseen! The representation minimizes the necessary comparisons 7 Deriving the transcripts from the lists Internal Splice Sites: external coordinates of the 5’ and 3’ exons are not allowed to contribute Deriving the transcripts from the lists Splice Sites: are set to the most common coordinate 5’ and 3’ coordinates: are set to the exon coordinate that extends the potential UTR the most Single exon transcripts Reject resulting single exon transcripts when using ESTs Annotation with ESTs ESTs aligned to the genome can provide information about UTRs and alternative splicing Annotation with ESTs EST-Transcripts at www.ensembl.org Annotation with ESTs Results for Human and Mouse Human EST-genes (assembly ncbi33): 38,581 Genes 122,247Transcripts ( 42% with full CDS ) Mouse EST-genes (assembly ncbi30): 32,848 Genes 103,664 Transcripts ( 36% with full CDS ) How many transcripts are conserved? Is Alternative Splicing conserved? EST-transcript pairs 42,625 transcript pairs (in 18,242 gene pairs) gene pairs 78% with one transcript pair conserved 22% with more than one transcript pair conserved For 22% of the gene pairs some form of alt. splicing is conserved Conservation of Alt. Splicing Take gene-pairs with more than one transcript-pair ∑ ( number of paired transcripts - 1) %conservation = ------------------------------------------------------∑ ( number of transcripts - 1 ) ∑ = sum over genes in a gene pair with more than one variant ( subtract the ‘main’ transcript form) 19% of alt. variants in human are conserved in mouse 32% of alt. variants in mouse are conserved in human How many predicted ‘novel’ genes are validated by Human-Mouse comparison? Novel genes ESTGenes Not in Ensembl Human ESTGenes validated by comparison to mouse 13,174 18,242 24,201 ESTGenes with at least one complete ORF Novel genes ESTGenes not in Ensembl validated by comparison to mouse 984 With a complete ORF THE END