* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download EXPLORING DEAD GENES
Epigenetics of diabetes Type 2 wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Gene desert wikipedia , lookup
Copy-number variation wikipedia , lookup
Genetic engineering wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Point mutation wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Oncogenomics wikipedia , lookup
Transposable element wikipedia , lookup
Quantitative trait locus wikipedia , lookup
X-inactivation wikipedia , lookup
Essential gene wikipedia , lookup
Public health genomics wikipedia , lookup
Metagenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Gene expression programming wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Human genome wikipedia , lookup
Genome editing wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genomic library wikipedia , lookup
History of genetic engineering wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Non-coding DNA wikipedia , lookup
Designer baby wikipedia , lookup
Microevolution wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Pathogenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression profiling wikipedia , lookup
EXPLORING DEAD GENES Adrienne Manuel I400 What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA Results from reverse transcription from an mRNA transcript Or from gene duplication and subsequent disablement Expression of Pseudogenes Evidently transcribed Expression of pseudogenes vary Snail (lymnaea stagnalis) example of an organism that still has functioning Pseudogenes, Good and Bad! - Raised expression for tumor cells + Useful in studying molecular evolution + Helpful in determining rates of genomic DNA Loss for an organism Size and Distribution of Pseudogenes DEFINING POPULATIONS AND SUBPOPULATIONS ‘G’ the total population of confirmed and predicted protein-encoding genes ΨG is the estimated population of pseudogenes that correspond to G The Set of genes with at least one verifying EST match was derived GE A set of genes that were deemed to be highly expressed was derived from microarray expression data and denoted GM The corresponding predicted tool or pseudogenes is denoted ΨGM Data Files Sanger Sequencing Centre ftp (ftp://ftp.sanger.ac.uk) in this website are the six complete sequences of worm chromosomes GFF Data Files with annotations for genes and other genomic features that correspond to wormpep18 Arranged were the pseudogene population in the form of a pipeline Pipelines Step 1: Sanger centre pseudogene annotations Start with list of 332 pseudogenes Pseudogene population was derived by looking for gene disablement Step 2: FASTA matching to find potential pseudogenes PIPELINES (continued) Worm genes masked for low complexity region with the program SEG TFASTX and TFASTY are next used to compare the complete wormpep18 against the worm genome After comparison Pseudogene matches were refined with the next step Pipeline (continued) Step 3: reduction for overlaps on the genomic DNA Significant matches of protein sequences to the DNA were reduced for redundancy where homologs match the same segment of DDNA Matches are then sorted Step 4: Prevention of over counting for adjacent matches. Initial matches may correspond to same pseudogene To avoid over counting matches were realigned Pipeline Step 5: Masking against Sanger Centre annotation and Transposon library. Potential pseudogenes filtered for overlap with any other annotations in the Sanger Centre GFF files e.g. exons of genes, tandem or inverted repeats Step 6: Reduction for possible additional repeat elements At this point there is a set of 3814 pseudogenic fragments Pipeline (final step) Step 7: reducing threshold stringency e-value match threshold reduced from .01 to .001 Check the web! http://bioinfo.mbb.yale.edu/genome/womr/pseudogene To find pseudogene population, the data can be viewed either by searching for protein name or viewing specific range in the chromosome Size of Pseudogene Popuation Composed of 2168 sequence, that’s about 12% of total gene complement Factors that affect the size: 1. Dead copies of transposable elements 2. Size of pseudogene underestimated because pseudogenes with less obvious disablement aren't included. 3.Annotated genes might be pseudogenes because disablement is undetectable 4. Pseudogenes still part of functioning gene 5. Some pseudogenes arise due to sequencing errors 6. Possible genomic repeats SUBPOPULATIONS Highly expressed genes have fewer dead gene copies The most reliable subset of the pseudogene population is about half the total for ΨG. 39% of pseudogenes are intronic-these kinds of pseudogenes aren't ailing families of proteins Chromosomal Distributions More abundant near the ends of chromosome (the “arms”) For each chromosome, there is a calculated proportion of dead genes The data plot above indicates genome to genome over all age. The percentage composition for each of the 20 amino acids is graphed in decreasing order of the implied amino acid composition in the pseudogene set. In the bottom part of the figure, the G difference for each amino acid composition is indicated by a bar. Listed are the largest sequence families in the worm ranked by genes and pseudogenes They’re named for their particular representative. Four of the 10 paralog genes family when ranked by number are functionally uncharacterized Three of the pseudogenes top 10 are amongst the biggest families when we rank according to number of genes Pseudofolds These charts ranked in terms of implied structural pseudofolds Proteins encoded by the worm genome have been assigned to globular domain folds From the SCOP database Why was this studied again? To provide an initial estimate of the size distribution and characterizations of the pseudogene comparing C.elegans in attempt to estimate the total number in humans. Found few pseudogenes that are apparently due to processing in the worm genome Found large uncharacterized gene family that makes up 2/3 of dead genes Arms of chromosome are an unreliable for encoding genes but more likely to spawn new proteins