Download EXPLORING DEAD GENES

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of diabetes Type 2 wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Gene desert wikipedia , lookup

Copy-number variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Point mutation wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Oncogenomics wikipedia , lookup

Transposable element wikipedia , lookup

Quantitative trait locus wikipedia , lookup

X-inactivation wikipedia , lookup

Essential gene wikipedia , lookup

Public health genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Gene expression programming wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Human genome wikipedia , lookup

Genomics wikipedia , lookup

Genome editing wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genomic library wikipedia , lookup

History of genetic engineering wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Non-coding DNA wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene wikipedia , lookup

Genomic imprinting wikipedia , lookup

RNA-Seq wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Minimal genome wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
EXPLORING DEAD GENES
Adrienne Manuel
I400
What are they?




Dead Genes are also called Pseudogenes
Pseudogenes are non functioning copies of
genes in DNA
Results from reverse transcription from an
mRNA transcript
Or from gene duplication and subsequent
disablement
Expression of Pseudogenes

Evidently transcribed

Expression of pseudogenes vary

Snail (lymnaea stagnalis) example of an
organism that still has functioning
Pseudogenes, Good and Bad!

- Raised expression for tumor cells

+ Useful in studying molecular evolution

+ Helpful in determining rates of genomic
DNA Loss for an organism
Size and Distribution of Pseudogenes
DEFINING POPULATIONS AND
SUBPOPULATIONS
 ‘G’ the total population of confirmed and
predicted protein-encoding genes

ΨG is the estimated population of
pseudogenes that correspond to G



The Set of genes with at
least one verifying EST
match was derived GE
A set of genes that were
deemed to be highly
expressed was derived from
microarray expression data
and denoted GM
The corresponding predicted
tool or pseudogenes is
denoted ΨGM
Data Files



Sanger Sequencing Centre ftp
(ftp://ftp.sanger.ac.uk)
in this website are the six complete
sequences of worm chromosomes
GFF Data Files with annotations for genes
and other genomic features that correspond
to wormpep18
Arranged were the pseudogene population in
the form of a pipeline
Pipelines
Step 1: Sanger centre pseudogene annotations


Start with list of 332 pseudogenes
Pseudogene population was derived by
looking for gene disablement
Step 2: FASTA matching to find potential
pseudogenes
PIPELINES (continued)



Worm genes masked for low complexity
region with the program SEG
TFASTX and TFASTY are next used to
compare the complete wormpep18 against
the worm genome
After comparison Pseudogene matches were
refined with the next step
Pipeline (continued)
Step 3: reduction for overlaps on the genomic DNA
 Significant matches of protein sequences to the DNA
were reduced for redundancy where homologs match
the same segment of DDNA
 Matches are then sorted
Step 4: Prevention of over counting for adjacent
matches.
 Initial matches may correspond to same pseudogene
 To avoid over counting matches were realigned
Pipeline
Step 5: Masking against Sanger Centre annotation
and Transposon library.
 Potential pseudogenes filtered for overlap with any
other annotations in the Sanger Centre GFF files
e.g. exons of genes, tandem or inverted repeats
Step 6: Reduction for possible additional repeat
elements
 At this point there is a set of 3814 pseudogenic
fragments
Pipeline (final step)
Step 7: reducing threshold stringency

e-value match threshold reduced from .01 to .001
Check the web!


http://bioinfo.mbb.yale.edu/genome/womr/pseudogene
To find pseudogene population, the data can be viewed
either by searching for protein name or viewing specific
range in the chromosome
Size of Pseudogene Popuation


Composed of 2168 sequence, that’s about 12% of
total gene complement
Factors that affect the size: 1. Dead copies of
transposable elements 2. Size of pseudogene
underestimated because pseudogenes with less
obvious disablement aren't included. 3.Annotated
genes might be pseudogenes because disablement
is undetectable 4. Pseudogenes still part of
functioning gene 5. Some pseudogenes arise due to
sequencing errors 6. Possible genomic repeats
SUBPOPULATIONS

Highly expressed genes have fewer dead
gene copies

The most reliable subset of the pseudogene
population is about half the total for ΨG.

39% of pseudogenes are intronic-these kinds
of pseudogenes aren't ailing families of
proteins
Chromosomal Distributions

More abundant near the ends of
chromosome (the “arms”)

For each chromosome, there is a calculated
proportion of dead genes


The data plot above
indicates genome to
genome over all age.
The percentage composition
for each of the 20 amino
acids is graphed in
decreasing order of the
implied amino acid
composition in the
pseudogene set. In the
bottom part of the figure, the
G difference for each amino
acid composition is indicated
by a bar.



Listed are the largest sequence families in
the worm ranked by genes and pseudogenes
They’re named for their particular
representative. Four of the 10 paralog genes
family when ranked by number are
functionally uncharacterized
Three of the pseudogenes top 10 are
amongst the biggest families when we rank
according to number of genes
Pseudofolds



These charts ranked in
terms of implied
structural pseudofolds
Proteins encoded by
the worm genome have
been assigned to
globular domain folds
From the SCOP
database
Why was this studied again?




To provide an initial estimate of the size distribution
and characterizations of the pseudogene comparing
C.elegans in attempt to estimate the total number in
humans.
Found few pseudogenes that are apparently due to
processing in the worm genome
Found large uncharacterized gene family that makes
up 2/3 of dead genes
Arms of chromosome are an unreliable for encoding
genes but more likely to spawn new proteins