Download Lecture Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Replisome wikipedia , lookup

Molecular cloning wikipedia , lookup

Genomic imprinting wikipedia , lookup

Messenger RNA wikipedia , lookup

Histone acetylation and deacetylation wikipedia , lookup

List of types of proteins wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Ridge (biology) wikipedia , lookup

Epitranscriptome wikipedia , lookup

Community fingerprinting wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Non-coding RNA wikipedia , lookup

Transcription factor wikipedia , lookup

Gene expression profiling wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Point mutation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genome evolution wikipedia , lookup

Gene regulatory network wikipedia , lookup

RNA-Seq wikipedia , lookup

RNA polymerase II holoenzyme wikipedia , lookup

Molecular evolution wikipedia , lookup

Gene wikipedia , lookup

Eukaryotic transcription wikipedia , lookup

Gene expression wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Transcript
Genomics and Gene Recognition
Genes and Blue Genes
November 3, 2004
Eukaryotic Gene Structure
• eukaryotic genomes are
considerably more complex than
those of prokaryotes
– eukaryotic cells have organelles →
a variety of chemical environments
can exist within a cell
– each cell type typically has a
distinct pattern of gene expression
(even though the same DNA)
– there is a significant portion of
introns and intergenic space whose
role is mostly unknown
• eukaryotic cells (nuclei) almost
always contain two copies of
chromosomes
animal cell
Chromosome Structure
• a very long, continuous piece of DNA
• contains many genes, regulatory elements
and other intervening nucleotide
sequences
• the uncondensed DNA exists in a quasiordered structure inside the nucleus
• it wraps around histones (structural
proteins)
• this composite material is called chromatin
sheer size and diversity of regulation and functions
make eukaryotic DNA very hard to annotate
(1) Chromatid
(2) Centromere
(3) Short arm
(4) Long arm.
Eukaryotic Genomes
Transcription in Eukaryotes
• much more complex than in
prokaryotes
– a typical mammalian cell has
1,500 times as much DNA than
the cell of E. Coli
– DNA wrapped around histones
which limits access of
transcription regulatory proteins to
promoters
– eukaryotic transcription requires
“factors” that can recognize the
chromatin so that the transcription
machinery can access promoters
What is Transcription Factor?
•
transcription factor is a complex of about 10 proteins
•
transcriptional regulation coordinates metabolic activity, cell
division, embryonic development
•
transcription start is enabled by
–
–
–
promoters
enhancers
response elements
Promoters
• promoters of eukaryotic genes that encode proteins are defined by
modules of short conserved sequences (e.g. TATA box, CAAT box,
GC box)
– CAAT box is usually located around position –80
– GC box usually contains sequence GGGCGG or its complement
– GC box is usually found upstream of ‘housekeeping genes’ – genes
that encode proteins commonly present in all cells and essential to
normal function (they are expressed at relatively stable level in all
cells)
• sets of various sequence modules are embedded in the upstream
region of genes they collectively define the promoter
• every (almost) eukaryotic gene has its own promoter
• RNA polymerase II is responsible for the transcription of the
protein coding genes
Promoters
Enhancers
• also called upstream activation sequences, or UASs
• enhancers are additional regulatory sequences and they assist
transcription initiation
• differ from promoters
– location of enhancers is not fixed
• they may be several thousand nucleotides away from the promoter
• sometimes downstream from the gene
– bidirectional sequences
• function in either orientation
• can be removed and then reinserted in a different orientation without
loss of function
• enhancers are also evolutionarily conserved
• enhancers are promiscuous
– stimulate transcription from any nearby promoter
• enhancer recognition depends on transcription factors
Promoters and Enhancers
Promoter Consensus Sequences
Response Elements
• response elements are promoter modules in genes responsive to
common regulation
• found in the promoter regions of genes whose transcription is
activated in response to a sudden increase in environment
– temperature -> heat shock proteins
– toxic heavy metals -> metal response elements
• heat shock element sequences are recognized by a specific
transcription factor (HSTF)
– located at about +15 from the transcription start site of genes whose
expression is dramatically enhanced
– consensus sequence for HSE is about 14bp long and it can be in
introns too
Regulatory Influences
• many genes are subject to a multiplicity of regulatory influences
• this is achieved via an array of regulatory elements
RNA Polymerases
• there are 3 RNA polymerases in eukaryotic proteins
• RNA polymerases I and II are involved in transcribing RNA
molecules
• RNA polymerase II transcribes protein coding genes
• RNA polymerase II DOES NOT directly recognize promoters
– this task is carried out by transcription factors (e.g. TATA-binding
proteins)
– there are at least 12 TATA associated factors that bind to the
nucleotide sequence in specific order
• transcription initiation site starts with an initiator sequence
– typically about 6 nucleotides long
• subtle differences in transcription factors are known to exist
among different cell types
RNA Polymerases
Transcription
Factors
• majority of transcription
factors are sequencespecific DNA-binding
proteins
– recognize consensus
sequences, e.g. TATA box
– recognize enhancers
DNA Looping
• because transcription must respond to a variety of regulatory
signals, multiple proteins are essential for appropriate regulation of
gene expression
• these regulatory proteins are the sensors of cellular circumstances
– how do they work?
– they communicate this information by binding at specific nucleotide
sequences
• DNA is a linear molecule so there is little space for all these proteins
to bind
– all these sites are near transcription initiation site
• DNA looping enables additional proteins to interact with RNA
polymerase II initiation complex
• DNA loping expands the repertoire of transcriptional regulation
mechanism
DNA Looping
Post-Transcriptional Modification of mRNA
• transcription and translation are separated in eukaryotes
• transcription occurs on DNA in the nucleus
• translation occurs on ribosomes in the cytoplasm
• transcript must move from nucleus into cytoplasm
– on its way, pre-mRNA undergoes processing
– this primary transcript (hnRNA) is converted into mature mRNA
• each mRNA encodes ONLY ONE protein (monocistronic RNAs)
– in prokaryotes, some are polycistronic
Post-Transcriptional Processing of mRNA
• prior to processing hnRNAs are capped and poly-adenylated
• Capping
– a set of chemical alterations at the 5’ end of all hnRNAs
• Poly-adenylation
– the process of replacing the 3’ end of an hnRNA with approximately 250
A’s that are NOT spelled out in the nucleotide sequence of a gene
– exception: histones lack poly-A tail
• Splicing
– removal of often large segments from the interior of hnRNA
Introns and Exons
• most genes in higher eukaryotes are split into coding and noncoding regions
– coding regions – exons
– non-coding regions – introns
• introns are removed from the primary transcript in the process called
splicing
• tRNA and rRNA also get spliced!!!
• Example:
– yeast actin gene has only one intron 309bp long, after the 3rd amino
acid
– chicken ovalbumin gene has 8 exons and 7 introns
Introns and Exons
“mosaic molecules consisting of sequences
complementary to several non-contiguous
segments of the viral genome”
Quote from: Adenovirus amazes at Cold Spring
Harbor (1977) Nature 268: 101-104.
“The notion of the cistron, the genetic unit of function that
one thought corresponded to a polypeptide chain, now must
be replaced by that of a transcription unit containing regions
which will be lost from the mature messenger -- which I
suggest we call introns (for intragenic regions) -- alternating
with regions which will be expressed -- exons. The gene is a
mosaic: expressed sequences held in a matrix of silent
DNA, an intronic matrix”.
Gilbert, W. (1978) Why genes in pieces? Nature 271: 501
Open Reading Frames (ORFs)
• predicting genes is more difficult than in prokaryotes
– splice sites are hard to predict
– detecting sufficiently long ORFs is not enough to detect a gene
– alternative splicing even further complicates the issue
• ORFs would be useful in eukaryotes ONLY if we had algorithms
that could accurately predict splice sites
• splice sites are very hard to predict, they are tissue specific
–
–
–
–
there are at least 8 different splice signals
GU-AG rule is the most common
introns are at least 60bp long (to be able to accommodate splicing)
introns can be tens of thousands of nucleotides long
• exons
– vary in length between about 100 and 2,000bp
Introns and Exons
Introns and Exons
Alternative Splicing
• majority of eukaryotic genes appear to be processed into a single
mRNA, but...
• 20-40% of human genes give rise to to more than one mRNA
sequence
• how?
– via alternative splicing
• alternative splicing depends on a cell type and environmental
circumstances
• splicing apparatus itself is made from a variety of snRNAs and
several proteins
• variations in splice junctions may reflect specific recognition
Alternative Splicing
GC Content in Eukaryotic Genomes
• overall, GC content does not vary as widely as in prokaryotes
• however, there is a large-scale variation of GC content within
eukaryotic genomes
• it is very important for gene recognition algorithms
– eukaryotic ORFs are much harder to recognize
– there is a useful correlation between genes, upstream promoter regions,
codon choices, gene length, gene density and GC regions are involved
• GC rich regions are termed CpG islands and they are very
underrepresented as compared to other dinucleotides within DNA
sequences
• CpG islands occur frequently at the 5’ ends of genes (-1,500 to
+500) with the level of GC content as predicted by chance
CpG Islands
CpG Islands
CpG Islands
• analysis shows ~45,000 of CpG islands
• about half of these islands are housekeeping genes
• many remaining CpG islands are associated with promoters of
tissue specific genes
• CpG islands are rarely found in gene-free regions
– the reasons are chemical modifications of CpG’s into CpA’s and TpG’s
– transcription requires un-methylated DNA
• methylation and acetylation of histones
– help process of transcription
– histones lose affinity to bind DNA and thus the chromatin becomes less
tightly packed
– the areas become more accessible to RNA polymerases
Codon Usage Bias
• every organism prefers to use some triplets over others (to code for
the same amino acid)
• Example
– in yeast Arg is frequently encoded by AGA (48%) although there are
four other codons (CGC, CGA, CGG, AGG)
– fruit flies use CGA in 33% of the cases
• How do they occur
– consequence of the abundance of tRNAs within the organism
– consequence of the avoiding of stop codons
Transposons
• insertion sequences; jumping genes
• mobile genetic material that can be moved from one location of a
gene and be inserted at another
• the movement occurs due to the presence of an enzyme which is
encoded within transposon itself
– transposase enzyme coded by one or two genes
– it catalyses its transposition from one part of the genome to another
– the enzyme genes are surrounded by “repeat segments”
• transposition
– conservative – the number of copies of the repeat does not change
– replicative – copy number increases
• transposons are more common in bacteria, but are known to exist in
eukaryotes as well (~1,000 transposons in human genome)
Repetitive Elements
• many DNA regions contain repetitive sequences
• typically, large repetitive chunks are divided into
– tandemly repeated DNA
– repeats that are interspersed throughout the genome
• tandemly repeated DNA
– satellites
– minisatellites and/or microsatellites
• Example:
– 5’ CTCTCTCTCT 3’ sequence in which the repeat unit is ‘CT’
– 5’ ATTCGATTCGATTCG 3’ sequence; the repeat unit is ‘ATTCG’
Tandem Repeats
• Satellite DNA
– long, simple sequences (up to 10mbp) with skewed nucleotide
compositions
– repeating fragments of up to 2,000bp
• Minisatellite DNA
– not so long as satellites (up to 20kbp)
– copies of sequences of up to 25bp
• Microsatellite DNA
– shorter than minisatellites (up to 150bp)
– up to 100 copies of sequences of up to 5bp (typically 2-3)
– “TAGTAGTAGTAGTAGTAGTAG..."
• Example: humans, ‘CA’ repeats
– occur once every 10,000bp
– make 0.5% of human genome
Interspersed Repeats
• scattered randomly throughout genomes
• propagated by the synthesis of an RNA intermediate - process
called retrotransposition
• there are three steps in retrotransposition
– an RNA copy of the transposon is transcribed by RNA polymerase
(regular transcription step)
– RNA copy is converted into a DNA molecule by reverse transcriptase
– reverse transcriptase inserts the DNA copy somewhere else in the
genome
• reverse transcriptase may be acquired through viral infections
Eukaryotic Gene Density
• very small
• in the human genome:
– 3% of DNA codes for genes
– 27% of DNA are promoters, introns, and pseudogenes
– 70% of DNA ??? – often called ‘junk DNA’
• unique sequences
• repetitive sequences
• genes are far apart
– the average distance between genes is about 65,000bp
– in E. Coli the average distance between genes is about 120bp