Download A primer on the structure and function of genes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cre-Lox recombination wikipedia , lookup

Gene desert wikipedia , lookup

RNA polymerase II holoenzyme wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Eukaryotic transcription wikipedia , lookup

RNA interference wikipedia , lookup

List of types of proteins wikipedia , lookup

Community fingerprinting wikipedia , lookup

Epitranscriptome wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

RNA silencing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genome evolution wikipedia , lookup

Non-coding RNA wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Gene regulatory network wikipedia , lookup

Molecular evolution wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gene expression wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Silencer (genetics) wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
A primer on the structure and function of genes
DNA is very nearly the universal genetic material
Hereditary information of all life on earth is chemically encoded in molecules of NUCLEIC ACID. Nucleic
acids are a linear polymer made up of monomers called NUCLEOTIDES (see figure below). Nucleotides are
composed of three subunits: (i) a nitrogenous base; (ii) a pentose sugar; and (iii) a phosphate group (see
figure below). Because nucleotides are chemically basic, they are commonly referred to as “BASES”. Two
major forms of nucleic acids serve as genetic material; DNA and RNA. DNA is very nearly the universal
form of genetic material, with the exception being a number of viruses that use RNA. Note not all viruses
use RNA, some use DNA as their genetic material. DNA differs from RNA in that the pentose is 2’deoxyribose, (i.e., there is no hydroxyl group), whereas it is ribose in RNA.
Phosphate group
P
Nitrogenous
base
Pentose
sugar
Nucleotide
Nucleic acids contain 4 types of bases. For DNA these bases are adenine (A), guanine (G), cytosine (C),
and thymine (T). For RNA, uracil (U) is found in stead of thymine (T). Chains of nucleic acids polymers
are joined together by hydrogen bonds between specific pairs of bases; hence DNA is sometimes referred
to in numbers of “base-pairs”. G pairs with C by means of three hydrogen bonds. In DNA, A pairs with T
by two hydrogen bonds; in RNA, A pairs with U, also by two hydrogen bonds. The two chains or “strands”
of DNA bonded together in this way form a double helix. Such DNA is called “double-stranded” (dsDNA).
O
N
║
Guanine H
N
H --------N
--------H --------N
H
N
H
NH2
H
H Cytosine
N
O
N
H
Hydrogen
Bonds
NH2
N
║
O
N
Adenine H
N
H
N
--------- H N
H ---------
O
CH3
Thymine
N
H
The backbone of each polynucleotide chain in nucleic acids is
polarized; in fact, the chains bonded together are ANTI-PARALLEL.
The sugar-phosphate links in the backbone are directional, with the
5’ position of one pentose ring connected to the 3’ position of the
next pentose ring (see figure below). The two polynucleotide
chains are bonded in anti-parallel directions. Note, some viruses
have genomes that consist of only single-strands (ss); there are
examples of this for both RNA and DNA.
Note: an important convention is to
write out DNA in the 5’ to 3’ direction!
5’ end
3’ end
3’ end
5’ end
5’ – A T T C A G T A A – 3’
is NOT the same as
3’ – A T T C A G T A A – 5’
Some additional comments about RNA are warranted. RNA is commonly found in nature in both single
and double strand forms. Regions of RNA molecules, although found in the form a single polynucleotide
chains, often pair up with other regions of the same chain, forming secondary structures. Also, base
pairing between G and U is possible, whereas pairing between G and T in DNA does not occur.
The structure and function of genes
In the broad sense a GENE is defined as the genetic element which is transmitted from parent to offspring
during the process of reproduction that influences hereditary traits. It has been more than a century since
the essential characteristics of a gene were defined by Mendel (1865). For much of that time there was
no mechanistic explanation of how a gene actually functioned. It wasn’t until 1941 that Beadle and Tatum
had clearly shown that a genetic mutant resulted in a defective enzyme. There findings become
formalized as the one-gene, one-enzyme hypothesis. Over time it became clear that some enzymes were
encoded by the products of more than one gene, which were subsequently assembled into a functioning
enzyme, the hypothesis was changed to one-gene, one-polypeptide. The definition later grew into one
which included both the coding DNA sequence and the adjacent segments necessary for the use of that
coding sequence. For example Benjamin Lewin defines the gene in his textbook “Genes V” as follows:
GENE: is the segment of DNA involved in producing a polypeptide chain; it includes
regions preceding and following the codon region (leader and trailer) as well as
intervening sequences (introns) between individual coding segments (exons). [Note this
is also the definition of a CISTRON.]
This is essentially the modern view of the gene in molecular biology. Although somewhat expanded from
the original one-gene, one-enzyme hypothesis, it is still a deterministic view of the gene as a discrete
element of DNA, as it suggests that most, if not all, of the information required to obtain the functional
protein is contained in the local DNA sequence. A consequence of this view was that most functional and
structural diversity arose via local changes in the DNA sequence of genes.
The deterministic view of the gene was not only popular, but productive; without it we could not have
identified the genetic basis of many diseases. In fact, one of the motivating factors behind the huge effort
and expense of the human genome project (HGP) was based on this view of the gene. It was envisioned
that knowing the sequences of genes and then discovering genetic variation in all human genes would
lead to the discovery of the genetic basis of many more diseases as well as provide clues to treatment
and cures. A highly unexpected result of the HGP was the discovery of just 30,000 genes; far far more
were believed to be required to encode all the information necessary to build a human being (the
consensus opinion had been around 100,000 human genes). In the simple terms of the absolute number
of genes, it seemed that humans are not much more complex than fruit flies and roundworms (Drosophila
has about 13,000 genes and Caenorhabditis about 19,000 genes), and about the same as the mouse
(Mus has about 30,000 genes). Among other things, the HGP highlighted the deficiencies of our classic
view of the function of a gene and how to define it.
The HGP and other genome projects have revealed that many genes encode more than one protein. It
was well known that the products of genes could me modified at different stages during the process of
producing the mature gene product; hence not all the information required to obtain the final gene product
is encoded in the “gene”. However, in light of the very low gene number of the human genome, it is now
thought that most of the evolutionary changes in functional and structural differences between humans,
chimpanzees, and even mice, occurred at the level of gene regulation (Clark et al. 2003). If we want to
define a gene by what it does, we have to alter our way of thinking about its form in the genome.
What is a gene?
1. a unit of inheritance
2. a location on a chromosome
3. a sequence of base pairs
4. a determinant of phenotype
They are all correct.
Let’s reconsider the broad sense definition of a GENE: the genetic
element which is transmitted from parent to offspring during the
process of reproduction that influences hereditary traits. What
happens if we use such a definition, and some of the information
required to achieve the final function of a protein (e.g., the
propensity for a particular disease) is not encoded in the segment
of DNA that encodes protein?
We can no longer assume a purely deterministic view of a gene. Although there is much interest in
improving the definition of the gene, little progress has been made; we simply need more information
about how the tremendous complexity and plasticity of gene expression is regulated. Recognizing that
one “gene” could represent any number of products, with different functions, Venter et al. (2001) proposed
defining a gene as a “transcriptional unit”; this does not seem to resolve the important issues.
Nevertheless, in order to move forward we need one or more operational definitions. Bearing in mind the
many limitations of our definitions, we will divide genes into three broad categories: (i) protein coding
genes; (ii) regulatory genes and (iii) RNA encoding genes.
1. Protein coding genes. This type easily fits the above definition, in that they transcribe a messenger
RNA (mRNA) that is used as a template for making a polypeptide. These genes are sometimes called
STRUCTURAL GENES. We can see the problems with defining a gene as a segment of DNA involved in
producing a polypeptide chain, as this differs among eukaryotes, prokaryotes and virus’s.
Prokaryotic protein coding sequences are COLINEAR with the polypeptide; the sequence of nucleotides
corresponds exactly to the sequence of amino acids in the polypeptide. Often several protein coding
genes are regulated and expressed as a single unit; this is called an OPERON (see figure below). The
mRNA for these adjacent coding sequences is synthesized in one piece. The operon includes regulatory
sequence elements physically located in the same region of the coding sequences that mediate
transcription of those sequences. Operons tend to be comprised of genes whose functions are related.
For example, it is very common for all the enzymes of a metabolic pathway to be organized into a cluster
of coding sequences that are co-ordinately regulated.
Promoter for regulatory gene
Regulatory gene
DNA
Pi
i
Plac
Structural genes
Z
Operator
Y
a
Promoter for lac operon
z = Structural gene for β-galactosidase
y = Structural gene for β-galactoside permease
a = Structural gene for β-galactoside transacetylase
Promoter: A region of DNA extending 150-300 bp upstream from the transcription start site that contains binding sites
for RNA polymerase and a number of proteins that regulate the rate of transcription of the adjacent gene.
Operator: a region of DNA that indicates the starting point for reading the coding sequences of bacterial structure
genes and controls the expression of those genes via interaction with a repressor.
Eukaryotic protein coding genes differ in many ways from prokaryotic ones; the most striking difference
being presence of introns. INTRONS are regions of DNA within a protein-coding gene that do not code for
amino acids; they are initially copied into the RNA, but are cut out of the final RNA transcript. Some
eukaryotic genes do not possess introns (e.g., histone genes) while others can have dozens. The size of
the introns can be highly variable as well. The figure below presents an example of a eukaryotic protein
coding gene
Regulatory Signals
RNA start
Introns
DNA
Exon 1
Exon 2
Exon 3
Poly-A addition site
-220
+2400
2. Regulatory signal genes. These are elements or motifs of DNA that are not transcribed, and serve as
signals to regulate the processing of the DNA molecule. The prominent types of such genes are:
1. Replicator signals: These signal the initiation or termination of DNA replication. Such sites often
function as binding sites for specific molecules that initiate or suppress the DNA replication
process.
2. Telomeres: These are repeats of specific DNA sequences found at the ends of eukaryotic
chromosomes. Because eukaryotic chromosomes are linear, having two ends, they must be
“capped” so that these ends are stable. Telomeres are crucial to the life of the cell, as they
function as the cap. In humans, telomeres can exist in an array of up to 2000 repeat units. Arrays
of telomeres shrink in size with each round of chromosome replication, so their length imposes a
finite life span on a cell.
3. Segregator signals: These determine the specific sites at which the segregation machinery of the
cell attaches to the chromosomes for the process of mitosis and meiosis.
4. Recombination signals: The sequence element that provides a recognition site for a
recombination enzyme.
Our understanding of the diversity and evolution of regulatory genes is far less advanced than that of the
other types of genes. However, the HGP illustrated the importance of gene regulation in the origin and
evolution of complexity. Remember that most protein coding genes are shared by humans, chimpanzees
and mice, and that divergence in the regulation of these genes is believed to be responsible for much of
the difference in complexity of these organisms. As a source of variation, regulatory sequences offer a
tremendous source of variation and opportunities for evolution of organism complexity.
Because regulatory genes are modular, complexity can arise from COMBINATORIAL EVOLUTION, in which
case there is much less need for rare beneficial mutations. Let’s look at an example. Consider 50 genes,
each with 2 possible ways of alternative splicing of the exons; this gives us 100 possibilities. Now
consider that by mixing and matching the regulatory elements allows expression of any 10 of these genes
at the same time. The number of unique sets of 10 different gene products is 1.7 × 1013. Even if only an
extremely small fraction of these gene expression patterns alter the phenotype (say 0.000001), we still
have an immense number of possibilities (>17 million) to work with, all without any mutations in the protein
coding sequences
Example of combinatory possibilities:
Let’s take a look at a familiar example. Say you have a deck of 52 cards and are about to play a game a poker. You wonder how
many different 5 card hands are possible.
We will use the notation C(n,r) for the number of combinations of n things taken r at a time. So in this case we have n = 52 things
taken r = 5 at a time.
C(n,r) = n! / (n-r)!r!
C(52,5) = 52! / 47! × 5!
C(52,5) = 2,298,960
Now in our example of gene combinations we have n = (50 × 2) = 100 genes, taken at r = 10 genes at a time.
C(n,r) = n! / (n-r)!r!
C(100,10) = 100! / 90! × 10!
C(100,10) = 1.7310 × 10
13
Combinatorial gene expression is well studies in the context of cell differentiation. Below is a diagram that
illustrates combinations of regulatory proteins can be used to determine the development of different cell
types. In this example differential expression of three different regulatory proteins (1, 2 and 3) leads to
eight different cell types.
Figure obtained from the Nation Health Museum (Access Excellence):
http://www.accessexcellence.org
The difference in the phenotypes of these cells is due to differences in the patterns of gene expression.
Imagine that rather than point mutation in proteins we can alter the phenotype of a cell by mixing and
matching the regulatory elements that control the pattern of gene expression. Mutation is an extremely
slow process; but with combinatorial evolution change can be achieve much more quickly via the much
faster process of recombination. The evolutionary dynamics of regulatory genes, and in particular
combinatorial evolution, warrants serious attention.
3. RNA encoding genes. In contrast to mRNA of protein coding genes, the final product of the RNA
gene is only transcribed RNA. RNA molecules specified by such genes fold into complex structures that
associate with proteins to form a sort of “chemical machine”. Three most prominent types of such RNA
molecules are:
1. Transfer RNA (tRNA): Amino acids have no affinity of their own for the mRNA; hence the tRNA
molecule is used as an adaptor molecule. tRNAs function to position a specific amino acid within
the translation complex so that it can be added to the growing polypeptide chain.
2. Ribosomal RNA (rRNA): The ribosomal RNA combines with proteins, to form the ribosome, which
is the site of protein synthesis within the cell. This type of RNA makes up the vast majority of all
RNA in the cell, about 95%.
3. Small nuclear RNA (snRNA): These are responsible for the processing of the mRNA molecule in
the nucleus. They associate with protein molecules to form an RNA splicing complex that removes
introns from mRNA. They are also important in the maintenance of the telomeres, or
chromosomal ends. snRNAs are unique to eukaryotes. snRNAs are always associated with
proteins in a complex called small nuclear ribonucleoproteins (SNRNPs, or snurps).
Other types of RNA genes are small nucleolar RNA (snoRNA), microRNA, guide RNA (gRNA), and
signal recognition particle RNA.
Some general features of RNA genes:
•
In general, RNA specifying genes do not contain introns and are largely similar in structure among
prokaryotes and eukaryotes. There are some exceptions, e.g., ciliates, slime molds, and certain
bacteria, where RNA genes encode introns that are spliced out in order to obtain a functional RNA
molecule.
•
Sequence elements that regulate the expression of RNA genes are sometimes found within the
DNA sequence of the gene. Examples include the eukaryotic tRNA genes.
•
Many RNA molecules are modified by incorporation of standard and non-standard nucleotides
after the process of transcription is complete. Standard nucleotides can also be modified into
non-standard ones.
•
As we have seen in the diagrams above, folding of RNA molecules means that some sites have
evolved to form base-pairs with other sites within the same RNA molecule. This is called RNA
SECONDARY STRUCTURE.