Download Lecture 7 notes - UC Davis Plant Sciences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic imprinting wikipedia , lookup

Gene expression wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Gene regulatory network wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene desert wikipedia , lookup

Community fingerprinting wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Gene wikipedia , lookup

RNA-Seq wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Genomic library wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Molecular evolution wikipedia , lookup

Genome evolution wikipedia , lookup

Transposable element wikipedia , lookup

Transcript
The problem: find the genes, regulatory elements, and repetitive elements in
the middle of millions of ATCG
Arial
After a genome is completely sequenced, it can then be annotated. This
operation is performed initially using computer programs that are designed to
find genes (exons and introns, regulatory sequences such as TATA boxes
and polyA addition sites), and repetitive elements and determine their
structure. Once genes are identified, the gene order and structure can be
compared with other species: comparative genomics.
For prokaryotes annotation use http://compbio.ornl.gov/prodigal/server.html
Arial
Comparison of genomic sequences with cDNA, using the program GeneSeqer is a good
strategy
t t
to
t annotate
t t genes (extrinsic
( t i i information)
i f
ti )
Arial
Arial
The absolute frequency of the pair A(k)A with k (0 to 50) nucleotides between the two As in
th first
the
fi t 200 bp
b in
i a sett off 1761 h
human genes and
d 1753 iintrons.
t
A clear
l
ttriplet
i l t pattern
tt
appears in the coding region but not in the introns. This pattern is a reflection of the
characteristic codon usage seen in coding regions
Gene prediction algorithms attempt to determine whether a particular DNA sequence
constitutes a working gene. The parameters distinguishing genes from non-genes are not
well understood. Although certain features, such as splice sites and ORFs, are fairly well
defined other features,
defined,
features such as regulatory regions
regions, are still very difficult to predict.
predict Even
identifying ORFs is not straightforward, particularly in mammalian genomes that are
characterized by small exons and large introns. A further problem in gene prediction is that
our knowledge of identifying features in genes is constantly expanding. Computer scientists
would classify gene prediction as a problem in pattern recognition. Machine learning
algorithms are good for solving problems in pattern recognition because they can be trained
on a sample data set to “learn” what defines the pattern in question when well-defined rules
are not available.
Arial
A Markov chain, model or process refers to a series of observations in which the probability
off an observation
b
ti depends
d
d on a number
b off previous
i
observations.
b
ti
M
Markov
k processes
describe many biological phenomena, including base-pair substitutions resulting from
mutations. In some cases, the states in a Markov process are not known with certainty. The
example of a dishonest gambler is often used to illustrate this point. The gambler may carry
a loaded die that he or she occasionally substitutes for a fair die, but not so often that the
other players would notice. The fair die has a one-in-six chance of showing any particular
number. When using the loaded die, a player will have a 50% chance of rolling a one and a
10% chance of rolling any other number. It is in these types of situations that stochastic
models called hidden Markov models (HMMs) are useful, because they take into account
unknown (or hidden) states. For example, exactly when the cheating gambler is using a fair
or loaded die is hidden from the other players, but insight may still be gained by looking at
the outcome of the cheater’s rolls. If he or she rolls three ones in a row, it is more likely (a
12.5% chance) that the loaded die is being used than the fair one, which would have only a
0.5% chance of generating three ones in a row. Hidden Markov models describe the
probability of transitions between hidden states, as well as the probabilities associated with
each state. In the example of the cheating gambler, an HMM would describe the
probabilities of rolling particular numbers given the loaded or fair die and the probability that
the dishonest gambler would switch from one die to another. Hidden Markov models can be
used to answer three types of questions. The first type is the likelihood question: Given a
particular HMM, what is the probability of obtaining a particular outcome (e.g., rolling three
ones)? The second type is the decoding question: Given a particular HMM, what is the
most likely sequence of transitions between states for a particular outcome? In the case of
the cheating gambler, this sequence would be the order in which he or she transitioned from
one die to another. The third type
yp is the learning
g question:
q
Given a p
particular outcome and
set of assumptions about possible transition states, what are the best model parameters
(e.g., probabilities between transition states)? This third question allows HMMs to be used
for machine learning. Every HMM has a start and end state, denoted by the S and E,
respectively. Hidden states lie between the start and end states. In the figure, the squares
are states, and the lines between them indicate the probability of one state transitioning to
another. The loops on the upper and lower states show the probabilities associated with the
state remaining the same. States transition back and forth until the HMM reaches the end
state. HMM can be used to assess the probability that a particular sequence is a gene or
Arial
Arial
Arial
Dynamic programming: the solution of a general problem is obtained by the
recursive solution of smaller versions of the problem. The main difficulty in
exon assembly is that the number of possible assemblies grows
exponentially with the number of predicted exons. Dynamic programing
allows to find an efficient solution without having to enumerate all possible
combinations
Genomscan: http://genes.mit.edu/genomescan.html
Arial
GRAIL and GRAIL-EXP predictions of URO-D (U30787). The top track
displays the annotated exonic structure. Below GRAIL ab initio prediction.
Below the alternative splicing forms identified by GRAIL-EXP by using an
external EST database.
The output gives the exon number, strand (forward + and reverse -), start
and stop point of the exon, exon type (start, internal, terminal),its length, its
raw score and qualitative description. Here GRAIL corrected predicted 5 of
th tten kknown exons as wellll as partt off a one off the
the
th internal
i t
l exons
Arial
Arial
Output: Gene, exon number, type, strand, begin end of prediction, reading
frame, , scoring values and a Probability value. In this example GENSACAN
correctly predicted 9 of the ten exons. Miss the first exon even if the
GenomeScan version was used
In Genome SCAN you enter the Protein you suspect similar using a previous
BLASTX search.
Arial
FGENESH predictions on the URO-D sequence. The tabular format of the
output gives the gene number, the strand (+ or -), the exon number within
the gene, the exon type, the start and stop positions for the exon, an exon
score, ORF start and stop positions, and exon length. The amino acid
sequence of the predicted protein product is given below the gene
prediction. The method can also predict TATA boxes and polyA tails. Here.
FGENESII correctly pre-dicted seven of the ten exons, two exons were
partially detected
detected, and the initial exon was missed altogether
altogether. The method
detected the presence of a polyA tail as well.
Arial
Transposons are DNA sequences with the ability to jump about a host’s
genome. Class I transposons use an RNA intermediate and are capable of
copying themselves many times over. Retroviruses are believed to have
evolved from this class of transposable elements. Class II transposons use a
DNA intermediate and are restricted mainly to extracting themselves from
one genomic location and reinserting themselves in another. MITEs are a
type of class II transposon that has a very high copy number in plants, which
is atypical for this class of transposons.
transposons
Arial
Transposons can be broadly defined as DNA sequences flanked by repeats
that have the ability to insert themselves throughout the genome. In some
cases, transposableelement (TE) insertion can be faster than chromosome
replication, allowing such sequences to rapidly accumulate within an
organism’s genome. For example, 44% of the human genome is composed
of sequences derived from TEs. Because transposons are able to occupy
random positions within the genome, they can have deleterious effects by
disrupting gene function; in some rare cases
cases, a TE can even evolve into a
beneficial gene. A TE can also alter gene expression by inserting itself near
the regulatory region of a preexisting gene. In eukaryotic genomes with large
amounts of noncoding DNA, TEs will mostly have a neutral effect.
Transposable-element content varies greatly between eukaryotic species.
Approximately one half of the human and maize genomes consists of
transposable elements. Drosophila, on the other hand, has only 15% of its
genome dedicated to TEs, and Arabidopsis, C. elegans, and S. cerevisiae all
have less than 5%. TE content does not appear to be evenly distributed in
genomes. For example, in humans, certain types of TE sequences are found
in much higher concentrations in the X chromosome.
Transposable elements can be classified as autonomous, nonautonomous,
and inactive. Autonomous elements possess all of the genes that are
required for transposition, which allows the sequence to move about the
genome with only the help of host enzymes. Nonautonomous elements are
degraded versions of autonomous sequences that require the proteins of
autonomous elements for transposition. In contrast, inactive elements
Arial
Transposable elements can be further categorized as class I and class II.
Two kinds of class I transposons are LTR and non-LTR retrotransposons.
Class I transposable elements (also called retrotransposons, or
retroelements) move via an RNA intermediate. Reverse transcriptase is then
used to generate a cDNA sequence that can reinsert itself into the genome.
This mode of action allows class I elements to copy themselves many times
over, thereby significantly expanding an organism’s genome. A large portion
(42%) of the human genome comes from retrotransposon sequences
sequences.
Interestingly, while a reversetranscriptase gene has been found in almost all
eukaryotic genomes examined to date, it is found in only a minority of
Bacteria and in almost no Archaea.
Arial
LTR retrotransposons are flanked by Long Terminal repeats that are identical
at the time of insertion.
The start and end of the LTR include a few inverted base pairs (sometimes
not exact inversion).
When the retroelement inserts in the host it generates a duplication of the
host sequence, which can be use to identify pairs of LTRs from the same
element. This is not part of the transposon and is characteristic of the
insertion site.
The region between the LTR code for the proteins.
Arial
Several features differentiate class II transposons from class I transposons.
Class II transposons are able to move about the genome using DNA only,
whereas class I transposons require an RNA intermediate to do so. Another
major difference is that class II or DNA transposons are typically unable to
copy themselves. A DNA transposon will extract itself from one region of the
genome in order to insert itself into another. There are, however, some
conditions under which a class II transposon can be replicated. For
example if a class II transposon moves shortly after its sister DNA
example,
strand is copied, a second copy of the transposon will exist. There are
approximately 200,000 copies of sequences derived from class II
transposons in the human genome. P-elements in Drosophila and
activation–dissociation (Ac/Ds) elements in maize are also examples of
DNA-based transposons. The figure in the slide shows the structure of the
Tc1-mariner class II transposons found in a variety of animal species. The
transposase protein is solely responsible for the excision of the TE from one
region and its integration into another. This protein is flanked by inverted
terminal repeats and target-site duplications on each side.
Arial
Miniature inverted transposable elements (or MITEs) are an unusual group of TEs found in a
variety
i t off organisms.
i
They
Th are especially
i ll common iin plants
l t ((e.g. 6% off th
the rice
i genome).
)
For many years, MITEs were a mystery to biologists, because their high copy number
implied active (or autonomous) elements, but none had been found. After the sequencing of
the rice genome three separate groups of researchers published proof that MITEs can move
about the rice genome. Plant MITEs fall into two categories: stowaway-like and tourist-like.
Stowaway-like MITEs are moved about the genome by autonomous class II transposons
similar to the TC1-mariner element. Tourist-like elements have as their autonomous partners
a newly described class of active MITE called PIF/Pong. The figure in the slide compares
the sequences of Pong and a nonautonomous MITE called Ping. The red and green regions
represent homologous terminal regions among a variety of degraded Ping elements (mPings
in the figure) and Pong elements. Unlike active Pong sequences, Ping elements have
degraded ORF-1 genes. While the function of ORF-1 is not fully understood, the ORF-2
gene is believed to code for the transposase that is responsible for making Ping elements
mobile. The high copy number of MITEs is very unusual for a class II transposon. The small
size of MITEs (typically less than 500 bp) is believed to be the reason that these
transposons are so common. Insertions of MITEs in the genome would presumably be less
disruptive than insertions of larger TEs and hence would not be selected against as strongly.
Science 2009 Vol. 325. no. 5946, pp. 1391 – 1394: We performed a genome-wide screen
of functional interactions between Stowaway MITEs and potential transposases in the rice
genome and identified a transpositionally active MITE that possesses key properties that
enhance transposition The MITE contains internal sequences that enhance transposition.
MITEs achieve high transposition activity by scavenging transposases encoded by
di t tl related
distantly
l t d and
d self-restrained
lf
t i d autonomous
t
elements.
l
t
Arial
Arial
Arial
Arial
Transposons are DNA sequences with the ability to jump about a host’s
genome. Class I transposons use an RNA intermediate and are capable of
copying themselves many times over. Retroviruses are believed to have
evolved from this class of transposable elements. Class II transposons use a
DNA intermediate and are restricted mainly to extracting themselves from
one genomic location and reinserting themselves in another. MITEs are a
type of class II transposon that has a very high copy number in plants, which
is atypical for this class of transposons.
transposons
Arial