Download Simple sequence repeat

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Simple sequence repeat
●
Also known as a low complexity sequence
●
Repetition of a nucleotide in genomic sequences
●
●
●
Can also be repetition of a pair of nucleotides, ie.
ACACACACACACACACACAC ... a microsatellite used
for forensic identification.
Can also be a triplet or more complex repeat.
Filtered from BLAST searches. Two common
programs that filter sequences for BLAST
searches are seg (amino acids – off by default)
and dust (nucleotides – on by default).
Simple sequence repeat
●
●
●
You can also find low complexity regions in
protein coding sequences
Repetition can be in the DNA or in the amino
acids in the proteins
Found in almost all eukaryotic organisms to
greater or lesser degrees
Simple sequence repeat
protein DDB G0268506 from Dictyostelium discoideum
MSKDHHHQQHQYQQLHPPIPSQHHHHHHHQSQNSDSELNHDNHKKFGHDRIVSNSFSPPPLHQFNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNYNNENNGNNVSFNPHQSNNNNNNPPMSSNEQYKPIYGQPSS
LSHLWFENKSNRASNNNNNNNHNNNNNNHNNNNNNSNNDNNNVSLTESYGPQAHDHPHHHHHPNHHSNNQ
NLFNQFSLQNSTPCNLSNNADMSNSNQHHHSNNSEIVRDRNIDNNNNINNNNNNTTTTNNNNSGNRDRYK
DSVDVLEKSTEKSKITTLGKHNTNINNNNSNKYKQLLPPLPIPNEQYNGIGIDNGLSHSSSNGSLGSADS
LDSPHTPMSSPSLSSLSLSQNLHINNNSNNYNNNNNGNNNNFNNNNYNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNYNNNNNYNSSSNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
GNNNNNNNINYNQSHNYNNYTPPISPLSTPPLTPTGSSISGPIGFLSGSPQNSPRNPNSPRLDPLVVQNN
QMVIYKRQFDQMLTKSMGDVWLKINKDVEEGSPTLPSATSTLPLARIKKIMKSDPGVKMISWEAPILFAK
ACEFFILELAARSWIHTDLSKRRTLQRSDIIHAVARVETFDFLIDVLPRDEIKPKKVDDIKPSYINSPEG
FPISLEPIPINNSGRLNSNNNNNNSNNRALTLTNPSPLNSNLTTQLPNIPTPQHQNQNQNQNQNQNQNQN
QHQHQNQNQNQNQNQNQNQNQNQYQHQHQHQHQHQHQHQHHQHHQHQHHQHHHHQHQHQNQNQNQNQHQQ
HQHQIYQPNQQQIHHINHQLGMHHHNPHQNQNQHPMYSHQFQNYSQVAFNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNSNNSNNSNNSSNNNNNNNNNNSNNNNNNNNNNNNNNNNNNNSNNNNNSNNSNNNN
NNNYNNYNGNNNNYNNYNSSSNNNSNNNNNNNNNNNNNNNNNNNNNNNNNNSNNNNNGNNNFENINPFQP
HNHMQSQYYYNQSINQYQNQNHNNNNNNSNNNNSNNQNSNNIYTRQYENEEDDENEEDQKSSTSGSESES
Simple sequence repeat
●
Saccharomyces cerevisae SRP40
>YKR092C
MASKKIKVDEVPKLSVKEKEIEEKSSSSSSSSSSSSSSSSSSSSSSSSSG
ESSSSSSSSSSSSSSDSSDSSDSESSSSSSSSSSSSSSSSDSESSSESDS
SSSGSSSSSSSSSDESSSESESEDETKKRARESDNEDAKETKKAKTEPES
SSSSESSSSGSSSSSESESGSESDSDSSSSSSSSSDSESDSESDSQSSSS
SSSSDSSSDSDSSSSDSSSDSDSSSSSSSSSSDSDSDSDSSSDSDSSGSS
DSSSSSDSSSDESTSSDSSDSDSDSDSGSSSELETKEATADESKAEETPA
SSNESTPSASSSSSANKLNIPAGTDEIKEGQRKHFSRVDRSKINFEAWEL
TDNTYKGAAGTWGEKANEKLGRVRGKDFTKNKNKMKRGSYRGGSITLESG
SYKFQD*
Fis binding sites
Sequence complexity
●
Information theory
Claude Elwood Shannon 1916 - 2001
●
●
●
●
Information theory
(Shannon entropy)
Proved that boolean
algebra and binary
arithmetic can be used
for electromechanical
relays (Master thesis)
PhD “An algebra for
theoretical genetics”
1948: “A mathematical
theory of communication”
From Wikipedia http://en.wikipedia.org/wiki/Claude_Shannon
Sequence complexity
●
Information theory
Information source
Encoder
Negative / positive
noise
Decoder
Destination source
From Shannon 1948
Sequence complexity
●
Information theory
Parent
DNA
Natural, sexual
selection
Mutation
DNA
Offspring
Sequence complexity
●
Information theory
–
In Biology:
●
DNA
●
RNA
●
Proteins
DNA
RNA
mRNA
Proteins
Transcription Splicing
Translation
Sequence complexity
●
Information theory
–
Noise sources:
●
heterologous sequences
●
rearranged and deleted sequences
●
repetitive elements
●
sequencing error
●
natural polymorphism
●
frameshift
●
codon usage
●
selection
Sequence complexity
●
Information theory
–
Developed by Shannon and Weaver to describe the
transmission of electronic signals
–
Used to look for pattern and complexity in DNA and
protein sequences
Shannon's Entropy
H = - L ∑ pi log2 (pi)
L: number of elements
pi: probability of occurrence
H: units in “bits”
Entropy
●
●
Entropy
–
A measure of the disorder or randomness in a closed
system
–
A measure of the loss of information in a transmitted
message
Given a random variable X with probabilities P(xi) for a
discrete set of events x1, x2, ..., xn the Shannon Entropy
is just the negative expected value of log(X)
–
H(X) = - E (log(X))
Entropy
–
From basic probability:
●
E(X) = ∑ xi p(xi)
●
E(h(X)) = ∑ h(xi) p(xi)
●
H(X) = - ∑ log p(xi) p(xi)
Entropy
●
●
The entropy measures the prior
uncertainty in the outcome of a random
experiment described by P, or the
information gained when the outcome is
observed
Uses the logarithm base 2, which makes
the unit of entropy bits
Entropy
●
Properties
–
H (X) ≥ 0
–
If we are certain of the outcome of a sample from the
distribution (P(xk) = 1, all other P (xi) = 0), then the
entropy is 0
–
Entropy is maximized when all n of the P(xi) are
equal to (1/n)
–
the maximum is then log (n)
Sequence complexity
- For a DNA sequence, sequence L, 4 nucleotides with
pi = 0.25, Hmax = L×log2 (4) = 2L bits
- Representation of each nucleotide as a 2 bit number (11,
10, 01, 00)
- In case of departure from equal probability:
H < Hmax
- If H/L = 0 sequence of minimal complexity (same
nucleotide or amino acid)
- if H/L = 2 maximal complexity all nucleotides are equally
represented
ATGTTCTATGGGCCACAAGTCACGAGCT
A:
T:
G:
C:
7
7
7
7
A:
T:
G:
C:
0.25
0.25
0.25
0.25
H = - 28 x ((0.25 x log2 (0.25)) + (0.25 x log2 (0.25)) +
(0.25 x log2 (0.25)) + (0.25 x log2 (0.25)))
H = 56 bits
H/L = 2
ACTTATATATACCGGAGACTATATGAGA
A:
T:
G:
C:
11
8
5
4
A:
T:
G:
C:
0.39
0.29
0.18
0.14
H = - 28 x ((0.39 x log2 (0.39)) + (0.29 x log2 (0.29)) +
(0.18 x log2 (0.18)) + (0.14 x log2 (0.14)))
H = 52.92 bits
H/L = 1.89
Sequence complexity
●
Uncertainty in the information
–
Because of selective pressures acting on
sequences (DNA or protein) some departure
from expectation can be observed
–
I (X) = Hexpected – Hobserved
–
The more conserved the sequence, the
higher the information content
Sequence complexity
●
Example
–
if pi = 0.25
–
Hexpected = 2 bits
–
At a particular position in a number of related
sequences we observe only A or G with
–
pA = 0.7 and pG = 0.3
–
Hobserved = (-0.7 log2 (0.7)) – (0.3 log2 (0.3)) – 0 - 0
–
Hobserved = 0.88
–
I (X) = 2 – 0.88 = 1.12 bits
Sequence complexity
●
Example
GTGTACTCTC CATTTGCGAT
GTGTATTCTC CATTTGCGTT
GTGTTCTCCC CAATTGCTCT
GTGTTTTCTC CATTTGCGGT
Assuming all four nucleotides are equally possible:
Hexpected = 2
Hobserved = 0
I = 2 bits
Sequence complexity
●
Example
GTGTACTCTC CATTTGCGAT
GTGTATTCTC CATTTGCGTT
GTGTTCTCCC CAATTGCTCT
GTGTTTTCTC CATTTGCGGT
Assuming all four nucleotides are equally possible:
Hexpected = 2
Hobserved = (-0.5 x log2 (0.5)) – (0.5 x log2 (0.5)) – 0 – 0
=1
I = 2 – 1 = 1 bit
Sequence complexity
●
Example
GTGTACTCTC CATTTGCGAT
GTGTATTCTC CATTTGCGTT
GTGTTCTCCC CAATTGCTCT
GTGTTTTCTC CATTTGCGGT
Assuming all four nucleotides are equally possible:
Hexpected = 2
Hobserved = (-0.25 x log2 (0.25)) – (0.25 x log2 (0.25))
(0.25 x log2 (0.25)) – (0.25 x log2 (0.25))
=2
I = 2 – 2 = 0 bit
Shannon-Weaver H content
Amino acid complexity in
Saccharomyces cerevisae
(Nsr1p). Windows size: 10 aa
Sequence window complexity
●
●
●
●
Sequence complexity can be investigated using a
sliding window analysis
–
Shannon – Weaver Index (H/L)
–
GC content
Maximum complexity is expected to be found in the
exons
High GC content is often associated with protein-coding
sequences
High AT content with non-coding DNA such as introns
Bit scores
“Raw scores have little meaning without detailed knowledge of the
scoring system used, or more simply its statistical parameters K and
lambda. Unless the scoring system is understood, citing a raw score
alone is like citing a distance without specifying feet, meters, or light
years. By normalizing a raw score using the formula
S' = (λ S - ln(K) ) / ln(2)
one attains a "bit score" S', which has a standard set of units. The
E-value corresponding to a given bit score is simply
E = m n 2(-S')
Bit scores subsume the statistical essence of the scoring system
employed, so that to calculate significance one needs to know in
addition only the size of the search space.”
From http://www.ncbi.nlm.nih.gov/BLAST/tutorial/#head3
Related documents