Download Low-complexity Regions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Low-complexity and Repetitive
Regions




OraLee Branch
John Wootton
NCBI
[email protected]
Sequence Composition

DNA Sequences
– What would be the expected number of
occurrences of a particular sequence in a
genome?
• Size: human genome 6*109 considering both
strands
• Base frequency: equal
• Sequence length: 20 nucleotides
– Bernouli Model:
6 *10
= 0.005
9
– But:
4 20
• (GT)
n
with n>10 = 105
Low-complexity Regions

Simple Sequence Regions (SSR)
– MICRO- or MINISATELLITES
– Regions that have significant biases in AA or nucleotide
composition : repeats of simple motifs
n

n
n
n
– (GT)
(AAC) (P) (NANP)
Low-Complexity Regions/Segments
– Complexity can be measured by Shannon’s Entropy
• Regarding an amino acid sequence
20
  f i ln( f i )
i 1
– For each composition of a complexity state, there exists
a large number of possible sequences
Low-Complexity Regions

Locally abundant residues may be
– continuous or loosely clustered
irregular or aperiodic

>25% of AA in currently sequenced genome is in
LC regions
– non-globular domains  SSR

Examples: myosins, pilins, segments in antigens,
short subsequences of 10-50 residues with
unknown function
– Beta-pleated sheets
– Alpha helices
– Coiled-coils
Low-Complexity Regions

Locally abundant residues may be
– continuous or loosely clustered
irregular or aperiodic

>25% of AA in currently sequenced genome is in
LC regions
– non-globular domains  SSR

Examples: myosins, pilins, segments in antigens,
short subsequences of 10-50 residues with
unknown function
– Beta-pleated sheets
– Alpha helices
– Coiled-coils
Detecting Low-Complexity

SEG and PSEG/NSEG algorithms
– Wootton and Federhen
• Methods in Enzymology 266:33 (1996)
• Computers and Chemistry 17:149 (1993)

SEG
– UNIX Executable available on ncbi servers
• seg FASTAfile Window TriggerComplexity Extension
K2(1)
K2(2)
• Longer Window lengths define more sustained
regions, but overlook short biased subsequences
clobber> seg hu.piron.fa 12 2.20 2.50
>gi|730388|sp|P40250|PRIO_CERAE MAJOR PRION PROTEIN PRECURSOR (PRP)
1-49
MANLGCWMLVVFVATWSDLGLCKKRPKPGG
WNTGGSRYPGQGSPGGNRY
ppqggggwgqphgggwgqphgggwgqphgg
50-86
gwgqggg
87-104 THNQWHKPSKPKTSMKHM
agaaaagavvgglggymlgsams 105-127
128-179 RPLIHFGNDYEDRYYRENMYRYPNQVYYRP
VDQYSNQNNFVHDCVNITIKQH
tvttttkgenftet 180-193
194-228 DVKMMERVVEQMCITQYEKESQAYYQRGSS
MVLFS
sppvillisflifliv 229-244
245-245 G
clobber> seg hu.piron.fa 12 2.20 2.50 -l
>gi|730388|sp|P40250|PRIO_CERAE(50-86) complexity=1.90 (12/2.20/2.50)
ppqggggwgqphgggwgqphgggwgqphgggwgqggg
>gi|730388|sp|P40250|PRIO_CERAE(105-127) complexity=2.47 (12/2.20/2.50)
agaaaagavvgglggymlgsams
>gi|730388|sp|P40250|PRIO_CERAE(180-193) complexity=2.26 (12/2.20/2.50)
tvttttkgenftet
>gi|730388|sp|P40250|PRIO_CERAE(229-244) complexity=2.50 (12/2.20/2.50)
sppvillisflifliv
SEG piron with different window lengths
question-based – exploratory tool – optimization step
Detecting Low-Complexity
– Intuitive explanation
• Take a 20-residue long sequence
– (20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
– (11111111111111111111)
– ( 3 3 3 3 3 2 2 2 2 1 1 0 0 0 0 0 0 0 0 0)
– Complexity can be described by Shannon’s Entropy (K2)
• Regarding an amino acid sequence
N
K 2   f i ln( f i )
i 1
– For each composition of a complexity state, there exists a
large number of possible sequences (K1)




1
L! 
K1  ln  N

L 
  ni ! 
 i 1

How SEG works

seg FASTAfile Window TriggerComplexity Extension
K2(1)
K2(2)

Looks within window length: if complexity < K2(1)
then extends until complexity < K2(2)

Uniform prior probabilities
– Protein sequence data base is a heterogeneous statistical
mixture such that the initially-unknown AA frequencies in
Low-complexity subsets need have no similarity to
frequencies in total data base
– Unbiased view of low-complexity regions
– Gives equiprobable compositions for any complexity state
How SEG works, continued

How do you correct for the background AA/nuc
composition bias?
– After randomly shuffling all the residues, determine the
trigger complexity that results in 4% of the data base
being within Low-complexity regions
– Then use this trigger complexity and subtract 4% from
%AA in Low-complexity regions
Detecting Low-complexity with
repetitive motif: SSR




PSEG or NSEG
Repetition of residue types or k-grams
Period 3
(n V E n K N n V D n K D n V N n K S n K)
(n m i n m i n m i n m i n m i n m i n m)
(n m E n m N n m D n m D n m N n m S n m)
Sliding window along sequence in single residue
steps
Evolutionary Mechanisms

Evolution of sequences in general
– Evolution rate of 10-5 – 10-9
• Base pair substitution (10-9 )
• Insertion/deletions
• Recombination

In SSR, Low-complexity regions, mutations are in
length – with steps typically +/- one repeat unit
– Evolution rate 10-3
• Biased nucleotide substitution due to increased recombination
in repetitive regions
• Unequal crossing over (recombination)
• Replication slippage

Alignment of repeats does not imply
relationships/ancestory
Low-Complexity and BLAST searches





Low-complexity regions results in BLAST searches being
dominated by Low-complexity regions – biased AA/nuc
composition
BLAST added “mask low-complexity” by default
– Seg parameters:
12
2.2
2.5
BLAST now also uses a compositional bias filter on the
whole database
– Masks if composition bias using seg 10 1.8
2.1
YOU MAY WANT TO TURN THESE OPTIONS OFF and
use your own organism-specific seg paramenters when
doing protein homology searching
YOU WILL NEED TO TURN THESE OPTIONS OFF if
you are interested in looking at sequence similarities of
repetitive/low complexity regions.
Example:



Plasmodium falciparum
Using whole genome sequences is
important to limit pcr sequencing bias for
antigens: hydrophilic proteins
Considering GC-content / AA bias
– P. falciparum is approximately 28 % GC
Visualization of individual proteins
A helpful tool here and in general

SEALS: A system for Easy Analysis of Lots of
Sequences, R. Walker and E. Koonin, NCBI

www.ncbi.nlm.nih.gov/
CBBresearch/Walker/SEALS/index.html

Demonstrate getting an appropriate data set
– Taxnode2gi, gi2fasta
– Daffy
– Purge
– Gref
– Fanot
Use cleaned data set of P. falciparum proteins

Protein Analysis

Setting the trigger complexity:
– Dbcomp
– Shuffledb
– Seg

Run SEG on P. falciparum MSP1, PfEMP2, Cg2
– Options
•
•
•
•

–p (tree form output)
-l (only report Low-C segs)
-h (don’t report Low-C segs)
-x (substitute Low-C with x)
Run PSEG on P. falciparum MSP1, PfEMP2,
Cg2 with different –z (periodicity)
Usefulness of studying Low-Complexity
Within a protein
secondary structure, homology searchers, protein
location
genetic disorders
Within taxa
microsatellite markers
polymorphism comparisons between proteins
Between taxa
Synteny , orthologs
different selection pressures upon different organisms
parasites: immunogenicity, rapid evolution of
antigens, recombination
Related documents