Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Low-complexity and Repetitive Regions OraLee Branch John Wootton NCBI [email protected] Sequence Composition DNA Sequences – What would be the expected number of occurrences of a particular sequence in a genome? • Size: human genome 6*109 considering both strands • Base frequency: equal • Sequence length: 20 nucleotides – Bernouli Model: 6 *10 = 0.005 9 – But: 4 20 • (GT) n with n>10 = 105 Low-complexity Regions Simple Sequence Regions (SSR) – MICRO- or MINISATELLITES – Regions that have significant biases in AA or nucleotide composition : repeats of simple motifs n n n n – (GT) (AAC) (P) (NANP) Low-Complexity Regions/Segments – Complexity can be measured by Shannon’s Entropy • Regarding an amino acid sequence 20 f i ln( f i ) i 1 – For each composition of a complexity state, there exists a large number of possible sequences Low-Complexity Regions Locally abundant residues may be – continuous or loosely clustered irregular or aperiodic >25% of AA in currently sequenced genome is in LC regions – non-globular domains SSR Examples: myosins, pilins, segments in antigens, short subsequences of 10-50 residues with unknown function – Beta-pleated sheets – Alpha helices – Coiled-coils Low-Complexity Regions Locally abundant residues may be – continuous or loosely clustered irregular or aperiodic >25% of AA in currently sequenced genome is in LC regions – non-globular domains SSR Examples: myosins, pilins, segments in antigens, short subsequences of 10-50 residues with unknown function – Beta-pleated sheets – Alpha helices – Coiled-coils Detecting Low-Complexity SEG and PSEG/NSEG algorithms – Wootton and Federhen • Methods in Enzymology 266:33 (1996) • Computers and Chemistry 17:149 (1993) SEG – UNIX Executable available on ncbi servers • seg FASTAfile Window TriggerComplexity Extension K2(1) K2(2) • Longer Window lengths define more sustained regions, but overlook short biased subsequences clobber> seg hu.piron.fa 12 2.20 2.50 >gi|730388|sp|P40250|PRIO_CERAE MAJOR PRION PROTEIN PRECURSOR (PRP) 1-49 MANLGCWMLVVFVATWSDLGLCKKRPKPGG WNTGGSRYPGQGSPGGNRY ppqggggwgqphgggwgqphgggwgqphgg 50-86 gwgqggg 87-104 THNQWHKPSKPKTSMKHM agaaaagavvgglggymlgsams 105-127 128-179 RPLIHFGNDYEDRYYRENMYRYPNQVYYRP VDQYSNQNNFVHDCVNITIKQH tvttttkgenftet 180-193 194-228 DVKMMERVVEQMCITQYEKESQAYYQRGSS MVLFS sppvillisflifliv 229-244 245-245 G clobber> seg hu.piron.fa 12 2.20 2.50 -l >gi|730388|sp|P40250|PRIO_CERAE(50-86) complexity=1.90 (12/2.20/2.50) ppqggggwgqphgggwgqphgggwgqphgggwgqggg >gi|730388|sp|P40250|PRIO_CERAE(105-127) complexity=2.47 (12/2.20/2.50) agaaaagavvgglggymlgsams >gi|730388|sp|P40250|PRIO_CERAE(180-193) complexity=2.26 (12/2.20/2.50) tvttttkgenftet >gi|730388|sp|P40250|PRIO_CERAE(229-244) complexity=2.50 (12/2.20/2.50) sppvillisflifliv SEG piron with different window lengths question-based – exploratory tool – optimization step Detecting Low-Complexity – Intuitive explanation • Take a 20-residue long sequence – (20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) – (11111111111111111111) – ( 3 3 3 3 3 2 2 2 2 1 1 0 0 0 0 0 0 0 0 0) – Complexity can be described by Shannon’s Entropy (K2) • Regarding an amino acid sequence N K 2 f i ln( f i ) i 1 – For each composition of a complexity state, there exists a large number of possible sequences (K1) 1 L! K1 ln N L ni ! i 1 How SEG works seg FASTAfile Window TriggerComplexity Extension K2(1) K2(2) Looks within window length: if complexity < K2(1) then extends until complexity < K2(2) Uniform prior probabilities – Protein sequence data base is a heterogeneous statistical mixture such that the initially-unknown AA frequencies in Low-complexity subsets need have no similarity to frequencies in total data base – Unbiased view of low-complexity regions – Gives equiprobable compositions for any complexity state How SEG works, continued How do you correct for the background AA/nuc composition bias? – After randomly shuffling all the residues, determine the trigger complexity that results in 4% of the data base being within Low-complexity regions – Then use this trigger complexity and subtract 4% from %AA in Low-complexity regions Detecting Low-complexity with repetitive motif: SSR PSEG or NSEG Repetition of residue types or k-grams Period 3 (n V E n K N n V D n K D n V N n K S n K) (n m i n m i n m i n m i n m i n m i n m) (n m E n m N n m D n m D n m N n m S n m) Sliding window along sequence in single residue steps Evolutionary Mechanisms Evolution of sequences in general – Evolution rate of 10-5 – 10-9 • Base pair substitution (10-9 ) • Insertion/deletions • Recombination In SSR, Low-complexity regions, mutations are in length – with steps typically +/- one repeat unit – Evolution rate 10-3 • Biased nucleotide substitution due to increased recombination in repetitive regions • Unequal crossing over (recombination) • Replication slippage Alignment of repeats does not imply relationships/ancestory Low-Complexity and BLAST searches Low-complexity regions results in BLAST searches being dominated by Low-complexity regions – biased AA/nuc composition BLAST added “mask low-complexity” by default – Seg parameters: 12 2.2 2.5 BLAST now also uses a compositional bias filter on the whole database – Masks if composition bias using seg 10 1.8 2.1 YOU MAY WANT TO TURN THESE OPTIONS OFF and use your own organism-specific seg paramenters when doing protein homology searching YOU WILL NEED TO TURN THESE OPTIONS OFF if you are interested in looking at sequence similarities of repetitive/low complexity regions. Example: Plasmodium falciparum Using whole genome sequences is important to limit pcr sequencing bias for antigens: hydrophilic proteins Considering GC-content / AA bias – P. falciparum is approximately 28 % GC Visualization of individual proteins A helpful tool here and in general SEALS: A system for Easy Analysis of Lots of Sequences, R. Walker and E. Koonin, NCBI www.ncbi.nlm.nih.gov/ CBBresearch/Walker/SEALS/index.html Demonstrate getting an appropriate data set – Taxnode2gi, gi2fasta – Daffy – Purge – Gref – Fanot Use cleaned data set of P. falciparum proteins Protein Analysis Setting the trigger complexity: – Dbcomp – Shuffledb – Seg Run SEG on P. falciparum MSP1, PfEMP2, Cg2 – Options • • • • –p (tree form output) -l (only report Low-C segs) -h (don’t report Low-C segs) -x (substitute Low-C with x) Run PSEG on P. falciparum MSP1, PfEMP2, Cg2 with different –z (periodicity) Usefulness of studying Low-Complexity Within a protein secondary structure, homology searchers, protein location genetic disorders Within taxa microsatellite markers polymorphism comparisons between proteins Between taxa Synteny , orthologs different selection pressures upon different organisms parasites: immunogenicity, rapid evolution of antigens, recombination