Download Method and system for computationally identifying clusters within a

US006109776A United States Patent [19] [11] Haas [45] Date of Patent: [54] METHOD AND SYSTEM FOR COMPUTATIONALLY IDENTIFYING CLUSTERS WITHIN A SET OF SEQUENCES 6,109,776 Patent Number: Aug. 29, 2000 Primary Examiner—Kenneth R. Horlick Assistant Examiner—Jeffrey SieW Attorney, Agent, or Firm—Weiss Jensen Ellis & Howard [75] Inventor: Juergen Haas, Gaithersburg, Md. [57] [73] Assignee: Gene Logic, Inc., Gaithersburg, Md. A method and system for computationally analyzing an initial set of patterns in order to identify subsets of patterns, called clusters, that contain common sub-patterns. The pat [21] Appl. No.2 09/063,450 [22] Filed: Apr. 21, 1998 [51] Int. Cl.7 .......................... .. G01N 1/00; G01N 31/00; C12Q 1/68; 6066 7/48 [52] US. Cl. ............................ .. 364/496; 435/6; 364/497; 364/578; 702/127; 702/30 [58] Field of Search ................................... .. 364/496, 497, 364/578; 435/6; 702/127, 30 [56] 5,867,402 ABSTRACT terns of the initial set of patterns are represented as linear sequences of subunits, and the common sub-patterns occur as sub-sequences of subunits Within the linear sequences starting at different positions Within the different linear sequences. Variations in the offset and in the sequence of subunits Within a common sub-pattern are considered in the analysis. In one embodiment, an initial set of oligonucle otide sequences that are produced by various biochemical techniques are computationally analyzed to identify clusters References Cited that may correspond to a number of different binding sites U.S. PATENT DOCUMENTS stranded DNA duplexes. The method places each oligo 2/1999 Schneider et a1. .................... .. 364/496 OTHER PUBLICATIONS Schneider et al NAR vol. 18 No. 20 pp.6097—6100, 1990. Mehldau et al CABIOS vol. 9, No. 3 pp. 299—314, 1993. Pesole et al NAR vol. 20, No. 11 pp. 2871—2875, 1992. Lefevre et al CABIOS vol. 9, No. 3 pp. 349—354, 1993. Staden NAR vol. 12, No. 1 pp. 505—519, 1984. for DNA-binding proteins Within one or more double nucleotide sequence Within a neW cluster and calculates an initial information Weight matrix for that cluster. Then, other sequences from the initial set of sequences are added to the cluster and the information Weight matrix of the cluster is re-computed until the information content of the information Weight matrix falls beloW a threshold value. 41 Claims, 15 Drawing Sheets U.S. Patent Aug. 29,2000 Sheet 2 0f 15 6,109,776 U.S. Patent Aug. 29, 2000 NHZ 415 N/ Sheet 3 0f 15 422 \ KN| N\>A/402 5’ end —» H0002 0 6,109,776 U.S. Patent Aug. 29, 2000 Sheet 4 0f 15 Fig. 5 6,109,776 U.S. Patent Aug. 29,2000 Sheet 6 0f 15 704 Fig. 7A 6,109,776 U.S. Patent Aug. 29, 2000 Sheet 7 0f 15 805 808 KO @[l 81 1 809 L 0 .=. A. A 812 g 804 6,109,776 RNA polymerase II 802? A x ' f . m m U.S. Patent Aug. 29, 2000 Sheet 9 0f 15 6,109,776 I002 1004 1005 > '/ [1032 ,/ 1012 MMAGTCCCCCATMMMM 0 1 2 3 I006 { max-1 171g. 10A 1010 - z0>14 0 1020\_ 1 / 1034 3 2 , . Sequence MClh‘lX (S) 1/1 m°"'/l015 1022A00100000001000007m w24\c0000o11111oo00oo/IW 7026\00001oooooooooooo/lm \T0000100o0001o000/ U.S. Patent Aug. 29,2000 Sheet 10 0f 15 7105 \‘ 6,109,776 fimz Frequency Muirix (F) mox_1 10g. 11A /Il04 lnformu’rion Weigh’r Ma’rrix (I) —IO> ?g. 11B K7207 CATACAATGC‘\1Z02 CCCCCCCCCC CTTGGATAA GTGGGGTAA ACACAATGCG CAGTTCTAGG AAGGAGGCAG ACACGATGCG TCCATGTATT GTGTATGAGC CGCGGATATG AACTATGATC TCATTGTGAG GGATTTAGCT CGTGGGAACT Fig. 12 mox_1 U.S. Patent Aug. 29,2000 Sheet 11 0f 15 C1uster size: 1 1503 / é 6,109,776 7507 Sequence 1: MMM ATACAATGC jg. Frequency Muirix: 0 1 13A 1504 3 (4 2 5 jg. 13B Informo?on weigh’r mo’rrix 0 1 2 3 4 10 11 12 13 14 15 ACGT 0 0 0 0 0 0 0. 0. 04. 0. 0. 0. 0 . 0 .0 . 0. 0. 0. 0. 4.04. 4.04 0. 0. 0 0 0 0 0 0 _2__ .|.. 1.. 2 -2 __2 _2-_ _2 - _2 - F0. 130 ....|. 1.. 2 1.1. 1.. 2 _2__ U.S. Patent Aug. 29,2000 Sheet 12 0f 15 C1uster size: 2 I405 2 Sequence 1: MMMC TACAATGC 6,109,776 lfl402 Sequence 5: MMMMAfACAATGCG ‘\MOI I404 ?g. 14A Frequency Mo’rrix: 0 1 10 12 13 15 ACGTM 01 0 0 1 0 0 1 0 0 05 10 0 0 05 10 0 01 0 10 0 10 0 01 0 01 0 01 0 0 05 0 1 0 0 1 0 Fig. 145’ lnforma’rion weight matrix 0 1 2 3 4 1O 12 13 14 15 ACGT 0 0 0 0 0 0 02 om 1 4.0 DM mn m 0.MtA. UM M O. M HNM MOJM 02 “U0 U0 0 0 _1- .1 -_ 2- -_ 2- -2-_ _2 - -2 _ Fig. 146 _-_2 _-2_ _2-_ -_l_ .1 .3>I U.S. Patent Aug. 29,2000 uS Sheet 13 0f 15 .185 U 8E E0F. 0. SGe CZC e08 ' 1g. 6,109,776 3. .150 M M M CM A TC A C Mm_|e_A|nG0 G0 G 15A Frequency Matrix: 0 1 11 12 13 14 15 ACGTM 01 0 01 0 01 0 0 10 0 0 10 0 0.1 0 0 10 0 01 0 01 0 0.1 0 0 0.63 0 1 0 0.1 0 63 63 36 36 .g. 1515’ lnformu’rion weighi mcl’rrix 0 ACGT 1 2 3 10 11 12 13 14 15 ‘la_2_ 1 0 0 0 0 0 0 0 0.1 5.0 9.1 .04 _1.0 4 2._1 .04 .|_21 .40 4 4 201A.04 M O. 4M0.M 021 .40 .1 58 0 0 0 0 1.1. .1 .1 1.1. .g. 150 __2 __1_ .5.l U.S. Patent Aug. 29, 2000 Sheet 14 0f 15 i findClusiers ) ii receive sef of sequencesS [7602 comprising sequences 50, 51, ---$N+1 ii iniiiulize lisf of found /7604 clusfers i0 NULL if for n=0 io n==N-i [1606 ii“ creqie new ciusfer ihuf /-7603 includes sequence Sn ii inifiui frequency mqfrix /-1510 and informqfion weighf mqfrix ii= findNxfBesi /l6l2 (mums SnxfBe'sf) did findNxfBesf “'4 refurn on SnXfBesf informoiion Cil'id confenf S0, of informuiion weighf mqfrix be >= ihreshoid if new clusier is noi already in ihe lisf of found ciusfers, add new clusier f0 lisf of found clusiers, incremeni n fim odd SnxfBesi f0 clusfer 0nd upduie frequency mufrix and informofion weighf mdfrix 6,109,776 U.S. Patent Aug. 29, 2000 Sheet 15 0f 15 @ Fig. 17 receive sei of sequences 3, curreni fl702 informoiion weighi muirix, curreni clusier i sei highVuiue = O fl 704 sei highSeqNum = —1 i / 17/6 for m=0 io m==N-i fl 706 incremeni m I 708 ii clusier? iniormoiion conieni of Sm > highVolue, considering ihe various possible ulignmenis of ihe sequence Sm and ihe reverse complemeni of ihe sequence Sm? highVolue = highesi informoiion conieni 0f Sm highSequence = m f17l2 6,109,776 6,109,776 1 2 METHOD AND SYSTEM FOR COMPUTATIONALLY IDENTIFYING CLUSTERS WITHIN A SET OF SEQUENCES TECHNICAL FIELD Amino Acid The present invention relates to computational method ologies for identifying common sub-patterns Within a set of patterns and, in particular, to identifying DNA sequences that correspond to protein binding sites by computationally analyZing large sets of relatively short oligonucleotide sequences that represent potential protein binding sites. 10 BACKGROUND OF THE INVENTION The molecular blueprint for a living eukaryotic organism 15 is stored in double-stranded deoxyribose-nucleic acid (“DNA”) molecules Within the nucleus of each cell of the organism. Each double-stranded DNA molecule comprises a large number of templates, called genes, that each speci?es the composition of a protein molecule and a large number of regulatory regions and additional regions for Which a func tionality has not yet been identi?ed. Protein molecules are synthesiZed from the gene templates in a tWo-step process. In the ?rst step, called transcription, the gene is copied to produce a molecule of messenger ribose-nucleic acid 20 Symbol Alanine Ala A Argine Arg R Asparagine Asn N Aspartic acid Asp D Asparagine or aspartic acid Asx B Cysteine Cys C Glutamic acid Glutamine Glutamine or glutamic acid Glu Gln Glx E Q Z Glycine Gly G Histidine Isoleucine Leucine His Ile Leu H I L Lysine Lys K Methionine Met M Phenylalanine Phe F Proline Serine Threonine Pro Ser Thr P S T Tryptophan Tyrosine Trp Tyr W Y Valine Val V 30 amino acid subunit sequence using either the three-letter symbols or the one-letter symbols, listed in Table 1, for the amino acids of the protein, starting from the N-terminal amino acid on the left side and ending With the C-terminal amino acid on the right side. For example, the polypeptide polymer displayed in FIG. 2 can be described either as “ALA-TYR-ASP-GLY” or “AYDG.” Although a protein 35 the protein molecule in solution normally folds into a regions of a double-stranded DNA molecule act as sWitches, genes. Proteins serve as catalysts for the myriad of chemical can be conceptualiZed as a linear sequence of amino acids, actions that occur Within living organisms, as Well as struc trols the development, structure, and dynamic composition of living cells. Both proteins and DNA molecules are long linear poly Symbol A protein can be chemically described by Writing its (“RNA”). In the second step, called translation, a protein tural and mechanical elements from Which living organisms are formed. Thus, the regulation of protein formation via the regulatory regions of double-stranded DNA molecules con One Letter 25 molecule is synthesiZed according to the information con tained in the messenger RNA molecule. The regulatory brakes, and accelerators for controlling the transcription of genes into messenger RNA molecules, thereby controlling the rate of synthesis of the various proteins speci?ed by the Three Letter complex and speci?c three-dimensional shape. FIG. 3 shoWs a representation of the three-dimensional shape of a rela tively small, common protein. 40 DNA molecules, like proteins, are linear polymers. DNA molecules are synthesiZed from only four different types of subunit molecules: (1) deoxy-adenosine, abbreviated “A”; mers synthesiZed from a relatively small number of com (2) deoxy-thymidine, abbreviated “T”; (3); deoxy-cytosine, ponent molecules, or subunits. FIG. 1 shoWs the tWenty amino acid subunits from Which protein molecules are commonly synthesiZed. Each amino acid subunit has an FIG. 4 illustrates a short DNA polymer 400, called an abbreviated “C”; and (4) deoxy-guanosine, abbreviated “G.” 45 ot-carboxyl group (e.g., the ot-carboxyl group 101 of the amino acid lysine 103), an ot-amino group (e.g., the ot amino group 105 of the amino acid lysine 103), and a side chain phosphorylated, these subunits of the DNA molecule are (e.g., the y-amino propyl side chain 107 of the amino acid lysine 103), all attached to an ot-carbon atom (e.g., the ot-carbon 109 of the amino acid lysine 103). FIG. 2 shoWs called nucleotides, and are linked together through phos phodiester bonds 410—415 to form the DNA polymer. The DNA molecule has a 5‘ end 418 and a 3‘ end 420. A DNA a small polypeptide polymer built from four amino acids. The polypeptide polymer 200 has a free ot-amino group 202 at the N-terminal end 204 of the polypeptide polymer 200 and a free ot-carboxyl group 206 at the C-terminal end 208 of the polypeptide polymer 200. The polypeptide polymer 200 is composed from the folloWing amino acids: (1) alanine polymer can be chemically characteriZed by Writing, in sequence from the 5‘ end to the 3‘ end, the single letter abbreviations for the nucleotide subunits that together com 55 shoWn in FIG. 4 can be chemically represented as “ATCG.” 60 and the one-letter symbols corresponding to each of the amino acids: nucleotide 402), and a phosphate group (eg phosphate 426) that links the nucleotide to the next nucleotide in the DNA subunits. The amino acid subunits Within a protein are normally designated by either three-letter symbols or by one-letter symbols. Table 1, beloW, lists both the three-letter symbols pose the DNA polymer. For example, the oligomer 400 A nucleotide comprises a purine or pyrimidine base (e.g. adenine 422 of the deoxy-adenylate nucleotide 402), a deoxy-ribose sugar (e.g. ribose 424 of the deoxy-adenylate 210; (2) tyrosine 212; (3) aspartic acid 214; and (4) glycine 216. Aprotein comprises one or more polypeptide polymers, similar to the polypeptide polymer 200 shoWn in FIG. 2, each generally comprising tens to hundreds of amino acid oligomer, composed of the folloWing subunits: (1) deoxy adenosine 402; (2) deoxy-thymidine 404; (3) deoxy cytosine 406; and (4) deoxy-guanosine 408. When polymer. 65 The DNA polymers that contain the organiZational infor mation for living organisms occur in the nuclei of cells in pairs, called double-stranded DNA helixes. One polymer of the pair is laid out in a 5‘ to 3‘ direction, and the other polymer of the pair is laid out in a 3‘ to 5‘ direction. The tWo 6,109,776 3 4 DNA polymers in the double-stranded DNA helix are there fore described as being anti-parallel. The tWo DNA polymers, or strands, Within a double-stranded DNA helix are bound to each other through hydrogen bonds. Because of a number of chemical and topographic constraints, a deoxy adenylate subunit of one strand must hydrogen bond to a betWeen an amino acid subunit 712 of a DNA-binding protein 714 and a nucleotide subunit 716 of a DNA double helix 718 vieWed doWn the central axis of the DNA double helix. deoxy-thymidylate subunit of the other strand, and a deoxy FIG. 8 illustrates the spatial relationship betWeen a gene and various regulatory regions of a DNA double helix that control transcription of the gene. The gene 802 is generally guanylate subunit of one strand must hydrogen bond to a preceded by a promoter region 804 Where various molecular deoxy-cytidylate subunit of the other strand. FIG. 5 illustrates the hydrogen bonding that joins tWo components 806 are assembled in order to catalyZe the 10 addition, various regulatory DNA-binding proteins or assemblies of regulatory DNA-binding proteins 808—810 anti-parallel DNA strands. The ?rst strand 502 occurs in the 5‘ to 3‘ direction and contains a deoxy-adenylate subunit 504 and a deoxy-guanylate subunit 506. The second, anti parallel strand 508 contains a deoxy-thymidylate subunit 510 and a deoxy-cytidylate subunit 512. The deoxy speci?cally bind to a number of regulatory regions of the DNA double helix 811—813 that are located at various 15 adenylate subunit 504 is joined to the deoxy-thymidylate subunit 510 through hydrogen bonds 514 and 516. The deoxy-guanylate subunit 506 is joined to the deoxy cytidylate subunit 512 through hydrogen bonds 518—522. The tWo DNA strands linked together by hydrogen bonds of gene transcription or decrease the rate of gene 20 transcription, thus controlling the concentration of the pro tein speci?ed by the gene Within the cell. Each type of regulatory DNA-binding protein recogniZes and binds to a speci?c sequence, or pattern, of base pairs Within the regu latory region. These sequences, called binding sites, are generally less than tWenty nucleotides in length. 25 organism largely depends on the regulation of gene tran DNA helix. FIG. 6A illustrates a short section of a DNA double helix 600 comprising a ?rst strand 602 and a second, anti-parallel strand 604. A deoxy-guanylate subunit in one The molecular state of a cell and of an entire living scription by thousands of different regulatory DNA-binding proteins. Only one or several molecules of each different type of regulatory protein may occur in a cell at any given time. A cell thus contains a very complex mixture of subunit in the other strand 612. FIG. 6B shoWs a represen tation of the tWo strands illustrated in FIG. 6A using the single-letter designations for the nucleotide subunits. The ?rst strand 614 (602 in FIG. 6A) is Written in the familiar 5‘ to 3‘ direction, and the second strand 616 (604 in FIG. 6A) 30 regulatory DNA-binding proteins, and each regulatory DNA-binding protein may occur in the mixture at extremely small concentrations. Aberrations in the structures of certain regulatory DNA-binding proteins, or in the concentrations of certain regulatory DNA-binding proteins Within cell is Written in the 3‘ to 5‘ direction in order to clearly shoW the subunit pairings betWeen the tWo strands. These pairings are called base pairs because the hydrogen bonding occurs betWeen the purine and pyrimidine bases of the nucleotide distances along the DNA double helix from the gene 802. In general, the regulatory proteins may either increase the rate form the familiar helix structure of the double-stranded strand 606 is alWays paired With a deoxy-cytidylate subunit 608 in the other strand, and a deoxy-thymidylate subunit in one strand 610 is alWays paired With a deoxy-adenylate synthesis of messenger RNA from the gene template. In 35 nuclei, may underlie many different diseases and disorders, including developmental problems, inherited genetic subunits. Nucleotide subunits are often referred to as bases. disorders, and cancers. It is therefore a goal of biological There is a “C” (e.g., 618) in the second strand directly opposite from each “G” (e.g., 620) in the ?rst strand, an “A” sciences and of the biotechnology industry to identify and characteriZe the many different types of regulatory DNA (e.g., 622) in the second strand directly opposite from each “T” (e.g., 624) in the ?rst strand, a “T” (e.g., 626) in the second strand directly opposite from each “A” (e.g., 628) in 40 binding proteins. There are a number of different approaches to identifying regulatory DNA-binding proteins. One such approach is the ?rst strand, and a “G” (e.g., 630) in the second strand called the multiplex selection technique, or “MuSTTM.” The directly opposite from each “C” (e. g., 632) in the ?rst strand. MuST technique is described in the folloWing patent applications, Which are hereby incorporated by reference in their entirety: US. patent application Ser. No. 08/590,571, ?led Jan. 24, 1996, PCT application Serial No. PCT/ US97101230, ?led Jan. 24, 1997, and Us. application Ser. Thus, knoWing the sequence for the ?rst strand, one can immediately determine and Write doWn the sequence for the second strand. DNA base-pair sequences are alWays Written in the 5‘ to 3‘ direction. The second strand 634 is shoWn properly Written in the 5‘ to 3‘ direction as the last sequence in FIG. 6B. When Written in this fashion, the second strand is said to be the reverse complement of the ?rst strand. Thus, 45 No. 08/906,691 ?led Aug. 6, 1997. In this method, a very large number of relatively short oligonucleotide DNA duplexes having random sequences are prepared and mixed together With a sample that contains various DNA-binding the “G” 636 on the left or 5‘ end of the second strand 634 is proteins. The random-sequence oligonucleotide duplexes paired in the DNA double helix 600 With the “C” 638 at the right or 3‘ end of the ?rst strand 614. As described above, the synthesis of proteins from gene generally have lengths of betWeen eight and tWelve base 55 pairs. After the random-sequence oligonucleotide duplexes 60 are mixed With the DNA-binding proteins, the DNA-binding proteins bind to speci?c oligonucleotide duplexes that con tain base-pair sequences that the DNA-binding proteins recogniZe; or, in other Words, a particular type of DNA binding protein binds to those oligonucleotide duplexes that templates is controlled through regulatory regions of DNA molecules. A large number of different types of DNA binding proteins bind to these regulatory regions of DNA molecules and, by so doing, initiate, promote, inhibit, or prevent the synthesis of one or more speci?c genes. FIG. 7A illustrates the binding of a dimeric, or tWo-polymer DNA contain base-pair sequences identical or similar to the base binding protein 702 to a speci?c regulatory region 704 of a double-stranded DNA helix 706. In general, a number of amino acid subunits of a DNA-binding protein hydrogen bond to nucleotide subunits of the DNA molecule to affect the binding of the DNA-binding protein to the DNA double helix. FIG. 7B illustrates tWo hydrogen bonds 708 and 710 65 pair sequence of the binding site Within the regulatory region of the DNA double helix controlled by that DNA-binding protein. Various biochemical separation techniques are employed to separate the DNA-binding proteins bound to the oligonucleotide duplexes from unbound proteins, unbound oligonucleotide duplexes, and other molecules 6,109,776 5 6 Within the mixture. The bound DNA-binding protein/ oligonucleotide duplex pairs are then separated, the sepa ?rst cluster 904 are indicated by box 906. Note that the common sub-sequence occurs toWards the end of sequence rated oligonucleotide duplexes are ampli?ed by the poly merase chain reaction (“PCR”) technique and, ?nally, the common sub-sequence are missing. It should also be noted 13 (908 in FIG. 9) in Which the ?nal tWo nucleotides of the tWo strands of the oligonucleotide duplexes are separated and identi?ed by sequence analysis. The result of the analy that, in some sequences, one or more nucleotides of the sis is a list of nucleotide sequences of single strands of the For example, sequence 18 (910 in FIG. 9) contains an initial “C” 912 rather than a “G.” Sequence 18 (910 in FIG. 9) is shifted three positions to the right relative to sequence 17 (914 in FIG. 9) and is shifted four places to the right relative to sequence 19 (916 in FIG. 9) in order that the common sub-sequence of sequence 18 aligns With the common sub sequences of sequences 17 and 19. Sequence 22 (918 in FIG. 9) in the original set of sequences 902 does not initially common sub-sequence have been substituted With another. oligonucleotide duplexes that Were bound by DNA-binding proteins in the mixture. DNA-binding proteins have varying speci?cities for base pair sequences. Each different type of DNA-binding protein generally recogniZes and binds to a particular binding site 10 Within a particular regulatory region of a DNA double helix. The binding site comprises a speci?c sequence of base pairs Within the DNA double helix. HoWever, a particular DNA 15 appear to have a portion in common With any of the other binding protein may recogniZe and bind to any number of sequences. HoWever, the reverse complement of sequence sequences similar to the sequence of the binding site Which 22 (918 in FIG. 9) is identical With sequence 1 (920 in FIG. 9) and is therefore included, along With sequence 1, in the the DNA-binding protein normally recogniZes and to Which the DNA-binding protein binds. Base-pair sequence analysis is conducted on single strands of DNA rather than on DNA 20 duplexes. A DNA-binding site for a particular DNA-binding protein Will be therefore characteriZed, folloWing an analysis of oligonucleotide sequences produced by the MuST 904—908 (e.g., line 924) shoW a mapping from the original MuST sequences to the ?ve clusters. It is this mapping betWeen oligonucleotide sequences and clusters, including technique, by a set of similar sequences corresponding to one strand of the duplex regions bound by the DNA-binding ?rst cluster 904. The lines betWeen the sequences in the set of MuST sequences 902 and the sequences Within clusters the alignments and reverse complementation required to 25 match the common sub-sequences Within the sequences of a protein and by a set of similar sequences corresponding to cluster, that is the goal of the computational technique of the the other strand of the duplex regions bound by the DNA described embodiment of the present invention. binding protein. Because the tWo sets of sequences are related by reverse complementation, the original tWo sets are merged into a single set of sequences by applying reverse complementation to the sequences in one of the original tWo original set of MuST sequences 902 represents a potential sets. Because the oligonucleotide duplexes employed in the MuST technique are randomly generated, the ?rst base pair of the sequence recogniZed by a DNA-binding protein may not correspond to the ?rst base pair of the oligonucleotide Each of the clusters 904—908 that are identi?ed from the 30 a cluster may be related to the concentration in the original cell extract mixture of the DNA-binding protein that recog niZes the common sequence Within that cluster. Clusters With one or a feW sequences, such as cluster 2 (905 in FIG. 35 duplex, but may occur at many different positions Within the DNA-binding protein binds, or may possibly represent an artifact arising from experimental methodologies. 40 particular binding site for a particular DNA-binding protein Will be characteriZed Within the set of sequences produced sequences that contain sub-sequences identical or similar to puri?ed from complex mixtures by various biochemical 45 techniques. The sequence of amino acids that together compose the one or more polymers of the DNA-binding protein can be determined from the puri?ed protein by biochemical protein sequence analysis techniques. Once the sequence for a DNA-binding protein has been determined, that sequence can be compared to data bases of knoWn identi?ed by the MuST technique. As commonly applied to cell extracts containing DNA-binding proteins, the MuST technique may produce a set of many thousands of sequences. FIG. 9 is intended to illustrate the general concept of MuST sequence analysis rather than provide an Once a binding site has been identi?ed by analysis of the MuST sequences, that binding site can be compared to data bases of knoWn binding sites to determine Whether the binding site has been previously characteriZed. The DNA binding proteins that bind to a particular binding site can be by the MuST technique by a set of oligonucleotide sub-sequences of the binding site sequence greater than or equal in length to some minimum number of nucleotides. FIG. 9 illustrates the characteriZation of various clusters representing potential DNA-binding sites from a set of sequences produced by the MuST technique. A set of 21 sequences 902 represents the oligonucleotide sequences 9) and cluster 4 (907, in FIG. 9) may represent a binding site to Which an extremely rare or loW-concentration regulatory oligonucleotide duplex. Generally, a DNA-binding protein may bind to some minimum number of base pairs that compose a sub-sequence of the sequence of the binding-site. Because the MuST oligonucleotide sequences are random, a DNA-protein binding site. The number of sequences Within protein sequences or can serve as the basis for the identi? cation of the gene or genes Within an organism’s DNA molecules that serve as a template for the synthesis of that 55 actual example. DNA-binding protein. These various characteriZations of the DNA-binding protein may eventually lead to the identi?ca tion of diseases associated With aberrations in the structure of the protein or in the control of the expression of the gene Examination of the set of MuST sequences 902 does not immediately reveal a pattern of related sequences. HoWever, that is the template for the DNA binding protein. These as a result of an exhaustive comparison of each sequence in the set of sequences 902 to the other sequences in the set of various characteriZations may also lead to various amelio rative therapies that can be employed to treat such diseases. sequences 902 by shifting the sequences relative to one another, and identifying common sub-sequences, ?ve clus 60 SUMMARY OF THE INVENTION An embodiment of the present invention provides a ters of related sequences 904—908 can be identi?ed. Each method and system for computationally analyZing an initial sequence of the ?rst cluster of sequences 904 contains a common seven-base-pair sub-sequence “GTTTACC” or 65 set of oligonucleotide sequences in order to identify groups some very similar variation of that sub-sequence. These or subsets of sequences that contain common sub-sequences. common sub-sequences Within each of the sequences of the The initial set of oligonucleotide sequences may be pro 6,109,776 8 7 duced by various biochemical techniques and may corre FIG. 13C shoWs an information Weight matrix calculated from the values in the frequency matrix of FIG. 13B. spond to a number of different DNA-binding sites Within one or more double-stranded DNA duplexes. The common sub FIG. 14A shoWs a cluster having tWo sequences. sequences may be offset from each other Within the initial FIG. 14B shoWs a frequency matrix calculated from the sequences, requiring the sequences of the initial set of 5 cluster of FIG. 14A. oligonucleotide sequences to be aligned in order to identify FIG. 14C shoWs an information Weight matrix calculated the common sub-sequences. Reverse complementation may from the values of the frequency matrix shoWn in FIG. 14B. also be applied to a sequence in order to reveal the common FIG. 15A shoWs a cluster having three sequences. sub-sequence that is contained Within the sequence. FIG. 15B shoWs a frequency matrix calculated from the In one embodiment of the present invention, the common 10 cluster of FIG. 15A. sub-sequence Within a subset of sequences, or cluster, that is identi?ed as a potential binding site is modeled by a numeri FIG. 15C shoWs an information Weight matrix calculated cal construct called an information Weight matrix. Each from the values in the frequency matrix of FIG. 15B. sequence in the initial set of sequences is separately ana FIG. 16 is a ?oW-control diagram for the routine “?nd lyZed With respect to all other sequences Within the initial set 15 Clusters” that implements one embodiment of the present of sequences. A sequence to be analyZed is placed Within a invention. neW cluster and an initial information Weight matrix is FIG. 17 is a ?oW-control diagram for the routine “?nd calculated for that cluster. Then, other sequences from the NxtBest.” initial set of sequences are added to the cluster and the information Weight matrix of the cluster is re-computed for DETAILED DESCRIPTION OF THE INVENTION the cluster until the information content of the information Weight matrix falls beloW a threshold value. The next sequence chosen for addition to the cluster is a sequence that is not already included in the cluster and that has the highest information content With respect to the information Weight matrix calculated for the cluster. 25 invention is directed, for example, to identifying potential DNA-binding sites by analyZing a set of oligonucleotide sequences and organiZing the set of oligonucleotide BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shoWs tWenty amino acid subunits from Which sequences into subsets of oligonucleotide sequences, called clusters, that contain similar sequences. Each cluster may correspond to a DNA-binding site. The method of this protein molecules are commonly synthesiZed. FIG. 2 shoWs a small polypeptide polymer built from four amino acids. FIG. 3 shoWs a representation of the three-dimensional shape of one type of protein. FIG. 4 illustrates a short DNA oligomer. embodiment separately analyZes each sequence selected from the initial set of sequences. The selected sequence is 35 considered to be the ?rst member of a neW cluster. The cluster is modeled by an abstract numerical construct called FIG. 5 illustrates the hydrogen bonding that joins tWo anti-parallel DNA strands. an information Weight matrix. The method successively chooses additional sequences from the initial set of sequences to add to the cluster. At any point in the analysis of a given cluster, the next sequence chosen to be added to FIG. 6A illustrates a short section of a DNA double helix. FIG. 6B shoWs a representation of the tWo DNA strands illustrated in FIG. 6A using single-letter designations for the nucleotide subunits. FIG. 7A illustrates the binding of a DNA-binding protein to a speci?c regulatory region of a double-stranded DNA helix. FIG. 7B illustrates tWo hydrogen bonds betWeen an amino acid subunit of a DNA-binding protein and a nucleotide 45 subunit of a DNA double helix. FIG. 8 illustrates the spatial relationship betWeen a gene and various regulatory regions of a DNA double helix that control transcription of the gene. FIG. 9 illustrates the characteriZation of clusters repre senting potential various DNA-binding sites from a set of sequences produced by the MuST technique. FIG. 10A shoWs the representation of the oligonucleotide sequence “AGTCCCCCAT” Within a character array. FIG. 10B shoWs a sequence matrix representing the oligonucleotide sequence “AGTCCCCCAT” of FIG. 10A. FIGS. 11A & 11B shoWs a frequency matrix and an information Weight matrix. Embodiments of the present invention provide a method and system for analyZing a set of linear sequences in order to identify subsets of the set of linear sequences that share common sub-sequences. One embodiment of the present the cluster is the sequence that has the highest information content With respect to the current information Weight matrix that describes the cluster. Sequences are successively added to the cluster until the information content of the information Weight matrix that describes the cluster falls beloW a particular threshold value. At that point, the cluster is complete. The analysis then continues With a different sequence selected from the set of initial sequences. The result of the analysis of this embodiment of the present invention is a set of clusters. 55 While the embodiment described beloW is directed toWard identifying potential DNA-binding sites from a set of oli gonucleotide sequences, the method and system of the present invention may be employed, in other embodiments, to analyZe different types of sequences. For example, the sequences of amino acid subunits Within protein polymers might be analyZed by an embodiment of the present inven tion in order to identify common or conserved amino acid subunit sub-sequences Within the polymers that correspond to common structural features Within a family of proteins. As another example, the Words of a language, represented as FIG. 12 shoWs an initial list of sequences obtained from sequences of letters, might be analyZed by a different a biochemical technique, such as the MuST technique. embodiment of the present invention in order to identify FIG. 13A shoWs a ?rst cluster identi?ed from the common root Words from Which families of Words have sequences of FIG. 12. 65 been derived. FIG. 13B shoWs a frequency matrix calculated from the Each oligonucleotide sequence from the initial set of ?rst cluster of FIG. 13A. sequences produced by the MuST technique, or by a similar

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Method and system for computationally identifying clusters within a