Download Method and system for computationally identifying clusters within a

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Restriction enzyme wikipedia , lookup

Protein wikipedia , lookup

Western blot wikipedia , lookup

Nucleosome wikipedia , lookup

Metalloprotein wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Molecular cloning wikipedia , lookup

DNA supercoil wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Gene wikipedia , lookup

Community fingerprinting wikipedia , lookup

Genetic code wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Biochemistry wikipedia , lookup

Gene expression wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Proteolysis wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Protein structure prediction wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Biosynthesis wikipedia , lookup

Non-coding DNA wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Point mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
US006109776A
United States Patent [19]
[11]
Haas
[45] Date of Patent:
[54]
METHOD AND SYSTEM FOR
COMPUTATIONALLY IDENTIFYING
CLUSTERS WITHIN A SET OF SEQUENCES
6,109,776
Patent Number:
Aug. 29, 2000
Primary Examiner—Kenneth R. Horlick
Assistant Examiner—Jeffrey SieW
Attorney, Agent, or Firm—Weiss Jensen Ellis & Howard
[75] Inventor: Juergen Haas, Gaithersburg, Md.
[57]
[73] Assignee: Gene Logic, Inc., Gaithersburg, Md.
A method and system for computationally analyzing an
initial set of patterns in order to identify subsets of patterns,
called clusters, that contain common sub-patterns. The pat
[21] Appl. No.2 09/063,450
[22] Filed:
Apr. 21, 1998
[51]
Int. Cl.7 .......................... .. G01N 1/00; G01N 31/00;
C12Q 1/68; 6066 7/48
[52]
US. Cl. ............................ .. 364/496; 435/6; 364/497;
364/578; 702/127; 702/30
[58]
Field of Search ................................... .. 364/496, 497,
364/578; 435/6; 702/127, 30
[56]
5,867,402
ABSTRACT
terns of the initial set of patterns are represented as linear
sequences of subunits, and the common sub-patterns occur
as sub-sequences of subunits Within the linear sequences
starting at different positions Within the different linear
sequences. Variations in the offset and in the sequence of
subunits Within a common sub-pattern are considered in the
analysis. In one embodiment, an initial set of oligonucle
otide sequences that are produced by various biochemical
techniques are computationally analyzed to identify clusters
References Cited
that may correspond to a number of different binding sites
U.S. PATENT DOCUMENTS
stranded DNA duplexes. The method places each oligo
2/1999
Schneider et a1. .................... .. 364/496
OTHER PUBLICATIONS
Schneider et al NAR vol. 18 No. 20 pp.6097—6100, 1990.
Mehldau et al CABIOS vol. 9, No. 3 pp. 299—314, 1993.
Pesole et al NAR vol. 20, No. 11 pp. 2871—2875, 1992.
Lefevre et al CABIOS vol. 9, No. 3 pp. 349—354, 1993.
Staden NAR vol. 12, No. 1 pp. 505—519, 1984.
for DNA-binding proteins Within one or more double
nucleotide sequence Within a neW cluster and calculates an
initial information Weight matrix for that cluster. Then, other
sequences from the initial set of sequences are added to the
cluster and the information Weight matrix of the cluster is
re-computed until the information content of the information
Weight matrix falls beloW a threshold value.
41 Claims, 15 Drawing Sheets
U.S. Patent
Aug. 29,2000
Sheet 2 0f 15
6,109,776
U.S. Patent
Aug. 29, 2000
NHZ
415
N/
Sheet 3 0f 15
422
\ KN| N\>A/402
5’ end —» H0002
0
6,109,776
U.S. Patent
Aug. 29, 2000
Sheet 4 0f 15
Fig. 5
6,109,776
U.S. Patent
Aug. 29,2000
Sheet 6 0f 15
704
Fig. 7A
6,109,776
U.S. Patent
Aug. 29, 2000
Sheet 7 0f 15
805
808
KO
@[l
81 1
809
L 0 .=.
A.
A
812
g
804
6,109,776
RNA polymerase II
802?
A
x
'
f
.
m
m
U.S. Patent
Aug. 29, 2000
Sheet 9 0f 15
6,109,776
I002
1004
1005
>
'/ [1032
,/
1012
MMAGTCCCCCATMMMM
0
1
2
3
I006 {
max-1
171g. 10A
1010
-
z0>14
0
1020\_
1
/
1034
3
2
,
.
Sequence MClh‘lX (S)
1/1
m°"'/l015
1022A00100000001000007m
w24\c0000o11111oo00oo/IW
7026\00001oooooooooooo/lm
\T0000100o0001o000/
U.S. Patent
Aug. 29,2000
Sheet 10 0f 15
7105
\‘
6,109,776
fimz
Frequency Muirix (F)
mox_1
10g. 11A
/Il04
lnformu’rion Weigh’r Ma’rrix (I)
—IO>
?g. 11B
K7207
CATACAATGC‘\1Z02
CCCCCCCCCC
CTTGGATAA
GTGGGGTAA
ACACAATGCG
CAGTTCTAGG
AAGGAGGCAG
ACACGATGCG
TCCATGTATT
GTGTATGAGC
CGCGGATATG
AACTATGATC
TCATTGTGAG
GGATTTAGCT
CGTGGGAACT
Fig. 12
mox_1
U.S. Patent
Aug. 29,2000
Sheet 11 0f 15
C1uster size: 1
1503
/
é
6,109,776
7507
Sequence 1: MMM ATACAATGC
jg.
Frequency Muirix:
0
1
13A
1504
3 (4
2
5
jg. 13B
Informo?on weigh’r mo’rrix
0
1
2
3
4
10
11
12
13
14
15
ACGT 0 0 0 0 0 0 0. 0. 04. 0. 0. 0. 0 . 0 .0 . 0. 0. 0. 0. 4.04. 4.04 0. 0. 0 0 0 0 0 0
_2__
.|..
1..
2
-2
__2
_2-_
_2
-
_2
-
F0. 130
....|.
1..
2
1.1.
1..
2
_2__
U.S. Patent
Aug. 29,2000
Sheet 12 0f 15
C1uster size: 2
I405
2
Sequence 1: MMMC TACAATGC
6,109,776
lfl402
Sequence 5: MMMMAfACAATGCG ‘\MOI
I404
?g. 14A
Frequency Mo’rrix:
0
1
10
12
13
15
ACGTM 01 0 0 1 0 0 1 0 0 05 10 0 0 05 10 0 01 0 10 0 10 0 01 0 01 0 01 0 0 05 0 1 0 0 1 0
Fig. 145’
lnforma’rion weight matrix
0
1
2
3
4
1O
12
13
14
15
ACGT 0 0 0 0 0 0 02 om 1 4.0 DM mn m 0.MtA. UM M O. M HNM MOJM 02 “U0 U0 0 0
_1- .1
-_
2-
-_
2-
-2-_
_2
-
-2
_
Fig. 146
_-_2
_-2_
_2-_
-_l_ .1
.3>I
U.S. Patent
Aug. 29,2000
uS
Sheet 13 0f 15
.185 U
8E E0F. 0. SGe CZC e08
'
1g.
6,109,776
3.
.150 M M M CM A TC A C Mm_|e_A|nG0 G0 G
15A
Frequency Matrix:
0
1
11
12
13
14
15
ACGTM 01 0 01 0 01 0 0 10 0 0 10 0 0.1 0 0 10 0 01 0 01 0 0.1 0 0 0.63 0 1 0 0.1 0
63
63
36
36
.g. 1515’
lnformu’rion weighi mcl’rrix
0
ACGT
1
2
3
10
11
12
13
14
15
‘la_2_
1
0
0 0 0 0 0 0 0.1 5.0 9.1 .04 _1.0 4 2._1 .04 .|_21 .40 4 4 201A.04 M O. 4M0.M 021 .40 .1 58 0 0 0 0
1.1.
.1
.1
1.1.
.g. 150
__2
__1_
.5.l
U.S. Patent
Aug. 29, 2000
Sheet 14 0f 15
i findClusiers )
ii
receive sef of
sequencesS
[7602
comprising sequences
50, 51, ---$N+1
ii
iniiiulize lisf of found /7604
clusfers i0 NULL
if
for n=0 io n==N-i
[1606
ii“
creqie new ciusfer ihuf /-7603
includes sequence Sn
ii
inifiui frequency mqfrix /-1510
and informqfion weighf
mqfrix
ii=
findNxfBesi
/l6l2
(mums SnxfBe'sf)
did
findNxfBesf
“'4
refurn on
SnXfBesf
informoiion
Cil'id confenf
S0, of
informuiion weighf mqfrix
be >= ihreshoid
if new clusier is noi
already in ihe lisf of
found ciusfers, add
new clusier f0 lisf of
found clusiers,
incremeni n
fim
odd SnxfBesi f0 clusfer
0nd upduie frequency
mufrix and informofion
weighf mdfrix
6,109,776
U.S. Patent
Aug. 29, 2000
Sheet 15 0f 15
@
Fig. 17
receive sei of
sequences 3, curreni fl702
informoiion weighi
muirix, curreni clusier
i
sei highVuiue = O fl 704
sei highSeqNum = —1
i
/ 17/6
for m=0 io m==N-i
fl 706
incremeni m
I 708
ii
clusier?
iniormoiion
conieni of Sm >
highVolue, considering ihe
various possible ulignmenis of
ihe sequence Sm and ihe
reverse complemeni
of ihe sequence
Sm?
highVolue = highesi
informoiion conieni
0f Sm
highSequence = m
f17l2
6,109,776
6,109,776
1
2
METHOD AND SYSTEM FOR
COMPUTATIONALLY IDENTIFYING
CLUSTERS WITHIN A SET OF SEQUENCES
TECHNICAL FIELD
Amino Acid
The present invention relates to computational method
ologies for identifying common sub-patterns Within a set of
patterns and, in particular, to identifying DNA sequences
that correspond to protein binding sites by computationally
analyZing large sets of relatively short oligonucleotide
sequences that represent potential protein binding sites.
10
BACKGROUND OF THE INVENTION
The molecular blueprint for a living eukaryotic organism
15
is stored in double-stranded deoxyribose-nucleic acid
(“DNA”) molecules Within the nucleus of each cell of the
organism. Each double-stranded DNA molecule comprises a
large number of templates, called genes, that each speci?es
the composition of a protein molecule and a large number of
regulatory regions and additional regions for Which a func
tionality has not yet been identi?ed. Protein molecules are
synthesiZed from the gene templates in a tWo-step process.
In the ?rst step, called transcription, the gene is copied to
produce a molecule of messenger ribose-nucleic acid
20
Symbol
Alanine
Ala
A
Argine
Arg
R
Asparagine
Asn
N
Aspartic acid
Asp
D
Asparagine or aspartic acid
Asx
B
Cysteine
Cys
C
Glutamic acid
Glutamine
Glutamine or glutamic acid
Glu
Gln
Glx
E
Q
Z
Glycine
Gly
G
Histidine
Isoleucine
Leucine
His
Ile
Leu
H
I
L
Lysine
Lys
K
Methionine
Met
M
Phenylalanine
Phe
F
Proline
Serine
Threonine
Pro
Ser
Thr
P
S
T
Tryptophan
Tyrosine
Trp
Tyr
W
Y
Valine
Val
V
30
amino acid subunit sequence using either the three-letter
symbols or the one-letter symbols, listed in Table 1, for the
amino acids of the protein, starting from the N-terminal
amino acid on the left side and ending With the C-terminal
amino acid on the right side. For example, the polypeptide
polymer displayed in FIG. 2 can be described either as
“ALA-TYR-ASP-GLY” or “AYDG.” Although a protein
35
the protein molecule in solution normally folds into a
regions of a double-stranded DNA molecule act as sWitches,
genes. Proteins serve as catalysts for the myriad of chemical
can be conceptualiZed as a linear sequence of amino acids,
actions that occur Within living organisms, as Well as struc
trols the development, structure, and dynamic composition
of living cells.
Both proteins and DNA molecules are long linear poly
Symbol
A protein can be chemically described by Writing its
(“RNA”). In the second step, called translation, a protein
tural and mechanical elements from Which living organisms
are formed. Thus, the regulation of protein formation via the
regulatory regions of double-stranded DNA molecules con
One
Letter
25
molecule is synthesiZed according to the information con
tained in the messenger RNA molecule. The regulatory
brakes, and accelerators for controlling the transcription of
genes into messenger RNA molecules, thereby controlling
the rate of synthesis of the various proteins speci?ed by the
Three
Letter
complex and speci?c three-dimensional shape. FIG. 3 shoWs
a representation of the three-dimensional shape of a rela
tively small, common protein.
40
DNA molecules, like proteins, are linear polymers. DNA
molecules are synthesiZed from only four different types of
subunit molecules: (1) deoxy-adenosine, abbreviated “A”;
mers synthesiZed from a relatively small number of com
(2) deoxy-thymidine, abbreviated “T”; (3); deoxy-cytosine,
ponent molecules, or subunits. FIG. 1 shoWs the tWenty
amino acid subunits from Which protein molecules are
commonly synthesiZed. Each amino acid subunit has an
FIG. 4 illustrates a short DNA polymer 400, called an
abbreviated “C”; and (4) deoxy-guanosine, abbreviated “G.”
45
ot-carboxyl group (e.g., the ot-carboxyl group 101 of the
amino acid lysine 103), an ot-amino group (e.g., the ot amino
group 105 of the amino acid lysine 103), and a side chain
phosphorylated, these subunits of the DNA molecule are
(e.g., the y-amino propyl side chain 107 of the amino acid
lysine 103), all attached to an ot-carbon atom (e.g., the
ot-carbon 109 of the amino acid lysine 103). FIG. 2 shoWs
called nucleotides, and are linked together through phos
phodiester bonds 410—415 to form the DNA polymer. The
DNA molecule has a 5‘ end 418 and a 3‘ end 420. A DNA
a small polypeptide polymer built from four amino acids.
The polypeptide polymer 200 has a free ot-amino group 202
at the N-terminal end 204 of the polypeptide polymer 200
and a free ot-carboxyl group 206 at the C-terminal end 208
of the polypeptide polymer 200. The polypeptide polymer
200 is composed from the folloWing amino acids: (1) alanine
polymer can be chemically characteriZed by Writing, in
sequence from the 5‘ end to the 3‘ end, the single letter
abbreviations for the nucleotide subunits that together com
55
shoWn in FIG. 4 can be chemically represented as “ATCG.”
60
and the one-letter symbols corresponding to each of the
amino acids:
nucleotide 402), and a phosphate group (eg phosphate 426)
that links the nucleotide to the next nucleotide in the DNA
subunits.
The amino acid subunits Within a protein are normally
designated by either three-letter symbols or by one-letter
symbols. Table 1, beloW, lists both the three-letter symbols
pose the DNA polymer. For example, the oligomer 400
A nucleotide comprises a purine or pyrimidine base (e.g.
adenine 422 of the deoxy-adenylate nucleotide 402), a
deoxy-ribose sugar (e.g. ribose 424 of the deoxy-adenylate
210; (2) tyrosine 212; (3) aspartic acid 214; and (4) glycine
216. Aprotein comprises one or more polypeptide polymers,
similar to the polypeptide polymer 200 shoWn in FIG. 2,
each generally comprising tens to hundreds of amino acid
oligomer, composed of the folloWing subunits: (1) deoxy
adenosine 402; (2) deoxy-thymidine 404; (3) deoxy
cytosine 406; and (4) deoxy-guanosine 408. When
polymer.
65
The DNA polymers that contain the organiZational infor
mation for living organisms occur in the nuclei of cells in
pairs, called double-stranded DNA helixes. One polymer of
the pair is laid out in a 5‘ to 3‘ direction, and the other
polymer of the pair is laid out in a 3‘ to 5‘ direction. The tWo
6,109,776
3
4
DNA polymers in the double-stranded DNA helix are there
fore described as being anti-parallel. The tWo DNA
polymers, or strands, Within a double-stranded DNA helix
are bound to each other through hydrogen bonds. Because of
a number of chemical and topographic constraints, a deoxy
adenylate subunit of one strand must hydrogen bond to a
betWeen an amino acid subunit 712 of a DNA-binding
protein 714 and a nucleotide subunit 716 of a DNA double
helix 718 vieWed doWn the central axis of the DNA double
helix.
deoxy-thymidylate subunit of the other strand, and a deoxy
FIG. 8 illustrates the spatial relationship betWeen a gene
and various regulatory regions of a DNA double helix that
control transcription of the gene. The gene 802 is generally
guanylate subunit of one strand must hydrogen bond to a
preceded by a promoter region 804 Where various molecular
deoxy-cytidylate subunit of the other strand.
FIG. 5 illustrates the hydrogen bonding that joins tWo
components 806 are assembled in order to catalyZe the
10
addition, various regulatory DNA-binding proteins or
assemblies of regulatory DNA-binding proteins 808—810
anti-parallel DNA strands. The ?rst strand 502 occurs in the
5‘ to 3‘ direction and contains a deoxy-adenylate subunit 504
and a deoxy-guanylate subunit 506. The second, anti
parallel strand 508 contains a deoxy-thymidylate subunit
510 and a deoxy-cytidylate subunit 512. The deoxy
speci?cally bind to a number of regulatory regions of the
DNA double helix 811—813 that are located at various
15
adenylate subunit 504 is joined to the deoxy-thymidylate
subunit 510 through hydrogen bonds 514 and 516. The
deoxy-guanylate subunit 506 is joined to the deoxy
cytidylate subunit 512 through hydrogen bonds 518—522.
The tWo DNA strands linked together by hydrogen bonds
of gene transcription or decrease the rate of gene
20
transcription, thus controlling the concentration of the pro
tein speci?ed by the gene Within the cell. Each type of
regulatory DNA-binding protein recogniZes and binds to a
speci?c sequence, or pattern, of base pairs Within the regu
latory region. These sequences, called binding sites, are
generally less than tWenty nucleotides in length.
25
organism largely depends on the regulation of gene tran
DNA helix. FIG. 6A illustrates a short section of a DNA
double helix 600 comprising a ?rst strand 602 and a second,
anti-parallel strand 604. A deoxy-guanylate subunit in one
The molecular state of a cell and of an entire living
scription by thousands of different regulatory DNA-binding
proteins. Only one or several molecules of each different
type of regulatory protein may occur in a cell at any given
time. A cell thus contains a very complex mixture of
subunit in the other strand 612. FIG. 6B shoWs a represen
tation of the tWo strands illustrated in FIG. 6A using the
single-letter designations for the nucleotide subunits. The
?rst strand 614 (602 in FIG. 6A) is Written in the familiar 5‘
to 3‘ direction, and the second strand 616 (604 in FIG. 6A)
30
regulatory DNA-binding proteins, and each regulatory
DNA-binding protein may occur in the mixture at extremely
small concentrations. Aberrations in the structures of certain
regulatory DNA-binding proteins, or in the concentrations
of certain regulatory DNA-binding proteins Within cell
is Written in the 3‘ to 5‘ direction in order to clearly shoW the
subunit pairings betWeen the tWo strands. These pairings are
called base pairs because the hydrogen bonding occurs
betWeen the purine and pyrimidine bases of the nucleotide
distances along the DNA double helix from the gene 802. In
general, the regulatory proteins may either increase the rate
form the familiar helix structure of the double-stranded
strand 606 is alWays paired With a deoxy-cytidylate subunit
608 in the other strand, and a deoxy-thymidylate subunit in
one strand 610 is alWays paired With a deoxy-adenylate
synthesis of messenger RNA from the gene template. In
35
nuclei, may underlie many different diseases and disorders,
including developmental problems, inherited genetic
subunits. Nucleotide subunits are often referred to as bases.
disorders, and cancers. It is therefore a goal of biological
There is a “C” (e.g., 618) in the second strand directly
opposite from each “G” (e.g., 620) in the ?rst strand, an “A”
sciences and of the biotechnology industry to identify and
characteriZe the many different types of regulatory DNA
(e.g., 622) in the second strand directly opposite from each
“T” (e.g., 624) in the ?rst strand, a “T” (e.g., 626) in the
second strand directly opposite from each “A” (e.g., 628) in
40
binding proteins.
There are a number of different approaches to identifying
regulatory DNA-binding proteins. One such approach is
the ?rst strand, and a “G” (e.g., 630) in the second strand
called the multiplex selection technique, or “MuSTTM.” The
directly opposite from each “C” (e. g., 632) in the ?rst strand.
MuST technique is described in the folloWing patent
applications, Which are hereby incorporated by reference in
their entirety: US. patent application Ser. No. 08/590,571,
?led Jan. 24, 1996, PCT application Serial No. PCT/
US97101230, ?led Jan. 24, 1997, and Us. application Ser.
Thus, knoWing the sequence for the ?rst strand, one can
immediately determine and Write doWn the sequence for the
second strand. DNA base-pair sequences are alWays Written
in the 5‘ to 3‘ direction. The second strand 634 is shoWn
properly Written in the 5‘ to 3‘ direction as the last sequence
in FIG. 6B. When Written in this fashion, the second strand
is said to be the reverse complement of the ?rst strand. Thus,
45
No. 08/906,691 ?led Aug. 6, 1997. In this method, a very
large number of relatively short oligonucleotide DNA
duplexes having random sequences are prepared and mixed
together With a sample that contains various DNA-binding
the “G” 636 on the left or 5‘ end of the second strand 634 is
proteins. The random-sequence oligonucleotide duplexes
paired in the DNA double helix 600 With the “C” 638 at the
right or 3‘ end of the ?rst strand 614.
As described above, the synthesis of proteins from gene
generally have lengths of betWeen eight and tWelve base
55
pairs. After the random-sequence oligonucleotide duplexes
60
are mixed With the DNA-binding proteins, the DNA-binding
proteins bind to speci?c oligonucleotide duplexes that con
tain base-pair sequences that the DNA-binding proteins
recogniZe; or, in other Words, a particular type of DNA
binding protein binds to those oligonucleotide duplexes that
templates is controlled through regulatory regions of DNA
molecules. A large number of different types of DNA
binding proteins bind to these regulatory regions of DNA
molecules and, by so doing, initiate, promote, inhibit, or
prevent the synthesis of one or more speci?c genes. FIG. 7A
illustrates the binding of a dimeric, or tWo-polymer DNA
contain base-pair sequences identical or similar to the base
binding protein 702 to a speci?c regulatory region 704 of a
double-stranded DNA helix 706. In general, a number of
amino acid subunits of a DNA-binding protein hydrogen
bond to nucleotide subunits of the DNA molecule to affect
the binding of the DNA-binding protein to the DNA double
helix. FIG. 7B illustrates tWo hydrogen bonds 708 and 710
65
pair sequence of the binding site Within the regulatory region
of the DNA double helix controlled by that DNA-binding
protein. Various biochemical separation techniques are
employed to separate the DNA-binding proteins bound to
the oligonucleotide duplexes from unbound proteins,
unbound oligonucleotide duplexes, and other molecules
6,109,776
5
6
Within the mixture. The bound DNA-binding protein/
oligonucleotide duplex pairs are then separated, the sepa
?rst cluster 904 are indicated by box 906. Note that the
common sub-sequence occurs toWards the end of sequence
rated oligonucleotide duplexes are ampli?ed by the poly
merase chain reaction (“PCR”) technique and, ?nally, the
common sub-sequence are missing. It should also be noted
13 (908 in FIG. 9) in Which the ?nal tWo nucleotides of the
tWo strands of the oligonucleotide duplexes are separated
and identi?ed by sequence analysis. The result of the analy
that, in some sequences, one or more nucleotides of the
sis is a list of nucleotide sequences of single strands of the
For example, sequence 18 (910 in FIG. 9) contains an initial
“C” 912 rather than a “G.” Sequence 18 (910 in FIG. 9) is
shifted three positions to the right relative to sequence 17
(914 in FIG. 9) and is shifted four places to the right relative
to sequence 19 (916 in FIG. 9) in order that the common
sub-sequence of sequence 18 aligns With the common sub
sequences of sequences 17 and 19. Sequence 22 (918 in FIG.
9) in the original set of sequences 902 does not initially
common sub-sequence have been substituted With another.
oligonucleotide duplexes that Were bound by DNA-binding
proteins in the mixture.
DNA-binding proteins have varying speci?cities for base
pair sequences. Each different type of DNA-binding protein
generally recogniZes and binds to a particular binding site
10
Within a particular regulatory region of a DNA double helix.
The binding site comprises a speci?c sequence of base pairs
Within the DNA double helix. HoWever, a particular DNA
15
appear to have a portion in common With any of the other
binding protein may recogniZe and bind to any number of
sequences. HoWever, the reverse complement of sequence
sequences similar to the sequence of the binding site Which
22 (918 in FIG. 9) is identical With sequence 1 (920 in FIG.
9) and is therefore included, along With sequence 1, in the
the DNA-binding protein normally recogniZes and to Which
the DNA-binding protein binds. Base-pair sequence analysis
is conducted on single strands of DNA rather than on DNA
20
duplexes. A DNA-binding site for a particular DNA-binding
protein Will be therefore characteriZed, folloWing an analysis
of oligonucleotide sequences produced by the MuST
904—908 (e.g., line 924) shoW a mapping from the original
MuST sequences to the ?ve clusters. It is this mapping
betWeen oligonucleotide sequences and clusters, including
technique, by a set of similar sequences corresponding to
one strand of the duplex regions bound by the DNA-binding
?rst cluster 904. The lines betWeen the sequences in the set
of MuST sequences 902 and the sequences Within clusters
the alignments and reverse complementation required to
25
match the common sub-sequences Within the sequences of a
protein and by a set of similar sequences corresponding to
cluster, that is the goal of the computational technique of the
the other strand of the duplex regions bound by the DNA
described embodiment of the present invention.
binding protein. Because the tWo sets of sequences are
related by reverse complementation, the original tWo sets are
merged into a single set of sequences by applying reverse
complementation to the sequences in one of the original tWo
original set of MuST sequences 902 represents a potential
sets. Because the oligonucleotide duplexes employed in the
MuST technique are randomly generated, the ?rst base pair
of the sequence recogniZed by a DNA-binding protein may
not correspond to the ?rst base pair of the oligonucleotide
Each of the clusters 904—908 that are identi?ed from the
30
a cluster may be related to the concentration in the original
cell extract mixture of the DNA-binding protein that recog
niZes the common sequence Within that cluster. Clusters
With one or a feW sequences, such as cluster 2 (905 in FIG.
35
duplex, but may occur at many different positions Within the
DNA-binding protein binds, or may possibly represent an
artifact arising from experimental methodologies.
40
particular binding site for a particular DNA-binding protein
Will be characteriZed Within the set of sequences produced
sequences that contain sub-sequences identical or similar to
puri?ed from complex mixtures by various biochemical
45
techniques. The sequence of amino acids that together
compose the one or more polymers of the DNA-binding
protein can be determined from the puri?ed protein by
biochemical protein sequence analysis techniques. Once the
sequence for a DNA-binding protein has been determined,
that sequence can be compared to data bases of knoWn
identi?ed by the MuST technique. As commonly applied to
cell extracts containing DNA-binding proteins, the MuST
technique may produce a set of many thousands of
sequences. FIG. 9 is intended to illustrate the general
concept of MuST sequence analysis rather than provide an
Once a binding site has been identi?ed by analysis of the
MuST sequences, that binding site can be compared to data
bases of knoWn binding sites to determine Whether the
binding site has been previously characteriZed. The DNA
binding proteins that bind to a particular binding site can be
by the MuST technique by a set of oligonucleotide
sub-sequences of the binding site sequence greater than or
equal in length to some minimum number of nucleotides.
FIG. 9 illustrates the characteriZation of various clusters
representing potential DNA-binding sites from a set of
sequences produced by the MuST technique. A set of 21
sequences 902 represents the oligonucleotide sequences
9) and cluster 4 (907, in FIG. 9) may represent a binding site
to Which an extremely rare or loW-concentration regulatory
oligonucleotide duplex. Generally, a DNA-binding protein
may bind to some minimum number of base pairs that
compose a sub-sequence of the sequence of the binding-site.
Because the MuST oligonucleotide sequences are random, a
DNA-protein binding site. The number of sequences Within
protein sequences or can serve as the basis for the identi?
cation of the gene or genes Within an organism’s DNA
molecules that serve as a template for the synthesis of that
55
actual example.
DNA-binding protein. These various characteriZations of the
DNA-binding protein may eventually lead to the identi?ca
tion of diseases associated With aberrations in the structure
of the protein or in the control of the expression of the gene
Examination of the set of MuST sequences 902 does not
immediately reveal a pattern of related sequences. HoWever,
that is the template for the DNA binding protein. These
as a result of an exhaustive comparison of each sequence in
the set of sequences 902 to the other sequences in the set of
various characteriZations may also lead to various amelio
rative therapies that can be employed to treat such diseases.
sequences 902 by shifting the sequences relative to one
another, and identifying common sub-sequences, ?ve clus
60
SUMMARY OF THE INVENTION
An embodiment of the present invention provides a
ters of related sequences 904—908 can be identi?ed. Each
method and system for computationally analyZing an initial
sequence of the ?rst cluster of sequences 904 contains a
common seven-base-pair sub-sequence “GTTTACC” or 65 set of oligonucleotide sequences in order to identify groups
some very similar variation of that sub-sequence. These
or subsets of sequences that contain common sub-sequences.
common sub-sequences Within each of the sequences of the
The initial set of oligonucleotide sequences may be pro
6,109,776
8
7
duced by various biochemical techniques and may corre
FIG. 13C shoWs an information Weight matrix calculated
from the values in the frequency matrix of FIG. 13B.
spond to a number of different DNA-binding sites Within one
or more double-stranded DNA duplexes. The common sub
FIG. 14A shoWs a cluster having tWo sequences.
sequences may be offset from each other Within the initial
FIG. 14B shoWs a frequency matrix calculated from the
sequences, requiring the sequences of the initial set of 5 cluster of FIG. 14A.
oligonucleotide sequences to be aligned in order to identify
FIG. 14C shoWs an information Weight matrix calculated
the common sub-sequences. Reverse complementation may
from
the values of the frequency matrix shoWn in FIG. 14B.
also be applied to a sequence in order to reveal the common
FIG. 15A shoWs a cluster having three sequences.
sub-sequence that is contained Within the sequence.
FIG. 15B shoWs a frequency matrix calculated from the
In one embodiment of the present invention, the common 10
cluster of FIG. 15A.
sub-sequence Within a subset of sequences, or cluster, that is
identi?ed as a potential binding site is modeled by a numeri
FIG. 15C shoWs an information Weight matrix calculated
cal construct called an information Weight matrix. Each
from the values in the frequency matrix of FIG. 15B.
sequence in the initial set of sequences is separately ana
FIG. 16 is a ?oW-control diagram for the routine “?nd
lyZed With respect to all other sequences Within the initial set 15 Clusters” that implements one embodiment of the present
of sequences. A sequence to be analyZed is placed Within a
invention.
neW cluster and an initial information Weight matrix is
FIG. 17 is a ?oW-control diagram for the routine “?nd
calculated for that cluster. Then, other sequences from the
NxtBest.”
initial set of sequences are added to the cluster and the
information Weight matrix of the cluster is re-computed for
DETAILED DESCRIPTION OF THE
INVENTION
the cluster until the information content of the information
Weight matrix falls beloW a threshold value. The next
sequence chosen for addition to the cluster is a sequence that
is not already included in the cluster and that has the highest
information content With respect to the information Weight
matrix calculated for the cluster.
25
invention is directed, for example, to identifying potential
DNA-binding sites by analyZing a set of oligonucleotide
sequences and organiZing the set of oligonucleotide
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 shoWs tWenty amino acid subunits from Which
sequences into subsets of oligonucleotide sequences, called
clusters, that contain similar sequences. Each cluster may
correspond to a DNA-binding site. The method of this
protein molecules are commonly synthesiZed.
FIG. 2 shoWs a small polypeptide polymer built from four
amino acids.
FIG. 3 shoWs a representation of the three-dimensional
shape of one type of protein.
FIG. 4 illustrates a short DNA oligomer.
embodiment separately analyZes each sequence selected
from the initial set of sequences. The selected sequence is
35 considered to be the ?rst member of a neW cluster. The
cluster is modeled by an abstract numerical construct called
FIG. 5 illustrates the hydrogen bonding that joins tWo
anti-parallel DNA strands.
an information Weight matrix. The method successively
chooses additional sequences from the initial set of
sequences to add to the cluster. At any point in the analysis
of a given cluster, the next sequence chosen to be added to
FIG. 6A illustrates a short section of a DNA double helix.
FIG. 6B shoWs a representation of the tWo DNA strands
illustrated in FIG. 6A using single-letter designations for the
nucleotide subunits.
FIG. 7A illustrates the binding of a DNA-binding protein
to a speci?c regulatory region of a double-stranded DNA
helix.
FIG. 7B illustrates tWo hydrogen bonds betWeen an amino
acid subunit of a DNA-binding protein and a nucleotide
45
subunit of a DNA double helix.
FIG. 8 illustrates the spatial relationship betWeen a gene
and various regulatory regions of a DNA double helix that
control transcription of the gene.
FIG. 9 illustrates the characteriZation of clusters repre
senting potential various DNA-binding sites from a set of
sequences produced by the MuST technique.
FIG. 10A shoWs the representation of the oligonucleotide
sequence “AGTCCCCCAT” Within a character array.
FIG. 10B shoWs a sequence matrix representing the
oligonucleotide sequence “AGTCCCCCAT” of FIG. 10A.
FIGS. 11A & 11B shoWs a frequency matrix and an
information Weight matrix.
Embodiments of the present invention provide a method
and system for analyZing a set of linear sequences in order
to identify subsets of the set of linear sequences that share
common sub-sequences. One embodiment of the present
the cluster is the sequence that has the highest information
content With respect to the current information Weight
matrix that describes the cluster. Sequences are successively
added to the cluster until the information content of the
information Weight matrix that describes the cluster falls
beloW a particular threshold value. At that point, the cluster
is complete. The analysis then continues With a different
sequence selected from the set of initial sequences. The
result of the analysis of this embodiment of the present
invention is a set of clusters.
55
While the embodiment described beloW is directed toWard
identifying potential DNA-binding sites from a set of oli
gonucleotide sequences, the method and system of the
present invention may be employed, in other embodiments,
to analyZe different types of sequences. For example, the
sequences of amino acid subunits Within protein polymers
might be analyZed by an embodiment of the present inven
tion in order to identify common or conserved amino acid
subunit sub-sequences Within the polymers that correspond
to common structural features Within a family of proteins. As
another example, the Words of a language, represented as
FIG. 12 shoWs an initial list of sequences obtained from
sequences of letters, might be analyZed by a different
a biochemical technique, such as the MuST technique.
embodiment of the present invention in order to identify
FIG. 13A shoWs a ?rst cluster identi?ed from the
common root Words from Which families of Words have
sequences of FIG. 12.
65 been derived.
FIG. 13B shoWs a frequency matrix calculated from the
Each oligonucleotide sequence from the initial set of
?rst cluster of FIG. 13A.
sequences produced by the MuST technique, or by a similar