Download Prediction of the Coding Sequences of Unidentified Human Genes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
DNA RESEARCH 7, 143-150 (2000)
Short Communication
Prediction of the Coding Sequences of Unidentified Human
Genes. XVII. The Complete Sequences of 100 New cDNA Clones
from Brain Which Code for Large Proteins in vitro
Takahiro
NAGASE,*
Reiko
KIKUNO,
Ken-ichi
ISHIKAWA,
Makoto
HIROSAWA,
and Osamu
OHARA
Kazusa DNA Research Institute, 1532-3 Yana, Kisarazu, Chiba 292-0812, Japan
(Received 31 March 2000}
Abstract
To provide information regarding the coding sequences of unidentified human genes, we have conducted
a sequencing project of human cDNAs which encode large proteins. We herein present the entire sequences
of 100 cDNA clones of unknown human genes, named KIAA1444 to KIAA1543, from two sets of sizefractionated human adult and fetal brain cDNA libraries. The average sizes of the inserts and corresponding
open reading frames of cDNA clones analyzed here were 4.4 kb and 2.6 kb (856 amino acid residues),
respectively. Database searches of the predicted amino acid sequences classified 53 predicted gene products
into the following five functional categories: cell signaling/communication, nucleic acid management, cell
structure/motility, protein management and metabolism. It was also revealed that homologues for 32 KIAA
gene products were detected in the databases, which were similar in sequence through almost their entire
regions. Additionally, the chromosomal loci of the genes were determined by using human-rodent hybrid
panels unless their chromosomal loci were already assigned in the public databases. The expression levels
of the genes were monitored in spinal cord, fetal brain and fetal liver, as well as in 10 human tissues
and 8 brain regions, by reverse transcription-coupled polymerase chain reaction, products of which were
quantified by enzyme-linked immunosorbent assay.
Key words: large proteins; in vitro transcription/translation; cDNA sequencing; expression profile; chromosomal location; brair.
In December of 1999, the complete DNA sequence covering almost the entire region of the euchromatic part
of human chromosome 22 was reported.1 Moreover, in
the near future, a working draft sequence, covering at
least 90% of the human genome, is expected to become
available as a result of international collaboration of human genome sequencing. However, it is difficult to conclusively identify genes only from the genomic sequence
by any current computer programs because of complicated and extensive splicing events during transcription
and our limited knowledge of the sequence signals which
define the transcriptional start and termination sites,
Therefore, it is still important to collect full-length cDNAs and analyze their sequences, not only for utilization for functional studies but also for interpretation of
the gene structure, even alter the human genome sequence becomes available. Considering this situation,
we have been making efforts to accumulate information
on the coding sequences of unidentified human genes.2'3
~p,
: TTT ««• u• 7^~Z-
Currently, we have focused our sequencing efforts on
the unidentified genes encoding large proteins in human
brain since these gene products appear to play important
roles in the central nervous system.3'4 As an extension
of the preceding studies, we herein report the predicted
coding sequence of 100 new cDNA clones which have the
potential to code for large proteins in vitro. In addition
to the specific features of the newly predicted protein sequences annotated by the database search, the expression
profiles and the chromosomal locations of these 100 new
genes are also reported. The growing catalog of genes
encoding large proteins should provide a wealth of information concerning the primary structures of proteins
present in the human brain that are difficult to identify
by conventional methods of gene discovery,
x
Sequence Analysis and Prediction of ProteinCoding Regions in cDNA Clones
The cDNA clones were isolated from the size-
Lommumcated by Michio Oishi
*
To whom correspondence should be addressed. Tel. +81-438- fractionated human adult brain cDNA libraries Nos. 2
52-3930, Fax. +81-438-52-3931, E-mail: [email protected]
to 5 (insert sizes ranging from 4 to 6 kb) and
[Vol.
Prediction of Unidentified Human Genes
144
7 kb
7 kb
K1AA
K1AA
1444
1494
1445
1495
1446
1496
1447
1497
1448
1498
1449
1499
1450
1500
1451
1501
1452
1502
1453
1503
1454
1504
1455
1505
1456
1506
1457
1507
1458
1508
1459
1509
1460
1510
1461
1511
1462
1512
1463
1464
1465
1466
1467
1468
1469
1470.
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1482
1533
1483
1534
1484
1535
1485
1536
1486
1537
1487
1538
1488
1539
1489
1540
1490
1541
1491
1542
1492
1543
1493
Figure 1. Physical maps of cDNA clones analyzed. The physical maps shown here were constructed from the sequence data of respective
cDNA clones or, when necessary, from the combination of cDNA clones and RT-PCR products. The horizontal scale represents the
cDNA length in kilobases, and the gene numbers corresponding to respective cDNAs are given on the left. The ORFs and untranslated
regions are shown by solid and open boxes, respectively. The positions of the first ATG codons with or without the contexts of
Kozak's rule, are indicated by solid and open triangles, respectively. RepeatMasker, which is a program that screens DNA sequences
for interspersed repeats known to exist in mammalian genomes, was applied to detect repeat sequences in respective cDNA sequences
(Smit, A.F.A. and Green, P., RepeatMasker at http://ftp.genome.washington.edu/RM/ RepeatMasker.html). Short interspersed
nucleotide elements (SINEs) including Alu and MIRs sequences and other repetitive sequences thus detected are displayed by dotted
and hatched boxes, respectively.
No. 2]
T. Nagase et al.
145
T a b l e 1. Information of sequence d a t a and chromosomal locaitons of t h e identified genes.
Gene
number
(KIAA)
Accession
number"
I444r>
1445"
1446
1447
1448
1449
1450"
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
I486
1487
1488
1489
1490
1491
1492
1493
AB040877
AB040878
ABO40879
AB040880
AB040881
AB040882
AB040883
AB040884
AB040885
AB040886
AB04O887
AB040888
AB040889
AB040890
AB040891
AB040892
ABO4O893
AB040894
AB040895
AB040896
AB04O897
AB040898
AB040899
AB040900
AB040901
AB04O9O2
AB040903
AB040904
AB040905
AB040906
AB040907
AB04O908
AB040909
AB040910
AB040911
AB040912
AB040913
AB040914
AB040915
AB040916
AB040917
AB040918
AB040919
AB040920
AB040921
AB040922
AB040923
AB040924
AB040925
AB040926
cDNA
ORF length Chromosomal
lengih
(amino acid
location"
1
residues)
(bp) "
1248
416
X
3
4559
1202
5806
645
14
17"'
5166
1721
1*
5536
526
607
5314
3*
4
1139
^570
12
5556
488
16
4800
694
1123
17"
5879
15
5438
1265
4
1769
5309
8
5759
421
12
5517
989
612
4
5843
1"
853
4417
16
5273
1400
2
5285
1498
5104
800
10*
5437
533
12*
16*
5224
621
15dl
3737
642
4520
T
505
12*
4114
432
18*
4661
985
20*
4723
662
16
564
4028
3
4038
1345
8*
601
2528
574
19"'
4133
19
3692
1079
986
15*
3869
2
1220
4878
1
870
4638
12*
2818
570
464
15*
4746
3154
682
X
4*
4755
1380
4*
4841
818
6*
2692
428
700
3174
19"'
4006
1104
8*
4498
677
2
650
4*
3992
3*
3060
852
7*
4330
511
4115
13*
749
2920
757
9*
4232
711
2*
4768
11*
415
Gene
number
(KIAA)
Accession
number"
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509"
1510"
1511"
1512"
1513"
1514"
1515"
1516"
1517"
1518"
1519"
1520"
1521"
1522"
1523"
1524"
1525"
1526"
1527"
1528"
1529"
1530"
1531"
1532"'
1533''"
1534""
1535<"
1536'"
I537C!,
1538""
1539"
1540"
1541"
1542"
1543'"
AB040927
AB040928
AB040929
AB040930
AB040931
AB04O932
ABO4O933
AB040934
ABO4O935
AB040936
AB040937
AB040938
AB040939
AB040940
AB04O941
AB040942
AB04O943
AB040944
AB040945
AB040946
AB040947
AB040948
AB040949
AB040950
AB040951
AB040952
AB040953
AB040954
AB040955
AB040956
AB04O957
AB040958
AB040959
AB040960
AB040961
AB040962
AB04O963
AB04O964
AB040965
AB040966
AB040967
AB04O968
AB040969
AB040970
AB04O971
AB040972
AB040973
AB040974
AB040975
AB040976
ORF length Chromosomal
cDNA
(amino acid
length
location"
residues)
(bp)1"
4182
638
4*
X*
502
4105
4594
3*
920
6250
3
730
5197
1731
12
5551
660
17"'
764
4*
5425
2436
735
17
4594
560
9'
827
17
4119
3846
447
16
4447
693
7
1335
7*
4006
4360
872
7
4368
573
19
1326
5283
14
1140
20
7504
1253
5813
1
5326
1692
20
649
1"
5724
5333
1598
17
3498
757
11
1609
10*
5278
4112
998
12
19*
876
5370
2352
5*
648
2157
12
462
9*
1003
5276
5294
1*
1044
2996
687
17
3878
884
3
973
4174
14*
3566
963
9
2794
801
4«
740
7
2614
9
6269
1680
4
794
4562
8*
4798
1060
3400
601
19*
2179
651
19*
865
11
3469
3496
703
1
645
12*
3920
619
2110
10*
615
4408
17
543
5771
9*
424
11*
4109
477
10*
6206
1654
11
5521
917
2800
19*
a) Accession numbers of DDBJ, EMBL and GenBank databases, b) Values excluding poly (A) sequences, c) Chromosome numbers identified by using GeneBridge 4 radiation hydrid panel unless specified. The actual primer sequences and the PCR conditions used for the radiation hybrid mapping are accessible through the World Wide Web at http://www.kazusa.or.jp/huge.
The chromosomal locations highlighted by asterisks were fetched from the UniGene database. The chromosomal locations
highlighted by sharp were referred from the GeneBank database because the sequences of the cDNA clones could be found
in the genomic sequences whose chromosome numbers were assigned. The actual primer sequences and the PCR conditions
used for the radiation hybrid mapping are also accessible through the World Wide Web site referred above, d) Chromosome
number determined by using CCR human-rodent hybrid panel, e) cDNA and ORF lengths were revised by direct analysis
of the RT-PCR products, f) Nucleotide sequences were determined after subcloning of the internal Not I-digested fragment.
Therefore, cDNA length of these genes represented those of internal Not I-digested fragment, g) cDNA clones were selected
by analysis of 5'-end single-pass sequences by the computer-assisted method.
Prediction of Unidentified Human Genes
146
[Vol. 7,
Table 2. Functional classifications of the gene products.
2-1. Predicted function based on homology searcha>
Function"
Cell signaling/communication
Nucleic acid management
Cell slructure/motility
Protein management
Metabolism
Gene product
KIAA 445
KIAA 446
KIAA 455
KIAA 457
KIAA 459
KIAA 464
KIAA 471
KIAA 477
KIAA 478
KIAA 479
KIAA 480
KIAA 481
KIAA 487
KIAA 501
KIAA 516
KIAA 520
KIAA 528
KIAA 535
KJAA 540
KIAA 541
KIAA 470
KIAA 473
KIAA 483
KIAA 485
KIAA 488
KIAA 498
KIAA 508
KIAA 517
KIAA 542
KIAA 448
KIAA 469
KIAA 490
KIAA 496
KIAA 503
KIAA 510
KIAA 512
KIAA 522
KIAA 453
KJAA 492
KIAA 451
KJAA1534
aa.res.
1202
645
1769
989
853
621
1345
870
570
464
682
1380
650
735
1609
462
740
703
424
477
564
574
428
1104
852
1731
573
998
1654
526
662
749
920
827
1140
1692
1044
1123
711
488
865
OWL ID
X97818
AFO64869
AFO59485
AF006467
009127
AB0O8515
AF077000
Z83868
P40809
AF03O430
U41663
Q13796
P39524
UO2289
AFO44576
Z83328
AFO537OO
AJ225124
P30731
P36876
S71752
QO3923
Q99676
148668
A71434
S35458
P5274O
P34305
U49057
Q6O575
AB000114
AF059569
A53449
U19464
P32O18
P12883
S22697
U70368
P42658
AB017026
ABO17O26
c
% identity
(, coverage"
aa.res
91
1093
93
92
605
88
100
2825
69
95
1243
52
100
1004
95
81
500
66
89
1494
87
87
793
90
384
40
47
44
888
30
99
848
99
22
1616
58
83
1355
40
1439
28
50
81
1898
32
88
708
39
81
620
47
100
779
96
100
423
89
88
447
89
43
4861
32
98
595
71
48
726
30
25
710
34
83
883
34
17
769
36
88
589
51
89
1152
46
85
1473
)7
94
1150
)l
31
421
5
74
593
10
100
1028
?1
25
4588
1
99
1888
30
100
1935
71
21
464
33
28
545
46
100
865
52
51
410
33
27
410
34
Definition
semaphorin G - mouse
brain-enriched guanylate kinase-associated protein 2 mRNA, complete cds. - rat
DOC4 mRNA, complete cds. - mouse
membrane-associated phosphatidylinositol transfer protein mRNA. complete cds. - mouse
ephnn type-A receptor 8 precursor (EC 2.7 1.112) - mouse
RanBPM, complete cds - human
tyrosinc phosphalase TDM mRNA, complete cds - rat
serineythreonine kinase MARK1 - rat
GTPase activating protein rotund - Drciiopliila melanogasler
semaphorin Via mRNA, complete cds - mouse
neuroligin 3 mRNA, complete cds. - rat
APICAL-Iike protein - human
probable calcium-transporting ATPase 3 (EC 3.6.1.38) - Succharomvces cerevisiae
Bristol N2 GTPase-activating protein mRNA, partial cds. - Caenorhtibdins tlegatis
phospholipase C PLC210 mRNA, completi cds. - C degans
transport-associated protein Tap2A - Atlartic salmon.
deltex mRNA, complete cds. - human
hyperpolarization-activated cation channel HAC3 - mouse
probable G protein-coupled receptor form "-cells precursor - mouse
protein phosphatase PP2A, 55 kD regulatory subunit. alpha isoform - rat
giant protein p619 - human
zinc finger protein 85 - human
zinc finger protein 184 - human
zinc finger protein 51 - mouse
probable RNA helicase - Arabidopsis thaliaim
SNF2 protein homolog - human
zinc finger protein 132. - human
putative ATP-dependent RNA helicase - C elegans
CTD-binding SR-like protein rA9 mRNA, :omplete cds. - rat
kinesin-like protein KIF1B - mouse
osteomodulin, complete cds. - human
actin binding protein MAYVEN mRNA. complete cds. - human
ptasmacytoma-associated neuronal glycoprotein PANG • mouse
outer arm dynein beta heavy chain gene, complete cds. - Paramecium leiraurelia
collagen alpha l(XIV) chain precursor - ch cken
myosin heavy chain, cardiac muscle beta isoform - human
exrensin - Volvox carteri
hematopoieric-specific IL-2 deubiquitinatirg enzyme gene, complete cds - mouse
dipeptidyl peptidase IV like protein - human
oxysterol-binding protein, complete cds. - mouse
oxysterol-binding protein, complete cds - mouse
a) Homology search was performed by Smith-Waterman algorithm, using BioView Toolkit and GeneMatcher (revision 3.3,
Paracel Inc. USA) against OWL database (release 31.4). The homologous protein with the highest score was listed, when it
satisfied the following conditions, i) the protein was functionally annotated, ii) the aligned region exceeded 200 amino acid
residues, and iii) percent identity in the algined region was 30% or greater, b) Function was classified based on the annotation
of the entry of the homologous protein in the database, c) The values mean the ratio of the length of aligned region to the
original length of the query sequence, in percentage.
the size-fractionated human fetal brain cDNA libraries
Nos. 4 and 6 (insert sizes ranging from 4 to 7 kb)
previously constructed.3'4 In this report, 20 cDNA
clones (KIAA1497-KIAA1508, KIAA1528-KIAA1531,
KIAA1538 and KIAA1541-KIAA1543) were selected
from the adult brain libraries and the remaining 80 cDNA
clones were obtained from the fetal brain cDNA libraries.
According to our selection system for unidentified genes,
the clones with unidentified sequences at both ends were
chosen by single-pass sequencing and homology search
against the GenBank database (release 116.0) excluding expressed sequence tags and genomic sequences.4 For
selection of cDNA clones to be entirely sequenced, we
used an in vitro transcription/translation system to examine their protein-coding capacities and/or a computerbased method based on GeneMark analysis to predict
protein-coding potentialities on their 5'-end sequences in
this study.4'5 Eighty-eight cDNA clones were selected by
the in vitro expression system, while 12 cDNA clones
(KIAA1532-KIAA1543) were chosen by the computerassisted method. Entire sequencing of these clones was
performed according to the methods previously described
in detail.4 Thirty clones (KIAA1509-KIAA1538) seemed
to carry spurious coding interruption caused by errors
of the reverse transcriptase or by retained intron sequences. For these cases, the sequences of the regions
causing interruption of open reading frame (ORF) were
reexamined by direct sequencing of the major products of
reverse transcription-coupled polymerase chain reaction
(RT-PCR) to predict authentic protein-coding sequences
in brain.6 As the results of these confirmations, spurious
interruptions were found in the following cDNA clones:
ORFs in 23 clones (KIAA1509-KIAA1512, KIAA1514KIAA1517,
KIAA1519-KIAA1521,
KIAA1523,
KIAA1524,
KIAA1526-KIAA1528,
KIAA1530
and KIAA1532-KIAA1537) were found to carry single or multiple insertions, most of which probably
corresponded to retained intronic sequences; ORFs
in 10 clones (KIAA1513, KIAA1515, KIAA1520KIAA1522,
KIAA1524,
KIAA1525, KIAA1529,
KIAA1531 and KIAA1538) were found to carry single
or multiple deletions; ORFs in 2 clones (KIAA1523 and
KIAA1524) were frame-shifted by a single nucleotide
deletion. For those genes, the revised sequences by
the RT-PCR experiments, not the actual cloned cDNA
sequences, were deposited to GenBank/EMBL/DDBJ
No. 2]
T. Nagase et al.
147
Table 2. Continued.
a
2-2. Predicted function by motif search
Function"1
Cell signaling/communication
Nucleic acid management
Protein management
Gene product
K1AA1465
aa res.
642
KIAA1484
700
K1AA1494
638
KIAA1497
730
KIAA1514
1598
KIAA1531
1060
KIAAI460
KIAA1474
1400
1079
KIAA1476
1220
KIAA1509
KIAA1538
1326
615
KIAA1515
757
Pfam ID
PFO1463
PF0O56O
PF0O56O
PF00047
PF0O56O
PFO0560
PF0OO41
PF0O047
PF01463
PF00018
PF0OO18
PFO056O
PFO0560
PF0O56O
PF0O56O
PFO056O
PF01463
PF0OO47
PF0I462
PF00041
PF00041
PF00041
PF0OO41
PF00041
PF00041
PF00041
PF0O041
PF00041
PF0OO41
PF0004I
PF0004I
PF0004]
PFOO047
PF00047
PF00047
PF01825
PF00047
PF00002
PFOOO76
PFO0O96
PFO0O96
PF00096
PF00439
PFOO628
PFO1486
PF0O096
PF00096
PF00443
PFOO515
E-value"
1.50E-14
5.90E-02
I.50E-14
I .O0E-02
3.50E-01
8.5OE-O5
2.40E-06
4.10E-09
8.5OE-O5
4.60E-16
1.10E-II
6.70E-04
8.3OE-O2
1.4OF.-04
I.30E-01
9.90E-13
9.90E-13
1.5OE-O8
6.70E-04
4.80E-01
9.50E-12
4.90E-14
1.30E-07
4.20E-15
3.90E-I1
3.70E-16
6.5OE-16
8.40E-14
5.6OE-15
1.70E-12
3.80E-20
7.3OE-14
3.10E-04
5.90E-I0
5.4OE-07
4.10E-03
7.7OE-O1
1J0E-03
1.70E-02
4.60E-02
9.4OE02
3.5OE-O3
3.9OE-33
8.8OE-18
2.70E-01
4.30E-03
5.90E-03
1.50E-19
8.90E-0I
Definition
Leucine rich repeat C-tenninal domain
Leucine Rich Repeat
Leucinc Rich Repeat
Immunoglobulin domain
Leucine Rich Repeat
Leucine Rich Repeat
Fibronectin type 111 domain
Immunoglobulin domain
Leucine rich repeat C-terminal domain
SH3 domain
SH3 domain
Leucine Rich Repeat
Leucine Rich Repeat
Leucine Rich Repeat
Leucine Rich Repeat
Leucine Rich Repeat
Leucine rich repeat C-terminal domain
Immunoglobulin domain
Leucine rich repeat N-terminal domain
Fibronectin type III domain
Fibronectin type HI domain
Fibronectin type HI domain
Fibronectin type III domain
Fibronectin type III domain
Fibronectin type III domain
Fibronectin type HI domain
Fibronectin type HI domain
Fibronectin type HI domain
Fibronectin type III domain
Fibronectin type III domain
Fibronectin type III domain
Fibronectin type HI domain
Immunoglobulin domain
Immunoglobulin domain
Immunoglobulin domain
Utrophilin/CL-1 like GPS domain
Immunoglobulin domain
7 transmembrane receptor (Secretin family)
RNA recognition motif
Zinc finger, C2H2 type
Zinc finger, C2H2 type
Zinc finger, C2H2 type
Bromodotnain
PHD-fmger
K-box region
Zinc fmger. C2H2 type
Zinc finger, C2H2 type
Ubiquitin carboxy!-terminal hydrolase family 2
TPR Domain
•
a) Motif search was performed by HMMER2.1.1 against Pfam database (release 5.1). b) Function was classified based on
the annotation of the Pfam entry which was hit in the query sequence, c) Only the entries possesing the expectation value
(E-value) less than 1.0 were presented.
databases and used for prediction of their proteincoding sequences unless otherwise stated. The results
of the comparison between the cloned DNA and the revised DNA sequences are available through the World
Wide Web site at http://www.kazusa.or.jp/huge. Notably, clones for seven genes (KIAA1444, KIAA1447,
KIAA1455, KIAA1471, KIAA1498, KIAA1502 and
KIAA1506) seemed to lack regions encoding C-terminal
portions due to the presence of a Not I site in their coding
regions because cDNAs were digested with Not I before
ligation to a vector. In contrast, clones for two genes
(KIAA1445 andKIAA1450) were found to lack the 5'portions of the sequences due to the presence of an internal Not I site. For these genes, the nucleotide sequences
of only the region between two Not I sites were determined, since their original clones were most likely to harbor intermolecularly ligated two independent cDNAs.7
After these revisions, the average size of the cDNA sequences reached 4.4 kb and that of the predicted coding region was approximately 856 amino acid residues.
Physical maps of the 100 cDNA sequences analyzed are
shown in Fig. 1, where the ORFs and the first ATG
codons in respective ORFs are indicated by solid boxes
and triangles, respectively. Repeat sequences are also
shown in Fig. 1. Eleven genes had 5'-untranslated regions longer than 1 kb. We could not completely rule
out the possibility that these clones retained a 5'-intron
upstream of the predicted protein coding region, since
the RNAs used to construct the cDNA libraries contained heterogeneous nuclear RNAs besides the cytoplasmic mRNAs. Chromosomal loci of 54 newly identified
genes were determined by using human-rodent hybrid
panels, GeneBridge 4 (Research Genetics Inc., USA)8 or
CCR (Coriell Cell Repositories, USA) since their mapping data were not available. The chromosomal locations of the 42 genes, which are highlighted by asterisks in Table 1, were fetched from the UniGene database
(http://www.ncbi.nlm.nih.gov/UniGene). The chromosomal locations of the remaining four genes, which are
highlighted by sharp (#) in Table 1, were taken from the
description of the GenBank database.
Prediction of Unidentified Human Genes
148
[Vol. 7,
Table 3. Homologues of the newly identified genes found in various databases.*
Dal abase"
HUGE and new genes K1AAI455
KIAA1460
KIAA1469
KIAA1473
1769
14(X)
662
574
KIAAI476
KIAAI481)
1220
682
K1AAI528
KIAAI533
KIAA1488
K1AAI499
KIAA1449
KJAA1459
KJAA1488
740
651
852
660
607
853
852
KIAA1516
KIAA1517
K1AA1541
KIAA1445
KIAA1446
1609
998
477
1202
645
853
621
662
1345
574
986
1220
870
682
852
749
711
920
730
660
573
1692
462
740
703
424
477
1654
C.elegims
ID in dalabase
K1AA1127
K1AA1O93
KJAA0405
K1AA0961
KIAAO798
K1AA1198
KIAA0972
KJAA0314
K1AA0951
K1AA1260
K1AA1070
KIAA1021
KIAA0956
KJAAO937
KIAA1201
PR43_YEAST
N'PL4_YEAST
F35GI2.4
M03A1.1
C04H5.6
F56D2.6
F11C3.3
F58G4.I
K12F2.1
R06C7.10
T18D3.4
F52B10.I
F20G4J
K1AAI459
KJAA1464
KIAA1469
K1AA147I
K1AA1473
K1AA1475
KJAA1476
KIAA1477
K1AA1480
K1AA1488
K1AA1490
KIAA1492
KIAA1496
KIAA1497KIAA1499
KIAA15O8
KIAA1512
KJAA152O
KIAA1528
K1AA1535
K1AA1540
KIAA1541
KIAA1542
F31B12.1
C06E1.10
F26E4 1
HSSEMG
AFO64869
EPA8_MOUSE
AF006465
ABOO7865
AFO77OOO
ZN85_HUMAN
DOR_DROME
KIAA0314
RNMARK1
RNU41663
A71434
KELC_DROME
DPP6_HUMAN
A53449
MUSLRRPA
NPL4_YEAST
Z132_HUMAN
MYSB_HUMAN
SSZ83325
AFO5370O
MMJ225124
GCRC_MOUSE
2ABA_RAT
RNU49057
1737
1203
706
530
682
553
702
1240
679
817
826
797
672
653
761
'& Identity
70
39
46
50
42
41
38
37
76
77
72
38
36
45
48
767
580
686
919
1008
739
1963
1974
1992
1938
1968
1956
2020
1922
1152
495
1093
605
1004
653
660
1494
595
1002
1240
793
848
883
689
865
1028
716
580
589
1935
554
620
779
423
447
1473
31
34
43
34
31
30
48
47
47
43
43
35
34
32
46
65
93
88
95
68
46
87
71
35
37
90
99
34
38
52
91
96
34
51
71
39
47
96
89
89
67
^coverage' 1
91
94
96
92
98
89
95
98
99
97
97
84
84
96
82
85
84
100
100
100
99
100
100
99
81
89
88
91
92
100
99
96
89
98
93
98
87
99
83
86
100
100
98
84
88
100
87
81
100
100
88
85
pre-mRNA splicing laclor RNA helicase PRP43
NPL4 protein
hypothetical 77.0 kD Trp-Asp repeats containing protein f35g 12.4
CELMO3A13
CECO1G127
CELF56D25
CEF11C35
CELF58G46
CEK12F21
CEF21C36
CEF45E66
CELF52B1O
CEF20G42
CEF31B12
putative ATP-dependem RNA helicase
CEF26E412
semaphorin G - mouse
brain-enriched guanylate kinasc-associatej protein 2 mRNA, complete cds. • rat
ephrin type-A receptor 8 precursor (EC 2.7.1.112) - mouse
B cell antigen receptor Ig beta associated nrotein 1 mRNA, complete cds. - mouse
KIAAO4O5 mRNA, comptete cds. - humai
protein tyrosine phosphatase TD14 mRNA, complete cds. • rat
zinc finger protein 85 - human
deep orange protein - Drosophila melanogaster
KIAA0314 gene, partial cds. • human
serine/threonine kinase MARK1 - rat
neuroligin 3 mRNA, complete cds. - rat
probable RNA helicase - Arabidopsis thaliana
ring
canal protein - Drosophila melanogaMer
dipepudyl peptidase IV like protein - human
plasmacytoma-associated neuronal gtycorrotein PANG - mouse
NLRR-1 mRNA for leucine-rich-repeat protein, complete cds. • mouse
NPL4 protein - Saccharomyces cerevisiae
zinc finger protein 132. - human
myosin heavy chain, cardiac muscle beta soform - human
tap2A gene - Atlantic salmon
deltex mRNA, complete cds. - human
hyperpolarization-activated cation channel HAC3 - mouse
probable G protein-coupled receptor from T-cells precursor - mouse
protein phosphatase PP2A, 55 kD regulatory subunit, alpha isolorm - rat
CTD-hinding SR-like protein rA9 mRNA complete cds. • rat
a) The definition of homologues used here was the proteins found in the databases satisfying the following conditions: i)
the length ranged from 80% to 125% of the query sequence; ii) the ratio of the length of aligned region to that of the
original sequence of the query was 80% or greater; iii) percent identity was 30% or greater. The method of homology
search was the same to that explained in Table 2-1. b) The following databases were used. HUGE, our cDNA-encoded
protein database (http://www.kazusa.or.jp/huge); yeast, non redundant peptide database from genome-ftp.stanford.edu:
/pub/yeast/yeast.protein/yeastjirpep.fasta.Z; C. elegans, protein database deduced from C. elegans full genome sequence
(ftp.sanger.ac.uk:/pub/databases/C.elegansjsequences/CLelegans_proteins-1998-10-16.pep) and the entries derived from C.
elegans of OWL, and OWL (release 31.4). In the case of database search against OWL, only the homologue with the highest
score to each query was listed, c) The number of amino acid residues of the gene product, d) The values mean the ratio of
the length of aligned region to the original length of the query sequence, in percentage, e) For entries from databases, yeast
and OWL, the annotations were listed. For C. elegans, IDs of OWL were listed, when sequences identical to the entries from
the full genome were registered in OWL.
2.
Functional Classification of Predicted Gene
Products
The gene products predicted from the cDNA sequences were classified by homology and/or motif
search against the following public databases: protein sequence database, OWL (release 31.4),9 databases
of predicted protein sequences from yeast10 and
C. elegans11 genomes [genome-ftp.stanford.edu:/pub
/yeast/yeast-protein/yeast_nrpep.fasta.Z,
ftp.sanger.
ac.uk:/pub/databases/C.elegans _sequences/C_elegans_
proteins_1998-10-16.pep], protein domain database,
Pfam (release 5.1),12 and our own database, HUGE13
(http://www.kazusa.or.jp/huge). As is shown in Table 2,
the functions of 53 gene products were classified into five
functional categories. Among them, 41 gene products indicated significant sequence similarity to functionally annotated proteins (Table 2-1). The functions of the other
12 gene products were predicted based on the presence
No. 21
T. Nagase et al.
^-.^
4^#&5««
KIAA
1444 • • • • • • • • B i O
1445
HI
II u
1446
1448
1449
KIAA
1497
1498
1499
1500
1451
1501
1452
1502
1453
1503
I
• • • • • • • • • • •
1455
1456
[ r ~ l l II H II II IBSBI II II I
DDKBDDB
1458
1459
1461
1463
1464
• • • • • • •
• • • • • •
1471
1472
••••••••EM
• • • • • • • • • •
•••••••••I
1511
•BiBSlEaBI
1513
1512
1514
iH
BiBiCJBiB
iOBiBO
•••••^•••BIH
••••••••••Si
•••••••SCO • • • • « • • • • • •
• • • • • • • • • • ••••£!!••••••
• • • • • • • • • • •
• • • • • • • • •••••••••HHZI
• • • • • • •
• • • • • • • • • • •
ci
• • • • • • • •
rn
••£]•••••••
• • • • • • • • • •
• • • • • • • • • •
• • • • • • • • • • •
• • • • • • • • • • •
••••••••c
• • • • • • • • • •
••••••••••a
• • • • • • • • • • •
1517
• • • • • • • • • • •
• • • • ! • • • • HOB1CDI
•••£:•••• • •
• • • • • • • • • •
1520
1521
1522
1523
1474
1524
1475
1525
1476
1478
1515
1516
147.3
1477
••••••••E1BH
•••••S
1510
• • • • • • • • • •
1468
1470
1508
1509
•••••••B
1466
1469
•••••»••••
••••••••KD
1465
1467
1505
1507
• • • • • • •
•••••BD
1460
1462
i5U4
1506
1457
• • • • • • • • • • •
•»••»••••••
• • • • • • • • • • •
• • • • • • • • • • •
• • • • • • • • • • •
1495
1450
1454
iGaaannBiBJ
!494
••••c
!••• • • •
••••••••••a
••••••••••ED
• • • • • • • • • • •
wm
1447
vm
• • • • • • •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• • • •
• • • • • • •
• • • • • • •
• • • • • • •
1526
DBDDnnilDBD • • • • • ! • • • • • •
• • • • • • ! • • • •
1479
1527
mcjziimmmiTim
1528
••••«••• • • • • • • • • • •
•••••••§•
1529
1480
1530
• • • • • • • • • •
• • • • • • • • • • •
1531
1482 mzimnimacicDNCJ • • • • • • • • • •
1483 •O(T""'^^1'^"IIlMn • • • • • • • • • •
1484
1485
1486
BBB1
. ] • • BKZIBIIOBBiBiBSClOB
••cji_i^jk.icjizian
BilHBiBiainBiBiBiBi
1532
1533
• • • • • • • • •
• • • • • • • • •
1534
1535
1536
• • • • • • • • • •
• • • • • • • • • •
BDBDDOBCnnB
• • • • • • • • • • •
• • • • • • • • • • •
• • • • • • • • • • •
•••(£)•••••••
1487 HaaDDDIIDBD
1537
1488
BBiBDDDDDDD
1538
1489
• • • • • • • • • •
1539
•••••••••CD • • • • • • • • • • •
1540
• • • • • « • • • • •
1491 • • • • • • • • • •
1492 • • • • • • • • • •
1493 • • • • • • • • • •
•••••••••san
•••••nf : • • • •
• • • • • • • • • • a
1541
1542
1543
(fg/ng of poly (A)* RNA)
Figure 2. Expression profiles of 100 newly identified genes examined by RT-PCR ELISA. The tissue expression levels of the 100 human
genes were analyzed by using the RT-PCR ELISA according to the methods previously described in detail. 14 Gene names are given
as KIAA numbers at the left side of each set of color codes. Tissue and brain region names are indicated above the top sets of
color codes. A color conversion panel shown at the bottom was used for displaying mRNA levels as color codes. The mRNA levels
are expressed in equivalent amounts (fg) of the authentic cDNA plasmids in 1 ng of starting poly(A)+ RNAs. Besides 10 tissues,
9 regions of the adult central nervous system (amygdala, corpus callosum, cerebellum, caudate nucleus, hippocampus, substantia
nigra, subthalamic nucleus, thalamus, and spinal cord) and fetal brain were included in the expression profiling. As a control, mRNA
levels in fetal liver were also examined.
150
Prediction of Unidentified Human Genes
of functional motifs/domains, since they did not show
sequence similarity to functionally annotated proteins
(Table 2-2). In total, 48 gene products (91% of genes
functionally annotated here) were suggested to have functions relating to cell signaling/communication, nucleic
acid management or cell structure/motility. To find the
genes structurally conserved in others, we tentatively defined "homologues" as genes with at least 30% amino acid
identity spanning almost the entire region (more than
80% coverage against the query protein sequence). As
shown in Table 3, 32 KIAA gene products were found to
have "homologues" in the databases.
[Vol. 7,
domly sampled cDNA clones from human immature
myeloid cell line KG-1, DNA Res., 1, 27-35.
3. Nagase, T., Kikuno, R., Ishikawa, K.-I. et al. 2000, Prediction of the coding sequences of unidentified human
genes. XVI. The complete sequences of 150 new cDNA
clones from brain which code for large proteins in vitro,
DNA Res., 7, 65-73.
4. Ohara, O., Nagase, T., Ishkawa, K.-I. et al. 1997, Construction and characterization of human brain cDNA libraries suitable for analysis of cDNA clones encoding relatively large proteins, DNA Res., 4, 53-59.
5. Hirosawa, M., Nagase, T., Ishikawa, K.-I., Kikuno, R.,
Nomura, N., and Ohara, O. 1999, Characterization of
cDNA clones selected by ;he GeneMark analysis from
size-fractionated cDNA libraries from human brain, DNA
3. Expression Profiles of Predicted Genes
Res., 6, 329-336.
6. Ishikawa, K.-I., Nagase, T., Nakajima, D. et al. 1997,
The expression profiles of the genes newly identified in
Prediction of the coding sequences of unidentified human
this study are represented in Fig. 2 by color codes.14 The
genes. VIII. 78 new cDNA clones from brain which code
expression levels of six genes (KIAA1444, KIAA1446,
for large proteins in vitro, DNA Res., 4, 307-313.
KIAA1472, KIAA1479, KIAA1484 and KIAA1497) were
7. Nagase, T., Ishikawa, K.-I., Miyajima, N. et al. 1998,
relatively higher in brain than in other tissues. Among
Prediction of the coding sequences of unidentified human
them, KIAA1446 and KIAA1497 were homologues of rat
genes. IX. 100 new cDNA clones from brain which can
brain-enriched guanylate kinase-associated protein15 and
code for large proteins in vitro, DNA Res., 5, 31-39.
mouse neuronal leucine-rich repeat protein I, 16 respec8. Gyapay, G., Schmitt, K., Fizames, C. et al. 1996, A radiation hybrid map of the human genome, Hum. Mol.
tively. The gene product of KIAA 1472 exhibited simiGenet, 5, 339-346.
larity in sequence through almost the entire regions to
9. Bleasby, A. J., Akrigg, D , and Attwood, T. K. 1994,
KIAA0374 protein, whose expression was relatively high
OWL - a non-redundant composite protein sequence
in brain although it's function was unknown. KIAA 1535
database,
Nucleic Acids Res., 22, 3574-3577.
was a human counterpart of mouse hyperpolarization- ac10. Goffeau, A., Barrell, B. G., Bussey, H. et al. 1996, Life
tivated cation channel HAC3, and its expression pattern
with 6000 genes, Science, 274, 546-567.
suggests that this gene might play an important role in 11. The C. elegans Sequencing Consortium. 1998, Genome
development of the human brain since the expression of
sequence of the nematode, C. elegans: A platform for
KIAA 1535 was higher in fetal brain than in adult brain.
investing biology, Science, 282, 2012-2018.
These expression profiles provide us important informa- 12. Bateman, A., Birney, E., Durbin, R. et al. 1999, Pfam
tion for identifying biologically important genes charac3.1: 1313 multiple alignments and profile HMMs match
terized in this project.
the majority of proteins, Nucleic Acids Res., 27, 260-262.
Acknowledgements: This project was supported by 13. Kikuno, R., Nagase, T., Suyama, M., Waki, M.,
Hirosawa, M., and Ohara, O. 2000, HUGE: a database
grants from the Kazusa DNA Research Institute. We
for human large proteins identified in Kazusa cDNA sethank Tomomi Tajino, Keishi Ozawa, Tomomi Kato,
quencing project, Nucleic Acids Res., 28, 331-332.
Kazuhiro Sato, Akiko Ukigai, Emiko Suzuki, Kazuko
14. Nagase, T., Ishikawa, K.-I., Suyama, M. et al. 1998, PreYamada, Kiyoe Sumi, Takashi Watanabe, Kozue Kaneko,
diction of the coding sequences of unidentified human
Naoko Shibano, Kazutaka Mitsui, Mina Waki and Nobue
genes. XI. The complete sequences of 100 new cDNA
Kashima for their technical assistance.
clones from brain which code for large proteins in vitro,
DNA Res., 5, 277-286.
15.
Deguchi,
M., Hata, Y., Takeuchi, M. et al. 1998, BEReferences
GAIN (brain-enriched guanylate kinase-associated protein), a novel neuronal PSD-95/SAP90-binding protein,
1. Dunham, I., Shimizu, N., Roe, B. A. et al. 1999, The
J. Biol. Chem., 273, 26269-26272.
DNA sequence of human chromosome 22, Nature, 402,
16.
Taguchi,
A., Wanaka, A., Mori, T. et al. 1996, Molecular
489-495.
cloning of novel leucine-rich repeat proteins and their ex2. Nomura, N., Miyajima, N., Sazuka, T. et al. 1994,
pression in the developing mouse nervous system, Brain
Prediction of the coding sequences of unidentified huRes. Mol. Brain Res., 35, 31-40.
man genes. I. The coding sequences of 40 new genes
(KIAA0001-KIAA0040) deduced by analysis of ran-
Related documents