Download Modeling Multi-typed Structurally Viewed Chemicals with the UMLS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Butyric acid wikipedia , lookup

NADH:ubiquinone oxidoreductase (H+-translocating) wikipedia , lookup

Drug discovery wikipedia , lookup

Chemical weapon wikipedia , lookup

Protein wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Point mutation wikipedia , lookup

Metabolism wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Metalloprotein wikipedia , lookup

Genetic code wikipedia , lookup

Protein structure prediction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Hepoxilin wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Biosynthesis wikipedia , lookup

Peptide synthesis wikipedia , lookup

Biochemistry wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
116
Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN
Research Paper 䡲
Modeling Multi-typed Structurally Viewed Chemicals with the
UMLS Refined Semantic Network
LING CHEN, PHD, C. PAUL MORREY, MS, HUANYING GU, PHD, MICHAEL HALPER, PHD,
YEHOSHUA PERL, PHD
A b s t r a c t Objective: Chemical concepts assigned multiple “Chemical Viewed Structurally” semantic types
(STs) in the Unified Medical Language System (UMLS) are subject to ambiguous interpretation. The multiple
assignments may denote the fact that a specific represented chemical (combination) is a conjugate, derived via a
chemical reaction of chemicals of the different types, or a complex, composed of a mixture of such chemicals. The
previously introduced Refined Semantic Network (RSN) is modified to properly model these varied multi-typed
chemical combinations.
Design: The RSN was previously introduced as an enhanced abstraction of the UMLS’s concepts. It features new
types, called intersection semantic types (ISTs), each of which explicitly captures a unique combination of ST
assignments in one abstract unit. The ambiguous ISTs of different “Chemical Viewed Structurally” ISTs of the
RSN are replaced with two varieties of new types, called conjugate types and complex types, which explicitly
denote the nature of the chemical interactions. Additional semantic relationships help further refine that new
portion of the RSN rooted at the ST “Chemical Viewed Structurally.”
Measurements: The number of new conjugate and complex types and the amount of changes to the type
assignment of chemical concepts are presented.
Results: The modified RSN, consisting of 35 types and featuring 22 new conjugate and complex types, is
presented. A total of 800 (about 98%) chemical concepts representing multi-typed chemical combinations from
“Chemical Viewed Structurally” STs are uniquely assigned one of the new types. An additional benefit is the
identification of a number of illegal ISTs and ST assignment errors, some of which are direct violations of
exclusion rules defined by the UMLS Semantic Network.
Conclusion: The modified RSN provides an enhanced abstract view of the UMLS’s chemical content. Its array of
conjugate and complex types provides a more accurate model of the variety of combinations involving chemicals
viewed structurally. This framework will help streamline the process of type assignments for such chemical
concepts and improve user orientation to the richness of the chemical content of the UMLS.
䡲 J Am Med Inform Assoc. 2009;16:116 –131. DOI 10.1197/jamia.M2604.
Introduction
1
The Metathesaurus (META) and the Semantic Network
(SN)2 are two fundamental knowledge resources of the
Unified Medical Language System (UMLS).3 The SN, consisting of 135 broad categories called semantic types (STs),
Affiliations of the authors: Department of Science, BMCC, City
University of New York (LC), New York, NY; CS Department, New
Jersey Institute of Technology (CPM, YP), Newark, NJ; Department
of Health Informatics, SHRP, University of Medicine and Dentistry
of New Jersey (HG), Newark, NJ; Department of Computer Science,
Kean University, (MH), Union, NJ.
Supported in part by the National Library of Medicine under grant
R-01-LM008445-01A2.
The authors thank Jim Cimino for his repeated feedback, and
Olivier Bodenreider for pointing out examples of chemicals composed of carbohydrates and amino acid components that are valid
ISTs, which initiated our analysis of partial ISTs.
Correspondence: Dr. Yehoshua Perl, CS Department, New Jersey
Institute of Technology, Newark, NJ 07102-1982. e-mail: ⬍perl@
oak.njit.edu⬎.
Received for review 08/22/07; accepted for publication: 09/23/08.
provides a high-level categorization of all 1.5 million biomedical concepts residing in the META. Each concept is
assigned one or more of these STs, which serve to denote an
aspect of the concept’s semantics. The extent of an ST is the
entire set of concepts to which it is assigned. If the extent of
an ST contains some concepts also assigned other STs at the
same time—a common occurrence—then the set will elaborate a variegated semantics. In this sense, the high-level
abstract view provided by the SN does not in general show
semantic uniformity for the concepts of the META included
in a particular ST’s extent. For example, two concepts
assigned the ST Steroid are enterodiol and lipoprotein-X
cholesterol (concepts are written in italics; semantic types are
capitalized and written in bold, except in tables and figures).
However, their high-level semantics are not that similar. The
former is assigned only Steroid, whereas the latter is multityped, assigned both Steroid and Amino Acid, Peptide, or
Protein.
The previously introduced Refined Semantic Network
(RSN)4 offers a semantically uniform abstract view of the
META by utilizing reification with respect to combinations
Journal of the American Medical Informatics Association
Volume 16
of ST assignments. By reification, in this context, we mean
the creation of an explicit type at the Semantic-Network
level. In particular, we model all existing ST assignment
combinations as separate types in their own right, called
intersection semantic types (ISTs). For example, because the
concept lipoprotein-X cholesterol is one of 39 concepts assigned both Steroid and Amino Acid, Peptide, or Protein
with respect to the SN, the existence of that assignment
combination causes the RSN to include an IST named
Steroid 僕 Amino Acid, Peptide, or Protein that is lipoprotein-X cholesterol’s sole type assignment. (The symbol “艚” is
set intersection.) All boldface terms are defined in the
Glossary Appendix (available as an online data supplement
at [email protected]).
The largest collection of ISTs, 411 in total, exists for that part
of the UMLS devoted to chemicals, where two or more ST
assignments per concept are common. The 32 ISTs involving
Chemical Viewed Structurally or its descendants have
revealed semantic ambiguities with respect to various ST
assignment combinations. Typically, an assignment of an
IST involving, say, two STs to a given concept indicates that
the concept has the semantics of both. However, with a
multi-typed chemical concept that is viewed structurally, an
ST combination may indicate a simple mixture or some
implied chemical reaction. All chemicals, in fact, can be
categorized as either pure substances (with definite compositions) or mixtures (without definite compositions). A conjugate is a pure substance produced through a chemical
reaction involving two or more compounds (which themselves are also pure substances). The constituent moieties of
a conjugate are linked together by covalent bonds. An
example conjugate is avidin-adenosine monophosphate conjugate, consisting of a protein moiety, avidin, and a nucleotide
moiety, adenosine monophosphate. Of interest is the fact
that the constituent components of a conjugate can only be
separated via a chemical reaction (i.e., a decomposition or
hydrolysis reaction) that undoes the original reaction used
in the conjugate’s creation.
On the other hand, mixtures are made of two or more
chemicals, where the chemicals are not joined by covalent
bonds. Therefore, they can be mixed at different proportions
(i.e., the composition can be varied). When at least one of the
chemicals is a macromolecule, the mixture is called a complex. Theoretically speaking, it is possible for any two
compounds (macromolecules) to form both a conjugate (via
a chemical reaction) and a complex (via physical means).
Virus core is an example of a complex consisting of nucleic
acids and proteins. The nucleic acid is enclosed in a protective coat of protein. In contrast to a conjugate, a complex can
be separated into its constituent substances without having
to resort to a chemical reaction. The two components of the
virus core can easily be separated via physical means, i.e.,
solvent extractions. When a virus infects a host, the protein
coat helps it attach to the cell surface, and the nucleic acid
component is injected into the host cells. With this distinction in mind, we see that an IST may in actuality denote
chemical conjugates or complexes, whose component chemical concepts are assigned different STs in the SN.
In this article, we analyze the possible composite semantics
elaborated by ISTs comprising chemical-viewed-structurally
STs based on the nature of the chemical interactions. Fol-
Number 1
January / February 2009
117
lowing this analysis, extensions to the RSN are proposed. In
particular, the RSN is augmented with new types derived
from ISTs to explicitly represent conjugates and complexes,
as well as their semantic relationships. Rules expressed by
the UMLS in its ST definitions and usage notes concerning
illegal ST combinations are used to expose concepts with
erroneous ST assignment combinations and suggest proper
re-assignments. Redundant assignments5 are also identified
and corrected. Overall, the resulting RSN more properly
elaborates the semantics of varied multi-typed chemical
compositions. Its abstract view facilitates user orientation to
the richness of the UMLS’s chemical content and provides
maintenance personnel with an easier and more accurate
framework for carrying out chemical concept categorizations.
Practical implications of adding complex types and conjugate types to the RSN are considered. Tradeoffs regarding
the RSN’s overall size and coverage and various implementation options are also discussed. An alternative use of the
RSN as a high-level categorization mechanism for a chemical ontology, such as ChEBI,6 is also considered.
Background
The RSN4,7 consists of two kinds of types. All STs appearing
in the SN are carried over to be types in the RSN. These are
referred to as original semantic types (OSTs). The others are
the intersection semantic types (ISTs), which, as noted above,
are reifications of ST assignment combinations appearing in
the UMLS. An IST exists for every such combination of
multiple ST assignments to a concept, as defined by the
UMLS’s MRSTY table.8 An IST involving two STs, say,
Carbohydrate and Lipid is denoted Carbohydrate 僕 Lipid,
where “艚” is the symbol for set intersection. ISTs may
involve more than two STs, e.g., Steroid 僕 Amino Acid,
Peptide, or Protein 僕 Carbohydrate.
It is important to note that each concept receives a single
type assignment with respect to the RSN. In particular, the
type assignments for OSTs tend to differ from those of their
corresponding STs in the SN. A concept retains an OST
assignment only if that assignment was its sole assignment
previously. For example, enterodiol is assigned only Steroid
in the SN and thus has the exact same assignment in the
RSN. On the other hand, a concept with multiple assignments in the SN will lose all of those assignments in favor of
a single IST assignment in the RSN. For example, lipoprotein-X cholesterol is assigned Steroid and Amino Acid, Peptide, or Protein. Therefore, with respect to the RSN, it will be
assigned only the IST Steroid 僕 Amino Acid, Peptide, or
Protein, not its two OSTs. In this manner, the entire collection of the RSN’s types functions as a partition of the
concepts of the UMLS into disjoint extents of uniform
semantics. The extent of the OST Steroid contains chemical
concepts that are categorized strictly as steroids (including
enterodiol). The extent of the IST Steroid 僕 Amino Acid,
Peptide, or Protein contains chemical concepts categorized
jointly as steroids and amino acids, peptides, or proteins
(including lipoprotein-X cholesterol). Overall, the chemical
concepts in the UMLS 2007AA release are partitioned into 25
OSTs and 411 ISTs. In total, 108,299 concepts are assigned
the 25 OSTs; 196,524 concepts are assigned the 411 ISTs.
There are 84,630 concepts assigned the 11 chemical-viewed-
118
Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN
F i g u r e 1. The portion of the RSN rooted at Chemical Viewed Structurally.
structurally OSTs and another 820 concepts assigned the 32
chemical-viewed-structurally ISTs.
The IS-A hierarchy of the RSN extends that of the SN to
allow for multiple parents. In fact, every IST has more than
one parent. Although we will not get into all the details of
the derivations of these IS-As (see Gu et al.7), we do note that
an IST always has paths of IS-A relationships leading to each
of its constituent OSTs.
The portion of the RSN rooted at Chemical Viewed
Structurally, including 11 OSTs and 32 ISTs, is shown in
Figure 1. All OSTs appear above the dashed line in the
figure; all ISTs are below it. A type is drawn as a box,
whereas an IS-A is a bold arrow directed from the child to
the parent. Again, note that the IST Amino Acid, Peptide,
or Protein 僕 Carbohydrate 僕 Lipid, for example, has
three IS-As leading to its parents, a situation that would
not occur in the SN.
Methods
Our methodology augments the RSN with new types and
semantic relationships in order to more properly capture
knowledge of chemicals. Because the focus of this work is on
types in the hierarchy rooted at Chemical, let us start by
introducing some terminology concerning such types. An ST
that is a descendant of Chemical is called a chemical ST
(CST). An ST beneath Chemical Viewed Structurally is
called a structurally viewed CST, whereas one beneath
Chemical Viewed Functionally is a functionally viewed
CST. Lastly, an ST under Organic Chemical in the hierarchy
is called an organic CST. As an example, Lipid is a structurally viewed CST; it is also an organic CST. Vitamin is a
functionally viewed CST. The UMLS definition of ST Chemical states that: “Chemicals are viewed from two distinct
perspectives in the network, functionally and structurally.
Almost every chemical concept is assigned at least two
types, generally one from the structure hierarchy and at least
one from the function hierarchy.”9 This implies that ISTs
involving CSTs should be common in the RSN, and in fact 90
of the 100 ISTs with the largest extents do include CSTs. An
IST involving functionally viewed CSTs has the expected
interpretation of a logical “AND” operator. Its assigned
concepts have the semantics of all the types in the conjunctive form. For example, with Vitamin 僕 Pharmacologic
Substance, all concepts represent chemicals that are both
vitamins and pharmacologic substances. If an IST represents
a combination of one structurally viewed CST and one or
Journal of the American Medical Informatics Association
Volume 16
more functionally viewed CSTs, then we find the same
interpretation. As an example, Lipid 僕 Vitamin is assigned
to concepts that indeed represent chemicals that are both
lipids and vitamins. In both of these circumstances, the IST’s
IS-A relationships to its constituent OSTs in the RSN help to
reinforce this interpretation.
Conjugate versus Complex
The situation is different, however, in cases where two or
more structurally viewed CSTs are involved. Such an IST
models chemicals obtained from the combination of other
chemicals. When combining two (or more) chemicals
whose corresponding concepts are assigned two (or more)
structurally viewed CSTs, a chemical reaction may occur
and produce an entirely new chemical. Such a chemical is
called a conjugate. A conjugate chemical does not necessarily have all of the properties of its source chemicals,
because some of the original structural components are
expended in its creation. The neutralization reaction of an
acid and a base producing a salt is a simple example of
this scenario. The new chemical, salt, contains parts of
acid and base; however, it is neither an acid nor a base. In
this sense, a conjugate does not exhibit the semantic
combination of the STs underlying its corresponding
concept’s assigned IST. It, in fact, has a brand-new semantics. An example conjugate is N-dodecanoyl serine produced by a reaction of dodecanoic acid (whose concept is
assigned Lipid) and serine (assigned Amino Acid, Peptide, or Protein). This chemical is neither a lipid nor an
amino acid, peptide, or protein.
It is also possible that the chemical combination results in a
new chemical that is a mixture of the originals. In this case,
the new chemical is called a complex. A complex chemical, in
contrast to a conjugate, preserves the properties of its
original chemicals. It has the semantic conjunctive combination of the constituent STs of the IST assigned to its corresponding concept. The concept high density lipoprotein (HDL),
“the good cholesterol,” consisting of cholesterol and lipoprotein, represents a complex.
Let us note that many conjugate and complex chemicals are
derived from two or more chemicals of the same type, e.g.,
two carbohydrates or two lipids. Such “intra-type” conjugates and complexes are not dealt with in this work because
the modeling provided by the SN itself suffices. That is, the
assignment of an OST, say, T to a concept representing an
intra-type complex or conjugate whose component chemical
concepts are all assigned T, too, is appropriate. For example,
Animal fat is a concept denoting a complex comprising only
lipids. Hence, it is fittingly assigned OST Lipid. Another
example is Dietary carbohydrates, which is composed of
different kinds of carbohydrates. It is appropriately assigned
Carbohydrate. In such circumstances, the OSTs offer the
proper type assignments to the respective concepts, and the
RSN does not need to be altered in any special way in order
to accommodate them. In the remainder of this article, we
use “conjugate” and “complex” exclusively for chemicals
whose concepts were originally multi-typed with respect to
the SN and have been assigned a single IST in the RSN.
These are the chemicals that we propose to remodel from a
type perspective.
Number 1
January / February 2009
119
Different Configurations of ISTs Comprising
Structurally Viewed CSTs
We have identified three distinct cases concerning concepts assigned an IST involving two or more structurally viewed CSTs:
1. All of the concepts assigned the IST represent conjugates.
2. Some (but not all) of the concepts assigned the IST
represent complexes; the remaining concepts represent
conjugates.
3. None of the concepts should actually be assigned the IST
because its combination of the two (or more) structurally
viewed CSTs is semantically invalid. The IST should not
exist.
Although theoretically possible, we did not encounter any
case of an IST in the UMLS where all concepts represented
complexes. The interesting cases, from the perspective of
further refinement of the RSN, are (1) and (2), in which the
IST categorizes a representation of a combination of two or
more chemicals— each of whose concepts is assigned various structurally viewed CSTs—that results in a new chemical.
For Case 1, the RSN is augmented to express knowledge of
conjugates explicitly by first adding a new type called
Conjugate, with the following definition:
A compound produced from a chemical reaction of two or
more compounds. Such a compound consists of chemically
(covalently) bonded moieties of each constituent.
Conjugate is also defined to stand in an IS-A relationship to
Organic Chemical, which is a child of Chemical Viewed
Structurally. The reason for this IS-A arrangement is because, as it happens, all conjugates represented in the UMLS
are organic chemicals (without any restrictions on the molecular sizes of the components). Second, each IST satisfying
Case 1 is replaced by a new type whose name expresses the
fact that it denotes a conjugate. The conjugate chemical
concepts originally assigned the particular IST are now
assigned the new conjugate type instead. Finally, the new
conjugate type is given an IS-A relationship to Conjugate. It
does not receive the original IS-As of the IST it is replacing
because the concepts to which it is assigned do not exhibit
the individual semantics of the underlying OSTs. However,
to preserve the link between a conjugate concept and the
constituent chemical concepts underlying the complex’s
derivation, a new semantic relationship named has component is substituted for those discarded IS-As.
The name of the conjugate type is constructed by transforming and combining the names of the IST’s constituent OSTs
into a new composite word-form that is used as a modifier
for the word “Conjugate.” The International Union of Pure
and Applied Chemistry’s (IUPAC’s) nomenclature system10
is followed when names are available. Some examples are
Glycolipid, for Carbohydrate 僕 Lipid; Glyco-amino-acid, Glycopeptide, or Glycoprotein, for Carbohydrate 僕 Amino Acid,
Peptide, or Protein; Lipoprotein, for Lipid 僕 Amino Acid,
Peptide, or Protein; and Nucleoprotein, for Nucleic Acid, Nucleoside, or Nucleotide 僕 Amino Acid, Peptide, or Protein.11
Names of other conjugates were taken from the chemistry
literature. Samples include: Lipo-amino-acid (in the form of
one word),12 Lipo-nucleo-protein (in the form of one
word),13 Liponucleotide,14 Peptidopolysaccharide,15 Phosphoamino Acid,16 Phosphopeptide,17 Polysaccharide-pep-
120
Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN
assigned concepts, representing chemical complexes, still
exhibit the semantics of the OSTs.
As an example of Case 1, the IST Eicosanoid 僕 Amino Acid,
Peptide, or Protein has assigned concepts that all represent
conjugates. One of them is 6-ketoprostaglandin F1 alphathyroglobulin conjugate. It is replaced by the new conjugate
type Eicosanoic-peptide or Eicosanoic-protein Conjugate,
which has an IS-A to Conjugate, but not to the IST’s parents,
Eicosanoid or Amino Acid, Peptide, or Protein. Instead,
there are has component’s directed from Eicosanoic-peptide
or Eicosanoic-protein Conjugate to Eicosanoid and Amino
Acid, Peptide, or Protein. This modeling is illustrated in
Figure 2.
F i g u r e 2. New conjugate type Eicosanoic-peptide or
Eicosanoic-protein Conjugate.
tide,18 Proteopolysaccharide,19 Steroid-amino-acid, Peptide, or Protein.20-22
If common name formulations are not available, we combine
the OSTs’ names using the following left-to-right ST prioritization: Carbohydrate; Lipid; Nucleic Acid, Nucleoside, or
Nucleotide; and Amino Acid, Peptide, or Protein. In this
prioritization, we employ the rule that shorter names are
placed to the left (as adjectives) of the longer names.
(Exceptions to this rule occur in two special situations where
the relative sizes of each chemical component matter.)
An IST satisfying Case 2 is divided up and replaced by a pair
of new types, one for its concepts representing conjugates
and one for its concepts representing complexes. The creation of the conjugate type follows the handling of Case 1,
with all the IST’s conjugate chemical concepts being exclusively re-assigned the new conjugate type. The modeling of
the IST’s complex chemical concepts follows a similar pattern to that of the conjugates. First, the RSN is extended to
include the type Complex, whose definition is:
A chemical mixture produced by physically mixing two or
more chemical compounds without chemical reactions. They
are held together by nonspecific intermolecular forces, not by
chemical bonds (covalent bonds).
As with Conjugate, the new type Complex is made a child
of Organic Chemical because all complex chemicals represented in the UMLS consist of mixtures of organic compounds, such as steroids, nucleic acids, proteins, etc. (In Yu
et al.,23 we also find a proposed type “complex” defined as
a child of Chemical Viewed Structurally.) Next, a second
new type, accompanying the conjugate type, is defined to
replace the IST. The complex chemical concepts originally
assigned several CSTs, the intersection of which created the
particular IST, are now exclusively assigned the new complex type instead. The name of the complex type is derived
in a manner similar to that for its companion conjugate type,
with “Complex” used instead of “Conjugate.” Lastly, the
new type is placed in the IS-A hierarchy by giving it the
same IS-As (there must be more than one) as the IST from
which it was derived. It should be noted that the IST’s
existing IS-As are preserved for the complex type because its
An example of Case 2 is Amino Acid, Peptide, or Protein 僕
Lipid, assigned to some concepts that correspond to conjugates and others that represent complexes. Part of its replacement is the new conjugate type Lipo-amino-acid, Lipopeptide, or Lipoprotein Conjugate. The other part is the
new complex type Lipopeptide or Lipoprotein Complex.
Eighty-two of Amino Acid, Peptide, or Protein 僕 Lipid’s 121
assigned concepts denote conjugates and are thus re-assigned
Lipo-amino-acid, Lipopeptide, or Lipoprotein Conjugate.
One of them is the conjugate N-dodecanoyl serine, mentioned
above. The remaining 39 assigned concepts stand for complexes and are therefore re-assigned Lipopeptide or Lipoprotein Complex. Virosomes is an example. The complete remodeling of Amino Acid, Peptide, or Protein 僕 Lipid is
illustrated in Figure 3. It should be noted that Lipopeptide
or Lipoprotein Complex has three parents, while Lipoamino-acid, Lipopeptide, or Lipoprotein Conjugate has
just one. The conjugate type does have two has component
relationships.
Different Kinds of Invalid ISTs
Cases 1 and 2 imply changes to the structure of the RSN that
enhance the view it affords. Case 3 represents a violation of
the rules defined in the Semantic Network itself. Therefore,
not only should the RSN be changed in that case, but the
assigned UMLS modeling must be corrected, too. Actually,
we find two distinct scenarios regarding Case 3. One is
where a concept is assigned an IST in explicit violation of the
instructions in the definition or usage notes of an ST. For
example, two chemicals categorized respectively by the
structurally viewed CSTs Carbohydrate and Lipid combine,
F i g u r e 3. Remodeling conjugate and complex chemicals
originally assigned Amino Acid, Peptide, or Protein 艚
Lipid in the RSN.
Journal of the American Medical Informatics Association
Volume 16
most of the time, to create a glycolipid (which is a conjugate). According to the usage note of the ST Carbohydrate,
“glycolipid should only be typed as Lipid.” Hence, the 110
concepts representing glycolipids (e.g., N-octanoylglucosylceramide), among the 126 total concepts assigned Lipid and
Carbohydrate, should be assigned only Lipid.
A second example can be seen with a concept assigned
Element, Ion, or Isotope and an organic CST, such as
Steroid, Carbohydrate, etc. According to the UMLS definition of Element, Ion, or Isotope, it “does not include organic
ions such as iodoacetate to which the type ‘Organic Chemical’ is assigned.” Therefore, such a concept should only be
assigned the organic CST. For example, the concept isopropyl
trimethacryltitanate is assigned Steroid and Element, Ion, or
Isotope, but it should only be assigned the former. Thus,
Steroid 僕 Element, Ion, or Isotope should not appear in the
RSN.
There are special cases of UMLS rules violations involving
an ST that is disjunctive in its construction, meaning it has
the form “X, Y, or Z.” Specifically, the problematic ST is
Amino Acid, Peptide, or Protein. This ST lumps together
concepts whose high-level semantics are related but not
identical. This becomes a problem when the chemicals
represented by these concepts are combined with chemicals
whose concepts are from another ST. In particular, this
results in rules being applicable only to a subset of the
assigned concepts. For example, in the definition of Carbohydrate, there is the following stipulation: “Excluded
are . . . glycoproteins . . .”; and in its usage note, we find:
“Glycoproteins should only be typed as ‘Amino Acid, Peptide, or Protein.’” Combinations of carbohydrates and amino
acids or carbohydrates and peptides are perfectly acceptable. Only combinations of carbohydrates and proteins are
expressly forbidden. But actually this restriction is not an
absolute. Some such combinations are in fact allowed when
they are not glycoproteins, depending on the proportion of
the different kinds of molecules involved. The definition of
Organophosphorus Compound, which similarly intersects
with Amino Acid, Peptide, or Protein, also contains an
explicit exclusion of phosphoproteins, comprising organophosphorus compound and protein portions.
To deal with these special situations, it is necessary to divide
the respective ISTs into multiple ISTs across the disjunctive
form and process each separately. The new ISTs, not actually
appearing in the RSN, are called partial ISTs because each
represents a portion of the underlying intersection. The IST
Carbohydrate 僕 Amino Acid, Peptide, or Protein is divided
into the three partial ISTs Carbohydrate 僕 Amino Acid
(representing the amino acid contribution to the IST, e.g.,
glucose-cysteine and (Man)5(GlcNAc)2Asn), Carbohydrate 僕
Peptide (representing the peptide contribution, e.g., histidylAH-sepharose and 6-chlorofructos-1-yl-glutathione), and Carbohydrate 僕 Protein (representing the protein contribution, e.g.,
Immunoglobulin A-sepharose 4B and N-acetylgalactosamine-bovine
serum albumin conjugate). The IST Organophosphorus Compound 僕 Amino Acid, Peptide, or Protein is broken down
similarly into three partial ISTs. All these newly derived
partial ISTs are further refined into conjugate and complex
types. The analysis, in some cases, leads to a refinement into
multiple families of conjugates. For example, there may be
concepts for two types of conjugates for a partial IST (as with
Number 1
January / February 2009
121
Carbohydrate 僕 Protein) that are distinguished by the
relative proportions of the two constituent chemicals. In
such a circumstance, the larger component is used as the
primary naming element of the conjugate type. In other
words, the name of the smaller chemical component comes
first as an adjective (prefix) and the name of the larger
chemical component is written last (as the suffix). One
example mentioned above for the partial IST Carbohydrate
僕 Amino Acid, namely, glucose-cysteine, containing one
sugar (carbohydrate) and one amino acid unit, is assigned
Glycoamino-acid Conjugate. Another example for this partial IST, (Man)5(GlcNAc)2Asn, containing a larger carbohydrate portion and one amino acid unit, is assigned Aminoacid-polysaccharide Conjugate.
At this point, we must raise a practical issue. The types of
the RSN, including the conjugate and complex types, are
intended to serve as broad categories that encompass the
concepts of the META. However, some ISTs are assigned
very few concepts—in some cases only one concept. It is
arguable whether such ISTs really represent first-order categories. While one could make arguments from both sides
on this issue, we have chosen to allow for the specification of
a threshold value on types’ extent sizes for the purpose of
determining whether a type qualifies for inclusion in the
RSN. In other words, if a type’s extent size is below the
threshold value, then it is excluded from the network
presentation. A type excluded in this manner is referred to
as a minor category. Let us note that the use of the threshold
value actually leaves those concepts assigned minor categories without a unique type assignment with respect to the
RSN. For those concepts with minor categories, we will need
to keep the assignments of the original types of the SN.
Of course, the choice of the threshold value is arbitrary and
depends on one’s point of view in this matter. Because of
this, we report the results of varying the threshold over a
range of different values. A higher threshold value will
increase the number of minor categories and therefore
reduce the percentage of chemical concepts— originally assigned structurally viewed CSTs— having unique type coverage in the RSN.
Results
ISTs with Conjugates Only
The RSN derived from the 2007AA UMLS release contains
32 ISTs involving structurally viewed CSTs as shown in
Figure 1. Six of those ISTs satisfy Case 1, that is, the chemical
concepts categorized by these ISTs all represent conjugates
(excluding miscategorized cases). (The analysis of the chemical concepts was performed by one of the authors (L.C.), a
chemistry professor, utilizing the IUPAC Compendium of
Chemical Terminology24 as a resource.) Table 1 lists these six
ISTs along with their conjugate-type replacements and their
numbers of assigned concepts in parentheses. An example
assigned conjugate concept is included for each type as well.
For example, Nucleic Acid, Nucleoside, or Nucleotide 僕
Carbohydrate is replaced by Glyco-Nucleic Acid, Nucleoside
or Nucleotide Conjugate. DNA-cellulose is one of the 149
concepts now assigned this conjugate type instead of the IST.
Let us point out that the IST Nucleic Acid, Nucleoside, or
Nucleotide 艚 Lipid is replaced by the type Liponucleoside
or Liponucleotide Conjugate which is missing the “Lipo-
122
Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN
Table 1 y ISTs Replaced by Conjugate Types Only
Conjugate Type or Minor Category
(# Concepts)
IST (# Concepts)
Nucleic Acid, Nucleoside, or Nucleotide
艚 Carbohydrate (149)
Steroid 僕 Nucleic Acid, Nucleoside, or
Nucleotide (8)
Nucleic Acid, Nucleoside, or Nucleotide
艚 Lipid (141)
Eicosanoid 僕 Amino Acid, Peptide, or
Protein (3)
Eicosanoid 僕 Nucleic Acid, Nucleoside,
or Nucleotide (2)
Nucleic Acid, Nucleoside, or Nucleotide
僕 Carbohydrate 僕 Lipid (1)
Example of Assigned Concept
Glyco-nucleic acid, Nucleoside, or Nucleotide
Conjugate (149)
Steroid-nucleic acid, Nucleoside, or
Nucleotide Conjugate (8)
Liponucleoside or Liponucleotide
Conjugate (134)
Eicosanoic-peptide or Eicosanoic-protein
Conjugate (3)
Eicosanoic-nucleotide Conjugate (2)
DNA-cellulose
Glyco-lipo-nucleotide Conjugate (1)
UDP-3-O-(3-hydroxymyristoyl)-Nacetylglucosamine
Cortisone-4-ara-C
2-octynoyl-coenzyme A
6-ketoprostaglandin F1 alpha-thyroglobulin
conjugate
Phytanoyl-coenzyme A
IST ⫽ intersection semantic type.
nucleic Acid” component from its name. The reason for this
omission is because of the fact that there are no concepts at
all in the UMLS representing liponucleic acids. We are not
including combinations that do not have any occurrences.
Another example of this can be seen for Eicosanoid 僕
Amino Acid, Peptide, or Protein.
It should also be noted that for Nucleic Acid, Nucleoside, or
Nucleotide 僕 Lipid, the number of chemicals assigned the
conjugate type (134) is lower than the number assigned the
IST (141). The reason for this is that seven of the original
assignments were in error. (Details and explanations for
these seven erroneous concepts—and others—are discussed
below in the section Errors Identified.)
As mentioned, a threshold value is used to distinguish
between types, included in the RSN, and minor categories,
which are excluded. For the sake of demonstration, we
initially discuss the results for a threshold value of five. That
is, a conjugate type or a complex type is deemed a minor
category if it contains fewer than five concepts. Although the
minor categories do not appear in the RSN, we do report on
them, too, for the sake of completeness. They are shaded gray
in Tables 1 through 4. For example, the IST Nucleic Acid,
Nucleoside, or Nucleotide 僕 Carbohydrate 僕 Lipid is replaced by Glyco-Lipo-Nucleotide Conjugate, which is assigned to UDP-3-O-(3-hydroxymyristoyl)-N-acetylglucosamine,
the single concept previously assigned the IST. However,
Table 2 y ISTs Replaced by Conjugate and Complex Types
Conjugate Type or Minor Category
IST (# Concepts)
Amino Acid, Peptide, or
Protein 艚 Lipid (121)
Steroid 艚 Amino Acid,
Peptide, or Protein (39)
Nucleic Acid, Nucleoside,
or Nucleotide 艚 Amino
Acid, Peptide, or
Protein (121)
Steroid 艚 Amino Acid,
Peptide, or Protein 艚
Carbohydrate (8)
Amino Acid, Peptide, or
Protein 艚 Carbohydrate
僕 Lipid (7)
Nucleic Acid, Nucleoside,
or Nucleotide 艚 Amino
Acid, Peptide, or
Protein 艚 Lipid (2)
Nucleic Acid, Nucleoside,
or Nucleotide 艚 Amino
Acid, Peptide, or
Protein 艚
Carbohydrate (4)
Name (# Concepts)
Example Assigned
Concept
Lipo-amino-acid, Lipopeptide,
or Lipoprotein
Conjugate (82)
Steroid-amino-acid, Peptide,
or Protein Conjugate (33)
N-stearoylhistidine
Nucleo-amino-acid, Peptide,
or Protein Conjugate (106)
Aspartyl adenylate
Steroid-glycoamino-acid or
Glycoprotein Conjugate (2)
Tyrosyl-ouabain
Glycolipoprotein
Conjugate (1)
Lysozyme-glucose
stearic acid
monoester
s-adenosyl-Lmethionine
N-ole-1oyltaurate
N(alpha)-dansylN(omega)-1,N(6)etheno-ADPribosylarginine
methyl ester
Lipo-nucleo-amino-acid
Conjugate (1)
Glyco-nucleo-amino-acid or
Glyco-nucleo-peptide
Conjugate (2)
IST ⫽ intersection semantic type.
Estradiol-bovine
serum albumin
Complex Type or Minor Category
Name (# Concepts)
Example Assigned
Concept
Lipopeptide or
Lipoprotein
Complex (39)
Steroid-peptide, or
Steroid-protein
Complex (5)
Nucleo-amino-acid,
Peptide, or Protein
Complex (15)
Virosomes
Glyco-steroid-amino-acid,
Peptide, or Protein
Complex (6)
Glycolipoprotein
Complex (5)
Polyspectran OS
Lipo-nucleo-protein
Complex (1)
RNA-proteolipid complex
Glyco-nucleo-amino-acid
or Peptide Complex (2)
Foltene
Lipoprotein-X cholesterol
Actinomycin D-dATGCAT
complex
Low-density lipoproteinheparin complex
Journal of the American Medical Informatics Association
Volume 16
Number 1
123
January / February 2009
Table 3 y Most of the ISTs Not Appearing in the Remodeled RSN Due to Violation of ST Definitions or Usage
Notes
IST (# Concepts)
Correct Type or Minor
Category (# Concepts)
Eicosanoid 艚 Carbohydrate (1) Eicosanoid (1)
Steroid 艚 Carbohydrate (154) Steroid (145)
Steroid-Polysaccharide
Conjugate (2)
Carbohydrate-steroid
Complex (6)
Carbohydrate 艚 Lipid (126)
Lipid (110)
Lipopolysaccharide
Conjugate (13)
Carbohydrate-Lipid
Complex (2)
Lipid 艚 Element, Ion, or
Lipid (2)
Isotope (2)
Steroid 艚 Element, Ion, or
Steroid (2)
Isotope (2)
Organic Chemical 艚 Element, Organic Chemical (11)
Ion, or Isotope (11)
Amino Acid, Peptide, or
Amino Acid, Peptide, or
Protein 艚 Element, Ion, or
Protein (6)
Isotope (6)
Steroid 艚 Organophosphorus Steroid (1)
Compound (1)
Organophosphorus Compound Carbohydrate (31)
艚 Carbohydrate (31)
Organophosphorus Compound Lipid (30)
艚 Lipid (30)
Nucleic Acid, Nucleoside, or
Nucleic Acid, Nucleoside, or
Nucleotide 艚
Nucleotide (22)
Organophosphorus
Compound (22)
Example Assigned Concept
Comment
Prostaglandin-inositol cyclic phosphate
Alpha-(3-hydroxysialyl)cholesterol
Ouabain-sepharose
ST Carbohydrate, by its definition,
is excluded from glycolipids.
Kombetin
N-octanoylglucosylceramide
Lipopolysaccharide, Escherichia coli O9
Pediatric fat emulsion 4501
9-tellurium Te 123m heptadecanoic acid According to its definition, ST
Element, Ion, or Isotope does
24-telluracholestanol
not include organic ions or
compounds to which Organic
15-(4-iodophenyl)-6Chemical is assigned.
tellurapentadecanoic acid
Technetium Tc-99m immunoglobulin
3-O-(4-nitrophenylphosphate)lithocholic According to its definition, ST
acid
Organophosphorus Compound
Arabit ol-5-phosphate
is excluded from phospholipids,
sugar phosphates,
Oleoyl thiophosphate
phosphoproteins, nucleotides,
and nucleic acids.
5’-O-phosphonylmethylthymidine
IST ⫽ intersection semantic type; RSN ⫽ refined semantic network; ST ⫽ semantic type.
Glyco-Lipo-Nucleotide Conjugate’s extent size falls under the
threshold for inclusion in the RSN.
ISTs with Both Conjugates and Complexes
Seven of those 32 ISTs in Figure 1 satisfy Case 2, that is, the
chemical concepts categorized by these ISTs represent either
conjugates or complexes. Each is thus replaced by corresponding conjugate and complex types. Table 2 lists all
seven of these ISTs along with the types replacing them. The
number of assigned concepts is shown in parentheses for
each. A concept assigned the type is shown, too. An example
is Amino Acid, Peptide, or Protein 僕 Lipid, assigned to a
total of 121 concepts. It is replaced by the conjugate type
Lipo-amino-acid, Lipopeptide, or Lipoprotein Conjugate
(assigned to 82 concepts representing conjugates) and the
complex type Lipopeptide or Lipoprotein Complex (assigned to 39 concepts denoting complexes). N-stearoylhistidine is an example of a concept assigned the conjugate type,
whereas Virosomes is assigned the companion complex type.
It will be noted that the name of the complex type in this
case is missing the “Lipo-amino Acid” component due to the
fact that no concepts representing lipo-amino acids are
found in the UMLS.
The ISTs Steroid 僕 Amino Acid, Peptide, or Protein and
Amino Acid, Peptide, or Protein 僕 Carbohydrate 僕 Lipid
each were previously assigned one concept in error (see
section Errors Identified). These errors were corrected in the
process of the re-assignments, so the numbers of concepts
assigned the respective conjugate and complex types do not
add up to the numbers originally assigned the IST.
The extent of the conjugate type for Steroid 艚 Amino Acid,
Peptide, or Protein 艚 Carbohydrate, for example, falls
under the threshold and is designated a minor category. The
same is true for both the conjugate type and the complex
type replacing Nucleic Acid, Nucleoside, or Nucleotide 艚
Amino Acid, Peptide, or Protein 艚 Lipid.
Exclusions and Invalid ISTs
Besides the 13 ISTs (Tables 1 and 2) that are transformed into
either conjugate types only or conjugate and complex types
in the new RSN, the other 19 ISTs involving structurally
viewed CSTs do not appear as legitimate types in the
remodeled RSN due to the noted violations of UMLS modeling rules. A portion of the exclusion rules for Carbohydrate was cited in the Methods section. Additionally, its
usage note stipulates: “Sugar phosphates should only be
typed as ‘Carbohydrate.’ Glycolipids should only be typed
as ‘Lipid.’” Examples of more rules for exclusion of ISTs are
as follows. In the definition of Organophosphorus Compound, we find: “Excluded are phospholipids, sugar phosphates, phosphoproteins, nucleotides, and nucleic acids.”
The usage note of Nucleic Acid, Nucleoside, or Nucleotide
states: “If this type has been assigned, the type ‘Organophosphorus Compound’ will not also be assigned.” And the
124
Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN
Table 4 y Transformations of Two ISTs Involving Amino Acid, Peptide, or Protein
IST (# Concepts)
Partial IST
(# Concepts)
Carbohydrate 艚 Carbohydrate 艚
Amino Acid,
Amino Acid (75)
Peptide, or
Protein (303)
Carbohydrate 艚
Peptide (70)
Organophosphorus
Compound 艚
Amino Acid,
Peptide, or
Protein (33)
Complex Type or
Minor Category
(# Concepts)
Carbohydrate-aminoacid Complex (5)
Conjugate Type or Minor
Category
(# Concepts)
Example Assigned
Concept
Calciofix
Glycoamino-acid
Conjugate (47)
Amino-acid-polysaccharide
Conjugate (23)
PolysaccharideProtamine heparin Peptidopolysaccharide
peptide Complex (1)
aggregate
Conjugate (12)
Glycopeptide Conjugate (57)
Example Assigned
Concept
Glucose-cysteine
(Man)5(GlcNAc)2Asn
Histidyl-AH-sepharose
6-chlorofructos-1-ylglutathione
Immunoglobulin
A-sepharose 4B
N-acetylgalactosaminebovine serum
albumin conjugate
phospho-L-arginine
Carbohydrate 艚
Protein (151)
Polysaccharide-protein Dermatan sulfate
Complex (5)
proteoglycan
Proteopolysaccharide
Conjugate (20)
⬍⬍Glycoprotein
Conjugate⬎⬎ (126)
Organophosphorus
Compound 艚
Amino-Acid (21)
Organophosphorus
Compound 艚
Peptide (9)
Organophosphorus
Compound 艚
Protein (3)
(None)
—
Phosphoamino acid
Conjugate (21)
(None)
—
Phosphopeptide
Conjugate (9)
Thiotepa-glutathione
conjugate
(None)
—
⬍⬍Phosphoprotein
Conjugate⬎⬎ (3)
Phosphorylcholinebovine serum
albumin
The symbols ⬍⬍ ⬎⬎ indicate that type is invalid and excluded from the revised RSN.
IST ⫽ intersection semantic type; RSN ⫽ refined semantic network.
usage note of Organic Chemical says: “Salts of organic
chemicals . . . would be considered organic chemicals and
should not also receive the type ‘Inorganic Chemical.’”
Table 3 shows 11 ISTs that do not appear in the revised RSN
due to such violations. For example, the one concept assigned Eicosanoid 僕 Carbohydrate, namely, prostaglandininositol cyclic phosphate, represents a glycolipid. But such
concepts, by definition, are not to be assigned Carbohydrate.
In this situation, the assignment should be Eicosanoid only.
It will be noted that the numbers of concepts for the “correct
type” (column 2) are inconsistent with those for the ISTs
themselves in the cases of Steroid 僕 Carbohydrate and
Carbohydrate 僕 Lipid. This is due to the discovery and
subsequent correction of one assignment error with respect
to each. Again, details are given below.
As seen in Table 3, while 110 of the 126 chemical concepts
assigned Carbohydrate 僕 Lipid represent glycolipids and
are re-assigned Lipid according to the UMLS usage note of
Carbohydrate, there are two other cases involving this IST.
There are two concepts denoting complex chemicals (e.g.,
Pediatric fat emulsion 4501) that are re-assigned the new
complex type Carbohydrate-Lipid Complex. There are also
13 concepts denoting lipopolysaccharide conjugates (e.g.,
lipopolysaccharide, E coli O9), characterized by a large carbohydrate molecule, that are re-assigned the new corresponding conjugate type. Overall, Carbohydrate 僕 Lipid’s concepts are re-assigned three different types in the revised
RSN. A similar situation occurs for the IST Steroid 僕
Carbohydrate.
Invalid Partial ISTs
As noted, the two ISTs Carbohydrate 僕 Amino Acid,
Peptide, or Protein and Organophosphorus Compound 僕
Amino Acid, Peptide, or Protein must be treated as special
cases with regard to the exclusion rules due to the disjunctive nature of the ST Amino Acid, Peptide, or Protein. In
particular, each is broken down into three new partial ISTs
which are then analyzed separately. While each partial IST is
analyzed with regard to its conjugates and complexes, it is
the protein-based chemicals (i.e., glycoproteins and phosphoproteins) that are of specific interest due to the stated
exclusions. For the IST Carbohydrate 僕 Amino Acid, Peptide, or Protein, the division yields the partial ISTs Carbohydrate 僕 Amino Acid, Carbohydrate 僕 Peptide, and
Carbohydrate 僕 Protein. The first partial IST, Carbohydrate
僕 Amino Acid, includes both complex (e.g. Calciofix) and
conjugate concepts. A new complex type Carbohydrateamino-acid Complex is defined. For the conjugate concepts,
further analysis reveals that carbohydrates combined with
amino acids yield two distinct kinds of conjugates. The
distinction between the two is mainly due to the size of the
carbohydrate: monomer (“glyco”) versus polymer (“polysaccharide”). When the carbohydrate is a monomer, we use
the type Glycoamino-acid Conjugate. An example of this is
the concept Glucose-cysteine. When the carbohydrate is a
polymer, we use the type Amino-acid-polysaccharide Conjugate. An example is (Man)5(GlcNAc)2Asn. Table 4 shows
the complete results of the transformation of Carbohydrate
僕 Amino Acid, Peptide, or Protein.
The second partial IST, Carbohydrate 艚 Peptide, also has
both complex and conjugate concepts. For the former, a
new complex type Polysaccharide-peptide Complex is
defined. For the latter, we again have the situation of two
distinct kinds of conjugates. The distinction between the
two is based on the proportional sizes of the two molecular
Journal of the American Medical Informatics Association
Volume 16
contributions. Peptidopolysaccharide Conjugate denotes a
smaller peptide (“peptido”) and a large carbohydrate (“polysaccharide”). An example is histidyl-AH-sepharose. Glycopeptide Conjugate denotes a smaller carbohydrate (“glyco”) and a
large peptide portion. An example of this is 6-chlorofructos-1ylglutathione.
For the last of the three partial ISTs, Carbohydrate 僕
Protein, there are various concepts representing complexes
and two kinds of conjugates (Table 4). The conjugates are
again distinguished by the proportional sizes of their two
molecular contributions. For example, Proteopolysaccharide is the name of the conjugate type whose concepts
represent chemicals containing a smaller protein portion
(“proteo,” prefix) and a larger carbohydrate portion (“polysaccharide”). An example is Immunoglobulin A-sepharose 4B.
In contrast, Glycoprotein is used when the carbohydrate
portion is smaller: prefix “glyco.” An example is N-acetylgalactosamine-bovine serum albumin conjugate. According to
the UMLS specifications noted above, the conjugate type
Glycoprotein is in fact invalid. The 126 concepts warranting
its assignment should be assigned Amino Acid, Peptide, or
Protein instead. Thus, Glycoprotein does not appear in the
revised RSN. Overall, Carbohydrate 僕 Protein is replaced
by one conjugate type and an accompanying complex type
(Table 4).
The IST Carbohydrate 僕 Amino Acid, Peptide, or Protein
had an additional seven assigned concepts that were discovered to be in error in the course of remodeling and reassignment. These are listed in Table 5 (with brief explanations), along with the other concept errors found during the
revision of the RSN.
The IST Organophosphorus Compound 僕 Amino Acid,
Peptide, or Protein is also broken down into three partial
ISTs, with a particular interest on the phosphoproteins:
Number 1
125
January / February 2009
Organophosphorus Compound 僕 Amino Acid, Organophosphorus Compound 僕 Peptide, and Organophosphorus Compound 僕 Protein. In all three cases, the chemical
combinations are strictly conjugates, so no complex types
are required. The first two partial ISTs are replaced by the
conjugate types Phosphoamino Acid Conjugate (example
concept: phosphor-L-arginine) and Phosphopeptide Conjugate (example: thiotepa-glutathione conjugate), respectively.
The prefix “phospho” is used to convey the organophosphorus portion of the compounds. The last partial IST, Organophosphorus Compound 僕 Protein, is assigned to the
phosphoprotein concepts such as phosphorylcholine-bovine
serum albumin. However, that is in violation of the UMLS
assignment rules. In fact, by the definition of the ST Organophosphorus Compound, only Amino Acid, Peptide, or
Protein should be assigned to the three phosphoprotein
concepts found. So, no type named “Phosphoprotein Conjugate” is included in the RSN. The complete transformation
of Organophosphorous Compound 僕 Amino Acid, Peptide, or Protein is summarized in Table 4.
Errors Identified
As noted previously, in various contexts, existing typeassignment errors were encountered during the re-assignment of conjugate and complex types to concepts in the
revised RSN. Overall, 18 such errors were discovered. Most
of them are due to a lack of one of the asserted component
types. For example, the concept 1,2-4,5-di-O-isopropylidene-3C-(5-phenyl-1,2,4-oxadiazol-3-yl)-beta-D-psicopyranose was previously assigned the IST Steroid 僕 Carbohydrate. However,
the chemical has no steroid component, and the concept
should never have been assigned Steroid in the first place.
As another example, 3-hydroxy-17-(1H-1,2,3-triazol-1-yl)androsta-5,16-diene was assigned the IST Steroid 僕 Amino Acid,
Peptide, or Protein. However, this chemical has no “amino
Table 5 y Type-Assignment Errors Discovered during RSN Remodeling
IST
Steroid 艚 Amino Acid,
Peptide, or Protein
Amino Acid, Peptide, or
Protein 艚
Carbohydrate 艚 Lipid
Carbohydrate 艚 Amino
Acid, Peptide, or
Protein
Steroid 艚 Carbohydrate
Carbohydrate 艚 Lipid
Nucleic Acid,
Nucleoside, or
Nucleotide 艚 Lipid
Concepts in Error
3-hydroxy-17-(1H-1,2,3-triazol-1-yl)androsta-5,16-diene
Comment
No Amino Acid, Peptide, or
Protein component
Difucosyl lactosamine
Zinc (II)-iminodiacetate agarose
RIG 200
Methyl-2,3,4-tris-O-(N-2,3-di(hydroxyl)benzoyl)aminopropyl)glucopyranoside
Aurantoside D
MEN 4901
Glyceryl glyphosate, 2-propanamine (1:1)
Glyceryl glyphosate, disodium salt
1,2-4,5-di-O-isopropylidene-3-C-(5-phenyl-1,2,4-oxadiazol-3-yl)-beta-Dpsicopyranose
17-glucuronosylestradiol
Dodecaglycerol-thymine
Dodecaglycerol-adenine
(GlyA-dT)10
GlyT-(GlyA-GlyT)9
Formyl-coenzyme A
2’-deoxycytidine-diphosphate-diglyceride
Guanosine 5’-(5’-deoxyadenosylcobinamide pyrophosphate
IST ⫽ intersection semantic type; RSN ⫽ refined semantic network.
No Carbohydrate
component
No Steroid component
Lipid should be replaced by
Steroid
No Lipid component
126
Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN
F i g u r e 4. The complex types of the RSN.
acid, peptide, or protein” component. In the case of Glyceryl
glyphosate, disodium salt, assigned Carbohydrate 僕 Amino
Acid, Peptide, or Protein, the chemical has no carbohydrate
component. The concept 17-glucuronosylestradiol was assigned Carbohydrate 僕 Lipid but should have been assigned Carbohydrate 僕 Steroid. A total of 14 errors (seven
each) of the 18 errors are associated with the ISTs Nucleic
Acid, Nucleoside, or Nucleotide 僕 Lipid and Carbohydrate
僕 Amino Acid, Peptide, or Protein. Table 5 lists all 18 errors
along with a brief explanation for each.
Proposed RSN
Figure 4 shows part of the new version of the RSN rooted at
Chemical Viewed Structurally with some of the complex
types. Again, this RSN was generated with respect to an
extent-size threshold value of five. That is, ISTs whose
extents have fewer than five concepts are omitted. The issue of
a higher threshold value is considered below. Such a higher
value would imply fewer such types in Figures 4 and 5. All the
structurally viewed CSTs appear in the upper part of the figure
above the broken line. The complex types appear in the
lower part. Complex is a child of Organic Chemical and
the parent of all complex types. As noted, there are eight
complex types. Each complex type preserves the IS-A relationships of the original corresponding IST (in Figure 1) to
its constituent structurally viewed CSTs. We note that the
modeling of the complex types follows the modeling of
Figure 3.
Figure 5 shows the part of the new version of the RSN
rooted at Chemical Viewed Structurally with some of the
conjugate types. All the structurally viewed CSTs appear
in the upper part of the figure above the broken line. The
conjugate types appear in the lower part of the figure.
Conjugate is a child of Organic Chemical and the parent
of all conjugate types. As noted, there are 14 conjugate
types. Each conjugate type has has component relationships
to its constituent CSTs. We note that Figure 2 is embedded
in Figure 5.
Regarding the issue of types with small extents, let us note
that in the current SN, some nonleaf types fall into this
group. In fact, most of the concepts in the META are
assigned leaf types in order to categorize them as specifically
as possible. The RSN’s complex types and conjugate types
are indeed leaf types, and one could reasonably expect them
to be assigned to a relatively significant number of concepts
if they are to warrant the designation of “broad category.”
But even in the SN, there are nine leaf types whose extents
only have between 25 and 100 concepts. As noted, to allow
flexibility, we have adopted the use of a threshold value and
designated types whose extent sizes fall below it as “minor
Journal of the American Medical Informatics Association
Volume 16
Number 1
January / February 2009
127
F i g u r e 5. The conjugate types of the RSN.
categories.” The version of the RSN rooted at Chemical
Viewed Structurally reported above was derived with a
threshold of five. We could, of course, have chosen another
threshold value of, say, 10, 25, 50, or 100. There is a tradeoff
between lowering the number of conjugate types and complex types (which follows the raising of the threshold)
versus the accuracy of capturing all possible conjugate and
complex chemicals by unique explicit types. Table 6 illustrates this tradeoff. For example, if we choose a minimum
extent size of 10 concepts (third row), then we wind up with
Table 6 y Effect of Threshold Values on the Size of
the Chemical Viewed Structurally Portion of
the RSN
Extent-size
Threshold
# Types (OSTs, Complex
Types, and Conjugate
Types)
# Concepts
Covered
% Covered
1
5
10
25
50
100
45
33
25
19
16
14
86,270
86,250
86,201
86,097
85,978
85,839
100.00
99.98
99.92
99.80
99.66
99.50
only 25 types as compared to the 45 types of the unrestricted
RSN (first row) and the 33 types of the RSN with a threshold
of five (second row). This represents a 44% reduction and a
24% reduction, respectively, in the number of types. The 25
types are collectively assigned to a total of 86,201 concepts.
This choice thus results in 69 (0.08%) of the concepts not
having a unique type. For such concepts, we would have to
resort back to the original multi-typing arrangement in
order to accommodate them with high-level categorizations.
It is up to a user for whom the conjugate/complex distinction is pertinent to decide on a value that balances the
tradeoff in a way they see fit.
Discussion
Significance and Impact
OST ⫽ original semantic type; RSN ⫽ refined semantic network.
Various kinds of semantic combinations are possible when a
chemical concept is assigned several structurally viewed
CSTs. This led to the creation of new complex and conjugate
types for the RSN, such as Glyco-lipo-nucleotide Conjugate
and Glyco-steroid-amino-acid, peptide, or protein Complex. In this way, the RSN provides a more precise abstraction for legitimate combinations of structurally viewed
CSTs.
128
Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN
When the SN was first introduced, it was noted that it did
not have particularly great depth.2 The expressed expectation was for the creation of additional depth during further
development. The modifications suggested in this article are
a step toward that goal. They constitute a natural increase in
the depth of the subnetwork rooted at Chemical Viewed
Structurally.
Two different kinds of combinations of structurally viewed
CSTs that are violations of UMLS rules were presented and
excluded from the revised RSN. The impact of these exclusions is the prevention of assigning “illegal” ST combinations involving structurally viewed CSTs to concepts representing chemicals that are formed from combinations of
other chemicals. Using the RSN, an editor will only be able
to choose legitimate semantic combinations that appear
explicitly as conjugate or complex types.
As an example of the first kind of violation, we see in Table
3 that 30 concepts were assigned the combination of Organophosphorus Compound and Lipid. By the definition of
Organophosphorus Compound, such a concept should only
be assigned Lipid. A UMLS editor would not be able to
assign such an illegal combination of STs because it is not
reified as a type of its own in the RSN, where unique type
assignments are required. Another example of a violation in
Table 3 is the combination of two exclusive types, Organic
Chemical and Element, Ion, or Isotope, assigned to 11
concepts. By the definition of Element, Ion, or Isotope, such
a combination is considered an organic chemical due to the
organic component of the chemical, and no assignment of
Element, Ion, or Isotope should be made. Again, not having
such a type in the RSN will prevent an editor from assigning
this illegal combination. Similarly, all 361 concepts assigned
structurally viewed CSTs in Table 3 in violation of UMLS
rules would not have been assigned the CST causing the
violation.
Furthermore, when an editor is faced with the task of
assigning structurally viewed CSTs to chemical concepts,
the task will be streamlined by offering only the legitimate
combinations with understandable names for complex
chemicals and conjugate chemicals. In this way, not only
will many errors be prevented, but we expect the laborintensive type-assignment process to become more efficient.
Toward this end, we have designed a decision tree (Figure 6)
that can be utilized by an editor when trying to determine
the appropriate conjugate type or complex type to be
assigned to a chemical concept representing a chemical
composed of two or more other chemicals. This decision tree
is with respect to the RSN derived using an extent-size
threshold value of five. A slightly revised decision tree
would be required for a different threshold value.
Users will benefit from the correction of existing ST assignment errors and the prevention of new ones for chemical
concepts. In a recent UMLS study,25 there were two questions pertaining to the extent to which a user is bothered by
a list of 12 kinds of errors. Among the errors related to
aspects of a concept, the highest concern was for incorrect
STs. Therefore, auditing the META for ST assignments is
imperative to ensure the overall quality and usability of the
UMLS.
The suggested modeling of compound chemicals in the RSN
framework will also facilitate user comprehension of such
F i g u r e 6. Decision tree for assigning a type to a chemical concept in the context of the RSN.
Journal of the American Medical Informatics Association
Volume 16
chemical concepts. The suggested categorization of chemical
concepts with a unique type will help users see which chemicals are obtained from a combination of other chemicals
assigned different structurally viewed CSTs. Furthermore, the
categorization will explicitly specify the nature of the combination, either as complex or conjugate.
Size and Scope of the RSN
Although there are various advantages to the RSN, in
general, and its finer-grained modeling of structurally
viewed CSTs, in particular, one also has to consider practical
consequences of its physical implementation. One such
consequence is that the number of types in the unrestricted
(i.e., threshold value 1) RSN, 690, is about five times the
current number of types of the SN, 135. Although the RSN
qualifies as a compact abstraction of the META’s 1.5 million
concepts, the RSN is not as compact as the SN. The RSN’s
increased size does have implications for its pictorial display, either as a diagram or as an indented list. The threshold value can certainly be increased to reduce the number of
types. For example, thresholds of 100 and 25 yield a total of
214 and 282 types, respectively.
As an even more conservative approach, one could opt
simply to augment the SN itself with just two new types:
Conjugate and Complex. In such an arrangement, a conjugate concept originally assigned, say, STs X and Y would
instead be assigned X, Y, and Conjugate to make its status as
a conjugate explicit. Complexes would be treated analogously. Of course, one would need to provide a set of
conventions and guidelines for categorizing the corresponding concepts along these lines.
An alternative to actually materializing the RSN, with its
ISTs, complex types, and conjugate types, is to implement it
in a virtual manner. That is, concepts will continue to be
assigned multiple types, but additional defined constraints
will forbid some illegal combinations of types, without
having to resort to the creation of explicit new types.
Examples of such constraints might include formalizations
of prohibitions found in ST definitions and usage notes, as
well as those for exclusive types and redundant type assignments.26 In upcoming work, we will describe such a virtual
RSN framework.
Let us compare the distribution of the extents of structurally
viewed CSTs in the SN and the RSN. Table 7 shows for each
such CST its extent size in the SN and that of its corresponding OST in the RSN. (Recall that concepts assigned multiple
types in the SN are removed from OST extents in the context
of the RSN.) As we see, only about 30%— 85,450 of the total
number of 279,995 concepts— of those extents in the SN
carry over to the RSN. Of those, 820 concepts are in ISTs
involving two or more chemically viewed CSTs. In this
article, their modeling has been revised to be conjugate or
complex types. The majority of concepts assigned a structurally viewed CST are also assigned a functionally viewed
CST. In upcoming work, we will discuss the representation
of such intersections with respect to the RSN.
The distribution of concepts in ISTs involving exactly two
structurally viewed CSTs is shown in Table 8, which is laid
out in two dimensions. (The abbreviations appearing in the
column headings are defined in the corresponding row
labels.) An entry in the table of the form “x, y” indicates that
Number 1
129
January / February 2009
Table 7 y Comparison of Extent Sizes in the SN and
the RSN
Structurally Viewed Chemical
ST
# Concepts in
Extent in SN
# Concepts in
Extent in
RSN
Chemical Viewed Structurally
Organic Chemical
Steroid
Eicosanoid
Nucleic Acid, Nucleoside, or
Nucleotide
Organophosphorus
Compound
Amino Acid, Peptide, or
Protein
Carbohydrate
Lipid
Element, Ion, or Isotope
Inorganic Chemical
Total
376
134,424
9,271
1,163
7,819
239
47,866
4,638
527
3,752
2,212
807
103,188
16,129
9,376
5,753
1,312
5,101
279,995
5,312
3,306
796
2,078
85,450
RSN ⫽ refined semantic network; SN ⫽ semantic network; ST ⫽
semantic type.
the IST involving the two respective types had x conjugate
and y complexes. For example, Steroid 艚 Amino-acid,
Peptide, or Protein had 33 conjugates and five complexes
(see also Table 2). An entry of “X” indicates a combination
forbidden due to exclusiveness, e.g., Steroid and Eicosanoid
(the children of Lipid), or redundancy, e.g., Organic Chemical and any of its descendents. Another reason for an “X”
entry is due to the definitions and usage notes of the STs in
the UMLS, e.g., for Organophosphorus Compound with all
STs except Amino Acid, Peptide, or Protein as was shown
in Table 3.
Let us further comment on one of the STs of the SN with a
disjunctive form of related chemicals. In the analysis of ISTs
involving Amino Acid, Peptide, or Protein, we needed to
create partial ISTs. It raises the question of whether it is
better to lump those chemicals into a single type as in the SN
or to separate them into three separate types, one Amino
Acid, one Peptide, and one Protein. It seems that a finergrained categorization with multiple types is better suited
for clarifying the nature of the chemicals and the subtleties
of their interactions with other kinds of chemicals. To a
lesser degree, the question also arises regarding Nucleic
Acid, Nucleoside, or Nucleotide. This separation into different types will cause the names of some conjugate and
complex types to get simpler and clearer.
Another Application of the RSN
One possible application of the modeling of the structurally
viewed CSTs in the RSN is as an upper-level categorization
mechanism for an ontology of chemicals, in the same capacity that the SN serves the META. A natural candidate for this
is ChEBI,6 an OBO ontology27 that models chemicals. It
consists of 31,168 concepts. In the following, we examine this
potential RSN usage in more detail.
We found that four of the RSN’s conjugate types, Glycopeptide, Lipopeptide, Lipoprotein, and Lipopolysaccharide,
appear as concepts in ChEBI—and also in the IUPAC Gold
Book24. In ChEBI, the concept Nucleoprotein apparently represents what we have modeled as Nucleo-amino-acid, Pep-
130
Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN
Table 8 y Conjugates and Complexes in ISTs Involving Two Structurally Viewed CSTs
Second Type
First Type
Chemical Viewed Structurally (CVS)
Organic Chemical (OC)
Steroid (SRD)
Eicosanoid (EID)
Nucleic Acid, Nucleoside, or Nucleotide
(NANN)
Organophosphorus Compound (OCD)
Amino Acid, Peptide, or Protein (AAPP)
Carbohydrate (CRB)
Lipid (LPD)
Element, Ion, or Isotope (EII)
Inorganic Chemical (IC)
CVS
OC
SRD
EID
NANN
N/A
X
X
X
X
N/A
X
X
X
N/A
X
8,0
N/A
2,0
N/A
X
X
X
X
X
X
X
X
X
X
X
X
X
33,5
0,8
X
X
X
X
3,0
0,0
X
X
X
X
106,15
149,0
134,0
X
X
OCD
AAPP
CRB
LPD
EII
IC
N/A
30,0
X
X
X
X
N/A
159,11
82,39
X
X
N/A
13,2
X
X
N/A
X
X
N/A
X
N/A
CST ⫽ chemical semantic type; IST ⫽ intersection semantic type.
tide, or Protein Complex and Nucleo-amino-acid, Peptide,
or Protein Conjugate. The ChEBI concept Polysaccharide
protein carries the name of another of our complex types.
Obviously, if names of structurally viewed CSTs are used at
the concept level of ChEBI, then naturally those concepts
would be assigned corresponding conjugate types and complex types if a two-level terminology structure were to be
utilized.
We randomly chose 10 concepts each from three new types
(two conjugate types, Lipopolysaccharide Conjugate and
Nucleo-amino-acid, Peptide, or Protein Conjugate, and one
complex type, Nucleo-amino-acid, Peptide, or Protein
Complex) and checked whether they actually appear in
ChEBI. The results were that all 10 concepts from Lipopolysaccharide Conjugate were present in ChEBI, e.g., lipidlinked oligosaccharides and lipoteichoic acid. Among the 10
concepts from Nucleo-amino-acid, Peptide, or Protein Conjugate, 6 were found, e.g., aspartyl adenylate and pacidamycin
1. Also, we found 6 of the 10 concepts from Nucleo-aminoacid, Peptide, or Protein Complex, including actinomycin
D-dATGCAT complex and enterogenin. As we see, 22 out of
our sample of 30 UMLS concepts are part of ChEBI. Those 22
could readily be assigned conjugate types or complex types
if an overarching network of categories for ChEBI were
desired.
Conclusion
The RSN has previously been introduced as a finer-granularity abstraction of the UMLS’s conceptual content. In
particular, it better represents combinations of multiple
semantic-type assignments by defining separate high-level
types, called intersection semantic types, for each. This
elevation of semantic-type combinations to first-class types
in their own right helps convey this important knowledge
more clearly. It also simplifies type assignments, as all are
unique in the context of the RSN.
The portion of the UMLS particularly benefiting from the
RSN pertains to chemicals because it is natural to combine
chemicals of different kinds and obtain new chemicals. In
this article, we further refined that part of the RSN to more
accurately convey the knowledge of chemical combinations
involving chemicals viewed structurally. Combining such
chemicals can yield simple mixtures (referred to as complexes in the field of chemistry) or more complicated chemicals derived via chemical reaction (called conjugates). The
RSN was augmented with new types to capture these
distinctions. In this way, each structurally viewed chemical
concept is assigned a unique type, whether it is an original
chemical type, a conjugate type, a complex type, or an
intersection type with a functionally viewed CST. Such a
categorization will benefit users, who will directly know that
a specific chemical is, say, a glyco lipoprotein conjugate or a
liponucleoprotein complex. Overall, this will enhance user
comprehension to the richness of the UMLS’s chemical
content.
Additionally, various violations of UMLS modeling rules, as
stipulated in semantic type definitions and usage notes,
were discovered and corrected with the removal or replacement of types appearing in the original RSN. The suggested
additions to the Semantic Network will help UMLS maintenance personnel in avoiding future type-assignment errors,
as a new chemical concept will only be permitted a single
assignment of an existing (validated) RSN type: original,
conjugate, or complex. Trade-offs between achieving a fully
accurate and a highly granular categorization of structurally
viewed chemical concepts and practical issues regarding the
number of types in the RSN, including threshold limits on
extent sizes for type-level qualification, were discussed. We
also considered the possibility of using the RSN as an
upper-level categorization network for ChEBI.
References y
1. Humphreys BL, Lindberg DAB, Schoolman HM, Barnett GO.
The Unified Medical Language System: an informatics research
collaboration. J Am Med Inform Assoc 1998;5:1–11.
2. McCray AT, Hole WT. The Scope and Structure of the First
Version of the UMLS Semantic Network. Los Alamitos, CA:
Proc 14th Annual SCAMC 1990:126 –30.
3. Schulyer PL, Hole WT, Tuttle MS, Sherertz DD. The UMLS
Metathesaurus: representing different views of biomedical concepts. Bull Med Libr Assoc 1993;81:217–22.
4. Geller J, Gu H, Perl Y, Halper M. Semantic refinement and error
correction in large terminological knowledge bases. Data
Knowledge Eng 2003;45:1–32.
5. Peng Y, Halper M, Perl Y, Geller J. Auditing the UMLS for
redundant classifications. Proc AMIA Annu Symp 2002:612– 6.
6. Degtyarenko K, de Matos P, Ennis M, et al. ChEBI: a database
and ontology for chemical entities of biological interest. Nucleic
Acids Res 2008;36(Database issue):D344 –50.
7. Gu H, Perl Y, Geller J, Halper M, Liu L, Cimino JJ. Representing
the UMLS as an OODB: modeling issues and advantages. J Am
Med Inform Assoc 2000;7:66 – 80.
Journal of the American Medical Informatics Association
Volume 16
8. UMLS Documentation, Section 2–Metathesaurus. Available at:
http://www.nlm.nih.gov/research/umls/meta2.html. Accessed
August 6, 2007.
9. The UMLS Semantic Network. Available at: http://semanticnetwork.nlm.nih.gov. Accessed August 6, 2007.
10. International Union of Pure and Applied Chemistry. Available
at: http://www.iupac.org. Accessed August 1, 2007.
11. IUBMB Biochemical Nomenclature and Related Documents.
2nd ed. London: Portland, 1992.
12. Surovoy A, Flechsler I, Jung G. A novel series of serum-resistant
lipoaminoacid compounds for cellular delivery of plasmid
DNA. Adv Exp Med Biol 1998;451:61–7.
13. Inoue Y. Studies on conjugated proteins (liponucleoproteinsystem). I. The interaction between lecithin and ovalbumin.
Acta Scholae Medicinalis Universitatis in Kioto 1957;34:276 – 84.
14. Yang VC, Turcotte JG, Steim JM. Physical properties of arabinofuranosylcytosine diphosphate diacylglycerol, an antitumor liponucleotide. Biochim Biophys Acta 1982;68:375– 84.
15. Baldo BA, Fletcher TC, Pepys J. Isolation of a peptido-polysaccharide
from the dermatophyte Epidermophyton floccosum and a study of its
reaction with human C-reactive protein and a mouse anti-phosphorylcholine myeloma serum. Immunology 1977;32:831–42.
16. Sickmann A, Meyer HE. Phosphoamino acid analysis. Proteomics 2001;1:200 – 6.
17. Gatti A. Profiling substrate phosphorylation at the phosphopeptide level. Anal Biochem 2003;312:40 –7.
18. You YH, Lin ZB. Antioxidant effect of Ganoderma polysaccharide peptide. Acta Pharmaceutica Sinica 2003;38:85– 8.
Number 1
January / February 2009
131
19. Ji Z, Tang Q, Zhang J, Yang Y, Jia W, Pan Y. Immunomodulation of RAW264.7 macrophages by GLIS, a proteopolysaccharide from Ganoderma lucidum. J Ethnopharmacol 2007;
112:445–50.
20. Panda S, Panda G. A new example of a steroid-amino acid
hybrid: construction of constrained nine membered D-ring
steroids. Org Biomol Chem 2007;5:360 – 6.
21. Wang C, Peng S, Zhang X, Qiu X. The synthesis and immunosuppressive effects of steroid-peptide linkers. Acta Pharmaceutica Sinica 1998;33:111– 6.
22. Uscheva AA, Stankov BM, Zachariev SG, Marinova CP,
Kanchev LN. Possible synthesis of steroid-protein for immunization with a fixed narrow range of the hapten-protein ratio. J
Steroid Biochem 1986;24:699 –702.
23. Yu H, Friedman C, Rzhetsky A, Kra P. Representing genomic
knowledge in the UMLS Semantic Network. Proc AMIA Annu
Symp 1999:181–5.
24. IUPAC Compendium of Chemical Terminology–The Gold Book
(XML version). Available at: http://goldbook.iupac.org/. Accessed February 27, 2008.
25. Chen Y, Perl Y, Geller J, Cimino JJ. UMLS users, uses and future
agenda. J Am Med Inform Assoc 2007;14:221–31.
26. McCray AT, Nelson SJ. The representation of meaning in the
UMLS. Methods Inf Med 1995;34:193–201.
27. Smith B, Ashburner M, Rosse C, et al. The OBO Foundry:
coordinated evolution of ontologies to support biomedical data
integration. Nat Biotechnol 2007:1251–5.