* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Modeling Multi-typed Structurally Viewed Chemicals with the UMLS
Butyric acid wikipedia , lookup
NADH:ubiquinone oxidoreductase (H+-translocating) wikipedia , lookup
Drug discovery wikipedia , lookup
Chemical weapon wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Point mutation wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Metalloprotein wikipedia , lookup
Genetic code wikipedia , lookup
Protein structure prediction wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Biosynthesis wikipedia , lookup
Peptide synthesis wikipedia , lookup
116 Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN Research Paper 䡲 Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined Semantic Network LING CHEN, PHD, C. PAUL MORREY, MS, HUANYING GU, PHD, MICHAEL HALPER, PHD, YEHOSHUA PERL, PHD A b s t r a c t Objective: Chemical concepts assigned multiple “Chemical Viewed Structurally” semantic types (STs) in the Unified Medical Language System (UMLS) are subject to ambiguous interpretation. The multiple assignments may denote the fact that a specific represented chemical (combination) is a conjugate, derived via a chemical reaction of chemicals of the different types, or a complex, composed of a mixture of such chemicals. The previously introduced Refined Semantic Network (RSN) is modified to properly model these varied multi-typed chemical combinations. Design: The RSN was previously introduced as an enhanced abstraction of the UMLS’s concepts. It features new types, called intersection semantic types (ISTs), each of which explicitly captures a unique combination of ST assignments in one abstract unit. The ambiguous ISTs of different “Chemical Viewed Structurally” ISTs of the RSN are replaced with two varieties of new types, called conjugate types and complex types, which explicitly denote the nature of the chemical interactions. Additional semantic relationships help further refine that new portion of the RSN rooted at the ST “Chemical Viewed Structurally.” Measurements: The number of new conjugate and complex types and the amount of changes to the type assignment of chemical concepts are presented. Results: The modified RSN, consisting of 35 types and featuring 22 new conjugate and complex types, is presented. A total of 800 (about 98%) chemical concepts representing multi-typed chemical combinations from “Chemical Viewed Structurally” STs are uniquely assigned one of the new types. An additional benefit is the identification of a number of illegal ISTs and ST assignment errors, some of which are direct violations of exclusion rules defined by the UMLS Semantic Network. Conclusion: The modified RSN provides an enhanced abstract view of the UMLS’s chemical content. Its array of conjugate and complex types provides a more accurate model of the variety of combinations involving chemicals viewed structurally. This framework will help streamline the process of type assignments for such chemical concepts and improve user orientation to the richness of the chemical content of the UMLS. 䡲 J Am Med Inform Assoc. 2009;16:116 –131. DOI 10.1197/jamia.M2604. Introduction 1 The Metathesaurus (META) and the Semantic Network (SN)2 are two fundamental knowledge resources of the Unified Medical Language System (UMLS).3 The SN, consisting of 135 broad categories called semantic types (STs), Affiliations of the authors: Department of Science, BMCC, City University of New York (LC), New York, NY; CS Department, New Jersey Institute of Technology (CPM, YP), Newark, NJ; Department of Health Informatics, SHRP, University of Medicine and Dentistry of New Jersey (HG), Newark, NJ; Department of Computer Science, Kean University, (MH), Union, NJ. Supported in part by the National Library of Medicine under grant R-01-LM008445-01A2. The authors thank Jim Cimino for his repeated feedback, and Olivier Bodenreider for pointing out examples of chemicals composed of carbohydrates and amino acid components that are valid ISTs, which initiated our analysis of partial ISTs. Correspondence: Dr. Yehoshua Perl, CS Department, New Jersey Institute of Technology, Newark, NJ 07102-1982. e-mail: ⬍perl@ oak.njit.edu⬎. Received for review 08/22/07; accepted for publication: 09/23/08. provides a high-level categorization of all 1.5 million biomedical concepts residing in the META. Each concept is assigned one or more of these STs, which serve to denote an aspect of the concept’s semantics. The extent of an ST is the entire set of concepts to which it is assigned. If the extent of an ST contains some concepts also assigned other STs at the same time—a common occurrence—then the set will elaborate a variegated semantics. In this sense, the high-level abstract view provided by the SN does not in general show semantic uniformity for the concepts of the META included in a particular ST’s extent. For example, two concepts assigned the ST Steroid are enterodiol and lipoprotein-X cholesterol (concepts are written in italics; semantic types are capitalized and written in bold, except in tables and figures). However, their high-level semantics are not that similar. The former is assigned only Steroid, whereas the latter is multityped, assigned both Steroid and Amino Acid, Peptide, or Protein. The previously introduced Refined Semantic Network (RSN)4 offers a semantically uniform abstract view of the META by utilizing reification with respect to combinations Journal of the American Medical Informatics Association Volume 16 of ST assignments. By reification, in this context, we mean the creation of an explicit type at the Semantic-Network level. In particular, we model all existing ST assignment combinations as separate types in their own right, called intersection semantic types (ISTs). For example, because the concept lipoprotein-X cholesterol is one of 39 concepts assigned both Steroid and Amino Acid, Peptide, or Protein with respect to the SN, the existence of that assignment combination causes the RSN to include an IST named Steroid 僕 Amino Acid, Peptide, or Protein that is lipoprotein-X cholesterol’s sole type assignment. (The symbol “艚” is set intersection.) All boldface terms are defined in the Glossary Appendix (available as an online data supplement at [email protected]). The largest collection of ISTs, 411 in total, exists for that part of the UMLS devoted to chemicals, where two or more ST assignments per concept are common. The 32 ISTs involving Chemical Viewed Structurally or its descendants have revealed semantic ambiguities with respect to various ST assignment combinations. Typically, an assignment of an IST involving, say, two STs to a given concept indicates that the concept has the semantics of both. However, with a multi-typed chemical concept that is viewed structurally, an ST combination may indicate a simple mixture or some implied chemical reaction. All chemicals, in fact, can be categorized as either pure substances (with definite compositions) or mixtures (without definite compositions). A conjugate is a pure substance produced through a chemical reaction involving two or more compounds (which themselves are also pure substances). The constituent moieties of a conjugate are linked together by covalent bonds. An example conjugate is avidin-adenosine monophosphate conjugate, consisting of a protein moiety, avidin, and a nucleotide moiety, adenosine monophosphate. Of interest is the fact that the constituent components of a conjugate can only be separated via a chemical reaction (i.e., a decomposition or hydrolysis reaction) that undoes the original reaction used in the conjugate’s creation. On the other hand, mixtures are made of two or more chemicals, where the chemicals are not joined by covalent bonds. Therefore, they can be mixed at different proportions (i.e., the composition can be varied). When at least one of the chemicals is a macromolecule, the mixture is called a complex. Theoretically speaking, it is possible for any two compounds (macromolecules) to form both a conjugate (via a chemical reaction) and a complex (via physical means). Virus core is an example of a complex consisting of nucleic acids and proteins. The nucleic acid is enclosed in a protective coat of protein. In contrast to a conjugate, a complex can be separated into its constituent substances without having to resort to a chemical reaction. The two components of the virus core can easily be separated via physical means, i.e., solvent extractions. When a virus infects a host, the protein coat helps it attach to the cell surface, and the nucleic acid component is injected into the host cells. With this distinction in mind, we see that an IST may in actuality denote chemical conjugates or complexes, whose component chemical concepts are assigned different STs in the SN. In this article, we analyze the possible composite semantics elaborated by ISTs comprising chemical-viewed-structurally STs based on the nature of the chemical interactions. Fol- Number 1 January / February 2009 117 lowing this analysis, extensions to the RSN are proposed. In particular, the RSN is augmented with new types derived from ISTs to explicitly represent conjugates and complexes, as well as their semantic relationships. Rules expressed by the UMLS in its ST definitions and usage notes concerning illegal ST combinations are used to expose concepts with erroneous ST assignment combinations and suggest proper re-assignments. Redundant assignments5 are also identified and corrected. Overall, the resulting RSN more properly elaborates the semantics of varied multi-typed chemical compositions. Its abstract view facilitates user orientation to the richness of the UMLS’s chemical content and provides maintenance personnel with an easier and more accurate framework for carrying out chemical concept categorizations. Practical implications of adding complex types and conjugate types to the RSN are considered. Tradeoffs regarding the RSN’s overall size and coverage and various implementation options are also discussed. An alternative use of the RSN as a high-level categorization mechanism for a chemical ontology, such as ChEBI,6 is also considered. Background The RSN4,7 consists of two kinds of types. All STs appearing in the SN are carried over to be types in the RSN. These are referred to as original semantic types (OSTs). The others are the intersection semantic types (ISTs), which, as noted above, are reifications of ST assignment combinations appearing in the UMLS. An IST exists for every such combination of multiple ST assignments to a concept, as defined by the UMLS’s MRSTY table.8 An IST involving two STs, say, Carbohydrate and Lipid is denoted Carbohydrate 僕 Lipid, where “艚” is the symbol for set intersection. ISTs may involve more than two STs, e.g., Steroid 僕 Amino Acid, Peptide, or Protein 僕 Carbohydrate. It is important to note that each concept receives a single type assignment with respect to the RSN. In particular, the type assignments for OSTs tend to differ from those of their corresponding STs in the SN. A concept retains an OST assignment only if that assignment was its sole assignment previously. For example, enterodiol is assigned only Steroid in the SN and thus has the exact same assignment in the RSN. On the other hand, a concept with multiple assignments in the SN will lose all of those assignments in favor of a single IST assignment in the RSN. For example, lipoprotein-X cholesterol is assigned Steroid and Amino Acid, Peptide, or Protein. Therefore, with respect to the RSN, it will be assigned only the IST Steroid 僕 Amino Acid, Peptide, or Protein, not its two OSTs. In this manner, the entire collection of the RSN’s types functions as a partition of the concepts of the UMLS into disjoint extents of uniform semantics. The extent of the OST Steroid contains chemical concepts that are categorized strictly as steroids (including enterodiol). The extent of the IST Steroid 僕 Amino Acid, Peptide, or Protein contains chemical concepts categorized jointly as steroids and amino acids, peptides, or proteins (including lipoprotein-X cholesterol). Overall, the chemical concepts in the UMLS 2007AA release are partitioned into 25 OSTs and 411 ISTs. In total, 108,299 concepts are assigned the 25 OSTs; 196,524 concepts are assigned the 411 ISTs. There are 84,630 concepts assigned the 11 chemical-viewed- 118 Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN F i g u r e 1. The portion of the RSN rooted at Chemical Viewed Structurally. structurally OSTs and another 820 concepts assigned the 32 chemical-viewed-structurally ISTs. The IS-A hierarchy of the RSN extends that of the SN to allow for multiple parents. In fact, every IST has more than one parent. Although we will not get into all the details of the derivations of these IS-As (see Gu et al.7), we do note that an IST always has paths of IS-A relationships leading to each of its constituent OSTs. The portion of the RSN rooted at Chemical Viewed Structurally, including 11 OSTs and 32 ISTs, is shown in Figure 1. All OSTs appear above the dashed line in the figure; all ISTs are below it. A type is drawn as a box, whereas an IS-A is a bold arrow directed from the child to the parent. Again, note that the IST Amino Acid, Peptide, or Protein 僕 Carbohydrate 僕 Lipid, for example, has three IS-As leading to its parents, a situation that would not occur in the SN. Methods Our methodology augments the RSN with new types and semantic relationships in order to more properly capture knowledge of chemicals. Because the focus of this work is on types in the hierarchy rooted at Chemical, let us start by introducing some terminology concerning such types. An ST that is a descendant of Chemical is called a chemical ST (CST). An ST beneath Chemical Viewed Structurally is called a structurally viewed CST, whereas one beneath Chemical Viewed Functionally is a functionally viewed CST. Lastly, an ST under Organic Chemical in the hierarchy is called an organic CST. As an example, Lipid is a structurally viewed CST; it is also an organic CST. Vitamin is a functionally viewed CST. The UMLS definition of ST Chemical states that: “Chemicals are viewed from two distinct perspectives in the network, functionally and structurally. Almost every chemical concept is assigned at least two types, generally one from the structure hierarchy and at least one from the function hierarchy.”9 This implies that ISTs involving CSTs should be common in the RSN, and in fact 90 of the 100 ISTs with the largest extents do include CSTs. An IST involving functionally viewed CSTs has the expected interpretation of a logical “AND” operator. Its assigned concepts have the semantics of all the types in the conjunctive form. For example, with Vitamin 僕 Pharmacologic Substance, all concepts represent chemicals that are both vitamins and pharmacologic substances. If an IST represents a combination of one structurally viewed CST and one or Journal of the American Medical Informatics Association Volume 16 more functionally viewed CSTs, then we find the same interpretation. As an example, Lipid 僕 Vitamin is assigned to concepts that indeed represent chemicals that are both lipids and vitamins. In both of these circumstances, the IST’s IS-A relationships to its constituent OSTs in the RSN help to reinforce this interpretation. Conjugate versus Complex The situation is different, however, in cases where two or more structurally viewed CSTs are involved. Such an IST models chemicals obtained from the combination of other chemicals. When combining two (or more) chemicals whose corresponding concepts are assigned two (or more) structurally viewed CSTs, a chemical reaction may occur and produce an entirely new chemical. Such a chemical is called a conjugate. A conjugate chemical does not necessarily have all of the properties of its source chemicals, because some of the original structural components are expended in its creation. The neutralization reaction of an acid and a base producing a salt is a simple example of this scenario. The new chemical, salt, contains parts of acid and base; however, it is neither an acid nor a base. In this sense, a conjugate does not exhibit the semantic combination of the STs underlying its corresponding concept’s assigned IST. It, in fact, has a brand-new semantics. An example conjugate is N-dodecanoyl serine produced by a reaction of dodecanoic acid (whose concept is assigned Lipid) and serine (assigned Amino Acid, Peptide, or Protein). This chemical is neither a lipid nor an amino acid, peptide, or protein. It is also possible that the chemical combination results in a new chemical that is a mixture of the originals. In this case, the new chemical is called a complex. A complex chemical, in contrast to a conjugate, preserves the properties of its original chemicals. It has the semantic conjunctive combination of the constituent STs of the IST assigned to its corresponding concept. The concept high density lipoprotein (HDL), “the good cholesterol,” consisting of cholesterol and lipoprotein, represents a complex. Let us note that many conjugate and complex chemicals are derived from two or more chemicals of the same type, e.g., two carbohydrates or two lipids. Such “intra-type” conjugates and complexes are not dealt with in this work because the modeling provided by the SN itself suffices. That is, the assignment of an OST, say, T to a concept representing an intra-type complex or conjugate whose component chemical concepts are all assigned T, too, is appropriate. For example, Animal fat is a concept denoting a complex comprising only lipids. Hence, it is fittingly assigned OST Lipid. Another example is Dietary carbohydrates, which is composed of different kinds of carbohydrates. It is appropriately assigned Carbohydrate. In such circumstances, the OSTs offer the proper type assignments to the respective concepts, and the RSN does not need to be altered in any special way in order to accommodate them. In the remainder of this article, we use “conjugate” and “complex” exclusively for chemicals whose concepts were originally multi-typed with respect to the SN and have been assigned a single IST in the RSN. These are the chemicals that we propose to remodel from a type perspective. Number 1 January / February 2009 119 Different Configurations of ISTs Comprising Structurally Viewed CSTs We have identified three distinct cases concerning concepts assigned an IST involving two or more structurally viewed CSTs: 1. All of the concepts assigned the IST represent conjugates. 2. Some (but not all) of the concepts assigned the IST represent complexes; the remaining concepts represent conjugates. 3. None of the concepts should actually be assigned the IST because its combination of the two (or more) structurally viewed CSTs is semantically invalid. The IST should not exist. Although theoretically possible, we did not encounter any case of an IST in the UMLS where all concepts represented complexes. The interesting cases, from the perspective of further refinement of the RSN, are (1) and (2), in which the IST categorizes a representation of a combination of two or more chemicals— each of whose concepts is assigned various structurally viewed CSTs—that results in a new chemical. For Case 1, the RSN is augmented to express knowledge of conjugates explicitly by first adding a new type called Conjugate, with the following definition: A compound produced from a chemical reaction of two or more compounds. Such a compound consists of chemically (covalently) bonded moieties of each constituent. Conjugate is also defined to stand in an IS-A relationship to Organic Chemical, which is a child of Chemical Viewed Structurally. The reason for this IS-A arrangement is because, as it happens, all conjugates represented in the UMLS are organic chemicals (without any restrictions on the molecular sizes of the components). Second, each IST satisfying Case 1 is replaced by a new type whose name expresses the fact that it denotes a conjugate. The conjugate chemical concepts originally assigned the particular IST are now assigned the new conjugate type instead. Finally, the new conjugate type is given an IS-A relationship to Conjugate. It does not receive the original IS-As of the IST it is replacing because the concepts to which it is assigned do not exhibit the individual semantics of the underlying OSTs. However, to preserve the link between a conjugate concept and the constituent chemical concepts underlying the complex’s derivation, a new semantic relationship named has component is substituted for those discarded IS-As. The name of the conjugate type is constructed by transforming and combining the names of the IST’s constituent OSTs into a new composite word-form that is used as a modifier for the word “Conjugate.” The International Union of Pure and Applied Chemistry’s (IUPAC’s) nomenclature system10 is followed when names are available. Some examples are Glycolipid, for Carbohydrate 僕 Lipid; Glyco-amino-acid, Glycopeptide, or Glycoprotein, for Carbohydrate 僕 Amino Acid, Peptide, or Protein; Lipoprotein, for Lipid 僕 Amino Acid, Peptide, or Protein; and Nucleoprotein, for Nucleic Acid, Nucleoside, or Nucleotide 僕 Amino Acid, Peptide, or Protein.11 Names of other conjugates were taken from the chemistry literature. Samples include: Lipo-amino-acid (in the form of one word),12 Lipo-nucleo-protein (in the form of one word),13 Liponucleotide,14 Peptidopolysaccharide,15 Phosphoamino Acid,16 Phosphopeptide,17 Polysaccharide-pep- 120 Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN assigned concepts, representing chemical complexes, still exhibit the semantics of the OSTs. As an example of Case 1, the IST Eicosanoid 僕 Amino Acid, Peptide, or Protein has assigned concepts that all represent conjugates. One of them is 6-ketoprostaglandin F1 alphathyroglobulin conjugate. It is replaced by the new conjugate type Eicosanoic-peptide or Eicosanoic-protein Conjugate, which has an IS-A to Conjugate, but not to the IST’s parents, Eicosanoid or Amino Acid, Peptide, or Protein. Instead, there are has component’s directed from Eicosanoic-peptide or Eicosanoic-protein Conjugate to Eicosanoid and Amino Acid, Peptide, or Protein. This modeling is illustrated in Figure 2. F i g u r e 2. New conjugate type Eicosanoic-peptide or Eicosanoic-protein Conjugate. tide,18 Proteopolysaccharide,19 Steroid-amino-acid, Peptide, or Protein.20-22 If common name formulations are not available, we combine the OSTs’ names using the following left-to-right ST prioritization: Carbohydrate; Lipid; Nucleic Acid, Nucleoside, or Nucleotide; and Amino Acid, Peptide, or Protein. In this prioritization, we employ the rule that shorter names are placed to the left (as adjectives) of the longer names. (Exceptions to this rule occur in two special situations where the relative sizes of each chemical component matter.) An IST satisfying Case 2 is divided up and replaced by a pair of new types, one for its concepts representing conjugates and one for its concepts representing complexes. The creation of the conjugate type follows the handling of Case 1, with all the IST’s conjugate chemical concepts being exclusively re-assigned the new conjugate type. The modeling of the IST’s complex chemical concepts follows a similar pattern to that of the conjugates. First, the RSN is extended to include the type Complex, whose definition is: A chemical mixture produced by physically mixing two or more chemical compounds without chemical reactions. They are held together by nonspecific intermolecular forces, not by chemical bonds (covalent bonds). As with Conjugate, the new type Complex is made a child of Organic Chemical because all complex chemicals represented in the UMLS consist of mixtures of organic compounds, such as steroids, nucleic acids, proteins, etc. (In Yu et al.,23 we also find a proposed type “complex” defined as a child of Chemical Viewed Structurally.) Next, a second new type, accompanying the conjugate type, is defined to replace the IST. The complex chemical concepts originally assigned several CSTs, the intersection of which created the particular IST, are now exclusively assigned the new complex type instead. The name of the complex type is derived in a manner similar to that for its companion conjugate type, with “Complex” used instead of “Conjugate.” Lastly, the new type is placed in the IS-A hierarchy by giving it the same IS-As (there must be more than one) as the IST from which it was derived. It should be noted that the IST’s existing IS-As are preserved for the complex type because its An example of Case 2 is Amino Acid, Peptide, or Protein 僕 Lipid, assigned to some concepts that correspond to conjugates and others that represent complexes. Part of its replacement is the new conjugate type Lipo-amino-acid, Lipopeptide, or Lipoprotein Conjugate. The other part is the new complex type Lipopeptide or Lipoprotein Complex. Eighty-two of Amino Acid, Peptide, or Protein 僕 Lipid’s 121 assigned concepts denote conjugates and are thus re-assigned Lipo-amino-acid, Lipopeptide, or Lipoprotein Conjugate. One of them is the conjugate N-dodecanoyl serine, mentioned above. The remaining 39 assigned concepts stand for complexes and are therefore re-assigned Lipopeptide or Lipoprotein Complex. Virosomes is an example. The complete remodeling of Amino Acid, Peptide, or Protein 僕 Lipid is illustrated in Figure 3. It should be noted that Lipopeptide or Lipoprotein Complex has three parents, while Lipoamino-acid, Lipopeptide, or Lipoprotein Conjugate has just one. The conjugate type does have two has component relationships. Different Kinds of Invalid ISTs Cases 1 and 2 imply changes to the structure of the RSN that enhance the view it affords. Case 3 represents a violation of the rules defined in the Semantic Network itself. Therefore, not only should the RSN be changed in that case, but the assigned UMLS modeling must be corrected, too. Actually, we find two distinct scenarios regarding Case 3. One is where a concept is assigned an IST in explicit violation of the instructions in the definition or usage notes of an ST. For example, two chemicals categorized respectively by the structurally viewed CSTs Carbohydrate and Lipid combine, F i g u r e 3. Remodeling conjugate and complex chemicals originally assigned Amino Acid, Peptide, or Protein 艚 Lipid in the RSN. Journal of the American Medical Informatics Association Volume 16 most of the time, to create a glycolipid (which is a conjugate). According to the usage note of the ST Carbohydrate, “glycolipid should only be typed as Lipid.” Hence, the 110 concepts representing glycolipids (e.g., N-octanoylglucosylceramide), among the 126 total concepts assigned Lipid and Carbohydrate, should be assigned only Lipid. A second example can be seen with a concept assigned Element, Ion, or Isotope and an organic CST, such as Steroid, Carbohydrate, etc. According to the UMLS definition of Element, Ion, or Isotope, it “does not include organic ions such as iodoacetate to which the type ‘Organic Chemical’ is assigned.” Therefore, such a concept should only be assigned the organic CST. For example, the concept isopropyl trimethacryltitanate is assigned Steroid and Element, Ion, or Isotope, but it should only be assigned the former. Thus, Steroid 僕 Element, Ion, or Isotope should not appear in the RSN. There are special cases of UMLS rules violations involving an ST that is disjunctive in its construction, meaning it has the form “X, Y, or Z.” Specifically, the problematic ST is Amino Acid, Peptide, or Protein. This ST lumps together concepts whose high-level semantics are related but not identical. This becomes a problem when the chemicals represented by these concepts are combined with chemicals whose concepts are from another ST. In particular, this results in rules being applicable only to a subset of the assigned concepts. For example, in the definition of Carbohydrate, there is the following stipulation: “Excluded are . . . glycoproteins . . .”; and in its usage note, we find: “Glycoproteins should only be typed as ‘Amino Acid, Peptide, or Protein.’” Combinations of carbohydrates and amino acids or carbohydrates and peptides are perfectly acceptable. Only combinations of carbohydrates and proteins are expressly forbidden. But actually this restriction is not an absolute. Some such combinations are in fact allowed when they are not glycoproteins, depending on the proportion of the different kinds of molecules involved. The definition of Organophosphorus Compound, which similarly intersects with Amino Acid, Peptide, or Protein, also contains an explicit exclusion of phosphoproteins, comprising organophosphorus compound and protein portions. To deal with these special situations, it is necessary to divide the respective ISTs into multiple ISTs across the disjunctive form and process each separately. The new ISTs, not actually appearing in the RSN, are called partial ISTs because each represents a portion of the underlying intersection. The IST Carbohydrate 僕 Amino Acid, Peptide, or Protein is divided into the three partial ISTs Carbohydrate 僕 Amino Acid (representing the amino acid contribution to the IST, e.g., glucose-cysteine and (Man)5(GlcNAc)2Asn), Carbohydrate 僕 Peptide (representing the peptide contribution, e.g., histidylAH-sepharose and 6-chlorofructos-1-yl-glutathione), and Carbohydrate 僕 Protein (representing the protein contribution, e.g., Immunoglobulin A-sepharose 4B and N-acetylgalactosamine-bovine serum albumin conjugate). The IST Organophosphorus Compound 僕 Amino Acid, Peptide, or Protein is broken down similarly into three partial ISTs. All these newly derived partial ISTs are further refined into conjugate and complex types. The analysis, in some cases, leads to a refinement into multiple families of conjugates. For example, there may be concepts for two types of conjugates for a partial IST (as with Number 1 January / February 2009 121 Carbohydrate 僕 Protein) that are distinguished by the relative proportions of the two constituent chemicals. In such a circumstance, the larger component is used as the primary naming element of the conjugate type. In other words, the name of the smaller chemical component comes first as an adjective (prefix) and the name of the larger chemical component is written last (as the suffix). One example mentioned above for the partial IST Carbohydrate 僕 Amino Acid, namely, glucose-cysteine, containing one sugar (carbohydrate) and one amino acid unit, is assigned Glycoamino-acid Conjugate. Another example for this partial IST, (Man)5(GlcNAc)2Asn, containing a larger carbohydrate portion and one amino acid unit, is assigned Aminoacid-polysaccharide Conjugate. At this point, we must raise a practical issue. The types of the RSN, including the conjugate and complex types, are intended to serve as broad categories that encompass the concepts of the META. However, some ISTs are assigned very few concepts—in some cases only one concept. It is arguable whether such ISTs really represent first-order categories. While one could make arguments from both sides on this issue, we have chosen to allow for the specification of a threshold value on types’ extent sizes for the purpose of determining whether a type qualifies for inclusion in the RSN. In other words, if a type’s extent size is below the threshold value, then it is excluded from the network presentation. A type excluded in this manner is referred to as a minor category. Let us note that the use of the threshold value actually leaves those concepts assigned minor categories without a unique type assignment with respect to the RSN. For those concepts with minor categories, we will need to keep the assignments of the original types of the SN. Of course, the choice of the threshold value is arbitrary and depends on one’s point of view in this matter. Because of this, we report the results of varying the threshold over a range of different values. A higher threshold value will increase the number of minor categories and therefore reduce the percentage of chemical concepts— originally assigned structurally viewed CSTs— having unique type coverage in the RSN. Results ISTs with Conjugates Only The RSN derived from the 2007AA UMLS release contains 32 ISTs involving structurally viewed CSTs as shown in Figure 1. Six of those ISTs satisfy Case 1, that is, the chemical concepts categorized by these ISTs all represent conjugates (excluding miscategorized cases). (The analysis of the chemical concepts was performed by one of the authors (L.C.), a chemistry professor, utilizing the IUPAC Compendium of Chemical Terminology24 as a resource.) Table 1 lists these six ISTs along with their conjugate-type replacements and their numbers of assigned concepts in parentheses. An example assigned conjugate concept is included for each type as well. For example, Nucleic Acid, Nucleoside, or Nucleotide 僕 Carbohydrate is replaced by Glyco-Nucleic Acid, Nucleoside or Nucleotide Conjugate. DNA-cellulose is one of the 149 concepts now assigned this conjugate type instead of the IST. Let us point out that the IST Nucleic Acid, Nucleoside, or Nucleotide 艚 Lipid is replaced by the type Liponucleoside or Liponucleotide Conjugate which is missing the “Lipo- 122 Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN Table 1 y ISTs Replaced by Conjugate Types Only Conjugate Type or Minor Category (# Concepts) IST (# Concepts) Nucleic Acid, Nucleoside, or Nucleotide 艚 Carbohydrate (149) Steroid 僕 Nucleic Acid, Nucleoside, or Nucleotide (8) Nucleic Acid, Nucleoside, or Nucleotide 艚 Lipid (141) Eicosanoid 僕 Amino Acid, Peptide, or Protein (3) Eicosanoid 僕 Nucleic Acid, Nucleoside, or Nucleotide (2) Nucleic Acid, Nucleoside, or Nucleotide 僕 Carbohydrate 僕 Lipid (1) Example of Assigned Concept Glyco-nucleic acid, Nucleoside, or Nucleotide Conjugate (149) Steroid-nucleic acid, Nucleoside, or Nucleotide Conjugate (8) Liponucleoside or Liponucleotide Conjugate (134) Eicosanoic-peptide or Eicosanoic-protein Conjugate (3) Eicosanoic-nucleotide Conjugate (2) DNA-cellulose Glyco-lipo-nucleotide Conjugate (1) UDP-3-O-(3-hydroxymyristoyl)-Nacetylglucosamine Cortisone-4-ara-C 2-octynoyl-coenzyme A 6-ketoprostaglandin F1 alpha-thyroglobulin conjugate Phytanoyl-coenzyme A IST ⫽ intersection semantic type. nucleic Acid” component from its name. The reason for this omission is because of the fact that there are no concepts at all in the UMLS representing liponucleic acids. We are not including combinations that do not have any occurrences. Another example of this can be seen for Eicosanoid 僕 Amino Acid, Peptide, or Protein. It should also be noted that for Nucleic Acid, Nucleoside, or Nucleotide 僕 Lipid, the number of chemicals assigned the conjugate type (134) is lower than the number assigned the IST (141). The reason for this is that seven of the original assignments were in error. (Details and explanations for these seven erroneous concepts—and others—are discussed below in the section Errors Identified.) As mentioned, a threshold value is used to distinguish between types, included in the RSN, and minor categories, which are excluded. For the sake of demonstration, we initially discuss the results for a threshold value of five. That is, a conjugate type or a complex type is deemed a minor category if it contains fewer than five concepts. Although the minor categories do not appear in the RSN, we do report on them, too, for the sake of completeness. They are shaded gray in Tables 1 through 4. For example, the IST Nucleic Acid, Nucleoside, or Nucleotide 僕 Carbohydrate 僕 Lipid is replaced by Glyco-Lipo-Nucleotide Conjugate, which is assigned to UDP-3-O-(3-hydroxymyristoyl)-N-acetylglucosamine, the single concept previously assigned the IST. However, Table 2 y ISTs Replaced by Conjugate and Complex Types Conjugate Type or Minor Category IST (# Concepts) Amino Acid, Peptide, or Protein 艚 Lipid (121) Steroid 艚 Amino Acid, Peptide, or Protein (39) Nucleic Acid, Nucleoside, or Nucleotide 艚 Amino Acid, Peptide, or Protein (121) Steroid 艚 Amino Acid, Peptide, or Protein 艚 Carbohydrate (8) Amino Acid, Peptide, or Protein 艚 Carbohydrate 僕 Lipid (7) Nucleic Acid, Nucleoside, or Nucleotide 艚 Amino Acid, Peptide, or Protein 艚 Lipid (2) Nucleic Acid, Nucleoside, or Nucleotide 艚 Amino Acid, Peptide, or Protein 艚 Carbohydrate (4) Name (# Concepts) Example Assigned Concept Lipo-amino-acid, Lipopeptide, or Lipoprotein Conjugate (82) Steroid-amino-acid, Peptide, or Protein Conjugate (33) N-stearoylhistidine Nucleo-amino-acid, Peptide, or Protein Conjugate (106) Aspartyl adenylate Steroid-glycoamino-acid or Glycoprotein Conjugate (2) Tyrosyl-ouabain Glycolipoprotein Conjugate (1) Lysozyme-glucose stearic acid monoester s-adenosyl-Lmethionine N-ole-1oyltaurate N(alpha)-dansylN(omega)-1,N(6)etheno-ADPribosylarginine methyl ester Lipo-nucleo-amino-acid Conjugate (1) Glyco-nucleo-amino-acid or Glyco-nucleo-peptide Conjugate (2) IST ⫽ intersection semantic type. Estradiol-bovine serum albumin Complex Type or Minor Category Name (# Concepts) Example Assigned Concept Lipopeptide or Lipoprotein Complex (39) Steroid-peptide, or Steroid-protein Complex (5) Nucleo-amino-acid, Peptide, or Protein Complex (15) Virosomes Glyco-steroid-amino-acid, Peptide, or Protein Complex (6) Glycolipoprotein Complex (5) Polyspectran OS Lipo-nucleo-protein Complex (1) RNA-proteolipid complex Glyco-nucleo-amino-acid or Peptide Complex (2) Foltene Lipoprotein-X cholesterol Actinomycin D-dATGCAT complex Low-density lipoproteinheparin complex Journal of the American Medical Informatics Association Volume 16 Number 1 123 January / February 2009 Table 3 y Most of the ISTs Not Appearing in the Remodeled RSN Due to Violation of ST Definitions or Usage Notes IST (# Concepts) Correct Type or Minor Category (# Concepts) Eicosanoid 艚 Carbohydrate (1) Eicosanoid (1) Steroid 艚 Carbohydrate (154) Steroid (145) Steroid-Polysaccharide Conjugate (2) Carbohydrate-steroid Complex (6) Carbohydrate 艚 Lipid (126) Lipid (110) Lipopolysaccharide Conjugate (13) Carbohydrate-Lipid Complex (2) Lipid 艚 Element, Ion, or Lipid (2) Isotope (2) Steroid 艚 Element, Ion, or Steroid (2) Isotope (2) Organic Chemical 艚 Element, Organic Chemical (11) Ion, or Isotope (11) Amino Acid, Peptide, or Amino Acid, Peptide, or Protein 艚 Element, Ion, or Protein (6) Isotope (6) Steroid 艚 Organophosphorus Steroid (1) Compound (1) Organophosphorus Compound Carbohydrate (31) 艚 Carbohydrate (31) Organophosphorus Compound Lipid (30) 艚 Lipid (30) Nucleic Acid, Nucleoside, or Nucleic Acid, Nucleoside, or Nucleotide 艚 Nucleotide (22) Organophosphorus Compound (22) Example Assigned Concept Comment Prostaglandin-inositol cyclic phosphate Alpha-(3-hydroxysialyl)cholesterol Ouabain-sepharose ST Carbohydrate, by its definition, is excluded from glycolipids. Kombetin N-octanoylglucosylceramide Lipopolysaccharide, Escherichia coli O9 Pediatric fat emulsion 4501 9-tellurium Te 123m heptadecanoic acid According to its definition, ST Element, Ion, or Isotope does 24-telluracholestanol not include organic ions or compounds to which Organic 15-(4-iodophenyl)-6Chemical is assigned. tellurapentadecanoic acid Technetium Tc-99m immunoglobulin 3-O-(4-nitrophenylphosphate)lithocholic According to its definition, ST acid Organophosphorus Compound Arabit ol-5-phosphate is excluded from phospholipids, sugar phosphates, Oleoyl thiophosphate phosphoproteins, nucleotides, and nucleic acids. 5’-O-phosphonylmethylthymidine IST ⫽ intersection semantic type; RSN ⫽ refined semantic network; ST ⫽ semantic type. Glyco-Lipo-Nucleotide Conjugate’s extent size falls under the threshold for inclusion in the RSN. ISTs with Both Conjugates and Complexes Seven of those 32 ISTs in Figure 1 satisfy Case 2, that is, the chemical concepts categorized by these ISTs represent either conjugates or complexes. Each is thus replaced by corresponding conjugate and complex types. Table 2 lists all seven of these ISTs along with the types replacing them. The number of assigned concepts is shown in parentheses for each. A concept assigned the type is shown, too. An example is Amino Acid, Peptide, or Protein 僕 Lipid, assigned to a total of 121 concepts. It is replaced by the conjugate type Lipo-amino-acid, Lipopeptide, or Lipoprotein Conjugate (assigned to 82 concepts representing conjugates) and the complex type Lipopeptide or Lipoprotein Complex (assigned to 39 concepts denoting complexes). N-stearoylhistidine is an example of a concept assigned the conjugate type, whereas Virosomes is assigned the companion complex type. It will be noted that the name of the complex type in this case is missing the “Lipo-amino Acid” component due to the fact that no concepts representing lipo-amino acids are found in the UMLS. The ISTs Steroid 僕 Amino Acid, Peptide, or Protein and Amino Acid, Peptide, or Protein 僕 Carbohydrate 僕 Lipid each were previously assigned one concept in error (see section Errors Identified). These errors were corrected in the process of the re-assignments, so the numbers of concepts assigned the respective conjugate and complex types do not add up to the numbers originally assigned the IST. The extent of the conjugate type for Steroid 艚 Amino Acid, Peptide, or Protein 艚 Carbohydrate, for example, falls under the threshold and is designated a minor category. The same is true for both the conjugate type and the complex type replacing Nucleic Acid, Nucleoside, or Nucleotide 艚 Amino Acid, Peptide, or Protein 艚 Lipid. Exclusions and Invalid ISTs Besides the 13 ISTs (Tables 1 and 2) that are transformed into either conjugate types only or conjugate and complex types in the new RSN, the other 19 ISTs involving structurally viewed CSTs do not appear as legitimate types in the remodeled RSN due to the noted violations of UMLS modeling rules. A portion of the exclusion rules for Carbohydrate was cited in the Methods section. Additionally, its usage note stipulates: “Sugar phosphates should only be typed as ‘Carbohydrate.’ Glycolipids should only be typed as ‘Lipid.’” Examples of more rules for exclusion of ISTs are as follows. In the definition of Organophosphorus Compound, we find: “Excluded are phospholipids, sugar phosphates, phosphoproteins, nucleotides, and nucleic acids.” The usage note of Nucleic Acid, Nucleoside, or Nucleotide states: “If this type has been assigned, the type ‘Organophosphorus Compound’ will not also be assigned.” And the 124 Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN Table 4 y Transformations of Two ISTs Involving Amino Acid, Peptide, or Protein IST (# Concepts) Partial IST (# Concepts) Carbohydrate 艚 Carbohydrate 艚 Amino Acid, Amino Acid (75) Peptide, or Protein (303) Carbohydrate 艚 Peptide (70) Organophosphorus Compound 艚 Amino Acid, Peptide, or Protein (33) Complex Type or Minor Category (# Concepts) Carbohydrate-aminoacid Complex (5) Conjugate Type or Minor Category (# Concepts) Example Assigned Concept Calciofix Glycoamino-acid Conjugate (47) Amino-acid-polysaccharide Conjugate (23) PolysaccharideProtamine heparin Peptidopolysaccharide peptide Complex (1) aggregate Conjugate (12) Glycopeptide Conjugate (57) Example Assigned Concept Glucose-cysteine (Man)5(GlcNAc)2Asn Histidyl-AH-sepharose 6-chlorofructos-1-ylglutathione Immunoglobulin A-sepharose 4B N-acetylgalactosaminebovine serum albumin conjugate phospho-L-arginine Carbohydrate 艚 Protein (151) Polysaccharide-protein Dermatan sulfate Complex (5) proteoglycan Proteopolysaccharide Conjugate (20) ⬍⬍Glycoprotein Conjugate⬎⬎ (126) Organophosphorus Compound 艚 Amino-Acid (21) Organophosphorus Compound 艚 Peptide (9) Organophosphorus Compound 艚 Protein (3) (None) — Phosphoamino acid Conjugate (21) (None) — Phosphopeptide Conjugate (9) Thiotepa-glutathione conjugate (None) — ⬍⬍Phosphoprotein Conjugate⬎⬎ (3) Phosphorylcholinebovine serum albumin The symbols ⬍⬍ ⬎⬎ indicate that type is invalid and excluded from the revised RSN. IST ⫽ intersection semantic type; RSN ⫽ refined semantic network. usage note of Organic Chemical says: “Salts of organic chemicals . . . would be considered organic chemicals and should not also receive the type ‘Inorganic Chemical.’” Table 3 shows 11 ISTs that do not appear in the revised RSN due to such violations. For example, the one concept assigned Eicosanoid 僕 Carbohydrate, namely, prostaglandininositol cyclic phosphate, represents a glycolipid. But such concepts, by definition, are not to be assigned Carbohydrate. In this situation, the assignment should be Eicosanoid only. It will be noted that the numbers of concepts for the “correct type” (column 2) are inconsistent with those for the ISTs themselves in the cases of Steroid 僕 Carbohydrate and Carbohydrate 僕 Lipid. This is due to the discovery and subsequent correction of one assignment error with respect to each. Again, details are given below. As seen in Table 3, while 110 of the 126 chemical concepts assigned Carbohydrate 僕 Lipid represent glycolipids and are re-assigned Lipid according to the UMLS usage note of Carbohydrate, there are two other cases involving this IST. There are two concepts denoting complex chemicals (e.g., Pediatric fat emulsion 4501) that are re-assigned the new complex type Carbohydrate-Lipid Complex. There are also 13 concepts denoting lipopolysaccharide conjugates (e.g., lipopolysaccharide, E coli O9), characterized by a large carbohydrate molecule, that are re-assigned the new corresponding conjugate type. Overall, Carbohydrate 僕 Lipid’s concepts are re-assigned three different types in the revised RSN. A similar situation occurs for the IST Steroid 僕 Carbohydrate. Invalid Partial ISTs As noted, the two ISTs Carbohydrate 僕 Amino Acid, Peptide, or Protein and Organophosphorus Compound 僕 Amino Acid, Peptide, or Protein must be treated as special cases with regard to the exclusion rules due to the disjunctive nature of the ST Amino Acid, Peptide, or Protein. In particular, each is broken down into three new partial ISTs which are then analyzed separately. While each partial IST is analyzed with regard to its conjugates and complexes, it is the protein-based chemicals (i.e., glycoproteins and phosphoproteins) that are of specific interest due to the stated exclusions. For the IST Carbohydrate 僕 Amino Acid, Peptide, or Protein, the division yields the partial ISTs Carbohydrate 僕 Amino Acid, Carbohydrate 僕 Peptide, and Carbohydrate 僕 Protein. The first partial IST, Carbohydrate 僕 Amino Acid, includes both complex (e.g. Calciofix) and conjugate concepts. A new complex type Carbohydrateamino-acid Complex is defined. For the conjugate concepts, further analysis reveals that carbohydrates combined with amino acids yield two distinct kinds of conjugates. The distinction between the two is mainly due to the size of the carbohydrate: monomer (“glyco”) versus polymer (“polysaccharide”). When the carbohydrate is a monomer, we use the type Glycoamino-acid Conjugate. An example of this is the concept Glucose-cysteine. When the carbohydrate is a polymer, we use the type Amino-acid-polysaccharide Conjugate. An example is (Man)5(GlcNAc)2Asn. Table 4 shows the complete results of the transformation of Carbohydrate 僕 Amino Acid, Peptide, or Protein. The second partial IST, Carbohydrate 艚 Peptide, also has both complex and conjugate concepts. For the former, a new complex type Polysaccharide-peptide Complex is defined. For the latter, we again have the situation of two distinct kinds of conjugates. The distinction between the two is based on the proportional sizes of the two molecular Journal of the American Medical Informatics Association Volume 16 contributions. Peptidopolysaccharide Conjugate denotes a smaller peptide (“peptido”) and a large carbohydrate (“polysaccharide”). An example is histidyl-AH-sepharose. Glycopeptide Conjugate denotes a smaller carbohydrate (“glyco”) and a large peptide portion. An example of this is 6-chlorofructos-1ylglutathione. For the last of the three partial ISTs, Carbohydrate 僕 Protein, there are various concepts representing complexes and two kinds of conjugates (Table 4). The conjugates are again distinguished by the proportional sizes of their two molecular contributions. For example, Proteopolysaccharide is the name of the conjugate type whose concepts represent chemicals containing a smaller protein portion (“proteo,” prefix) and a larger carbohydrate portion (“polysaccharide”). An example is Immunoglobulin A-sepharose 4B. In contrast, Glycoprotein is used when the carbohydrate portion is smaller: prefix “glyco.” An example is N-acetylgalactosamine-bovine serum albumin conjugate. According to the UMLS specifications noted above, the conjugate type Glycoprotein is in fact invalid. The 126 concepts warranting its assignment should be assigned Amino Acid, Peptide, or Protein instead. Thus, Glycoprotein does not appear in the revised RSN. Overall, Carbohydrate 僕 Protein is replaced by one conjugate type and an accompanying complex type (Table 4). The IST Carbohydrate 僕 Amino Acid, Peptide, or Protein had an additional seven assigned concepts that were discovered to be in error in the course of remodeling and reassignment. These are listed in Table 5 (with brief explanations), along with the other concept errors found during the revision of the RSN. The IST Organophosphorus Compound 僕 Amino Acid, Peptide, or Protein is also broken down into three partial ISTs, with a particular interest on the phosphoproteins: Number 1 125 January / February 2009 Organophosphorus Compound 僕 Amino Acid, Organophosphorus Compound 僕 Peptide, and Organophosphorus Compound 僕 Protein. In all three cases, the chemical combinations are strictly conjugates, so no complex types are required. The first two partial ISTs are replaced by the conjugate types Phosphoamino Acid Conjugate (example concept: phosphor-L-arginine) and Phosphopeptide Conjugate (example: thiotepa-glutathione conjugate), respectively. The prefix “phospho” is used to convey the organophosphorus portion of the compounds. The last partial IST, Organophosphorus Compound 僕 Protein, is assigned to the phosphoprotein concepts such as phosphorylcholine-bovine serum albumin. However, that is in violation of the UMLS assignment rules. In fact, by the definition of the ST Organophosphorus Compound, only Amino Acid, Peptide, or Protein should be assigned to the three phosphoprotein concepts found. So, no type named “Phosphoprotein Conjugate” is included in the RSN. The complete transformation of Organophosphorous Compound 僕 Amino Acid, Peptide, or Protein is summarized in Table 4. Errors Identified As noted previously, in various contexts, existing typeassignment errors were encountered during the re-assignment of conjugate and complex types to concepts in the revised RSN. Overall, 18 such errors were discovered. Most of them are due to a lack of one of the asserted component types. For example, the concept 1,2-4,5-di-O-isopropylidene-3C-(5-phenyl-1,2,4-oxadiazol-3-yl)-beta-D-psicopyranose was previously assigned the IST Steroid 僕 Carbohydrate. However, the chemical has no steroid component, and the concept should never have been assigned Steroid in the first place. As another example, 3-hydroxy-17-(1H-1,2,3-triazol-1-yl)androsta-5,16-diene was assigned the IST Steroid 僕 Amino Acid, Peptide, or Protein. However, this chemical has no “amino Table 5 y Type-Assignment Errors Discovered during RSN Remodeling IST Steroid 艚 Amino Acid, Peptide, or Protein Amino Acid, Peptide, or Protein 艚 Carbohydrate 艚 Lipid Carbohydrate 艚 Amino Acid, Peptide, or Protein Steroid 艚 Carbohydrate Carbohydrate 艚 Lipid Nucleic Acid, Nucleoside, or Nucleotide 艚 Lipid Concepts in Error 3-hydroxy-17-(1H-1,2,3-triazol-1-yl)androsta-5,16-diene Comment No Amino Acid, Peptide, or Protein component Difucosyl lactosamine Zinc (II)-iminodiacetate agarose RIG 200 Methyl-2,3,4-tris-O-(N-2,3-di(hydroxyl)benzoyl)aminopropyl)glucopyranoside Aurantoside D MEN 4901 Glyceryl glyphosate, 2-propanamine (1:1) Glyceryl glyphosate, disodium salt 1,2-4,5-di-O-isopropylidene-3-C-(5-phenyl-1,2,4-oxadiazol-3-yl)-beta-Dpsicopyranose 17-glucuronosylestradiol Dodecaglycerol-thymine Dodecaglycerol-adenine (GlyA-dT)10 GlyT-(GlyA-GlyT)9 Formyl-coenzyme A 2’-deoxycytidine-diphosphate-diglyceride Guanosine 5’-(5’-deoxyadenosylcobinamide pyrophosphate IST ⫽ intersection semantic type; RSN ⫽ refined semantic network. No Carbohydrate component No Steroid component Lipid should be replaced by Steroid No Lipid component 126 Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN F i g u r e 4. The complex types of the RSN. acid, peptide, or protein” component. In the case of Glyceryl glyphosate, disodium salt, assigned Carbohydrate 僕 Amino Acid, Peptide, or Protein, the chemical has no carbohydrate component. The concept 17-glucuronosylestradiol was assigned Carbohydrate 僕 Lipid but should have been assigned Carbohydrate 僕 Steroid. A total of 14 errors (seven each) of the 18 errors are associated with the ISTs Nucleic Acid, Nucleoside, or Nucleotide 僕 Lipid and Carbohydrate 僕 Amino Acid, Peptide, or Protein. Table 5 lists all 18 errors along with a brief explanation for each. Proposed RSN Figure 4 shows part of the new version of the RSN rooted at Chemical Viewed Structurally with some of the complex types. Again, this RSN was generated with respect to an extent-size threshold value of five. That is, ISTs whose extents have fewer than five concepts are omitted. The issue of a higher threshold value is considered below. Such a higher value would imply fewer such types in Figures 4 and 5. All the structurally viewed CSTs appear in the upper part of the figure above the broken line. The complex types appear in the lower part. Complex is a child of Organic Chemical and the parent of all complex types. As noted, there are eight complex types. Each complex type preserves the IS-A relationships of the original corresponding IST (in Figure 1) to its constituent structurally viewed CSTs. We note that the modeling of the complex types follows the modeling of Figure 3. Figure 5 shows the part of the new version of the RSN rooted at Chemical Viewed Structurally with some of the conjugate types. All the structurally viewed CSTs appear in the upper part of the figure above the broken line. The conjugate types appear in the lower part of the figure. Conjugate is a child of Organic Chemical and the parent of all conjugate types. As noted, there are 14 conjugate types. Each conjugate type has has component relationships to its constituent CSTs. We note that Figure 2 is embedded in Figure 5. Regarding the issue of types with small extents, let us note that in the current SN, some nonleaf types fall into this group. In fact, most of the concepts in the META are assigned leaf types in order to categorize them as specifically as possible. The RSN’s complex types and conjugate types are indeed leaf types, and one could reasonably expect them to be assigned to a relatively significant number of concepts if they are to warrant the designation of “broad category.” But even in the SN, there are nine leaf types whose extents only have between 25 and 100 concepts. As noted, to allow flexibility, we have adopted the use of a threshold value and designated types whose extent sizes fall below it as “minor Journal of the American Medical Informatics Association Volume 16 Number 1 January / February 2009 127 F i g u r e 5. The conjugate types of the RSN. categories.” The version of the RSN rooted at Chemical Viewed Structurally reported above was derived with a threshold of five. We could, of course, have chosen another threshold value of, say, 10, 25, 50, or 100. There is a tradeoff between lowering the number of conjugate types and complex types (which follows the raising of the threshold) versus the accuracy of capturing all possible conjugate and complex chemicals by unique explicit types. Table 6 illustrates this tradeoff. For example, if we choose a minimum extent size of 10 concepts (third row), then we wind up with Table 6 y Effect of Threshold Values on the Size of the Chemical Viewed Structurally Portion of the RSN Extent-size Threshold # Types (OSTs, Complex Types, and Conjugate Types) # Concepts Covered % Covered 1 5 10 25 50 100 45 33 25 19 16 14 86,270 86,250 86,201 86,097 85,978 85,839 100.00 99.98 99.92 99.80 99.66 99.50 only 25 types as compared to the 45 types of the unrestricted RSN (first row) and the 33 types of the RSN with a threshold of five (second row). This represents a 44% reduction and a 24% reduction, respectively, in the number of types. The 25 types are collectively assigned to a total of 86,201 concepts. This choice thus results in 69 (0.08%) of the concepts not having a unique type. For such concepts, we would have to resort back to the original multi-typing arrangement in order to accommodate them with high-level categorizations. It is up to a user for whom the conjugate/complex distinction is pertinent to decide on a value that balances the tradeoff in a way they see fit. Discussion Significance and Impact OST ⫽ original semantic type; RSN ⫽ refined semantic network. Various kinds of semantic combinations are possible when a chemical concept is assigned several structurally viewed CSTs. This led to the creation of new complex and conjugate types for the RSN, such as Glyco-lipo-nucleotide Conjugate and Glyco-steroid-amino-acid, peptide, or protein Complex. In this way, the RSN provides a more precise abstraction for legitimate combinations of structurally viewed CSTs. 128 Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN When the SN was first introduced, it was noted that it did not have particularly great depth.2 The expressed expectation was for the creation of additional depth during further development. The modifications suggested in this article are a step toward that goal. They constitute a natural increase in the depth of the subnetwork rooted at Chemical Viewed Structurally. Two different kinds of combinations of structurally viewed CSTs that are violations of UMLS rules were presented and excluded from the revised RSN. The impact of these exclusions is the prevention of assigning “illegal” ST combinations involving structurally viewed CSTs to concepts representing chemicals that are formed from combinations of other chemicals. Using the RSN, an editor will only be able to choose legitimate semantic combinations that appear explicitly as conjugate or complex types. As an example of the first kind of violation, we see in Table 3 that 30 concepts were assigned the combination of Organophosphorus Compound and Lipid. By the definition of Organophosphorus Compound, such a concept should only be assigned Lipid. A UMLS editor would not be able to assign such an illegal combination of STs because it is not reified as a type of its own in the RSN, where unique type assignments are required. Another example of a violation in Table 3 is the combination of two exclusive types, Organic Chemical and Element, Ion, or Isotope, assigned to 11 concepts. By the definition of Element, Ion, or Isotope, such a combination is considered an organic chemical due to the organic component of the chemical, and no assignment of Element, Ion, or Isotope should be made. Again, not having such a type in the RSN will prevent an editor from assigning this illegal combination. Similarly, all 361 concepts assigned structurally viewed CSTs in Table 3 in violation of UMLS rules would not have been assigned the CST causing the violation. Furthermore, when an editor is faced with the task of assigning structurally viewed CSTs to chemical concepts, the task will be streamlined by offering only the legitimate combinations with understandable names for complex chemicals and conjugate chemicals. In this way, not only will many errors be prevented, but we expect the laborintensive type-assignment process to become more efficient. Toward this end, we have designed a decision tree (Figure 6) that can be utilized by an editor when trying to determine the appropriate conjugate type or complex type to be assigned to a chemical concept representing a chemical composed of two or more other chemicals. This decision tree is with respect to the RSN derived using an extent-size threshold value of five. A slightly revised decision tree would be required for a different threshold value. Users will benefit from the correction of existing ST assignment errors and the prevention of new ones for chemical concepts. In a recent UMLS study,25 there were two questions pertaining to the extent to which a user is bothered by a list of 12 kinds of errors. Among the errors related to aspects of a concept, the highest concern was for incorrect STs. Therefore, auditing the META for ST assignments is imperative to ensure the overall quality and usability of the UMLS. The suggested modeling of compound chemicals in the RSN framework will also facilitate user comprehension of such F i g u r e 6. Decision tree for assigning a type to a chemical concept in the context of the RSN. Journal of the American Medical Informatics Association Volume 16 chemical concepts. The suggested categorization of chemical concepts with a unique type will help users see which chemicals are obtained from a combination of other chemicals assigned different structurally viewed CSTs. Furthermore, the categorization will explicitly specify the nature of the combination, either as complex or conjugate. Size and Scope of the RSN Although there are various advantages to the RSN, in general, and its finer-grained modeling of structurally viewed CSTs, in particular, one also has to consider practical consequences of its physical implementation. One such consequence is that the number of types in the unrestricted (i.e., threshold value 1) RSN, 690, is about five times the current number of types of the SN, 135. Although the RSN qualifies as a compact abstraction of the META’s 1.5 million concepts, the RSN is not as compact as the SN. The RSN’s increased size does have implications for its pictorial display, either as a diagram or as an indented list. The threshold value can certainly be increased to reduce the number of types. For example, thresholds of 100 and 25 yield a total of 214 and 282 types, respectively. As an even more conservative approach, one could opt simply to augment the SN itself with just two new types: Conjugate and Complex. In such an arrangement, a conjugate concept originally assigned, say, STs X and Y would instead be assigned X, Y, and Conjugate to make its status as a conjugate explicit. Complexes would be treated analogously. Of course, one would need to provide a set of conventions and guidelines for categorizing the corresponding concepts along these lines. An alternative to actually materializing the RSN, with its ISTs, complex types, and conjugate types, is to implement it in a virtual manner. That is, concepts will continue to be assigned multiple types, but additional defined constraints will forbid some illegal combinations of types, without having to resort to the creation of explicit new types. Examples of such constraints might include formalizations of prohibitions found in ST definitions and usage notes, as well as those for exclusive types and redundant type assignments.26 In upcoming work, we will describe such a virtual RSN framework. Let us compare the distribution of the extents of structurally viewed CSTs in the SN and the RSN. Table 7 shows for each such CST its extent size in the SN and that of its corresponding OST in the RSN. (Recall that concepts assigned multiple types in the SN are removed from OST extents in the context of the RSN.) As we see, only about 30%— 85,450 of the total number of 279,995 concepts— of those extents in the SN carry over to the RSN. Of those, 820 concepts are in ISTs involving two or more chemically viewed CSTs. In this article, their modeling has been revised to be conjugate or complex types. The majority of concepts assigned a structurally viewed CST are also assigned a functionally viewed CST. In upcoming work, we will discuss the representation of such intersections with respect to the RSN. The distribution of concepts in ISTs involving exactly two structurally viewed CSTs is shown in Table 8, which is laid out in two dimensions. (The abbreviations appearing in the column headings are defined in the corresponding row labels.) An entry in the table of the form “x, y” indicates that Number 1 129 January / February 2009 Table 7 y Comparison of Extent Sizes in the SN and the RSN Structurally Viewed Chemical ST # Concepts in Extent in SN # Concepts in Extent in RSN Chemical Viewed Structurally Organic Chemical Steroid Eicosanoid Nucleic Acid, Nucleoside, or Nucleotide Organophosphorus Compound Amino Acid, Peptide, or Protein Carbohydrate Lipid Element, Ion, or Isotope Inorganic Chemical Total 376 134,424 9,271 1,163 7,819 239 47,866 4,638 527 3,752 2,212 807 103,188 16,129 9,376 5,753 1,312 5,101 279,995 5,312 3,306 796 2,078 85,450 RSN ⫽ refined semantic network; SN ⫽ semantic network; ST ⫽ semantic type. the IST involving the two respective types had x conjugate and y complexes. For example, Steroid 艚 Amino-acid, Peptide, or Protein had 33 conjugates and five complexes (see also Table 2). An entry of “X” indicates a combination forbidden due to exclusiveness, e.g., Steroid and Eicosanoid (the children of Lipid), or redundancy, e.g., Organic Chemical and any of its descendents. Another reason for an “X” entry is due to the definitions and usage notes of the STs in the UMLS, e.g., for Organophosphorus Compound with all STs except Amino Acid, Peptide, or Protein as was shown in Table 3. Let us further comment on one of the STs of the SN with a disjunctive form of related chemicals. In the analysis of ISTs involving Amino Acid, Peptide, or Protein, we needed to create partial ISTs. It raises the question of whether it is better to lump those chemicals into a single type as in the SN or to separate them into three separate types, one Amino Acid, one Peptide, and one Protein. It seems that a finergrained categorization with multiple types is better suited for clarifying the nature of the chemicals and the subtleties of their interactions with other kinds of chemicals. To a lesser degree, the question also arises regarding Nucleic Acid, Nucleoside, or Nucleotide. This separation into different types will cause the names of some conjugate and complex types to get simpler and clearer. Another Application of the RSN One possible application of the modeling of the structurally viewed CSTs in the RSN is as an upper-level categorization mechanism for an ontology of chemicals, in the same capacity that the SN serves the META. A natural candidate for this is ChEBI,6 an OBO ontology27 that models chemicals. It consists of 31,168 concepts. In the following, we examine this potential RSN usage in more detail. We found that four of the RSN’s conjugate types, Glycopeptide, Lipopeptide, Lipoprotein, and Lipopolysaccharide, appear as concepts in ChEBI—and also in the IUPAC Gold Book24. In ChEBI, the concept Nucleoprotein apparently represents what we have modeled as Nucleo-amino-acid, Pep- 130 Chen et al., Modeling Multi-typed Structurally Viewed Chemicals with the UMLS Refined SN Table 8 y Conjugates and Complexes in ISTs Involving Two Structurally Viewed CSTs Second Type First Type Chemical Viewed Structurally (CVS) Organic Chemical (OC) Steroid (SRD) Eicosanoid (EID) Nucleic Acid, Nucleoside, or Nucleotide (NANN) Organophosphorus Compound (OCD) Amino Acid, Peptide, or Protein (AAPP) Carbohydrate (CRB) Lipid (LPD) Element, Ion, or Isotope (EII) Inorganic Chemical (IC) CVS OC SRD EID NANN N/A X X X X N/A X X X N/A X 8,0 N/A 2,0 N/A X X X X X X X X X X X X X 33,5 0,8 X X X X 3,0 0,0 X X X X 106,15 149,0 134,0 X X OCD AAPP CRB LPD EII IC N/A 30,0 X X X X N/A 159,11 82,39 X X N/A 13,2 X X N/A X X N/A X N/A CST ⫽ chemical semantic type; IST ⫽ intersection semantic type. tide, or Protein Complex and Nucleo-amino-acid, Peptide, or Protein Conjugate. The ChEBI concept Polysaccharide protein carries the name of another of our complex types. Obviously, if names of structurally viewed CSTs are used at the concept level of ChEBI, then naturally those concepts would be assigned corresponding conjugate types and complex types if a two-level terminology structure were to be utilized. We randomly chose 10 concepts each from three new types (two conjugate types, Lipopolysaccharide Conjugate and Nucleo-amino-acid, Peptide, or Protein Conjugate, and one complex type, Nucleo-amino-acid, Peptide, or Protein Complex) and checked whether they actually appear in ChEBI. The results were that all 10 concepts from Lipopolysaccharide Conjugate were present in ChEBI, e.g., lipidlinked oligosaccharides and lipoteichoic acid. Among the 10 concepts from Nucleo-amino-acid, Peptide, or Protein Conjugate, 6 were found, e.g., aspartyl adenylate and pacidamycin 1. Also, we found 6 of the 10 concepts from Nucleo-aminoacid, Peptide, or Protein Complex, including actinomycin D-dATGCAT complex and enterogenin. As we see, 22 out of our sample of 30 UMLS concepts are part of ChEBI. Those 22 could readily be assigned conjugate types or complex types if an overarching network of categories for ChEBI were desired. Conclusion The RSN has previously been introduced as a finer-granularity abstraction of the UMLS’s conceptual content. In particular, it better represents combinations of multiple semantic-type assignments by defining separate high-level types, called intersection semantic types, for each. This elevation of semantic-type combinations to first-class types in their own right helps convey this important knowledge more clearly. It also simplifies type assignments, as all are unique in the context of the RSN. The portion of the UMLS particularly benefiting from the RSN pertains to chemicals because it is natural to combine chemicals of different kinds and obtain new chemicals. In this article, we further refined that part of the RSN to more accurately convey the knowledge of chemical combinations involving chemicals viewed structurally. Combining such chemicals can yield simple mixtures (referred to as complexes in the field of chemistry) or more complicated chemicals derived via chemical reaction (called conjugates). The RSN was augmented with new types to capture these distinctions. In this way, each structurally viewed chemical concept is assigned a unique type, whether it is an original chemical type, a conjugate type, a complex type, or an intersection type with a functionally viewed CST. Such a categorization will benefit users, who will directly know that a specific chemical is, say, a glyco lipoprotein conjugate or a liponucleoprotein complex. Overall, this will enhance user comprehension to the richness of the UMLS’s chemical content. Additionally, various violations of UMLS modeling rules, as stipulated in semantic type definitions and usage notes, were discovered and corrected with the removal or replacement of types appearing in the original RSN. The suggested additions to the Semantic Network will help UMLS maintenance personnel in avoiding future type-assignment errors, as a new chemical concept will only be permitted a single assignment of an existing (validated) RSN type: original, conjugate, or complex. Trade-offs between achieving a fully accurate and a highly granular categorization of structurally viewed chemical concepts and practical issues regarding the number of types in the RSN, including threshold limits on extent sizes for type-level qualification, were discussed. We also considered the possibility of using the RSN as an upper-level categorization network for ChEBI. References y 1. Humphreys BL, Lindberg DAB, Schoolman HM, Barnett GO. The Unified Medical Language System: an informatics research collaboration. J Am Med Inform Assoc 1998;5:1–11. 2. McCray AT, Hole WT. The Scope and Structure of the First Version of the UMLS Semantic Network. Los Alamitos, CA: Proc 14th Annual SCAMC 1990:126 –30. 3. Schulyer PL, Hole WT, Tuttle MS, Sherertz DD. The UMLS Metathesaurus: representing different views of biomedical concepts. Bull Med Libr Assoc 1993;81:217–22. 4. Geller J, Gu H, Perl Y, Halper M. Semantic refinement and error correction in large terminological knowledge bases. Data Knowledge Eng 2003;45:1–32. 5. Peng Y, Halper M, Perl Y, Geller J. Auditing the UMLS for redundant classifications. Proc AMIA Annu Symp 2002:612– 6. 6. Degtyarenko K, de Matos P, Ennis M, et al. ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res 2008;36(Database issue):D344 –50. 7. Gu H, Perl Y, Geller J, Halper M, Liu L, Cimino JJ. Representing the UMLS as an OODB: modeling issues and advantages. J Am Med Inform Assoc 2000;7:66 – 80. Journal of the American Medical Informatics Association Volume 16 8. UMLS Documentation, Section 2–Metathesaurus. Available at: http://www.nlm.nih.gov/research/umls/meta2.html. Accessed August 6, 2007. 9. The UMLS Semantic Network. Available at: http://semanticnetwork.nlm.nih.gov. Accessed August 6, 2007. 10. International Union of Pure and Applied Chemistry. Available at: http://www.iupac.org. Accessed August 1, 2007. 11. IUBMB Biochemical Nomenclature and Related Documents. 2nd ed. London: Portland, 1992. 12. Surovoy A, Flechsler I, Jung G. A novel series of serum-resistant lipoaminoacid compounds for cellular delivery of plasmid DNA. Adv Exp Med Biol 1998;451:61–7. 13. Inoue Y. Studies on conjugated proteins (liponucleoproteinsystem). I. The interaction between lecithin and ovalbumin. Acta Scholae Medicinalis Universitatis in Kioto 1957;34:276 – 84. 14. Yang VC, Turcotte JG, Steim JM. Physical properties of arabinofuranosylcytosine diphosphate diacylglycerol, an antitumor liponucleotide. Biochim Biophys Acta 1982;68:375– 84. 15. Baldo BA, Fletcher TC, Pepys J. Isolation of a peptido-polysaccharide from the dermatophyte Epidermophyton floccosum and a study of its reaction with human C-reactive protein and a mouse anti-phosphorylcholine myeloma serum. Immunology 1977;32:831–42. 16. Sickmann A, Meyer HE. Phosphoamino acid analysis. Proteomics 2001;1:200 – 6. 17. Gatti A. Profiling substrate phosphorylation at the phosphopeptide level. Anal Biochem 2003;312:40 –7. 18. You YH, Lin ZB. Antioxidant effect of Ganoderma polysaccharide peptide. Acta Pharmaceutica Sinica 2003;38:85– 8. Number 1 January / February 2009 131 19. Ji Z, Tang Q, Zhang J, Yang Y, Jia W, Pan Y. Immunomodulation of RAW264.7 macrophages by GLIS, a proteopolysaccharide from Ganoderma lucidum. J Ethnopharmacol 2007; 112:445–50. 20. Panda S, Panda G. A new example of a steroid-amino acid hybrid: construction of constrained nine membered D-ring steroids. Org Biomol Chem 2007;5:360 – 6. 21. Wang C, Peng S, Zhang X, Qiu X. The synthesis and immunosuppressive effects of steroid-peptide linkers. Acta Pharmaceutica Sinica 1998;33:111– 6. 22. Uscheva AA, Stankov BM, Zachariev SG, Marinova CP, Kanchev LN. Possible synthesis of steroid-protein for immunization with a fixed narrow range of the hapten-protein ratio. J Steroid Biochem 1986;24:699 –702. 23. Yu H, Friedman C, Rzhetsky A, Kra P. Representing genomic knowledge in the UMLS Semantic Network. Proc AMIA Annu Symp 1999:181–5. 24. IUPAC Compendium of Chemical Terminology–The Gold Book (XML version). Available at: http://goldbook.iupac.org/. Accessed February 27, 2008. 25. Chen Y, Perl Y, Geller J, Cimino JJ. UMLS users, uses and future agenda. J Am Med Inform Assoc 2007;14:221–31. 26. McCray AT, Nelson SJ. The representation of meaning in the UMLS. Methods Inf Med 1995;34:193–201. 27. Smith B, Ashburner M, Rosse C, et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 2007:1251–5.