Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Part 3B: Text Indexing, Term Lists & Taxonomies Value space continuum of expressivity… Thesauri Text indexing Term lists Ontology Faceted Classification Less More Taxonomies Tagging Enumerated Classification Analytico-synthetic Classification Increasing control over form, relationships and meaning… Text Indexing ▪ Full-text and inverted files/indexes Inverted files… Primary form of index developed for use in information systems for full-text retrieval It is called an “inverted file” because the normal rows (documents) and columns (words) of a database are inverted with rows representing words and columns representing documents. Example inverted file… Main Data File ID HOUSE PRICE 1 1208 Twin Oaks Way $100,000 2 100 Sutton Heights $200,000 3 10 Pine Street $150,000 4 8539 Billings Circle $100,000 5 9537 Highway 101 North $100,000 6 10 Capitol Hill Avenue North $150,000 Inverted File or Inverted Index $100,000 1 4 $150,000 3 6 $200,000 2 5 Inverted file (document level)… Document Text 1 2 3 4 Gold silver truck Shipment of gold damaged in a fire Delivery of silver arrived in a silver truck Shipment of gold arrived in a truck Number Term 1 2 3 4 5 6 7 8 9 10 11 a arrived damaged delivery fire Gold of in shipment silver truck Times; Documents <3; 2,3,4> <2; 3,4> <1; 2> <1; 3> <1; 2> <3; 1,2,4> <3; 2,3,4> <3; 2,3,4> <2; 2,4> <2; 1,3> <3; 1,3,4> Inverted file (term-level)… Document Proximity operator support Text 1 2 3 4 Gold silver truck Shipment of gold damaged in a fire Delivery of silver arrived in a silver truck Shipment of gold arrived in a truck Number Term 1 2 3 4 5 6 7 8 9 10 11 a arrived damaged delivery fire gold of in shipment silver Truck Times; Documents Words <3; (2;6),(3;6),(4;6)> <2; (3;4),(4;4)> <1; (2;4)> <1; (3;1)> <1; (2;7)> <3; (1;1),(2;3),(4;3)> <3; (2;2),(3;2),(4;2)> <3; (2;5),(3;5),(4;5)> <2; (2;1),(4;1)> <2; (1;2),(3;3,7)> <3; (1;3),(3;8),(4;7)>> Inverted file (document level)… Document Stop words Text 1 2 3 4 Gold silver truck Shipment of gold damaged in a fire Delivery of silver arrived in a silver truck Shipment of gold arrived in a truck Number Term 1 2 3 4 5 6 7 8 9 10 11 a arrived damaged delivery fire Gold of in shipment silver truck Times; Documents <3; 2,3,4> <2; 3,4> <1; 2> <1; 3> <1; 2> <3; 1,2,4> <3; 2,3,4> <3; 2,3,4> <2; 2,4> <2; 1,3> <3; 1,3,4> Term Lists Term lists… The simplest forms of controlled value spaces are term lists—lists of controlled terms ordered by some principle (frequently alphabetical) Infants Ankle biters Rug rats Infants (preferred term) The list of authorized U.S. state abbreviations An alphabetic list of enumerated subject terms Simple (yet powerful) lists… A list (also sometimes called a pick list) is a limited set of terms arranged as a simple alphabetical list or in some other logically evident way. Lists are used to describe aspects of entities that have a limited number of possibilities. Examples include geography (e.g., country, state, city), language (e.g., English, French, Swedish), or format (e.g., text, image, sound) Simple alphabetical list: Alabama Alaska Arkansas California Connecticut Delaware Simple logical list: Mercury Venus Earth Mars Jupiter Saturn Uranus Neptune Pluto* Taxonomies ▪ Yahoo! Directory Dominant form on the Web… Hierarchical tree structure Example: Yahoo! Directory Frequently permit polyhierarchy (multiple parents) No general principles guiding design of taxonomies “A collection of controlled vocabulary terms organized into a hierarchical structure. Each term in a taxonomy is in one or more parent/child (broader/narrower) relationships to other terms in the taxonomy.” [NISO/Z39.19] [emphasis added] Polyhierarchy Polyhierarchy… [NISO/Z39.19] musical instruments Based on generic relationship stringed instruments percussion instruments piano Based on whole-part relationship biology chemistry biochemistry Based on multiple types of relationship bones head skull Node Labels milk . . <milk by source animal> .. buffalo milk .. cow milk .. goat milk .. sheep milk . <milk by region> .. United States .. India ..China Non-indexable concepts used for purposes of organizing other concepts in meaningful ways End • Part 3B: Text Indexing, Term Lists & Taxonomies