Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval* Vipul Kashyap National Library of Medicine [email protected] Workshop on Science and the Semantic Web October 24, 2002 * Work done by the author when at MCC and LSDIS Lab, UGA Outline Ontologies for Information Retrieval: The InfoSleuth System The Ontology Design Process: – “Reverse Engineering” from a database schema – Ontology refinement based on user queries – Using a data dictionary and Thesaurus Ontology-based Multimedia Information Retrieval – Information Extraction from Textual Data – Information Extraction from Image Data Conclusions and Future Work Science on the Semantic Web Worksshop – 2 Ontologies for Information Retrieval: The InfoSleuth System Image Database: features, patterns, semantic objects Ontology-based retrieval query KQML/OKBC agents Document Database e.g., Verity Structured Database e.g., Oracle Science on the Semantic Web Worksshop – 3 A Multimedia GIS Query using an ontological model Get me all regions (blocks, counties) having a population greater than 500 and area greater than 50 acres having an urban land cover and such that all the nearby fires have excellent containment county name Fire block area isLocatedNear Region population containment spatial_location select county, block, spatial_location from region where area > 50 and population > 500 and land_cover = “urban” and region.isLocatedNear.containment = “excellent” Science on the Semantic Web Worksshop – 4 land_cover Ontologies for Information Retrieval Provide a concise, uniform, declarative description of semantic information Independent of syntactic representations, conceptual models of the underlying information bases Domain models provide wider access by supporting multiple world views on the same underlying data EDEN ontology defined in the context of the InfoSleuth system: – important and crucial to capture elements of environmental information Science on the Semantic Web Worksshop – 5 Sources for Ontology construction Pre-existing Database Schemas – data directed component Collection of representative set of queries possibly parameterized based on application user interface – application directed component Thesauri and Vocabularies (e.g., EEA Thesaurus) – knowledge directed component Ontology = knowledge-based middle ground between applications and data !!! Science on the Semantic Web Worksshop – 6 The Ontology Design Process Choose new Database Schema Abstract details from Database Schema Determine entities and attributes Determine Relationships Group information, Analyze foreign keys and dependencies Implement and Test Evaluate Ontology Ontology from Database Schema Drop entities and attributes Ontology from Queries No more queries Add new subclasses and superclasses Choose new query Science on the Semantic Web Worksshop – 7 Add new entities and attributes Environmental Databases CERCLIS 3 – http://www.epa.gov/enviro/html/cerclis/ ITT HAZDAT – http://www.atsdr.cdc.gov/hazdat.html ERPIMS – http://ns1.ktc.com/personal/larnold/erpims.htm Basel Convention Database – http://www.unep.ch/basel Science on the Semantic Web Worksshop – 8 Grouping Information in Multiple Tables Site site_id (PK) site_name site_ifms_ssid_ code site_rcra_id site_epa_id Site_Characteristic site_id (PK, FK to Site) rsic_code (PK, FK to Ref_Sic) sc_date Ref_Sic rsic_code (PK) rsic_code_desc description name Site code date Site_Alias site_id (PK, FK to Site) site_alias_id (PK) sa_name alias_name Database Schema Ontology Science on the Semantic Web Worksshop – 9 Identifying Relationships Site Ref_action_type site_id (PK) site_name site_ifms_ssid_ code site_rcra_id site_epa_id rat_code (PK) rat_name rat_def Action site_id (PK, FK to Site) rat_code (PK, FK to ref_action_type) act_code_id (PK) Waste_Src_Media_Contaminated Database Schema Remedial_Response wsmrc_nmbr (PK) site_id (PK, FK to Action) rat_code (FK to Action) act_code_id (FK to Action) site_id act_code_id rat_code Ontology Contaminant actionName Site PerformedAt Science on the Semantic Web Worksshop – 10 RemedialResponse Ontology refinement based on user queries Addition of New Attributes – At NPL sites with a land use category of INDUSTRIAL, what is the cleanup level range for LEAD …. – Add an attribute landUseCategory to the entity Site in the ontology Addition of new Relationships – What is the range of concentrations for ARSENIC is a contaminant of concern in the SURFACE SOIL at NPL sites – Add a relationship HasContaminant between the entities Site and Contaminant in the ontology Addition of class-subclass relationships and new entities – How many Super fund sites are in Edison County, New Jersey ? – Add an entity SuperFundSite as a subclass of Site in the ontology Science on the Semantic Web Worksshop – 11 Using a data dictionary (EDR) to enhance the ontology Site Map coding_scheme1 state coding_scheme2 coding_scheme3 StateName StateCode { “Texas”, “California” } StateAbbr { “TX”, “CA” } select * from Site where state = ‘TX’ or state = ‘California’ select coding_scheme1 from Map where coding_scheme3 = ‘TX’ Science on the Semantic Web Worksshop – 12 Enhancing the Ontology by using a Thesaurus abandoned site THEME BT NT POLLUTION land setup disused military site LandSetup Site SuperfundSite AbandonedSite DisusedMilitarySite Science on the Semantic Web Worksshop – 13 Information Extraction from Text and Multimedia Data Get me all regions (blocks, counties) having a population greater than 500 and area greater than 50 acres having an urban land cover and such that all the nearby fires have excellent containment county name Fire block area isLocatedNear Region population containment spatial_location select county, block, spatial_location from region where area > 50 and population > 500 and land_cover = “urban” and region.isLocatedNear.containment = “excellent” Science on the Semantic Web Worksshop – 14 land_cover Information Extraction from Textual Data containment = “excellent” county block state Fire Column1 isLocatedNear Region fire.name containment excellent <ACCRUE>(<SENTENCE>(<AND>(<NUMBER>(X), X < 25), <WORD>(%), <WORD>(active)), <PHRASE>(full, containment,, <STEM>(was), expected) <PHRASE>(the, fire, <STEM>(is), contained)) region.county <ACCRUE>(<SENTENCE>( <PHRASE>(<OR>(New, Las, San), [region.county]), <OR>(county, block, state))) <PARAGRAPH>(FIRE, REGION) Science on the Semantic Web Worksshop – 15 Mapping “domain specific” model elements to media specific metadata county(x,y) gets mapped to: – word(x), phrase(x), accrue(<list-of-subtrees>) containment(x, “excellent”) gets mapped to: – sentence(<set-of-words>), stem(x), accrue(<list-of-subtrees>) isLocatedNear(x, y) gets mapped to: – paragraph(x,y) Science on the Semantic Web Worksshop – 16 Mapping SQL queries to Topic Expressions select county from region where isLocatedNear.containment = “excellent” <PARAGRAPH>( <ACCRUE>(<SENTENCE>(<AND>(<NUMBER>(X), X < 25), <WORD>(%), <WORD>(active)), <PHRASE>(full, containment,, <STEM>(was), expected) <PHRASE>(the, fire, <STEM>(is), contained)), <ACCRUE>(<SENTENCE>( <PHRASE>(<OR>(New, Las, San), [region.county]), county)) ) Science on the Semantic Web Worksshop – 17 Limitations of Current Indexing Technologies: “selection operation” select county from region <ACCRUE>(<SENTENCE>(<PHRASE>(<OR>(New, Las, San), WILDCARD), <OR>(county, block, state))) => post-processing of patterns returned (WILDCARD as place-holder) Problem: WILDCARD may match a lot of words in the same sentence WILDCARD may match different words in different sentences Science on the Semantic Web Worksshop – 18 Using NLP and statistical techniques WILDCARD matches a number of words in the same sentence Yeltsin was appointed the Prime Minister when sleeping article noun conjunction verb => Use part of speech tagging to reduce number of possibilities WILDCARD matches different words in different sentences Yeltsin was appointed Prime Minister Yeltsin was appointed President => use frequency statistics to give a level of confidence Science on the Semantic Web Worksshop – 19 Definition Support INCIDENT MANAGEMENT SITUATION REPORT Friday August 1, 1997 - 0530 MDT NATIONAL PREPAREDNESS LEVEL II Phrase: CURRENT SITUATION: Alaska continues to experience large fire activity. Additional fires ha SIMELS, Galina District, BLM. staffed for structure protection. fire.name SIMELS, Galena District, BLM. This fire is on theSlot: east side of the Innoko Flats, between Galena The fore is active on the southern perimeter, which is burning into a continuous stand of black s SIMELS fire has increased in size, but was not mapped due tovalue: thick smoke. The slopover on the eastern 35% contained, while protection of the historic cabit continues. structure: CHINIKLIK MOUNTAIN, Galena District, BLM. <name> A Type II Incident Management Team , <place> , <unit> . (Weh assigned to the Chiniklik fire. The fire is contained. Major areas of heat have been mopped up. contained. Major areas of heat have been mopped-up. All crews and overhead will mop-up wher burned beyond the meadows. No flare-ups occurred today. Demobilization is planned for this we depending on the results of infrared scanning. Science on the Semantic Web Worksshop – 20 MIDAS*: Information Extraction from Multimedia Data Query: Get me all regions (blocks, counties) having a population greater than 500 and area greater than 50 acres having an urban land cover select county, block, area, population, spatial_location, land_cover from region where area > 50 and population > 500 and land_cover = ‘urban’ and relief = ‘moderate’ *Media Independent DomAin Specific correlation Science on the Semantic Web Worksshop – 21 Get me all regions (counties, blocks) having 50 < population < 100 25 < area < 50 and low density urban area land cover ... media independent correlation across domain specific metadata correlation across image and structured data at an intensional domain level Science on the Semantic Web Worksshop – 22 SQL queries to structured data (Census DB) Population: Area: SQL Gateway to textual data (TIGER/Line DB) Boundaries: Land cover: Relief: Image Processing routines for Image Data Science on the Semantic Web Worksshop – 23 Science on the Semantic Web Worksshop – 24 Mapping “domain specific” model elements to media specific metadata contained(<concept>, <image>) gets mapped to: – latitude/longitude, image-coordinates – bounding box of region – image type: LULC, DEM land_cover(x, “low density urban”) gets mapped to: – percentage(<pixel-color>, <bounding-box>) relief(x, “moderate”) gets mapped to: – standard-deviation(<pixel-value, <bounding-box>) Science on the Semantic Web Worksshop – 25 Need for characterization of Domain Vocabularies Geological Region Urban Water Forest Land Industrial Residential Lakes Evergreen Commercial Deciduous Reservoirs Streams and Canals Mixed Geological Region State County City Rural Area Tract Block Group Another source of domain ontology Construction: - Classification Standards Block Science on the Semantic Web Worksshop – 26 Conclusions and Future Work Role of semantic content in handling data/information overload – Domain Specific ontologies: an approach for capturing semantic content Design and construction of domain ontologies – labor intensive, time consuming, difficult endeavor – Re-use readily information: schemas, queries, data dictionaries, thesauri minimize the involvement of the domain expert Metadata is the key for MultiMedia Information Retrieval – Use an expanded notion of metadata as schema and declarative SQL like query language – Pragamatic Incorporation of NLP/Image+Speech+Video Processing/Computer Vision techniques – Exploit synergy across multiple media for better precision and performance Extrapolate this technique into other domains: – Medical and Bio-Informatics – telecommunication – IP networks (use of CIM information model by DMTF) Ontology Extraction from Textual Data: – Clustering techniques to identify central concepts and taxonomic relationships – NLP techniques to identify concept associations – Consensus analysis techniques to establish ontologies Science on the Semantic Web Worksshop – 27