Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS245A – Syllabus (2005) Knowledge Discovery in Databases Query Processing With Domain Semantics Capture Database Semantics by Rule Induction Intentional Query Answering Fault Tolerant DDBMS Via Data Inference Intelligent Dictionary Directory Uncertainty Management Using Rough Sets Data Mining Techniques (Ch 4-7, H & K) Active Databases Mediators in Information Systems KQML: A Language and Protocol for Knowledge and Information Exchange 1 CS 245A - Syllabus (cont’d) CoBase CoSent Relaxation for XML Documents Query Formation From High-level Concepts Knowledge Acquisition for Query Relaxation Principles of Case-based Reasoning A Case-based Reasoning Approach to AQA CoXML Data Mining for Sequence Data Extracting key features from Free Text Knowledge based Approach for Free Text Retrieval Content-based Information Retrieval Digital Library 2 References Course notes: Intelligent Information Systems, CS245A, Course Reader Material, 1141 Westwood Blvd, 310-443-3303 Jiawei Han and Micheline Kamber, Data Mining: Concept and Techniques, Morgan Kaufmann, August 2000. Wesley Chu & T.Y. Lin (ed.) Foundations and Advances in Data Mining. Springer, 2005 3 CS 245A Intelligent Information Systems Wesley W. Chu Computer Science Department U. of California Los Angeles, CA Knowledge Discovery In Databases Information Explosion Information doubles every 20 months Increase in the number and size of DBs NASA - Earth observation satellites, 1 picture/sec Human genome - several billion genetic bases US census data - lifestyle and subculture of the US How to analyze these databases (raw data) There is a gap between Data generation and data understanding Intelligent data analysis will be useful and valuable AA uses frequent flyer DB to find its better customers for specific market promotions 5 Knowledge Discovery In Databases (Cont’d) Bank uses customers loan and credit information to derive better loan approval and bankrupt protection Package-goods manufacturers use the scanned supermarket data to measure the effect of their promotions and to look for shopping patterns Techniques Machine Learning Statistics Information Theory Fuzzy Set 6 Knowledge Discovery Extraction of implicit, previously unknown and potentially useful information from Data Given a set of facts (Data) F, a language L, measure of certainty C, pattern: a statement S in L that describes the relationship among a subset Fs of F with certainty C, such that Fs is a simpler representation than the enumeration of all facts in Fs Discovered Knowledge: The output of a program that monitors the set of facts in a DB and produce patterns. 7 Patterns Expressed by high level language Understand and used directly by people Able to input to another program (e.g. expert system) e.g. If age < 25 and Driver-Education-Course = No Then At-Fault-Accident = Yes with likelihood = 0.3 8 Patterns (Cont’d) Patterns that are completely unrelated to current goals are not considered as knowledge. e.g. Patterns that are relating at-fault-accident to a driver’s age is not useful to auto sales figures. Pattern + interesting results = knowledge Age > 16 is not an interesting pattern for driver since all drivers require age > 16. 9 Knowledge Discovery in DB Exhibits Four Main Characteristics: High-Level Language Understood by human users Accuracy Expressed by measure of uncertainty Interesting Results Patterns are novel and potentially useful Efficiency Running times for large-sized DB are predictable and acceptable 10 Efficiency The discovery process should be efficiently implemented on a computer. An algorithm is considered efficient if the run time and space used are a polynomial function of low degree of input length. e.g. efficient algorithms for restricted concept classes Conjunctive concepts, (A B C) Conjunction of classes of disjunctions of no more than k literals (A B) (C D) (E F) , k = 2. 11 Machine Learning A learning algorithm takes the data set and its accompanying information as input and returns a statement (e.g., a concept) representing the results of the learning as output Data sets can be a file of records in DB Problems in learning DB DB are Dynamic Incomplete Noisy Much larger than typical machine learning data sets Much of work in learning DB focuses on overcoming these complications! 12 Related Approaches DB Management Integrity Querying in DB Deduction in DB OODBM Expert Systems Expert generated knowledge usually are higher quality than the data in DB Only cover the important cases Experts are available to confirm the validity and usefulness of discovered patterns Autonomy of discovery is lacking in expert systems 13 Related Approaches (Cont’d) Statistics Ill suited for the nominal and structured data types Precluding the use of domain knowledge Difficult to interpret Require the guidance of the user to specify when and how to analyze the data 14 Scientific Discovery DBKD is less purposeful and controlling than SD Scientists can reformulate and rerun their experiment should they find the initial design was inadequate Database manager rarely have the luxury of redesigning their data fields and recollecting the data 15 A Framework for Knowledge Discovery Input Raw data from DB Information from data dictionary Additional domain knowledge User defined biases that provide high level focus Output New Domain Knowledge Feedback of the discovered knowledge to generate new knowledge DB issues Dynamic data (time sensitive; e.g. weight & height pulse rate) Irrelevant fields (zip codes, pulse rate, sex) Missing data Noise and uncertainty Missing field 16 Translation Between Database Management and Machine Learning Terms 17 Conflicting Viewpoints Between Database Management and Machine Learning 18 A Framework for Knowledge Discovery in Databases 19 Database and Knowledge Domain Knowledge assist in discovery by the searching scope Data Dictionary Inter-field Knowledge e.g., weight and height Inter-instance knowledge e.g., age + height = seniority age + weight = seniority Contradictory - rule out valuable discovery “Trucks don’t drive over water” eliminates potentially interesting solution, “Trucks drive over frozen lakes in winter.” 20 Discovered Knowledge Form Inter-field patterns - related values of field in the same record e.g. (procedure = surgery implies days in hospital > 5) Inter-record patterns - aggregated over group of records or identify useful clusters (e.g., profit making companies) Rules: X > Y1, A = > B forms casual chains or network 21 Discovered Knowledge (cont’d) Representation Discovery must be represented in a form appropriate for the intended user. Human: natural language, formal logic, visual depictions of information Computer program (expert system shells): Programming language, declarative formalisms Discovery System: Feedback as domain knowledge Need common representation Uncertainty Patterns are often probabilistic rather than deterministic missing and erroneous data inherent indeterminism of the underlying real world causes (50% chance of rain tomorrow) sampling 22 Discovered Knowledge (cont’d) Measures Proof of success Standard deviation Belief measures Linguistic uncertainty - fuzzy sets Visual presentations by density, size, and shading Sampling technique for large DB accuracy of results depends on sample size 23 Discovery Algorithms Machine Learning: Unsupervised Learning Supervised Learning Unsupervised Learning: Pattern identification: identifying interesting patterns and describing them in a concise and meaningful manner Examples customer with income > $25,000/yr questionable insurance claims 24 Discovery Algorithms (Cont’d) Methods: Traditional Clustering Minimized similarity between classes Maximize similarity within classes Drawbacks Conceptual clustering Based on Euclidean Distance, work well only on numerical data Inability to use background information such as likely cluster shape Based on attributes similarity, conceptual cohesiveness (defined by background information) Interactive clustering Combines human user’s knowledge with computation power of the computer 25 Discovery Algorithms (Cont’d) Supervised Learning: Description process Summaries relevant qualities of the identified class In discovery systems, user supervision can occur in either the identification or description process. 26 Concept Description (Supervised Concept Learning) Discovery in large, complex database requires both empirical methods to detect the statistical regularity of patterns and knowledge-based approaches to incorporate available domain knowledge. Discovery tasks Summarization - Summarize class records by describing their common or characteristic features Discrimination - Describe qualities sufficient to discriminate records of one class from another Comparison - Describe the class in a way that facilitates comparison and analysis with other records 27 Future Directions Domain Knowledge - how to effectively use domain knowledge to discover knowledge Efficient Algorithms Restrict rule type Heuristic and approximate algorithms Sampling Parallel computing OODBM Deductive DB Incremental methods Efficiently keep pace with changes in Data Incremental discovery system, reuse their discoveries and make more complex discoveries 28 Future Directions (cont’d) Interactive systems Knowledge analyst included in the discovery loop Use human judgement, machine computation power Need information to be presented on a human oriented form (text, sound, visuals) Integration 29 Applications of Discovery in DB Medicine Finance Agriculture Social Marketing & Sales Insurance Engineering Physics & Chemistry Military Law Enforcement Space Science Publishing 30 Applications of Discovery in DB (Cont’d) Discovery of Quantitative Laws Data Driven Discovery of Quantitative Laws Using Knowledge in Discovery Data Summarization Domain Specific Discovery Methods Integrated & Multi-Paradigm Systems Methodology and Application Issues 31 Query Processing With Domain Semantics Wesley W. Chu Query Optimization Problem To find a sequence of operations, which has the minimal processing cost. 33 Conventional Query Optimization (CQO) For a given query: Generate a set of query that are equivalent to the given query Determine the processing cost of each such query Select the lowest cost query processing strategy among these equivalent queries 34 Limitations of CQO There are certain queries that cannot be optimized by Conventional Query Optimization. For example, given the query: “Which ships have deadweight greater than 200 thousand tons?” A search of entire the database may be required to answer this query. 35 The Use of Knowledge ASSUMING EXPERT KNOWS THAT: 1. SHIP relation is indexed on ShipType. There are about 10 different ship types, and 2. the ship must be a “SuperTanker” (one of the ShipTypes) if the deadweight is greater than 150K tons. AUGMENTED QUERY: “Which SuperTanker have deadweight greater than 200K tons?” RESULT: About 90% time saved in searching the answers. The technique of improving queries with semantic knowledge is called Semantic Query Optimization. 36 Semantic Query Optimization (SQO) Uses domain knowledge to transform the original query into a more efficient query yet still yields the same answer. Assuming a set of integrity constraints is available as the domain knowledge, Represent each integrity constraint as Pi Ci, where 1 < i < n. Translate (Augment) original query Q into Q’ subject to C1, C2, ..., Cn, such that Q’ yields lower processing cost than Q. Query Optimization Problem: Find C1, C2, ..., Cm that yields minimal query processing cost; that is, C(Q’) = min C(QLC1L ... LCm) Ci 37 Semantic Equivalence Domain knowledge of the database application maybe used to transform the original query into semantically equivalent queries. Semantic Equivalence: Two queries are considered to be semantically equivalent if they result in the same answer in any state of the database that conforms to the Integrity Constraints. Integrity Constraints: A set of if and then rules that enforce the database to be accurate instance of the real world database application. Examples of constraints include: state snapshot constraints: e.g., if deadweight > 150K then ShipType = “SuperTanker.” state transition constraints: e.g., salary can only be increased, i.e., salary (new) > salary (old) 38 Limitations of Current Approach Current approach of SQO using: Integrity constraints as knowledge Conventional data models 39 Limitations of Integrity Constraints Integrity constraints are often too general to be useful in SQO, because: Integrity constraints describe every possible database state User is only concerned with the current database content. Most database do not provide integrity checking due to: Unavailability of integrity constraints Overhead of checking the integrity Thus, the usefulness of integrity constraints in SQO is quite limited. 40 Limitations Of Conventional Data Models Conventional data models lack expressive capability for modeling conveniences. Many useful semantics are ignored. Therefore, limited knowledge are collected. FOR EXAMPLE: “Which employee earns more than 70K a year?” The integrity constraint: “The salary range of employee is between 20K to 90K.” is useless in improving this query. 41 Augmentation Of SQO With Semantic Data Models If the employees are divided into three categories: MANAGERS, ENGINEERS, STAFFS and each category is associated with some constraints: 1. 2. 3. The salary range of MANAGERS is from 35K to 90K. The salary range of ENGINEERS is from 25K to 60K. The salary range of STAFF is from 20K to 35K. A better query can be obtained: “Which managers earn more than 70K a year?” 42 43 CLASS = (Type, Class, Name, Displacement, Draft, Enlist) 44 Rule Statistics Rule Set Rule Size CM Class Type 168 78 Name Type 3 9 Displacement Type 2 7 Draft Type 1 4 Enlist Type 36 35 45 SQP Performance for Selected Database Structure Type Hierarchy CQP SQP cpu (ms) #dio cpu (ms) #dio order by Class 429 12 426 12 order by Type 444 10 392 7 46 Performance Improvement for Selected Attributes attribute CQP cpu (ms) SQP #dio cpu (ms) #dio Class 505 11 129 3 Enlist 432 11 130 4 47 48 Summary Contributions: Providing a model-based methodology for acquiring knowledge from the database by rule induction. Applications: 1. Semantic Query Processing – use semantic knowledge to improve query processing performance. 2. Deductive Database Systems - use induced rules to provide intentional answers. 3. Data Inference Applications - use rules to improve data availability by inferring inaccessible data from accessible data. 49 Capture Database Semantics By Rule Induction Wesley W. Chu & Rei-Chi Lee Database Semantics Database semantics can be classified into: Database Structure - the description of the interrelationships between database objects. Database Characteristics - defines the characteristics and properties of each object type. However, only tools for modeling database structure are available. Very few tools exist in gathering and maintaining the database characteristics. 51 An Example of Database Characteristics The following table illustrates the US Navy battleship characteristics that classify ships into ship types with different displacement ranges. 52 Knowledge Acquisition A major problem in the development of a knowledge-based data processing system. Knowledge Engineers - persons in the use of expert system tools Domain Experts - persons with the expertise of the application domain The Process: Studying literature to obtain fundamental background. Interacting with domain experts to get their expertise. Translating the expertise into knowledge representation. Refining knowledge base through testing and further interacting with domain experts. A VERY TIME-CONSUMING TASK! 53 Knowledge Acquisition from Database Database schema is defined according to database semantics, and Database instances are constrained by the database characteristics. Thus, Database characteristics can be induced as the semantic knowledge from the database. Database schema can be a useful tool to guide the knowledge acquisition. 54 Knowledge Acquisition By Rule Induction Given an object hierarchy and a set of database instances contained in the object hierarchy, a set of classification rules can be induced by inductive learning techniques. Given: H - an object type hierarchy : H1, ..., Hn S - object schema I - database instances representing H Find: D - a set of descriptions, D1, ..., Dn such that for all x, x in I, if Di (x) is true, then x ISA Hi Example: SUBMARINES contains SSN, SSBN DSSN : 2145 < Displacement < 6955 DSSBN : 7250 < Displacement < 30000 55 Model-Based Knowledge Acquisition Methodology The methodology consists of: a Knowledge-based ER (KER) Model, a knowledge acquisition methodology, and a rule induction algorithm. KER is used as a knowledge acquisition tool when no knowledge specification is provided, or the database already exists. 56 Knowledge-Based ER (KER) Model To capture the database characteristics, a Knowledge-based Entity Relationship (KER) is proposed to extend the basic ER model to provide knowledge specification capability. A KER schema is defined by the following constructs: 1. has-attributed/with (aggregation) This construct links an object with other objects and specify certain properties of the object. 2. isa/with (generalization) This construct specifies a type/subtype relationship between object types. 3. has-instance (classification) This construct links a type to an object that is an instance of that type. The knowledge specification is represented by the with-constraint specification. 57 Components of the KER Diagram 58 A KER Diagram Example 59 Classification of Semantic Knowledge Domain Knowledge: Specifying the static properties of entities and relationships. e.g., displacement in the range of (0 - 30,000). Intra-Structure Knowledge: Specifying the relationships between attributes within an object (an entity or a relationship). e.g., if the displacement is less than 7000, then it is a nuclear submarine. Inter-Structure Knowledge: Specifying the relationship that is related to attributes of several entities of the aggregation relationship. e.g., the instructor’s department must be the same as the department of the class offered. 60 Knowledge Acquisition Methodology To provide a systematical way of collecting domain knowledge guided by the database schema. It consists of three steps: Schema Generating - using KER a. Identify entities and associated attributes. b. Identify type hierarchies by determining the class attributes of each type hierarchy. c. Identify aggregation relationships. Define each referential key as a class attribute. Rule Induction Knowledge Base Refinement 61 Rule Induction Algorithm Semantic rules for pair-wise attributes (X --> Y) are induced using the relational operations. Sketch of the Algorithm: 1. Retrieving (X,Y) value pairs. Retrieve the instance of the (X,Y) pair from the database. Let S be the result. 2. Removing inconsistent (X,Y) value pairs. Retrieve all the (X,Y) pairs that for the same value of X has multiple values of Y. Let T be the result. Let S = S -T. 3. Constructing Rules. For each distinct value of Y in S, say y, determine the value range x of X and create a rule in the form of if x1 < X < x2 then Y = y. 62 Examples Of Induced Rules A prototype system was implemented at UCLA using a naval ship database as a test bed. Examples of rules induced are: Entity: SUBMARINE x isa SUBMARINE R1 : if 0101 < x.Class < 0103 then x isa SSBN R2 : if 0201 < x.Class < 0215 then x isa SSN R3 : if Skate < x.ClassName < Thresher then x isa SSN R4 : if 2145 < x.Displacement < 6955 then x isa SSN R5 : if 7250 < x.Displacement < 30000 then x isa SSBN 63 Examples of Induced Rules (Cont’d) Relationship: INSTALL x isa SUBMARINE and y isa SONAR R1: if SSN582 < x.Id = SSN601 then y isa BQS R2: if SSN604 < x.Id = SSN671 then y isa BQQ R3: if x.Class = 0203 then y isa BQQ R4: if 0205 < x.Class < 0207 then y isa BQQ R5: if 0208 < x.Class < 0215 then y isa BQS R6: if y.Sonar = BQS-04 then x isa SSN 64 Pruning the Rule Set When the number of rules generated becomes too large, the system must reduce the size of the knowledge base. Two Criteria for Rule Pruning: 1. Coverage Keep the rules that are satisfied by more than Nc instances and drop those rules that are satisfied by less than Nc instances. 2. Completeness Keep the rule schema (X Y) that the total number of instances satisfied by the rules of the same scheme is greater than a coverage threshold Cc. 65 Induced Rules from Relation “PORT” 66 Summary Contributions: Providing a model-based methodology for acquiring knowledge from the database by rule induction. Applications: 1. Semantic query processing – use semantic knowledge to improve query processing performance. 2. Deductive Database Systems – use induced rules to provide intensional answers. 3. Data Inference Applications – use rules to improve data availability by inferring inaccessible data from accessible data. 67 Rule Induction 68 69 Generate the Rules Select targets Targets are the RHS attributes of rules. Method of selection: Use indices as targets Use selectivity selectivity = # of tuples with distinct value/total # of tuples Targets are chosen based on database schema (e.g., type hierarchy). Generate rules for each target 70