Download CS 245A Intelligent Information Systems - Computer Science

Document related concepts

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Relational model wikipedia , lookup

ContactPoint wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
CS245A – Syllabus (2005)
Knowledge Discovery in Databases
Query Processing With Domain Semantics
Capture Database Semantics by Rule Induction
Intentional Query Answering
Fault Tolerant DDBMS Via Data Inference
Intelligent Dictionary Directory
Uncertainty Management Using Rough Sets
Data Mining Techniques (Ch 4-7, H & K)
Active Databases
Mediators in Information Systems
KQML: A Language and Protocol for Knowledge and Information
Exchange
1
CS 245A - Syllabus (cont’d)
CoBase
CoSent
Relaxation for XML Documents
Query Formation From High-level Concepts
Knowledge Acquisition for Query Relaxation
Principles of Case-based Reasoning
A Case-based Reasoning Approach to AQA
CoXML
Data Mining for Sequence Data
Extracting key features from Free Text
Knowledge based Approach for Free Text Retrieval
Content-based Information Retrieval
Digital Library
2
References
Course notes: Intelligent Information Systems,
CS245A, Course Reader Material, 1141 Westwood
Blvd, 310-443-3303
Jiawei Han and Micheline Kamber, Data Mining:
Concept and Techniques, Morgan Kaufmann,
August 2000.
Wesley Chu & T.Y. Lin (ed.) Foundations and
Advances in Data Mining. Springer, 2005
3
CS 245A
Intelligent Information Systems
Wesley W. Chu
Computer Science Department
U. of California
Los Angeles, CA
Knowledge Discovery In Databases
Information Explosion


Information doubles every 20 months
Increase in the number and size of DBs



NASA - Earth observation satellites, 1 picture/sec
Human genome - several billion genetic bases
US census data - lifestyle and subculture of the US
How to analyze these databases (raw data)

There is a gap between


Data generation and data understanding
Intelligent data analysis will be useful and valuable

AA uses frequent flyer DB to find its better customers for
specific market promotions
5
Knowledge Discovery In Databases (Cont’d)


Bank uses customers loan and credit information to derive
better loan approval and bankrupt protection
Package-goods manufacturers use the scanned supermarket
data to measure the effect of their promotions and to look for
shopping patterns
Techniques




Machine Learning
Statistics
Information Theory
Fuzzy Set
6
Knowledge Discovery
Extraction of implicit, previously unknown and
potentially useful information from Data
Given a set of facts (Data) F, a language L, measure of
certainty C,
pattern: a statement S in L that describes the relationship
among a subset Fs of F with certainty C, such that Fs is a
simpler representation than the enumeration of all facts in Fs
Discovered Knowledge:
The output of a program that monitors the set of facts in a DB
and produce patterns.
7
Patterns
Expressed by high level language
Understand and used directly by people
Able to input to another program (e.g. expert
system)
e.g.
If age < 25 and Driver-Education-Course = No
Then At-Fault-Accident = Yes
with likelihood = 0.3
8
Patterns (Cont’d)
Patterns that are completely unrelated to current
goals are not considered as knowledge.
e.g.
Patterns that are relating at-fault-accident to a driver’s age
is not useful to auto sales figures.
Pattern + interesting results = knowledge
Age > 16 is not an interesting pattern for driver since all
drivers require age > 16.
9
Knowledge Discovery in DB Exhibits Four
Main Characteristics:
High-Level Language

Understood by human users
Accuracy

Expressed by measure of uncertainty
Interesting Results

Patterns are novel and potentially useful
Efficiency

Running times for large-sized DB are predictable and
acceptable
10
Efficiency
The discovery process should be efficiently
implemented on a computer.
An algorithm is considered efficient if the run time
and space used are a polynomial function of low
degree of input length.
e.g.
efficient algorithms for restricted concept classes


Conjunctive concepts, (A B C)
Conjunction of classes of disjunctions of no more than
k literals
(A B)
(C D) (E F) , k = 2.
11
Machine Learning
A learning algorithm takes the data set and its
accompanying information as input and returns a
statement (e.g., a concept) representing the results
of the learning as output
Data sets can be a file of records in DB
Problems in learning DB

DB are




Dynamic
Incomplete
Noisy
Much larger than typical machine learning data sets
Much of work in learning DB focuses on
overcoming these complications!
12
Related Approaches
DB Management




Integrity
Querying in DB
Deduction in DB
OODBM
Expert Systems




Expert generated knowledge usually are higher quality
than the data in DB
Only cover the important cases
Experts are available to confirm the validity and
usefulness of discovered patterns
Autonomy of discovery is lacking in expert systems
13
Related Approaches (Cont’d)
Statistics




Ill suited for the nominal and structured data types
Precluding the use of domain knowledge
Difficult to interpret
Require the guidance of the user to specify when and
how to analyze the data
14
Scientific Discovery
DBKD is less purposeful and controlling than SD
Scientists can reformulate and rerun their
experiment should they find the initial design was
inadequate
Database manager rarely have the luxury of
redesigning their data fields and recollecting the data
15
A Framework for Knowledge Discovery
Input




Raw data from DB
Information from data dictionary
Additional domain knowledge
User defined biases that provide high level focus
Output


New Domain Knowledge
Feedback of the discovered knowledge to generate new knowledge
DB issues





Dynamic data (time sensitive; e.g. weight & height pulse rate)
Irrelevant fields (zip codes, pulse rate, sex)
Missing data
Noise and uncertainty
Missing field
16
Translation Between Database Management and
Machine Learning Terms
17
Conflicting Viewpoints Between Database
Management and Machine Learning
18
A Framework for Knowledge Discovery in
Databases
19
Database and Knowledge
Domain Knowledge assist in discovery by the
searching scope


Data Dictionary
Inter-field Knowledge


e.g., weight and height
Inter-instance knowledge

e.g., age + height = seniority
age + weight = seniority
Contradictory - rule out valuable discovery
“Trucks don’t drive over water”
eliminates potentially interesting solution,
“Trucks drive over frozen lakes in winter.”
20
Discovered Knowledge
Form

Inter-field patterns - related values of field in the same
record



e.g. (procedure = surgery implies days in hospital > 5)
Inter-record patterns - aggregated over group of records
or identify useful clusters (e.g., profit making companies)
Rules: X > Y1, A = > B
forms casual chains or network
21
Discovered Knowledge (cont’d)
Representation

Discovery must be represented in a form appropriate for
the intended user.



Human: natural language, formal logic, visual depictions of
information
Computer program (expert system shells): Programming
language, declarative formalisms
Discovery System: Feedback as domain knowledge

Need common representation
Uncertainty

Patterns are often probabilistic rather than deterministic



missing and erroneous data
inherent indeterminism of the underlying real world causes
(50% chance of rain tomorrow)
sampling
22
Discovered Knowledge (cont’d)
Measures






Proof of success
Standard deviation
Belief measures
Linguistic uncertainty - fuzzy sets
Visual presentations by density, size, and shading
Sampling technique for large DB accuracy of results
depends on sample size
23
Discovery Algorithms
Machine Learning:


Unsupervised Learning
Supervised Learning
Unsupervised Learning:


Pattern identification: identifying interesting patterns and
describing them in a concise and meaningful manner
Examples


customer with income > $25,000/yr
questionable insurance claims
24
Discovery Algorithms (Cont’d)
Methods:

Traditional Clustering


Minimized similarity between classes
Maximize similarity within classes
Drawbacks



Conceptual clustering


Based on Euclidean Distance, work well only on numerical data
Inability to use background information such as likely cluster shape
Based on attributes similarity, conceptual cohesiveness (defined by
background information)
Interactive clustering

Combines human user’s knowledge with computation power of the
computer
25
Discovery Algorithms (Cont’d)
Supervised Learning:

Description process

Summaries relevant qualities of the identified class
In discovery systems, user supervision can occur in
either the identification or description process.
26
Concept Description
(Supervised Concept Learning)
Discovery in large, complex database requires both
empirical methods to detect the statistical
regularity of patterns and knowledge-based
approaches to incorporate available domain
knowledge.
Discovery tasks
Summarization - Summarize class records by describing
their common or characteristic features
Discrimination - Describe qualities sufficient to
discriminate records of one class from another
Comparison - Describe the class in a way that facilitates
comparison and analysis with other records
27
Future Directions
Domain Knowledge - how to effectively use
domain knowledge to discover knowledge
Efficient Algorithms






Restrict rule type
Heuristic and approximate algorithms
Sampling
Parallel computing
OODBM
Deductive DB
Incremental methods


Efficiently keep pace with changes in Data
Incremental discovery system, reuse their discoveries
and make more complex discoveries
28
Future Directions (cont’d)
Interactive systems



Knowledge analyst included in the discovery loop
Use human judgement, machine computation power
Need information to be presented on a human oriented
form (text, sound, visuals)
Integration
29
Applications of Discovery in DB
Medicine
Finance
Agriculture
Social
Marketing & Sales
Insurance
Engineering
Physics & Chemistry
Military
Law Enforcement
Space Science
Publishing
30
Applications of Discovery in DB (Cont’d)
Discovery of Quantitative Laws
Data Driven Discovery of Quantitative Laws
Using Knowledge in Discovery
Data Summarization
Domain Specific Discovery Methods
Integrated & Multi-Paradigm Systems
Methodology and Application Issues
31
Query Processing With
Domain Semantics
Wesley W. Chu
Query Optimization Problem
To find a sequence of operations, which has the
minimal processing cost.
33
Conventional Query Optimization (CQO)
For a given query:
Generate a set of query that are equivalent to the
given query
Determine the processing cost of each such query
Select the lowest cost query processing strategy
among these equivalent queries
34
Limitations of CQO
There are certain queries that cannot be optimized by
Conventional Query Optimization.
For example, given the query:
“Which ships have deadweight greater than 200
thousand tons?”
A search of entire the database may be required to
answer this query.
35
The Use of Knowledge
ASSUMING EXPERT KNOWS THAT:
1. SHIP relation is indexed on ShipType. There are about 10
different ship types, and
2. the ship must be a “SuperTanker” (one of the ShipTypes) if the
deadweight is greater than 150K tons.
AUGMENTED QUERY:
“Which SuperTanker have deadweight greater than 200K tons?”
RESULT:
About 90% time saved in searching the answers.
The technique of improving queries with semantic
knowledge is called Semantic Query Optimization.
36
Semantic Query Optimization (SQO)
Uses domain knowledge to transform the original
query into a more efficient query yet still yields the
same answer.
Assuming a set of integrity constraints is available as
the domain knowledge,
Represent each integrity constraint as Pi
Ci, where 1 < i < n.
Translate (Augment) original query Q into Q’ subject to C1, C2,
..., Cn, such that Q’ yields lower processing cost than Q.
Query Optimization Problem: Find C1, C2, ..., Cm that yields
minimal query processing cost; that is,
C(Q’) = min C(QLC1L ... LCm)
Ci
37
Semantic Equivalence
Domain knowledge of the database application maybe used to transform
the original query into semantically equivalent queries.
Semantic Equivalence:
Two queries are considered to be semantically equivalent if they result in
the same answer in any state of the database that conforms to the
Integrity Constraints.
Integrity Constraints:
A set of if and then rules that enforce the database to be accurate instance
of the real world database application. Examples of constraints include:

state snapshot constraints:
e.g., if deadweight > 150K then ShipType = “SuperTanker.”

state transition constraints:
e.g., salary can only be increased,
i.e., salary (new) > salary (old)
38
Limitations of Current Approach
Current approach of SQO using:
Integrity constraints as knowledge
Conventional data models
39
Limitations of Integrity Constraints
Integrity constraints are often too general to be useful
in SQO, because:


Integrity constraints describe every possible database state
User is only concerned with the current database content.
Most database do not provide integrity checking due
to:


Unavailability of integrity constraints
Overhead of checking the integrity
Thus, the usefulness of integrity constraints in SQO is
quite limited.
40
Limitations Of Conventional Data Models
Conventional data models lack expressive capability for
modeling conveniences. Many useful semantics are
ignored. Therefore, limited knowledge are collected.
FOR EXAMPLE:
“Which employee earns more than 70K a year?”
The integrity constraint:
“The salary range of employee is between 20K to 90K.”
is useless in improving this query.
41
Augmentation Of SQO With Semantic Data Models
If the employees are divided into three categories:
MANAGERS, ENGINEERS, STAFFS
and each category is associated with some constraints:
1.
2.
3.
The salary range of MANAGERS is from 35K to 90K.
The salary range of ENGINEERS is from 25K to 60K.
The salary range of STAFF is from 20K to 35K.
A better query can be obtained:
“Which managers earn more than 70K a year?”
42
43
CLASS = (Type, Class, Name, Displacement, Draft, Enlist)
44
Rule Statistics
Rule Set
Rule Size
CM
Class  Type
168
78
Name  Type
3
9
Displacement  Type
2
7
Draft  Type
1
4
Enlist  Type
36
35
45
SQP Performance for Selected Database
Structure
Type Hierarchy
CQP
SQP
cpu (ms)
#dio
cpu (ms)
#dio
order by Class
429
12
426
12
order by Type
444
10
392
7
46
Performance Improvement for Selected Attributes
attribute
CQP
cpu (ms)
SQP
#dio
cpu (ms)
#dio
Class
505
11
129
3
Enlist
432
11
130
4
47
48
Summary
Contributions:
Providing a model-based methodology for acquiring knowledge
from the database by rule induction.
Applications:
1.
Semantic Query Processing – use semantic knowledge to
improve query processing performance.
2.
Deductive Database Systems - use induced rules to provide
intentional answers.
3. Data Inference Applications - use rules to improve data
availability by inferring inaccessible data from accessible data.
49
Capture Database Semantics
By Rule Induction
Wesley W. Chu
&
Rei-Chi Lee
Database Semantics
Database semantics can be classified into:
Database Structure - the description of the
interrelationships between database objects.
Database Characteristics - defines the characteristics
and properties of each object type.
However, only tools for modeling database structure
are available. Very few tools exist in gathering and
maintaining the database characteristics.
51
An Example of Database Characteristics
The following table illustrates the US Navy battleship
characteristics that classify ships into ship types with
different displacement ranges.
52
Knowledge Acquisition
A major problem in the development of a knowledge-based data
processing system.


Knowledge Engineers - persons in the use of expert system tools
Domain Experts - persons with the expertise of the application
domain
The Process:




Studying literature to obtain fundamental background.
Interacting with domain experts to get their expertise.
Translating the expertise into knowledge representation.
Refining knowledge base through testing and further interacting
with domain experts.
A VERY TIME-CONSUMING TASK!
53
Knowledge Acquisition from Database
Database schema is defined according to database
semantics, and
Database instances are constrained by the database
characteristics.
Thus,
 Database characteristics can be induced as the
semantic knowledge from the database.
 Database schema can be a useful tool to guide the
knowledge acquisition.
54
Knowledge Acquisition By Rule Induction
Given an object hierarchy and a set of database instances contained in the
object hierarchy, a set of classification rules can be induced by inductive
learning techniques.
Given:
H - an object type hierarchy : H1, ..., Hn
S - object schema
I - database instances representing H
Find:
D - a set of descriptions, D1, ..., Dn such that
for all x, x in I,
if Di (x) is true, then x ISA Hi
Example:
SUBMARINES contains SSN, SSBN
DSSN : 2145 < Displacement < 6955
DSSBN : 7250 < Displacement < 30000
55
Model-Based Knowledge Acquisition Methodology
The methodology consists of:
a Knowledge-based ER (KER) Model,
a knowledge acquisition methodology, and
a rule induction algorithm.
KER is used as a knowledge acquisition tool when
 no knowledge specification is provided, or
 the database already exists.
56
Knowledge-Based ER (KER) Model
To capture the database characteristics, a Knowledge-based Entity
Relationship (KER) is proposed to extend the basic ER model to provide
knowledge specification capability.
A KER schema is defined by the following constructs:
1.
has-attributed/with (aggregation)
This construct links an object with other objects and specify certain
properties of the object.
2.
isa/with (generalization)
This construct specifies a type/subtype relationship between object types.
3.
has-instance (classification)
This construct links a type to an object that is an instance of that type.
The knowledge specification is represented by the with-constraint specification.
57
Components of the KER Diagram
58
A KER Diagram Example
59
Classification of Semantic Knowledge
Domain Knowledge:

Specifying the static properties of entities and relationships.

e.g., displacement in the range of (0 - 30,000).
Intra-Structure Knowledge:

Specifying the relationships between attributes within an object (an
entity or a relationship).

e.g., if the displacement is less than 7000, then it is a nuclear submarine.
Inter-Structure Knowledge:

Specifying the relationship that is related to attributes of several
entities of the aggregation relationship.

e.g., the instructor’s department must be the same as the department of
the class offered.
60
Knowledge Acquisition Methodology
To provide a systematical way of collecting domain
knowledge guided by the database schema. It
consists of three steps:

Schema Generating - using KER
a. Identify entities and associated attributes.
b. Identify type hierarchies by determining the class attributes
of each type hierarchy.
c. Identify aggregation relationships. Define each referential
key as a class attribute.


Rule Induction
Knowledge Base Refinement
61
Rule Induction Algorithm
Semantic rules for pair-wise attributes (X --> Y) are induced using the
relational operations.
Sketch of the Algorithm:
1.
Retrieving (X,Y) value pairs.
Retrieve the instance of the (X,Y) pair from the database.
Let S be the result.
2.
Removing inconsistent (X,Y) value pairs.
Retrieve all the (X,Y) pairs that for the same value of X has multiple values of
Y. Let T be the result.
Let S = S -T.
3.
Constructing Rules.
For each distinct value of Y in S, say y, determine the value range x of X and
create a rule in the form of
if x1 < X < x2 then Y = y.
62
Examples Of Induced Rules
A prototype system was implemented at UCLA using a naval
ship database as a test bed. Examples of rules induced are:
Entity: SUBMARINE
x isa SUBMARINE
R1 : if
0101 < x.Class < 0103 then x isa SSBN
R2 : if
0201 < x.Class < 0215 then x isa SSN
R3 : if
Skate < x.ClassName < Thresher then x isa SSN
R4 : if
2145 < x.Displacement < 6955 then x isa SSN
R5 : if
7250 < x.Displacement < 30000 then x isa SSBN
63
Examples of Induced Rules (Cont’d)
Relationship: INSTALL
x isa SUBMARINE and y isa SONAR
R1: if SSN582 < x.Id = SSN601 then y isa BQS
R2: if SSN604 < x.Id = SSN671 then y isa BQQ
R3: if x.Class = 0203 then y isa BQQ
R4: if 0205 < x.Class < 0207 then y isa BQQ
R5: if 0208 < x.Class < 0215 then y isa BQS
R6: if y.Sonar = BQS-04 then x isa SSN
64
Pruning the Rule Set
When the number of rules generated becomes too
large, the system must reduce the size of the
knowledge base.
Two Criteria for Rule Pruning:
1.
Coverage
Keep the rules that are satisfied by more than Nc
instances and drop those rules that are satisfied by less
than Nc instances.
2.
Completeness
Keep the rule schema (X  Y) that the total number
of instances satisfied by the rules of the same scheme is
greater than a coverage threshold Cc.
65
Induced Rules from Relation “PORT”
66
Summary
Contributions:
Providing a model-based methodology for acquiring
knowledge from the database by rule induction.
Applications:
1. Semantic query processing – use semantic knowledge to
improve query processing performance.
2. Deductive Database Systems – use induced rules to provide
intensional answers.
3. Data Inference Applications – use rules to improve data
availability by inferring inaccessible data from accessible
data.
67
Rule Induction
68
69
Generate the Rules
Select targets
Targets are the RHS attributes of rules.
Method of selection:
 Use
indices as targets
 Use selectivity
 selectivity = # of tuples with distinct
value/total # of tuples
 Targets
are chosen based on database schema
(e.g., type hierarchy).
Generate rules for each target
70