Download l1 - Webcourse

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Advanced Topics in Computer Science
Course 236605
Lecture 1: Introduction
Faculty of Computer Science
Technion – Israel Institute of Technology
Spring 2016
Assumed Background
• Databases
– Relational model, database querying, SQL,
relational algebra, schema, integrity
constraints (e.g., functional dependencies)
• Algorithms and complexity
– Asymptotic running time, polynomial time, NP,
completeness, reduction
• Basic probability theory
– Probability spaces, random variables,
conditional probability
2
Requirements
1. Home assignments
– 3 x dry (15% each), 1 x wet (25%)
2. Paper presentation
3. Mandatory attendance
– Contact me in advance if you are having a
problem attending a specific lecture
3
Lecture 1: Introduction
UNCERTAINTY IN DATABASES
4
Some Modern Database Content
Knowledge Bases
Business
DBs
Sensing Data
5
Integration
Signal / Image Processing
Text Analytics / NLP
Web Pages
Social Media
Financial Reports
OCR / Image
Gov Reports
Med Reports
Knowledge Bases
Attribute
Concept
Value
Instance
Instance
Concept
country
0.4
Probability
Relationship
Relationship
Israel
location
0.35
Person
0.2
• Microsoft Probase
• MPI YAGO
• Google Knowledge Graph
• CMU NELL
• Stanford DeepDive
• Freebase
• ...
6
Relating to Big Data
• Missing information
• Conflicting Information
• Probabilistic information
7
Popular Topics in DB Research
• VLDB 2014 Ten Year Best Paper
– Nilesh Dalvi and Dan Suciu: Efficient Query Evaluation on
Probabilistic Databases
• PODS 2014 Keynote
– Leonid Libkin: Incomplete data: what went wrong, & how to
fix it
• SIGMOD/PODS 2014 Workshop on Big
Uncertain Data
– Kimelfeld (DB) and Kersting (AI)
• ICDT 2013 Test-of-Time Award
– Ronald Fagin, Phokion Kolaitis, Renee Miller, and Lucian
Popa: Data Exchange: Semantics and Query Answering
8
In the Course
• Principled, application-independent paradigms
to managing uncertainty in data
– Incomplete / inconsistent / probabilistic databases
• Two key aspects for every paradigm:
– Representation & Semantics
• How do we represent what we know? what is missing? what
is conflicting? what is our confidence?
– Query evaluation
• What is the meaning of query answering in the presence of
uncertainty? What is the computational complexity?
9
Course Logo
Probabilistic
Database
Incomplete
Inconsistent
10
Lecture 1: Introduction
DATABASE RESEARCH
11
Publication Venues for DB Research
• Conferences:
– General:
•
•
•
•
SIGMOD: ACM Special Interest Group on Management of Data (since 1975)
VLDB: Intl. Conf. on Very Large Databases (since 1975)
ICDE: IEEE Intl. Conf. on Data Engineering (since 1984)
EDBT: Intl. Conference on Extending Database Technology (since 1988)
– Theory oriented:
• PODS: ACM Symp. on Principles of Database Systems (since 1982)
• ICDT: Intl. Conference on Database Theory (since 1986)
• Journals:
– TODS: ACM Transactions on Database Systems (since 1976)
– VLDBJ: The VLDB Journal (since 1992)
– SIGMOD REC: ACM SIGMOD Record (since 1969)
12
Selected Database Research Topics*
System Design
Database Security
• Distributed, storage,
in-memory, recovery
Views
• View-based access
• Incremental maintain
Query Languages
• Codasyl, SQL,
recursion, nesting
System Optimization
• Caching & replication
• Indexing
• Clustering
Schema Design
• ER models, normal
forms, dependency
Benchmarking
Transaction & concur.
DB Performance
• Query process & opt.
• Evaluation methods
Data Models
• OO, geo, temporal
Logic
• Deductive (Datalog)
• Integrity/constraints
Incompleteness (null)
1980
Heterogeneity
• Data Integration
• Interoperability
Analytics (OLAP)
Data Models
• Multimedia, DNA
• Text, XML
Mining & Discovery
• Discovering
association rules
1990
Further XML
• Query eval / optimize
• Compression
Database Privacy
Data Models
• Streaming data
• Graph data
DB Uncertainty
• Inconsistency &
cleaning
• Probabilistic DB
DB & IR
• DB for search
• Search for DB
Entity Resolution
Information Extraction
from Web/text
Crowdsourcing
• Utilizing crowd input
in databases
Social Networks &
Social Media
Data Models
• Semantic Web (RDF,
ontologies)
• NoSQL (doc, graph,
key-value)
DB & ML & AI
Schema Matching &
Discovery
Provenance/ lineage
Data Exchange
Ranking &
personalization
Cloud Databases
2000
* Based on SIGMOD session topics from DBLP
• Model / compute
Column Stores
13
Turing Awards for DB Technology
1973
1981
2014
1998
14
Lecture 1: Introduction
INCOMPLETE DATABASES
15
Missing Information
• Problem: pieces of data missing, but we need to
keep whatever partial knowledge we have
Registrations
Courses
student
course
course
lecturer
Ahuva
PL
PL
Eran
• A source tells us that Alon is a student of Keren
– How can we represent it in our DB?
Registrations
⊥=NULL
Courses
student
course
course
lecturer
Ahuva
PL
PL
Eran
Alon
⊥
⊥
Keren
16
SQL’s NULL
• NULL is SQL’s special “missing value”
• Same queries as complete tables, but SQL
assigns a special behavior to logic over NULL
– “Three-valued logic”: true, false, unknown
• Alas, there are some issues...
17
Try It Yourself (psql)
CREATE TABLE Registrations(
student varchar(40),
course varchar(40));
CREATE TABLE Courses(
course varchar(40),
lecturer varchar(40));
INSERT INTO Registrations VALUES
('Ahuva','PL'), ('Alon',NULL);
INSERT INTO Courses VALUES
('PL','Eran'), (NULL,'Keren');
Registrations
Courses
student
course
course
lecturer
Ahuva
PL
PL
Eran
Alon
⊥
⊥
Keren
SELECT student, lecturer
FROM Registrations R, Courses C
WHERE R.course = C.course;
student
lecturer
Ahuva
Eran
Of course, we've lost our initial association (join)...
18
Try More Yourself (psql)
Courses
Registrations
student
course
course
lecturer
Ahuva
PL
PL
Eran
Alon
⊥
⊥
Keren
SELECT student
FROM Registrations;
student
SELECT student
FROM Registrations
WHERE course='PL';
Ahuva
student
Alon
Ahuva
Inconsistent logic...
real problem!
SELECT student
FROM Registrations
WHERE course!='PL';
student
SELECT student
FROM Registrations
WHERE course='PL' OR course!='PL';
student
Ahuva
Alon??
19
Labeled Nulls in “Naive” Tables
• Just like nulls, but each null has a name
– We do not know what the value is, but we do know
that two nulls with the same name are the same
Registrations
Courses
student
course
course
lecturer
Ahuva
PL
PL
Eran
Alon
⊥1
⊥1
Keren
Ahuva
⊥2
⊥2
Shaul
⨝
=
student
course
lecture
r
Ahuva
PL
Eran
Alon
⊥1
Keren
Ahuva
⊥2
Shaul
?
?
?
?
?
?
20
Possible Worlds
Registrations
Closed-World
Assumption:
Registrations
student course
Ahuva
PL
Alon
⊥1
Ahuva
⊥2
student course
Registrations
Open-World
Assumption:
student
course
Ahuva
PL
Ahuva
PL
Alon
PL
Alon
PL
Ahuva
DB
Ahuva
DB
Anna
AI
Registrations
student course
Ahuva
PL
Alon
DB
Ahuva
DB
Registrations
student course
Ahuva
PL
Alon
Ahuva
Registrations
course
⊥1
Ahuva
PL
⊥2
Alon
DB
Ahuva
DB
Ahuva
AI
Avi
ML
...
...
student
21
Semantics of Query Answering
Incomplete DB
Possible Worlds
22
Semantics of Query Answering
Incomplete DB
Possible Worlds
23
Semantics of Query Answering
Incomplete DB
Certain answers
(“weak)
Represent as an
incomplete relation
(“strong”)
Possible Worlds
24
FQL Table Schema
Application: Data Exchange
status
link
PK
uid
status_id
time
source
message
group
PK
nid
pic_small
pic_big
pic
description
group_type
group_subtype
recent_news
creator
update_time
office
website
venue
privacy
uid
name
value
expires
path
note
PK
note_id
uid
created_time
updated_time
content
title
comment
xid
post_id
fromid
time
text
id
username
reply_xid
Messages
Users
Associations
Global Schema
friend_request
uid_from
uid_to
friend
source_id
target_id
target_type
is_following
updated_time
is_deleted
user
PK
Mappingname
link_id
owner
created_time
title
summary
url
image_urls
cookies
gid
connection
page
uid
first_name
last_name
name
pic_small
pic_big
pic_square
pic
affiliations
profile_update_time
timezone
religion
birthday
birthday_date
sex
hometown_location
meeting_sex
meeting_for
relationship_status
significant_other_id
political
current_location
activities
interests
is_app_user
music
tv
movies
books
quotes
about_me
hs_info
education_history
work_history
notes_count
wall_count
status
has_added_app
online_presence
locale
proxied_email
profile_url
email_hashes
pic_small_with_logo
PK
standard_user_info
uid
first_name
last_name
name
locale
affiliations
profile_url
timezone
birthday
sex
proxied_email
profile
id
name
url
pic
pic_square
pic_small
pic_big
type
page_admin
uid
page_id
type
25
page_id
name
pic_small
pic_big
pic_square
pic
pic_large
page_url
type
website
has_added
founded
company_o
mission
products
location
parking
public_tran
hours
attire
payment_o
culinary_te
general_ma
price_range
restaurant_
restaurant_
release_da
genre
starring
screenplay
directed_by
produced_b
studio
awards
plot_outline
network
season
schedule
written_by
band_mem
hometown
current_loc
record_labe
booking_ag
The Clio Project
IBM + U. Toronto – tool for data exchange
Commercialized in IBM DB2
26
Formalism [Fagin et al. 05]
A schema mapping is defined by a source schema S, a target schema
T, and a set Σ of logical assertions stating how S relates to T
S
T
StudLecturer
student
lecturer
Courses
Registrations
student
course
course
lecturer
Σ
StudLecturer(x,y)  ∃z Registrations(x,z) ⋀ Courses(z,y)
StudLecturer
student
course
Ahuva
Shaul
Alon
Keren
source instance
?? We don’t have z! So 2 options:
1) Abort
2) Do our best to max usability
Formalism [Fagin et al. 05]
A schema mapping is defined by a source schema S, a target schema
T, and a set Σ of logical assertions stating how S relates to T
S
T
StudLecturer
student
lecturer
Courses
Registrations
student
course
course
lecturer
Σ
StudLecturer(x,y)  ∃z Registrations(x,z) ⋀ Courses(z,y)
StudLecturer
Courses
Registrations
student
course
student
course
course
lecturer
Ahuva
Shaul
Ahuva
⊥1
⊥1
Shaul
Alon
Keren
Alon
⊥2
⊥2
Keren
source instance
solution
28
Problems Studied in Data Exchange
• Materialization
– Many solutions exist; what makes one solution
“better” than another? If there a “best” solution? How
to find it?
• Target query answering
– Given a source instance and a query over the target,
evaluate the query (semantics / complexity)
• Manipulating schema mappings
– Composition and inversion of mappings
29
Lecture 1: Introduction
INCONSISTENT DATABASES
30
Inconsistency
• An inconsistent database contains inconsistent
(or impossible) information
– Two students have the same ID
– A student gets credit for the same course twice
– A student takes a course that is not listed in the
course database
– A student has a grade for this course but a grade is
missing for an assignment
• Modeling: (D,Σ) where D is a database and Σ is
a set of required logical integrity constraints over
DBs; alas, D violates Σ
31
Query Answering
Grades
Courses
student
course
grade
course
lecturer
Ahuva
PL
90
PL
Eran
Alon
PL
86
DC
Keren
Alon
PL
81
Database D
Functional Dependency:
student, course  grade
Integrity Constraints Σ
SELECT student
FROM Grades G, Courses C
WHERE
G.grade >= 85 AND
G.course = C.course AND
C.lecturer=‘Eran’
Ahuva
Alon
32
Query Answering
Grades
Courses
student
course
grade
course
lecturer
Ahuva
PL
90
PL
Eran
Alon
PL
86
DC
Keren
Alon
PL
81
Database D
Functional Dependency:
student, course  grade
Integrity Constraints Σ
SELECT student
FROM Grades G, Courses C
WHERE
G.grade >= 87 AND
G.course = C.course AND
C.lecturer=‘Eran’
Ahuva
Alon
33
Query Answering
Grades
Courses
student
course
grade
course
lecturer
Ahuva
PL
90
PL
Eran
Alon
PL
86
DC
Keren
Alon
PL
81
Database D
Functional Dependency:
student, course  grade
Integrity Constraints Σ
SELECT student
FROM Grades G, Courses C
WHERE
G.grade >= 80 AND
G.course = C.course AND
C.lecturer=‘Eran’
Ahuva
Alon
34
Minimal Repairs
[Arenas, Bertossi, Chomicki 99]:
DEFINITION: Let (D,Σ) be an inconsistent DB. A
repair is a DB D', such that:
1. DB D' is consistent (with respect to Σ)
2. DB D' differs from D in a “minimal way”
Grades
Grades
student
course
grade
Ahuva
PL
90
Alon
PL
86
Alon
PL
81
Inconsistent database D
student
course
grade
Ahuva
PL
90
Alon
PL
86
Repair
D'1
Grades
student
course
grade
Ahuva
PL
90
Alon
PL
81
Repair
D'2
35
Semantics of Query Answering
Inconsistent DB
Repairs (consistent DBs)
36
Semantics of Query Answering
Inconsistent DB
Repairs (consistent DBs)
37
Semantics of Query Answering
Inconsistent DB
Consistent Answers
Repairs (consistent DBs)
38
Algorithms / Complexity
Koutris & Wijsen [2015]: For consistent query answering
with key constraints, we know how Select-Project-Join
(SPJ) w/o self joins are classified into 3 categories:
1.
Inconsistent DB
2.
3.
Inconsistent DB
coNP-complete
(exptime under standard
complexity assumptions)
ignore
inconsistency
Rewriting
Graph algorithm
39
Incorporating Preferences
Functional dependencies:
course  lecturer
lecturer  course
Courses
course
lecturer
DB
Keren
DC
Keren
DC
Eran
DB
Eran
What if we
trust some
tuples more
than others?
Staworko, Chomicki, Marcinkowski: Prioritized
repairing and consistent query answering in
relational databases. Ann. Math. Artif. Intell. 64(2-3):
40
Lecture 1: Introduction
PROBABILISTIC DATABASES
41
How to accommodate the probabilistic nature
of data at the database & query level?
Student
University
Ahuva
Technion
Alon
Technion
Employee
Employer
Role
Eng
Ahuva
Intel
PM
VP
HaifaU
Alon
Yahoo!
Eng
Google
Eng
Intel
PM
• Find the students that are employed as engineers
• How many students work at Intel?
• Is any PM a Technion student?
42
How to accommodate the probabilistic nature
of data at the database & query level?
Student
University
Pr
Ahuva
Technion
1.0
Technion
0.7
HaifaU
0.3
Alon
Employee
Ahuva
Alon
Employer
Role
Pr
Eng
0.7
PM
0.2
VP
0.1
Yahoo!
Eng
0.4
Google
Eng
0.4
Intel
PM
0.1
Intel
• Find the students that are employed as engineers
- Ahuva (0.7), Alon (0.8)
• How many students work at Intel?
- Expectation = 1 + 0.1
• Is any PM a Technion student?
- Yes w/ prob 1-((1-0.2)*(1-0.7*0.1))
43
Semantics
Probabilistic DB
p1
p2
p3
p4
pn
Space of ordinary DBs
44
Semantics of Query Answering
Probabilistic Database
p1
p2
p3
p4
pn
Space of ordinary DBs
45
Semantics of Query Answering
Probabilistic Database
p1
p1
p2
p2
p3
p3
p4
p4
pn
pn
Space of ordinary DBs
46
Semantics of Query Answering
Probabilistic Database
p1
p1
p2
p2
p3
p3
p4
Rep of the
probability space
Mapping tuple 
marginal probability
p4
pn
pn
Space of ordinary DBs
47
Algorithms for Query Answering
• Dalvi & Suciu dichotomy: SPJ queries can be
fully classified into:
– Queries that can be solved in polynomial time
• By repeated decomposition into simpler queries
– Queries for which answering is #P-hard
• Hence, cannot be computed in polynomial time under
standard complexity assumptions
• Heuristic via BDDs [Olteanu+]
• Guaranteed approximation via sampling
– Additive approx. p±𝜀 is simple
– Multiplicative approx. (1±𝜀)p requires more work
48
Probabilistic XML
university
department
0 .8
0.9
position
name
position
Paul
3
0.
chair f. prof a. prof
0.5
0.
6
name
ph.d. studs
Nicole
4
0.
0.8
0.5
0.
7
member
0.6
0.7
member
chair f. prof a. prof name name name
David
Amy
Emily
[Abiteboul, Kimelfeld, Sagiv, Senellart]:
Representation systems and XPath evaluation
49
Lecture 1: Introduction
PLANNED SCHEDULE
50
1
2
3
4
5
6
7
8
9
10
11
12
13
14/3
No Lecture
20/3
Intro
21/3
DB Essentials
28/3
Incompleteness
4/4
Data Exchange
11/4
No Lecture
18/4
Inconsistent DBs
Assignment 1 due
25/04
Passover
02/05
Consistent Q Answering
9/05
Consistent Q Answering
16/05
Pref. Repairs + Misc
Assignment 2 due
23/05
Probabilistic DB
30/05
Query Inference
Assignment 3 due
06/06
Query Inference
13/06
Probabilistic DB Extras
20/06
Perspectives
Assignment 4 due
51
Related documents