Download TDD - Informatics Homepages Server

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Big data wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
TDD: Topics in Distributed Databases
(Querying and cleaning big data)
Wenfei Fan
University of Edinburgh
1
What is big data?
2
Big data: What is it anyway?
Everyone talks about big data. But what is it?




Volume: horrendously large
• PB (1015B)
• EB (1018B)
Variety: heterogeneous, semi-structured or unstructured
• 9:1 ratio of unstructured data vs. structured data
• collecting 95% restaurants requires at least 5000 sources
Velocity: dynamic
cf. Online ordering of overlapping
data sources,
• think of the Web and Facebook,
… PVLDB 7(3), 2013,
Mariam Salloum, Xin Luna Dong,
Veracity: trust in its quality
Divesh Srivastava, Vassilis J. Tsotra
• real-life data is typically dirty!
A departure from our familiar data management!
3
Why is the data so big?

Worldwide information volume is growing annually at a minimum
rate of 59%
Gartner 2011

A single jet engine produces 20TB (1012B) of data per hour

Facebook has 1.38 billion users, 140 billion links, about 300 PB
of data

Genome of human: sampling, biochemistry, immunology, imaging,
genetic, phenotypic data
• 1 person: 1PB (1015B)
• 1000 people: 1EB (1018B)
• 1 billion people: 1ZB (1024B)
Big data is a relative notion: 1TB is already too big for your laptop
4
Why do we care about big data?
5
Example: Medicare

Google Flu Trends:
•
advance indication in the 2007-08 flu season
•
the 2009 H1N1 outbreak
Nature, 2009

IBM: Predict Heart Disease Through Big Data Analytics
•
traditional: EKGs, heart rate, blood pressure
• big data analysis: connecting
• exercise and fitness tests:
• diet
• fat and muscle composition
• genetics and environment
• social media and wellness: share information
• …
A new game: large number of data sources of big volume
6
Big data is needed everywhere



Social media marketing:
• 78% of consumers trust peer (friend, colleague and family
member) recommendations – only 14% trust ad
• if three close friends of person X like items P and W, and if X
also likes P, then the chances are that X likes W too
Social event monitoring:
• Prevent terrorist attack
• The Net Project, Shenzhen, China (Audaque)
Scientific research:
•
A new yet more effective way to develop theory, by exploring
and discovering correlations of seemingly disconnected
factors
The world is becoming data-driven, like it or not!
7
The big data market is BIG





US HEALTH CARE $300 B
Increase industry value per year by $300 B
US RETAIL 60+%
Increase net margin by 60+%
MANUFACTURING –50%
Decrease development and assembly costs by 50%
GLOBAL PERSONAL LOCATION DATA $100 B
Increase service provider revenue by $100 B
EUROPE PUBLIC SECTOR ADMIN 250 B Euro
Increase industry value per year by 250 B Euro
McKinsey Global Institute, May 2011
Big Data: The next frontier for innovation, competition and productivity
8
Why study big data?
Want to find a job?
• Research and development of big data systems:
ETL, distributed systems (eg, Hadoop), visualization tools,
data warehouse, OLAP, data integration, data quality
control, …
• Big data applications:
social marketing, healthcare, …
• Data analysis: to get values out of big data
discovering and applying patterns, predicative analysis,
complexity theory, distributed databases,
business intelligence,
andalgorithms,
security, …
queryprivacy
answering,
data quality
Prepare you for
• graduate study: current research and practical issues;
• the job market: skills/knowledge in need
Big data = Big $$$
9
What challenges are introduced by big data?
10
Big data: Through the eyes of computation

Computer science is the topic about
the computation of function f(x)

Big data: the data parameter x is horrendously large: PB or EB
What is the challenge introduced to query answering?
Fallacies:
 Big data introduces no fundamental problems
 Big data = MapReduce (Hadoop)
 Big data = data quantity (scalability)
Are these true?
11
Flashback: Relational queries
Questions:
 What is a relational schema? A relation? A relational database?
 What is a query? What is relational algebra?
 What does relationally completeness mean?
 What is a conjunctive query?
updates
query answer
DBMS
store data
DB
The bible for database researchers: Foundations of Databases
12
Traditional database management systems
 A database is a collection of data, typically containing the
information about one or more related organizations.
 A database management system (DBMS) is a software
package designed to store and manage databases.
 Database: local
 DBMS: centralized; single processor (CPU); managing local
databases (single memory, disk)
updates
query answer
DBMS
store data
13
DB
Facebook: Graph Search


Find me restaurants in New York my friends have been to in 2013
• friend(pid1, pid2)
• person(pid, name, city)
• dine(pid, rid, dd, mm, yy)
SQL query (in fact, a conjunctive query, or an SPC query)
select rid
from friend(pid1, pid2), person(pid, name, city),
dine(pid, rid, dd, mm, yy)
where pid1 = p0 and pid2 = person.pid and
Facebook : more than 1.38 billion
pid2 = dine.pid and
city =and
NYC
= 2013
nodes,
overand
140yy
billion
links
Is it feasible on big data?
14
Example queries: Graph pattern matching
 Input: A pattern graph Q and a graph G
 Output: All the matches of Q in G, i.e., all subgraphs of G that
are isomorphic to Q
 Applications
a bijective function f on nodes:
• pattern recognition
(u,u’ ) ∈ Q iff (f(u), f(u’)) ∈ G
• intelligence analysis
• transportation network analysis
• Web site classification
• social position detection
• user targeted advertising
• knowledge base disambiguation …
What other graph queries do you know?
15
Graph pattern matching
Find all matches of a pattern in a graph
B
Identify suspects
in a drug ring
B
A1
Am
1
A
3
3
W
pattern graph
S
W
W
W W
Is this feasible?
W
W
W Facebook : more
W than 1.38 billion
nodes, and over 140 billion links
“Understanding the structure of drug trafficking organizations”
16
Querying big data: New challenges
Given a query Q and a dataset D, compute Q(D)
Q(
D
)
traditional database
Q(
D
)
big data (PB or EB)
What are new challenges introduced by querying big data?

Does querying big data introduce new fundamental problems?

What new methodology do we need to cope with the sheer size
of big data D?
Why?
A departure from classical theory and traditional techniques
17
The good, the bad and the ugly

Traditional computational complexity theory of almost 50 years:
•
The good: polynomial time computable (PTIME)
•
The bad: NP-hard (intractable)
•
The ugly: PSPACE-hard, EXPTIME-hard, undecidable…
Howwhen
long does
it take?
What happens
it comes
to big data?
Using SSD of 6G/s, a linear scan of a data set D would take
•
1.9 days when D is of 1PB (1015B) What query is this?
•
5.28 years when D is of 1EB (1018B)
O(n) time is already beyond reach on big data in practice!
Polynomial time queries become intractable on big data!
18
Tractability revisited for big data
NP and beyond
Parallel polylog time
P
BD-tractable
not
BD-tractable
Yes, querying big data
comes with new and hard
fundamental problems
BD-tractable queries: properly contained in P unless P = NC
19
Challenges: query evaluation is costly
 Graph pattern matching by subgraph isomorphism
•
NP-complete to decide whether there exists a match
•
possibly exponentially many matches
 Membership problem for relational queries
What is the complexity?
 Input: a query Q, a database D, and a tuple t
 Question: Is t in Q(D)?
•
NP-complete if Q is a conjunctive query (SPC)
•
PSPACE-complete if Q is in relational algebra (SQL)
intractable even in the traditional complexity theory
Already beyond reach in practice when the data is not very big
20
Is it still feasible to query big data?
Can we do better if we are given more resources?
Parallel and distributed query processing – TDD
Using 10000 SSD of 6G/s, a linear scan of D might take:
 1.9 days/10000 = 16 seconds when D is of 1PB (1015B)
 5.28 years/10000 = 4.63 days when D is of 1EB (1018B)
Only ideally!
interconnection network
P
P
P
M
M
M
DB
DB
DB
10,000 processors
Yes, parallel query processing. But how?
21
The two sides of a coin
Data = quantity + quality
When we talk about big data, we typically mean its quantity:
 What capacity of a system provides to cope with the sheer size
of the data?
 Is a query feasible on big data within our available resources?
 How can we make our queries tractable on big data?
 ...
Can we trust the answers to our queries?
 Dirty data routinely lead to misleading financial reports, strategic
business planning decision  loss of revenue, credibility and
customers, disastrous consequences
The study of data quality is as important as data quantity
22
Data consistency
FN
LN
address
AC
city
Mary
Smith
2 Small St
908
NYC
Mary
Dupont
10 Elm St
610
PHI
Mary
Dupont
6 Main St
212
NYC
Bob
Luth
8 Cowan St
215
PHI
Robert
Luth
6 Drum St
212
NYC
 Q1: how many employees are in the NY office?
 3 may not be the correct answer: the AC and city in the first tuple
are inconsistent!
Error rates: 10% - 75% (telecommunication)
23
Information completeness
FN
LN
address
AC
city
Mary
Smith
2 Small St
908
NYC
Mary
Dupont
10 Elm St
610
PHI
Mary
Dupont
6 Main St
212
NYC
Bob
Luth
8 Cowan St
215
PHI
Robert
Luth
6 Drum St
212
NYC
 Q2: how many distinct employees have first name Marry?
 3 may not be the correct answer:
•
•
The first three tuples refer to the same person
The information may be incomplete
“information perceived as being needed for clinical decisions
24
was unavailable 13.6%--81% of the time” (2005)
Data currency
FN
LN
address
salary
status
Mary
Smith
2 Small St
50k
single
Mary
Dupont
10 Elm St
50k
married
Mary
Dupont
6 Main St
80k
married
Bob
Luth
8 Cowan St
80k
married
Robert
Luth
6 Drum St
55k
married
Entities:
Mary
Consistent, complete, and
once correct
Robert
 Q3: what is Mary’s current salary?
80k
 In the real world, salary is monotonically increasing
In a customer file, within two years about 50% of record may
become obsolete” (2002)
25
Data fusion
FN
LN
address
salary
status
Mary
Smith
2 Small St
50k
single
Mary
Dupont
10 Elm St
50k
married
Mary
Dupont
6 Main St
80k
married
Bob
Luth
8 Cowan St
80k
married
Robert
Luth
6 Drum St
55k
married
 Q4: what is Mary’s current last name?
 In real life:
•
•
Dupont
Marital status only changes from single  married  divorced
Tuples with the most current marital status also have the
most current last name
Deduce the true values of an entity
26
Data in real-life is often dirty
Pentagon asked
200+ dead officers
to re-enlist
81 million National Insurance
numbers but only 60 million
eligible citizens
98000 deaths each
year, caused by errors
in medical data
500,000 dead people
retain
active Medicare
Data error rates in industry: 1% - 30%
(Redman,
1998)
cards
Dirty data: inconsistent, inaccurate, incomplete, stale
27
Dirty data are costly

Poor data cost US businesses $611 billion annually

Erroneously priced data in retail databases cost US
customers $2.5 billion each year
2000
 1/3 of system development projects were forced to delay or
2001
cancel due to poor data quality

30%-80% of the development time and budget for data
warehousing are for data cleaning
1998

CIA dirty data about WMD in Iraq!
Can we trust answers to
our queries in dirty data?
The scale of the data quality problem is far worse on big data!
28
What does this course cover?
Big data = quantity + quality


Volume (quantity)
Veracity (quality)
29
Basic topic 1:
Parallel database management systems
Recall traditional DBMS:
 Database: “single” memory, disk
 DBMS: centralized; single processor (CPU);
Can we do better provided with multiple processors?
Parallel DBMS: exploring parallelism
 Improve performance
 Reliability and availability
interconnection network
P
P
P
M
M
M
DB
DB
DB
30
Basic topic 2:
Distributed databases
Data is stored in several sites, each with an independent DBMS
 Local ownership: physically stored across different sites
 Increased availability and reliability
 Performance
query answer
global schema
local schema
DBMS
DB
network
local schema
DBMS
DB
local schema
network
DBMS
DB
31
Advanced topic 1:
MapReduce
 A programming model with two primitive functions:
 Map: <k1, v1>  list (k2, v2)
 Reduce: <k2, list(v2)>  list (k3, v3)
 Connection between MapReduce and parallel query processing
 Other parallel programming models
•
BSP (Bulk Synchronous Parallel)
•
Vertex-centric
•
Partial evaluation
Applications in cloud computing
32
Advanced topic 2:
Querying big data
Foundations for querying big data
Tractability
Parallel
revised for querying big data
scalability
Bounded
evaluability of queries
Techniques for querying big data
 Develop parallel algorithms for querying big data
 Bounded evaluability and access constraints
 Query preserving compression
 Query answering using views
 Bounded incremental query processing
Querying big data: theory and practice
33
Advanced topic 3:
Data quality management
Big data = quantity + quality!
Central issues for data quality
 Object identification (data fusion): do two objects refer to the same
real-world entity? What is the true value of the entity?
 Data consistency: do our data values have conflicts?
 Data accuracy: is one value more accurate than another for a realword entity?
 Data currency: is our data out of date?
 Information completeness: does D have enough information to
answer our queries?
TDD: the Veracity of big data
Make our data consistent, accurate, complete and up to date!
34
Advanced topic 4:
Dependencies as data quality rules

•
•
•
Data quality rules:
Conditional (functional and inclusion) dependencies to
capture data inconsistencies
Matching dependencies for record matching Data consistency:
do our data values have conflicts?
There are also quality rules for data accuracy, data currency
and information completeness – in the textbook
A revision of classical dependencies

•
•
Fundamental problems for data quality rules:
consistency: are the data quality rules “dirty” themselves?
implication: can we optimize the rules by removing redundant
ones?
A uniform logic framework for improving data quality
35
Advanced topic 5:
Data cleaning
Repair
Detect errors
Reasoning
Discover rules
•
•
•
•
•
•
Discover data quality rules
Validate rules discovered
Detect errors with rules
Repairing data with rules
Certain fixes
Deducing the true values of
entities
Semi-automated systems for improving data quality
36
Putting together
Basic technology
 Parallel DBMS: architectures, data partition, (intra/inter) operator
parallelism, parallel query processing and optimization
 Distributed DBMS: architectures, fragmentation, replication
Advanced topics
 Big data: the Volume
– MapReduce and other parallel programming models
– Querying big data: theory and practice
 Big data: the Veracity
 Volume (quantity)
– Central issues for
data quality
relational
query processing,

Veracityalgebra/SQL,
(quality)
– Dependenciesbasic
as
data
quality
complexity
and
algorithmic
background
• Variety
(entityrules
resolution,
conflict
resolution
– Cleaning distributed
data:
rule NP,
discovery,
rule validation, error
(e.g.,
undecidability)
• Velocity
(incremental
computation)
detection, data repairing, certain fixes
37
Prerequisites
Course format
38
Basic information
 Web site:
http://homepages.inf.ed.ac.uk/wenfei/tdd/home.html
– Syllabus
– Announcements
– Lecture notes
– deadlines
 TA: Chao Tian
– [email protected]
 Office hours:
– Informatics Forum 5.23, 11:00-12:00, Thursday
39
Course format
 Seminar course: there will be no exam!
– Lectures: background.
http://homepages.inf.ed.ac.uk/wenfei/tdd/lecture/lecture-notes.html
– Textbook:
 R. Ramakrishnan, J. Gehrke: Database Management
Systems. WCB/McGraw-Hill 2003 (3rd edition). Chap 22
 Database System Concept, 4th edition, A. Silberschatz,
H. Korth, S. Sudarshan, Part 6 (Parallel and Distributed
Database Systems)
 W. Fan and F. Geerts. Foundations of Data Quality
Management. Morgan & Claypool, 2012 (Chapters 1-4;
e-copy available upon request)
– Research papers or chapters related to the topics (3-4 each)
40
• At the end of ln3-ln8
Grading
 Reviews of research papers (8 in total) 40%
 Project (report): 45%
 Project presentation: 15%
down from 12, 2012
Homework:
 Four sets of homework, starting from week 4; deadlines:
• 9am, Thursday, February 5, week 4
• 9am, Thursday, February 19, week 6
• 9am, Thursday, March 5, week 8
• 9am, Thursday, March 19, week 10
– Papers: choose two each time (two reviews) – not chapters
– 5% for each paper, and 10% for each homework
41
Review Evaluation
Pick 2 research papers each time from the lecture note to be
covered in next two weeks, starting from Week 4.
Write a one-page review for each of the papers, 10 marks
 Summary: 2 marks
• A clear problem statement: input, question/output
• The need for this line of research: motivation
• A summary of key ideas, techniques and contributions
 Evaluation: 5 marks
– Criteria for the line of research (e.g., expressive power,
complexity, accuracy, scalability, etc)
– Evaluation based on your criteria; justify your evaluation
• 3 strong points
• 3 weak points
 Suggest possible extensions: 3 marks
42
Project – Research and development
(recommended)
 Research and development:
– Topic: pick one from lecture notes (ln3 – ln8)
Example: A MapReduce algorithm for graph simulation
 Development
– Pick a research paper from the reading list of ln3—ln8
– Implement its main algorithms
– Conduct its experimental study
You are encouraged to come up with your own project – talk to me first
Multiple people may work on the same project independently
Start early!
43
Grading – design and development
 Distribution:
– Algorithms: technical depth, performance guarantees
20%
– Prove the correctness, complexity analysis and performance
guarantees of your algorithms
15%
– Justification (experimental evaluation)
10%
 Report: in the form of technical report/research paper
–
–
–
–
–
–
Introduction: problem statement, motivation
Related work: survey
Techniques; algorithms, illustration via intuitive examples
Correctness/complexity/property/proofs
Experimental evaluation
Possible extensions
44
Project – survey
 Topic: pick one topic from a lecture note (ln3 – ln8)
Example: techniques for conflict resolution
 Distribution:
– Select 5-6 representative papers, independently
10%
– Sample
Develop
a set of criteria: the most important issues in that
survey: A Brief Survey of Automatic Methods for Author
line of research, based
your own understanding; justify
Name on
Disambiguation
your criteria Find and download it from Google
10%
– Evaluate each of the papers based on your criteria 15%
– A table to summarize the assessment, based on your
criteria, draw and justify your conclusion and
recommendation for various application
10%
Your understanding of the topic
45
Project report and presentation – 15%
 A clear problem statement
 Motivation and challenges
 Key ideas, techniques/approaches
 Key results – what you have got, intuitive examples
 Findings/recommendations for different applications
 Demonstration: a must if you do a development project
 Presentation: question handling (show that you have developed
a good understanding of the line of work)
Learn how to present your work
46
Summary and Review
 What is big data?
 What is the volume of big data? Variety? Velocity? Veracity?
 Why do we care about big data?
 Is there any fundamental challenge introduced by querying big
data?
 Why study data quality?
 What is consistency? Information completeness? Data
currency? Data accuracy? Object identification?
47
Reading list
 For next week, parallel databases, before the next lecture
– Database Management Systems, 2nd edition, R.
Ramakrishnan and J. Gehrke, Chapter 22.
– Database System Concept, 4th edition, A. Silberschatz, H.
Korth, S. Sudarshan, Part 6 (Parallel and Distributed
Database Systems)
 About relational databases:
– Foundations of databases, S. Abiteboul, R. Hull, V. VIanu
 About big data
– W. Fan and J. Huai. Querying Big Data: Theory and
Practice, JCST 2014
http://homepages.inf.ed.ac.uk/wenfei/papers/JCST14.pdf
48