Download 28modern.pps - CMU-CS 15-415/615 Database Applications (Fall

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

SQL wikipedia , lookup

Database wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational algebra wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
CMU SCS
Carnegie Mellon Univ.
Dept. of Computer Science
15-415/615 - DB Applications
C. Faloutsos – A. Pavlo
Lecture#28: Modern Database Systems
CMU SCS
Administrivia – Final Exam
•
•
•
•
•
Who: You
What: R&G Chapters 15-22
When: Tuesday May 6th 5:30pm‐ 8:30pm
Where: WEH 7500
Why: Databases will help your love life.
Faloutsos/Pavlo
CMU SCS 15-415/615
2
CMU SCS
Administrivia – Final Exam
Handwritten Notes
Printed Notes
CMU SCS
Today’s Class
• Distributed OLAP
• OldSQL vs. NoSQL vs. NewSQL
• How to scale a database system
Faloutsos/Pavlo
CMU SCS 15-415/615
4
CMU SCS
OLTP vs. OLAP
• On-line Transaction Processing:
– Short-lived txns.
– Small footprint.
– Repetitive operations.
• On-line Analytical Processing:
– Long running queries.
– Complex joins.
– Exploratory queries.
Faloutsos/Pavlo
CMU SCS 15-415/615
5
CMU SCS
Operation Complexity
Workload Characterization
Complex
OLAP
Social
Networks
OLTP
Simple
Writes
Reads
Workload Focus
Michael Stonebraker – “Ten Rules For Scalable Performance In Simple Operation' Datastores”
http://cacm.acm.org/magazines/2011/6/108651
CMU SCS
Relational Database Backlash
• New Internet start-ups hit the limits of
single-node DBMSs.
• Early companies used custom middleware
to shard databases across multiple DBMSs.
• Google was a pioneer in developing nonrelational DBMS architectures.
Faloutsos/Pavlo
CMU SCS 15-415/615
7
CMU SCS
MapReduce
• Simplified parallel computing paradigm for
large-scale data analysis.
• Originally proposed by Google in 2004.
• Hadoop is the current leading open-source
implementation.
Faloutsos/Pavlo
CMU SCS 15-415/615
8
CMU SCS
MapReduce Example
Calculate total order amount per day after Jan 1st.
2009-03-01
$53.00
2009-03-01
$10.00
$25.00
$53.00
REDUCE(key,
MAP(key,
value)
values)
{
{
if (key
sum
= 0;>= “2009-01-01”) {
DATE
AMOUNT
DATE
AMOUNT
2009-03-02 $10.00 value);
while
output(key,
(values.hasNext())
{
2009-03-02 $10.00
2007-12-13 $25.00
2007-12-13 $25.00
$53.00
} sum +=2008-04-19
values.next();
2008-04-19 $53.00
2008-01-19 $12.00
2008-01-19 $12.00
2008-05-20 $45.00
} }
2008-05-20 $45.00
2009-03-21 $99.00
2009-03-21 $99.00
2009-01-18sum);
$15.00
output(key,
2009-01-18 $15.00
}
2009-03-02
2009-03-02
$93.00
$30.00
$85.00
$93.00
2009-03-03
2009-03-03
$69.00
$44.00
$62.00
$69.00
Map
ReduceOutput
Map Workers
Output
Reduce
Workers
Faloutsos/Pavlo
CMU SCS 15-415/615
9
CMU SCS
What MapReduce Does Right
• Since all intermediate results are written to
HDFS, if one node crashes the entire query
does not need to be restarted.
• Easy to load data and start running queries.
• Great for semi-structured data sets.
Faloutsos/Pavlo
CMU SCS 15-415/615
10
CMU SCS
What MapReduce Did Wrong
• Have to parse/cast values every time:
– Multi-attribute values handled by user code.
– If data format changes, code must change.
• Expensive execution:
– Have to send data to executors.
– A simple join requires multiple MR jobs.
Faloutsos/Pavlo
CMU SCS 15-415/615
11
CMU SCS
Join Example
• Find sourceIP that generated most
adRevenue along with its average
pageRank.
12
CMU SCS
Join Example – SQL
SELECT INTO Temp sourceIP,
AVG(pageRank) AS avgPageRank,
SUM(adRevenue) AS totalRevenue
FROM Rankings AS R, UserVisits AS UV
WHERE R.pageURL = UV.destURL
AND UV.visitDate BETWEEN “2000-01-15” AND “2000-01-22”
GROUP BY UV.sourceIP;
SELECT sourceIP, totalRevenue, avgPageRank
FROM Temp ORDER BY totalRevenue DESC LIMIT 1;
Faloutsos/Pavlo
CMU SCS 15-415/615
13
CMU SCS
Join Example – MapReduce
Phase 1:
Filter
Map:
Emit all records for
Rankings.
Filter UserVisits data.
Reduce:
Compute cross product.
Faloutsos/Pavlo
Phase 2:
Aggregation
Phase 3:
Search
Map:
Emit all tuples (i.e.,
passthrough)
Map:
Emit all tuples (i.e.,
passthrough)
Reduce:
Compute avg pageRank
for each sourceIP.
Reduce:
Scan entire input and emit
the record with greatest
adRevenue sum.
CMU SCS 15-415/615
14
CMU SCS
Join Example – Results
• Find sourceIP that generated most
adRevenue along with its average
pageRank.
Hadoop
Vertica
DBMS-X
1400
1200
1000
800
600
400
200
32.0
29.2
35.4
29.4
55.0
31.9
15
0
25 nodes
50 nodes
100 nodes
CMU SCS
Distributed Joins Are Hard
SELECT * FROM table1, table2
WHERE table1.val = table2.val
• Assume tables are horizontally partitioned:
– Table1 Partition Key → table1.key
– Table2 Partition Key → table2.key
• Q: How to execute?
• Naïve solution is to send all partitions to a
single node and compute join.
Faloutsos/Pavlo
CMU SCS 15-415/615
16
CMU SCS
Semi-Joins
• First distribute the join attributes between
nodes and then recreate the full tuples in the
final output.
– Send just enough data from each table to
compute which rows to include in output.
• Lots of choices make this problem hard:
– What to materialize?
– Which table to send?
Faloutsos/Pavlo
CMU SCS 15-415/615
17
CMU SCS
MapReduce in 2014
• SQL/Declarative Query Support
• Table Schemas
• Column-oriented storage.
Faloutsos/Pavlo
CMU SCS 15-415/615
18
CMU SCS
Column Stores
• Store tables as sections of columns of data
rather than as rows of data.
Faloutsos/Pavlo
CMU SCS 15-415/615
19
CMU SCS
Column Stores
SELECT sex, AVG(GPA) FROM student
GROUP BY sex
sid
name
login
1001
Faloutsos
1002
Row-oriented Storage
age
gpa
sex
christos@cs
45
4.0
M
<sid,name,login,age,gpa,sex>
Bieber
jbieber@cs
21
3.9
M
<sid,name,login,age,gpa,sex>
1003
Tupac
shakur@cs
26
3.5
M
<sid,name,login,age,gpa,sex>
1004
Ke$sha
kesha@cs
22
4.0
F
<sid,name,login,age,gpa,sex>
1005
LadyGaGa gaga@cs
24
3.5
F
<sid,name,login,age,gpa,sex>
1006
Obama
50
3.7
M
obama@cs
<sid,name,login,age,gpa,sex>
<sid,name,login,age,gpa,sex>
Faloutsos/Pavlo
CMU SCS 15-415/615
20
CMU SCS
Column Stores
SELECT sex, AVG(GPA) FROM student
GROUP BY sex
sid
name
login
age
gpa
sex
1001
Faloutsos
christos@cs
45
4.0
M
1002
Bieber
jbieber@cs
21
3.9
M
1003
Tupac
shakur@cs
26
3.5
M
1004
Ke$sha
kesha@cs
22
4.0
F
1005
LadyGaGa gaga@cs
24
3.5
F
1006
Obama
50
3.7
M
obama@cs
Faloutsos/Pavlo
Column-oriented Storage
sid name login age
CMU SCS 15-415/615
gpa
sex
21
CMU SCS
Column Stores
• Only scan the columns that a query needs.
• Allows for amazing compression ratios:
– Values for the same query are usually similar.
• Main goal is delay materializing a record
back to its row-oriented format for as long
as possible inside of the DBMS.
• Inserts/Updates/Deletes are harder…
Faloutsos/Pavlo
CMU SCS 15-415/615
22
CMU SCS
Column Store Systems
• Many column-store DBMSs
– Examples: Vertica, Sybase IQ, MonetDB
• Hadoop storage library:
– Example: Parquet, RCFile
Faloutsos/Pavlo
CMU SCS 15-415/615
23
CMU SCS
NoSQL
• In addition to MapReduce, Google created a
distributed DBMS called BigTable.
– It used a GET/PUT API instead of SQL.
– No support for txns.
• Newer systems have been created that
follow BigTable’s anti-relational spirit.
Faloutsos/Pavlo
CMU SCS 15-415/615
24
CMU SCS
NoSQL Systems
Key/Value
Faloutsos/Pavlo
Column-Family Documents
CMU SCS 15-415/615
25
CMU SCS
NoSQL Drawbacks
• Developers write code to handle eventually
consistent data, lack of transactions, and
joins.
• Not all applications can give up strong
transactional semantics.
Faloutsos/Pavlo
CMU SCS 15-415/615
26
CMU SCS
NewSQL
• Next generation of relational DBMSs that
can scale like a NoSQL system but without
giving up SQL or txns.
Faloutsos/Pavlo
CMU SCS 15-415/615
27
CMU SCS
Aslett White Paper
[Systems that] deliver the
scalability and flexibility
promised by NoSQL while
retaining the support for SQL
queries and/or ACID, or to
improve performance for
appropriate workloads.
Matt Aslett – 451 Group (April 4th, 2011)
https://www.451research.com/report-short?entityId=66963
28
CMU SCS
Wikipedia Article
A class of modern relational
database systems that provide
the same scalable
performance of NoSQL
systems for OLTP workloads
while still maintaining the
ACID guarantees of a
traditional database system.
Wikipedia (April 2014)
http://en.wikipedia.org/wiki/NewSQL
29
CMU SCS
NewSQL Systems
New Design MySQL Engines Middleware
Faloutsos/Pavlo
CMU SCS 15-415/615
30
CMU SCS
NewSQL Implementations
• Distributed Concurrency Control
• Main Memory Storage
• Hybrid Architectures
– Support OLTP and OLAP in single DBMS.
• Query Code Compilation
Faloutsos/Pavlo
CMU SCS 15-415/615
31