Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Tashkent:
Uniting Durability & Ordering
in Replicated Databases
Sameh Elnikety, EPFL
Steven Dropsho, EPFL
Fernando Pedone, USI
1
Write-Many Replicated Database
Tx A
Replica 1
durability
• All replicas agree on
– which update tx commit
– their commit order
• Total order
Tx B
Replica 2
durability
– Determined by middleware
– Followed by each replica
Replica 3
durability
2
separation 
Order Determined Outside DB
Tx A
Replica 1
Tx A
durability
AB
AB
Tx B
Tx B
Replica 2
durability
AB
Replication MW
(global ordering)
AB
AB
Replica 3
durability
AB
AB
3
One Replica 
Enforce External Commit Order
Middleware
Commit
order:
AB
Replica
Database
Proxy
Tx B
SQL interface
Tx A
Task
A
Task
B
durability
B  A
Cannot commit A & B concurrently!
4
Must serialize 
Enforce Order = Serial Commit
Middleware
Commit
order:
AB
Replica
Database
Proxy
Tx B
SQL interface
Tx A
Task
A
Task
B
durability
A  B
5
Serialization slow 
Commit Serialization is Slow
Middleware
order: A  B  C
Commit order
ABC
Ack C
Ack B
Commit C
Commit B
Commit A
Ack A
Proxy
Root cause:
Database
Durability & ordering separated  serial disk writes
CPU
durability
CPU
Durability
A
CPU
Durability
AB
Durability
ABC
6
Solutions 
Solution: Unite Durability & Ordering
1-Pass order info to DB
Replica
durability order
Middleware
(ordering)
2-Move durability to MW
Replica
durability
OFF
Middleware
(ordering)
durability
Replica
Replica
durability order
durability
OFF
7
Unite in DB 
1- Unite Dur. & Ord. in Database
Middleware
order: A  B  C
Commit order
ABC
Proxy
Database
Commit A at 1
Commit B at 2
Commit C at 3
Ack A
Ack B
Ack C
order
CPU
Durability
Solution
durability 1: pass order info to DB
ABC
Durability & ordering in database  group commit
8
Solutions 
Solution: Unite Durability & Ordering
1-Pass order info to DB
Replica
durability order
Middleware
(ordering)
2-Move durability to MW
Replica
durability
OFF
Middleware
(ordering)
durability
Replica
Replica
durability order
durability
OFF
9
Unite in DB 
2- Unite D. & O. in Middleware
Middleware
Proxy
CPU
Commit C
CPU
Commit B
Commit A
Database
Ack C
Commit order
ABC
Ack B
Durability
ABC
durability
Ack A
order: A  B  C
CPU
Solution 2: move durability to MW
Durability & ordering in middleware  group commit10
durability
OFF
Roadmap 
Roadmap
• Durability & ordering
– Separated  serial commit  slow
– United
 group commit  fast
• Two Implementations
– Tashkent-API: united in DB
– Tashkent-MW: united in MW
• Tashkent-MW
– Implementation
– Recovery
– Performance
11
Tashkent-MW
Tx A
Tx A
Replica 1
ABC
durability
OFF
ABC
Tx B
Tx B
Replica 2
ABC
ABC
Replication MW
(global ordering)
ABC
durability
OFF
Tx C
durability
ABC
Replica 3
ABC
durability
OFF
ABC
Tx C
12
One Replica 
Tashkent-MW
Durability & Ordering in Middleware
• Middleware logs tx effects
– Durability of update tx
• Guaranteed in middleware
• Turn durability off at database
• Middleware performs durability & ordering
– United  group commit  fast
• Database commits update tx serially
– Commit = quick main memory operation
13
Back to Example 
Recovery in Tashkent-MW
Replica 1
durability
OFF
Replica 2
Replication MW
(global ordering)
durability
durability
OFF
Replica 3
durability
OFF
14
Db i/o
Standard Database I/O
Database
Tx A
Memory
Crash!
Data
Log
A
A
Data
Log
Log flushed for
1- Durability
2- Allow cleaning dirty
data pages:
{ physical integrity }
Disk
A bad
15
DB recovery 
Database I/O with Durability=off
Database
Middleware
order: A  B  C
Tx A
Durability
A
Memory
Crash!
Data
Log
A
A
Data
Log
Simple Solution
Recover from a
data dump
(checkpoint)
Disk
A bad
16
DB recovery 
Roadmap
• Durability & ordering
– Separated  serial commit  slow
– United
 group commit  fast
• Two Implementations
– Tashkent-API: united in DB
– Tashkent-MW: united in MW
• Tashkent-MW
– Implementation
– Recovery
– Performance
17
Performance - Setup
• Metrics:
– Throughput
– Response time
• Workload:
– AllUpdates: tx = { 1 update }, mix= %100 updates
– TPC-B: tx={4 update,1 read}, mix=%100 updates
– TPC-W: mix of long & short txs
• System configuration:
– Linux Cluster running PostgreSQL
18
AllUpdates TH 
AllUpdates Throughput
800
Separated
700
.
600
.
tps
500
400
300
200
100
0
1
2
3
4
5
6
7
8
9
10 11 12
13 14 15
replica
19
Throughput 
AllUpdates Throughput
800
Separated
700
Standalone
600
tps
500
400
300
200
100
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
replica
20
AllUpdates Throughput
4000
Tashkent-MW
3500
Separated
3000
Standalone
tps
2500
2000
1500
1000
500
0
1
2
3
4
5
6
7
8
replica
9
10
11
12
13
14
15
21
RT 
AllUpdates Response Time
200
180
160
Separated
140
msec
120
Tashkent-MW
100
Standalone
80
60
40
20
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
replica
22
In paper 
In the Paper
• Design & Implementation
– Tashkent-API
• Performance results
– TPC-B & TPC-W
– Recovery times
– Another I/O subsystems
23
Conclusions 
Conclusions
• Durability & ordering
– Separated  serial commit  slow
– United
 group commit  fast
• Two Implementations
– Tashkent-API: united in DB
– Tashkent-MW: united in MW
• Tashkent-MW system
– Pure middleware replication
– Significant performance improvement
24
25
26
Concurrency Control
• Generalized Snapshot Isolation – GSI
• Conclusions valid whenever replicas agree
1- on which update transactions commit
2- on their commit order
• Example (bank database)
– T1: set balance = $1000
– T2: set balance = $2000
– Replica1: see T1 then T2  balance = $2000
– Replica2: see T2 then T1  balance = $1000

27
Durability and Ordering 1/2
Replica 1
T4
T9
Proxy
DB1 Log:
T4
T9
Database
Certifier
Cert. Log:
T4
T9
 Scalability
problem: one
write per trans.
28
Durability and Ordering 2/2
Replica 1
T4
T9
DB1 Log:
DB1 Log:
T1
T1,T2,T3
T2
T4
T3
T5, T6, T7, T8
T4
T9
T5
T6
T3
T7
Scalability
T8
T9
problem:
two
Proxy
Database
Replica 2

Proxy
writes per trans.
Database
Ti’s
. .. .. .
 One
disk write
Certifier
Cert. Log:
T1
T2
T3
T4
T5
T6
T7
T8
T9
29
AllUpdates 1-Replica Throughput
600
500
tps
400
300
low replication overhead,
1-replica == standalone DB
200
100
0
Standalone
Base
Tashkent-MW Tashkent-API
Tashkent-API
30
no Cert
AllUpdates Response Time
200
180
160
msec
140
120
Separated
100
Tashkent-MW
80
Standalone
60
40
20
0
1
2
3
4
5
6
7
8
replica
9
10
11
12
13
14
15
31
In paper 
TPC-B Throughput
800
Tashkent-United
700
600
Tashkent-Sep
Standalone
tps
500
400
300
200
100
Low replication overhead,
0
1-replica
system
==6 standalone
DB,
1
2
3
4
5
7
8
9 10 11 12 13 14
Performance scales withreplica
multiple replicas
15
32
In the Paper 
Related documents