Download Document

Tashkent: Uniting Durability & Ordering in Replicated Databases Sameh Elnikety, EPFL Steven Dropsho, EPFL Fernando Pedone, USI 1 Write-Many Replicated Database Tx A Replica 1 durability • All replicas agree on – which update tx commit – their commit order • Total order Tx B Replica 2 durability – Determined by middleware – Followed by each replica Replica 3 durability 2 separation  Order Determined Outside DB Tx A Replica 1 Tx A durability AB AB Tx B Tx B Replica 2 durability AB Replication MW (global ordering) AB AB Replica 3 durability AB AB 3 One Replica  Enforce External Commit Order Middleware Commit order: AB Replica Database Proxy Tx B SQL interface Tx A Task A Task B durability B  A Cannot commit A & B concurrently! 4 Must serialize  Enforce Order = Serial Commit Middleware Commit order: AB Replica Database Proxy Tx B SQL interface Tx A Task A Task B durability A  B 5 Serialization slow  Commit Serialization is Slow Middleware order: A  B  C Commit order ABC Ack C Ack B Commit C Commit B Commit A Ack A Proxy Root cause: Database Durability & ordering separated  serial disk writes CPU durability CPU Durability A CPU Durability AB Durability ABC 6 Solutions  Solution: Unite Durability & Ordering 1-Pass order info to DB Replica durability order Middleware (ordering) 2-Move durability to MW Replica durability OFF Middleware (ordering) durability Replica Replica durability order durability OFF 7 Unite in DB  1- Unite Dur. & Ord. in Database Middleware order: A  B  C Commit order ABC Proxy Database Commit A at 1 Commit B at 2 Commit C at 3 Ack A Ack B Ack C order CPU Durability Solution durability 1: pass order info to DB ABC Durability & ordering in database  group commit 8 Solutions  Solution: Unite Durability & Ordering 1-Pass order info to DB Replica durability order Middleware (ordering) 2-Move durability to MW Replica durability OFF Middleware (ordering) durability Replica Replica durability order durability OFF 9 Unite in DB  2- Unite D. & O. in Middleware Middleware Proxy CPU Commit C CPU Commit B Commit A Database Ack C Commit order ABC Ack B Durability ABC durability Ack A order: A  B  C CPU Solution 2: move durability to MW Durability & ordering in middleware  group commit10 durability OFF Roadmap  Roadmap • Durability & ordering – Separated  serial commit  slow – United  group commit  fast • Two Implementations – Tashkent-API: united in DB – Tashkent-MW: united in MW • Tashkent-MW – Implementation – Recovery – Performance 11 Tashkent-MW Tx A Tx A Replica 1 ABC durability OFF ABC Tx B Tx B Replica 2 ABC ABC Replication MW (global ordering) ABC durability OFF Tx C durability ABC Replica 3 ABC durability OFF ABC Tx C 12 One Replica  Tashkent-MW Durability & Ordering in Middleware • Middleware logs tx effects – Durability of update tx • Guaranteed in middleware • Turn durability off at database • Middleware performs durability & ordering – United  group commit  fast • Database commits update tx serially – Commit = quick main memory operation 13 Back to Example  Recovery in Tashkent-MW Replica 1 durability OFF Replica 2 Replication MW (global ordering) durability durability OFF Replica 3 durability OFF 14 Db i/o Standard Database I/O Database Tx A Memory Crash! Data Log A A Data Log Log flushed for 1- Durability 2- Allow cleaning dirty data pages: { physical integrity } Disk A bad 15 DB recovery  Database I/O with Durability=off Database Middleware order: A  B  C Tx A Durability A Memory Crash! Data Log A A Data Log Simple Solution Recover from a data dump (checkpoint) Disk A bad 16 DB recovery  Roadmap • Durability & ordering – Separated  serial commit  slow – United  group commit  fast • Two Implementations – Tashkent-API: united in DB – Tashkent-MW: united in MW • Tashkent-MW – Implementation – Recovery – Performance 17 Performance - Setup • Metrics: – Throughput – Response time • Workload: – AllUpdates: tx = { 1 update }, mix= %100 updates – TPC-B: tx={4 update,1 read}, mix=%100 updates – TPC-W: mix of long & short txs • System configuration: – Linux Cluster running PostgreSQL 18 AllUpdates TH  AllUpdates Throughput 800 Separated 700 . 600 . tps 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 replica 19 Throughput  AllUpdates Throughput 800 Separated 700 Standalone 600 tps 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 replica 20 AllUpdates Throughput 4000 Tashkent-MW 3500 Separated 3000 Standalone tps 2500 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 replica 9 10 11 12 13 14 15 21 RT  AllUpdates Response Time 200 180 160 Separated 140 msec 120 Tashkent-MW 100 Standalone 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 replica 22 In paper  In the Paper • Design & Implementation – Tashkent-API • Performance results – TPC-B & TPC-W – Recovery times – Another I/O subsystems 23 Conclusions  Conclusions • Durability & ordering – Separated  serial commit  slow – United  group commit  fast • Two Implementations – Tashkent-API: united in DB – Tashkent-MW: united in MW • Tashkent-MW system – Pure middleware replication – Significant performance improvement 24 25 26 Concurrency Control • Generalized Snapshot Isolation – GSI • Conclusions valid whenever replicas agree 1- on which update transactions commit 2- on their commit order • Example (bank database) – T1: set balance = $1000 – T2: set balance = $2000 – Replica1: see T1 then T2  balance = $2000 – Replica2: see T2 then T1  balance = $1000  27 Durability and Ordering 1/2 Replica 1 T4 T9 Proxy DB1 Log: T4 T9 Database Certifier Cert. Log: T4 T9  Scalability problem: one write per trans. 28 Durability and Ordering 2/2 Replica 1 T4 T9 DB1 Log: DB1 Log: T1 T1,T2,T3 T2 T4 T3 T5, T6, T7, T8 T4 T9 T5 T6 T3 T7 Scalability T8 T9 problem: two Proxy Database Replica 2  Proxy writes per trans. Database Ti’s . .. .. .  One disk write Certifier Cert. Log: T1 T2 T3 T4 T5 T6 T7 T8 T9 29 AllUpdates 1-Replica Throughput 600 500 tps 400 300 low replication overhead, 1-replica == standalone DB 200 100 0 Standalone Base Tashkent-MW Tashkent-API Tashkent-API 30 no Cert AllUpdates Response Time 200 180 160 msec 140 120 Separated 100 Tashkent-MW 80 Standalone 60 40 20 0 1 2 3 4 5 6 7 8 replica 9 10 11 12 13 14 15 31 In paper  TPC-B Throughput 800 Tashkent-United 700 600 Tashkent-Sep Standalone tps 500 400 300 200 100 Low replication overhead, 0 1-replica system ==6 standalone DB, 1 2 3 4 5 7 8 9 10 11 12 13 14 Performance scales withreplica multiple replicas 15 32 In the Paper 

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document