Download replex - Princeton CS - Princeton University

Document related concepts
Transcript
Replex: A Multi-Index, HighlyAvailable Data Store
Amy Tai*§, Michael Wei*, Michael J. Freedman§,
Ittai Abraham*, Dahlia Malkhi*
*VMware Research, §Princeton University
SQL vs NoSQL
Ex: CockroachDB,
Yesquel, H-Store
SQL
Key-value stores
Ex: HyperDex, Cassandra
NoSQL
NewSQL
Rich querying model
Rich querying model
Transaction support
Transaction support
Poor performance
Scalable, high
performance
Simple, key-based
querying
No or limited
transaction support
Scalable
SQL vs NoSQL
Ex: CockroachDB,
Yesquel, H-Store
SQL
Key-value stores
Ex: HyperDex, Cassandra
NoSQL
NewSQL
Rich querying model
Rich querying model
Transaction support
Transaction support
Poor performance
Scalable, high
performance
Simple, key-based
querying
No or limited
transaction support
Scalable
NoSQL Scales with Database Partitioning
col
col
0
1
col2
...
colc
r[2]
...
r[c]
rows
r[0]
r[1]
default partitioning key
NoSQL Scales with Database Partitioning
Partitions
(with respect to the
partitioning index)
How to enable richer queries
without sacrificing sharednothing scale-out?
Key Observations
1. Indexing enables richer queries
Local Indexing on the Partitioning Index
col0
col1
col2
...
colc
select * where col0=“foo”
Index updates are local
Index lookups are local
Local Indexing
Index stored locally at each partition
insert
(r[0] r[1] r[2]
col0
col1
col2
...
colc
. . . r[c])
Index updates are local
Local Indexing
Each partition builds a local index
col0
col1
col2
...
colc
select * where col2=“foo”
Index updates are local
Index lookups must be
broadcast to all partitions
Potential synchronization across
partitions
Global Indexing
Distributed data structure spans all partitions
insert
(r[0] r[1] r[2]
col0
. . . r[c])
Index updates
are multi-hop
Index lookups executed on
a single data structure
col1
col2
...
colc
Key Observations
1. Indexing enables richer queries
2. Indexes more efficient if data is
partitioned according to that index
Partitioning by Index  Storage Overheads
col0
col1
col2
...
colc
default
partitioning key
Partitioning by index  must store data again
Partitioning by Index  Storage Overheads
col0
col1
col2
...
colc
default
partitioning key
Partitioning by index  must store data again
Partitioning by Index  Storage Overheads
col0
col1
col2
...
colc
default
partitioning key
Partitioning by index  must store data again
Partitioning by Index  Storage Overheads
col0
col1
col2
default
partitioning key
HyperDex
...
colc
Key Observations
1. Indexing enables richer queries
2. Indexes more efficient if data is
partitioned according to that index
3. Rethink replication paradigm
Replex
New replication unit:
replex -- data replica partitioned and sorted with
respect to an associated index
replexes
Serves data replication and indexing
Replex
New replication unit:
replex -- data replica partitioned and sorted with
respect to an associated index
Replace data replicas with replexes
replexes
Serves data replication and indexing
System Architecture
col0
col1
col2
...
colc
Inserts in Replex
insert
(r[0] r[1] r[2] . . . r[c])
?
Partition Determined by a replex’s Index
insert
(r[0] r[1] r[2] . . . r[c])
?
Partition Determined by a replex’s Index
insert
(r[0] r[1] r[2] . . . r[c])
?
Partition Determined by a replex’s Index
insert
(r[0] r[1] r[2] . . . r[c])
?
Partition Determined by a replex’s Index
insert
(r[0] r[1] r[2] . . . r[c])
Data Chains in Replex
(z[0]
(x[0]
z[1] z[2]
x[2]
(y[0] x[1]
y[1]
y[2]
. . . x[c])
z[c])
y[c])
Partitions in different replexes make
up a data chain if they share rows
Data Chains in Replex
All pairs of partitions are
potential data chains
Main Challenge in Replex
Increased failure complexity
Partition Failures
1. Partition Recovery
2. Client Requests
Index is unavailable, data is available
Partition Failures
Where is the data?
Index is unavailable, data is available
Finding Data After Failure
Finding Data After Failure
Partition Recovery
Client Requests
Tradeoff?
Parallel Recovery
Requests must be multicast
Recovery Time
Lookups reduce to linear
scan at each partitions
Request Amplification
Hybrid Replex: Key Contributions
1.
2.
3.
4.
Recovery time vs request amplification tradeoff
Improve failure availability of multiple replexes
Storage vs recovery performance tradeoff
Graceful failure degradation
Hybrid Replex: Key Contributions
1.
2.
3.
4.
Recovery time vs request amplification tradeoff
Improve failure availability of multiple replexes
Storage vs recovery performance tradeoff
Graceful failure degradation
Hybrid Replex: What is it?
• Shared across r replexes
• Not associated with an index
• Partitioning function dependent on these r replexes
Hybrid Replex
hybrid
replex
Data Chains with Hybrid Replex
insert
(r[0] r[1] r[2] . . . r[c])
Data Chains with Hybrid Replex
insert
(r[0] r[1] r[2] . . . r[c])
Data Chains with Hybrid Replex
insert
(r[0] r[1] r[2] . . . r[c])
? ? ? ?
Data Chains with Hybrid Replex
insert
(r[0] r[1] r[2] . . . r[c])
1. Recovery time vs Request Amplification
Recovery is less parallel… but
lower request amplification
2. Improve failure availability of multiple replexes
2. Improve failure availability of multiple replexes
Improving Failure Availability w/o Hybrid Replexes
2x Storage!
3. Storage vs. recovery performance
1.5x Storage! … but
worse client requests
Generalizing Hybrid Replexes
k hybrid replexes may be shared across r replexes
For the special case when r = 2,
request amplification will be O(n1/(k+1))
For the special case when k = 1,
request amplification will be O(n(r-1)/r)
Implementation
• Built on top of HyperDex, ~700 LOC
• Implemented partition recovery and request
re-routing on failure
• Implemented variety of hybrid replex
configurations
Evaluation
1. What is impact of replexes on
steady-state performance?
2. How do hybrid replexes affect
failure performance?
Evaluation Setup
• Table with two indexes
• 12 server machines, 4 client machines
• All machines colocated in the same rack,
connected via 1GB head-of-rack switch
• 8 CPU, 16GB memory per machine
Systems Under Evaluation
Replex-2
Storage overhead: 2x
Network hops per insert: 2
Failures Tolerated: 1
Replex-3
HyperDex
Storage overhead: 3x
Network hops per insert: 3
Failures Tolerated: 2
Storage overhead: 6x
Network hops per insert: 6
Failures Tolerated: 2
Systems Under Evaluation
Replex-2
Storage overhead: 2x
Network hops per insert: 2
Failures Tolerated: 1
Replex-3
HyperDex
Storage overhead: 3x
Network hops per insert: 3
Failures Tolerated: 2
Storage overhead: 6x
Network hops per insert: 6
Failures Tolerated: 2
Steady State Latency
1
.75
.25
Replex−3
HyperDex
0
.25
Replex−3
HyperDex
3 network 6 network
hops
hops
.50
CDF
1
.50
.75
Inserts
0
CDF
Reads
0
20
40
60
80
Latency (us)
Reads against either index have
similar latency, but we report reads
against primary index
100
0
50
100
150
200
250
Latency (us)
Replex-2 not included because it
has a lower fault tolerance
threshold
300
100K
50K
Failure
Replex−2
Replex−3
HyperDex
0
Operations/s
150K
Single Failure Performance
0
50
100
Test time (s)
Experiment
• Load with 10M, 100 byte rows
• Split reads 50:50 between each index
150
100K
50K
Replex−2
Replex−3
HyperDex
0
Operations/s
150K
Single Failure: Recovery Time
0
50
100
150
Test time (s)
Recovery Time
1. HyperDex recovers slowest because 2-3x more data
2. Replex-2 recovers fastest because least data, parallel recovery
50K
100K
Increased
throughput with
hybrid replex
Replex−2
Replex−3
HyperDex
0
Operations/s
150K
Single Failure: Failure Throughput
0
50
100
150
Test time (s)
Failure Throughput
1. Replex-2 low throughput because of high request amplification
2. Replex-3 has throughput comparable to HyperDex
100K
200K
Failures
Replex−3
HyperDex
0
Operations/s
300K
Two Failure Performance
0
50
100
150
Test time (s)
Experiment
• Load with 1M, 1 KB rows
• 50:50 reads against each index
*Replex-2 not included b/c
only tolerates 1 failure
200K
100K
Replex−3
HyperDex
0
Operations/s
300K
Two Failure Performance
0
50
100
150
Test time (s)
1. Recovery Time: Replex-3 recovers nearly 3x faster, due to less data
2. Failure Throughput: HyperDex has ½ throughput, because nearly
every partition is participating in recovery
Summary
1. Replacing replicas with replexes decreases
index storage AND steady-state overheads
2. Hybrid replexes enable a rich tradeoff space
during failures
Questions?