Download SCALABLE DECENTRALIZED DE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

File system wikipedia , lookup

File Allocation Table wikipedia , lookup

Design of the FAT file system wikipedia , lookup

Lustre (file system) wikipedia , lookup

OneDrive wikipedia , lookup

Semantic Web wikipedia , lookup

Files-11 wikipedia , lookup

3D optical data storage wikipedia , lookup

File locking wikipedia , lookup

Web analytics wikipedia , lookup

Computer file wikipedia , lookup

Object storage wikipedia , lookup

Transcript
SCALABLE DECENTRALIZED
DE-DUPLICATION STORE
Prakash Chandrasekaran – Anand Gupta
Gautham Narayanasamy – Vijayaraghavan Subbaiah
Motivation

Importance of storage space
Finding enough space to meet the demands of the customers
has been a huge challenge for cloud providers.
 Saving significant resources during web crawling, indexing,
and search.


Backup Strategies


To backup the data and replicate them across many
geographical locations.
Need for devising ingenious techniques to use the
storage space more efficiently.
Deduplication


Removing duplicate copies of files and storing only the
pointers to the original copy.
Block-level deduplication
Allows more granularity and hence offers a greater
reduction in storage space.
 Requires more processing power when compared to filelevel deduplication.


Use case
Storage of snapshots of virtual machine (VM) images in a
virtualized cloud environment.
 Detecting exact duplicates and near duplicates in web
pages.

Architecture
Cassandra Schema

create keyspace minhash;
create column family minhash_chunks with
column_type=Super;
 create column family minhash_filerecipe with
column_type=Super;
 create column family minhash_fullhash;


create keyspace files;

create column family files_minhash;
Data Distribution
Client / Application
Cassandra Cluster
Load Balancing
Cassandra Nodes
Data Flow in Cassandra
OS Snapshot file /
Web page
Start
Chunks
Chunking Process
Compute
and fullhash
Full hash
MinHash minhash
File input to Client
Check full hash
already exists
Insert
Insert
<fileid
<minhash,filerecipe>
<minhash,
File
Name
, minhash>
chunkData>
Match
Insert
<minhash,
fullhash>
Check file already exists
Client
Cassandra Cluster
System Implementation
Sequence - put
Sequence – get
System Efficiency



Calculating the total amount of space saved.
Demonstrate the extent of similarity in various
snapshots and web pages.
The overhead associated with file storage and
retrieval in our system.
Questions ?