Download SCALABLE DECENTRALIZED DE

SCALABLE DECENTRALIZED DE-DUPLICATION STORE Prakash Chandrasekaran – Anand Gupta Gautham Narayanasamy – Vijayaraghavan Subbaiah Motivation  Importance of storage space Finding enough space to meet the demands of the customers has been a huge challenge for cloud providers.  Saving significant resources during web crawling, indexing, and search.   Backup Strategies   To backup the data and replicate them across many geographical locations. Need for devising ingenious techniques to use the storage space more efficiently. Deduplication   Removing duplicate copies of files and storing only the pointers to the original copy. Block-level deduplication Allows more granularity and hence offers a greater reduction in storage space.  Requires more processing power when compared to filelevel deduplication.   Use case Storage of snapshots of virtual machine (VM) images in a virtualized cloud environment.  Detecting exact duplicates and near duplicates in web pages.  Architecture Cassandra Schema  create keyspace minhash; create column family minhash_chunks with column_type=Super;  create column family minhash_filerecipe with column_type=Super;  create column family minhash_fullhash;   create keyspace files;  create column family files_minhash; Data Distribution Client / Application Cassandra Cluster Load Balancing Cassandra Nodes Data Flow in Cassandra OS Snapshot file / Web page Start Chunks Chunking Process Compute and fullhash Full hash MinHash minhash File input to Client Check full hash already exists Insert Insert <fileid <minhash,filerecipe> <minhash, File Name , minhash> chunkData> Match Insert <minhash, fullhash> Check file already exists Client Cassandra Cluster System Implementation Sequence - put Sequence – get System Efficiency    Calculating the total amount of space saved. Demonstrate the extent of similarity in various snapshots and web pages. The overhead associated with file storage and retrieval in our system. Questions ?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download SCALABLE DECENTRALIZED DE