Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
File system wikipedia , lookup
File Allocation Table wikipedia , lookup
Design of the FAT file system wikipedia , lookup
Lustre (file system) wikipedia , lookup
Semantic Web wikipedia , lookup
3D optical data storage wikipedia , lookup
File locking wikipedia , lookup
Web analytics wikipedia , lookup
SCALABLE DECENTRALIZED DE-DUPLICATION STORE Prakash Chandrasekaran – Anand Gupta Gautham Narayanasamy – Vijayaraghavan Subbaiah Motivation Importance of storage space Finding enough space to meet the demands of the customers has been a huge challenge for cloud providers. Saving significant resources during web crawling, indexing, and search. Backup Strategies To backup the data and replicate them across many geographical locations. Need for devising ingenious techniques to use the storage space more efficiently. Deduplication Removing duplicate copies of files and storing only the pointers to the original copy. Block-level deduplication Allows more granularity and hence offers a greater reduction in storage space. Requires more processing power when compared to filelevel deduplication. Use case Storage of snapshots of virtual machine (VM) images in a virtualized cloud environment. Detecting exact duplicates and near duplicates in web pages. Architecture Cassandra Schema create keyspace minhash; create column family minhash_chunks with column_type=Super; create column family minhash_filerecipe with column_type=Super; create column family minhash_fullhash; create keyspace files; create column family files_minhash; Data Distribution Client / Application Cassandra Cluster Load Balancing Cassandra Nodes Data Flow in Cassandra OS Snapshot file / Web page Start Chunks Chunking Process Compute and fullhash Full hash MinHash minhash File input to Client Check full hash already exists Insert Insert <fileid <minhash,filerecipe> <minhash, File Name , minhash> chunkData> Match Insert <minhash, fullhash> Check file already exists Client Cassandra Cluster System Implementation Sequence - put Sequence – get System Efficiency Calculating the total amount of space saved. Demonstrate the extent of similarity in various snapshots and web pages. The overhead associated with file storage and retrieval in our system. Questions ?