Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bigtable A Distributed Storage System for Structured Data Presented by: Itamar Hazor Link to the article Agenda Size Evaluation Motivation Databases SQL-NoSQL Big Table Data Model Query Data Structure Implementation Performance Experiment Summary 2 Introduction 3 Difficulties I had before creating the presentation: 1. Previous knowledge – I believe that some understating and experience working with Database Systems are acute to understand the challenges and motivation to create such a data storage system 2. Concurrency 3. Anthony – why is it important here? and Naama’s presentations 4 My Goals • General presentation of storage area • Feel the challenges and difficulties Bigtable (and other NoSQL systems) comes the handle • Class participation • Grade of 100 5 Size Evaluation 6 6th Grade Question: > ? < 5e1,000,000 7 8 52! = 8.06e67 9 Motivation 10 Motivation “Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data…” 11 Motivation Byte -> KB -> MB -> GB -> TB -> PB 1PB = 1,000,000,000,000,000,000 B 12 Databases 13 Database So what is a database? From Wikipedia: A database is an organized collection of data.[1] It is the collection of schemas, tables, queries, reports, views, and other objects. The data are typically organized to model aspects of reality in a way that supports processes requiring information, such as modelling the availability of rooms in hotels in a way that supports finding a hotel with vacancies. 14 Motivation – What about concurrency? Why do we need concurrency in database? Do we have some difference with concurrency issues we met in the course? 15 CRUD The basic database operations: Create – write Read Update Delete 16 CRUD The basic database operations: Create – write Read Update Delete 17 Bigtable Use More than sixty (2006) Google products use Bigtable. Including: Web indexing Google Earth Google Finance 18 SQLNoSQL 19 SQL VS NoSQL The major difference between SQL and NoSQL is: SQL data is stored in tables, while NoSQL isn’t 20 Example – 1st degree grades table 21 Example – 1st degree grades table JOIN ON ID 22 Example – Contact info 23 Link to the full table 24 CAP Scheme 25 Bigtable 26 Data Model 27 Data Model A Bigtable object in 3-deimensional map. timestamp column row 28 3D-Map:Value Example 29 Row The row keys in a table are arbitrary strings Rows are sorted by lexicographic order 30 Column Column keys are grouped into sets called column families, which form the basic unit of access control. Column Family “contents” 31 Column Family “anchor” Timestamp Bigtable timestamps are 64-bit integers Every timestamp of cell must be unique 32 Tablet Each row range is called a tablet As a result, reads of short row ranges are efficient Default size of 100-200 MB tablet 33 Query Data 34 Writing to Bigtable Example // Open the table Table *T = OpenOrDie("/bigtable/web/webtable"); // Write a new anchor and delete an old anchor RowMutation r1(T, "com.cnn.www"); r1.Set("anchor:cnnsi.com", "CNN"); Operation op; Apply(&op, &r1); 35 Writing to Bigtable Example // Open the table Table *T = OpenOrDie("/bigtable/web/webtable"); // Write a new anchor and delete an old anchor RowMutation r1(T, "com.cnn.www"); r1.Set(“anchor:cnnsi.com", "CNN"); Operation op; Apply(&op, &r1); 36 Reading from Bigtable Example // Open the table Table *T = OpenOrDie("/bigtable/web/webtable"); Scanner scanner(T); ScanStream *stream; stream = scanner.FetchColumnFamily(”anchor”); stream->SetReturnAllVersions(); scanner.Lookup(”com.cnn.www”); for (; !stream->Done(); stream->Next()) { printf("%s %s %lld %s\n", scanner.RowName(), stream->ColumnName(), stream->MicroTimestamp(), stream->Value()); } 37 Structure Implementation 38 Tablets Hierarchy The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the METADATA table, but is treated specially—it is never split—to ensure that the tablet location hierarchy has no more than three levels. 39 Tablet Server Manages a set of tablets Handles read/write requests from the client One a tablet has grown too large – splits it Tablet Server 40 Master Tablet Assigning tablets to tablet servers Balancing tablet server load Garbage collection Master Tablet 41 Performance Experiment 42 Performance Indication 43 Summary 44 Summary “Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance…” 45 Summary “…These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.” 46 References Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber. To appear in OSDI 2006 Link 47 Comments? Questions? Thank you! 48