Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
“Five minute rule ten years later and other computer storage rules of thumb” Authors: Jim Gray, Goetz Graefe Reviewed by: Nagapramod Mandagere Biplob Debnath Outline Problem Statement Motivation Importance and Relevance Main Contributions and Validation Key Ideas Illustrations New Metrics Assumptions Re-write Today Questions Problem Statement Broader Problem: Viewing developments over a long period of time to try and extract important technology trends. Specific Instance: Inferring rules of thumb for buffer replacement policies in a number of settings, including RAID environments. Given: Trends over time for parameters such as memory cost, disk cost, tape cost Find: Rules of thumb for deciding where to store the data and when to replace data from memory buffer Objectives: Simple rules, extensible rules Constraints: Hierarchical Storage Model Typical Database Administrators Dilemma Should I cache on the client? Should store data back on disk? (local or network disk) Should I cache this data in memory? Should I move data to tape? The performance isn’t good. Am I doing something wrong? Importance & Relevance Different rates at which parameters changes • seek/second & Disk capacity – 10x to 100x • Disk MB/K$ & DRAM MB/K$ - 1000x Importance & Relevance The location of data is very important Main Memory: Very Fast, Expensive, limited size Disk Storage: Lot slower that main memory, inexpensive, close to unlimited size Tape Storage: Slowest, dirt cheap, unlimited capacity How can one decide what data resides where? System Learns from data access patterns and adapts (Admins hate to give up control) Administrator controls data locality by using some experience or historical performance info (rules of thumb) Main Contributions & Validation The Five minute rule Randomly accessed buffer pages can be replaced if unused for more than 5 minutes. Sequentially accessed buffer pages can be replaced if unused for more than 1 minute. Metrics for storage performance characterization Cost/Access Maps: Megabyte accesses per second Scan: Time it takes to sequentially read or write all the data in the device Validation Methodology - Examples Examples Random access On pass sort Two pass sort Trends observed over a period of time Key Ideas Tradeoff between the cost of RAM and the cost of disk accesses. The tradeoff is that caching pages in the extra memory can save disk IOs. The break-even point is met when the rent on the extra memory for cache ($/page/sec) exactly matches the savings in disk accesses per second ($/disk_access/sec). Illustration – Typical System in 1997 For a system with following characteristics PagesPerMBofRAM = 128 pages/MB (8KB pages) AccessesPerSecondPerDisk = 64 access/sec/disk PricePerDiskDrive = 2000 $/disk (9GB + controller) PricePerMBofDRAM = 15 $/MB_DRAM The Inter reference interval is 266 seconds ~ 5 minutes Illustration One pass algorithms reads data and never references it, no need to cache the data in RAM. system needs only enough buffer memory to allow data to stream from disk to main memory. Typically, two or three one-track buffers (~100 KB) are adequate per disk to buffer disk operations and allow the device to stream data to the application. Illustration Two pass algorithms sequential operations that read a large dataset and then revisit parts of the data. Database join, cube, rollup, and sort operators Sorting uses two pass if memory size is smaller than the data set size Inter reference time is typically about a minute (sequential data access) Illustration – Two Pass Sort • One pass sort needs larger amount of memory • Memory needed grows faster with size of input file • For files bigger than memory size, two pass is the only option Disk vs Tape tradeoff Tape vs Disk Trade off ????? • Tape - larger penalty (slower access, least cost) • Solution – Larger breakeven point, bigger page size New Metrics Data flow applications which stream huge amounts of data like data mining applications, multimedia applications New Metrics Kaps Kilo byte accesses per second Maps Mega byte accesses per second Scan Time taken to sequentially read or write all data on a device These metrics combined with rent costs provide a price/performance metric Assumptions Disk storages have same characteristics (cost/performance). It assumes that the disk storage systems is homogenous and does not consider the more recent shift towards hierarchical/heterogeneous storage systems. The trade off only consider the performance aspect, the security and fault tolerance issues are assumed to be uniform throughout. Re-write Re-evaluate the rules of thumb considering more recent costs and the more recent trends in storage systems like heterogeneous/hierarchical storage Take into account SAN, NAS characteristics Questions??? Does Five minute rule hold good today??? No (With Reservations) If one changes the Page Size to MegaByte range, five minute rule still applies. Pages/MB of RAM = 16 (8 K pages) Access/sec/disk = 64 Price/disk drive = $400 Price/MB of RAM = $0.1 Break even point ~ 1000s Further Evidence - Jim (Keynote in FAST 2004) Grayhttp://www.usenix.org/events/fast05/