Download Query Processing, Resource Management and Approximate in a

1. The Age of Infinite Storage Section 1 #1 1. The Age of Infinite Storage has begun Many of us have enough money in our pockets right now to buy all the storage we will be able to fill for the next 5 years. So having the storage capacity is no longer a problem. Managing it is a problem (especially when the volume gets large). How much data is there? Section 1 #2 Googi 10100  Tera Bytes (TBs) are Here  1 TB costs  1k$ to buy  1 TB costs ~300k$/year to own ...  Management and curation are the expensive part  Searching 1 TB takes hours  I’m Terrified by TeraBytes We are here  I’m Petrified by PetaBytes  I’m completely Exafied by ExaBytes  I’m too old to ever be Zettafied by ZettaBytes, but you may be in your lifetime.  You may be Yottafied by YottaBytes.  You may not be Googified by GoogiBytes, Yotta 1024 Zetta 1021 Exa 1018 Peta 1015 Tera 1012 Giga 109 Mega 106 Kilo 103 but the next generation may be? Section 1 #3 Yotta How much information is there? Zetta  Soon everything can be recorded and indexed.  Most of it will never be seen by humans.  Data summarization, trend detection, anomaly detection, data mining, are key technologies Exa Everything! Recorded Peta All Books MultiMedia Tera All books (words) .Movie Giga A Photo Mega A Book Kilo 10-24 Yocto, 10-21 zepto, 10-18 atto, 10-15 femto, 10-12 pico, 10-9 nano, 10-6 micro, 10-3 milli Section 1 #4 First Disk, in 1956  IBM 305 RAMAC  4 MB  50 24” disks  1200 rpm  100 (revolutions per minute) milli-seconds (ms) access time  35k$/year to rent  Included computer & accounting software (tubes not transistors) Section 1 #5 1.6 meters 10 years later 30 MB Section 1 #6 In 2003, the Cost of Storage was about 1K$/TB. It’s gone steadily down since then. 12/1/1999 Price vs disk capacity 9/1/2000 k$/TB Price vs disk capacity 9/1/2001 Price vs disk capacity y = 17.9x $ IDE IDE SCSI 8.0 15 20 y = 13x SCSI 20 0 40 = 2.0x 80 8.0 054.0 9.0 0 7.0 10 03.0 8.0 06.07.0 2.0 $ 200 y=x 0 50 100 150 Raw Disk unit Size 50 100 150GB 200 Raw Disk unit Size GB 20 rawSCSI 6 raw IDE k$/TB 20 k$/TB GB 30 40 50 40 Disk unit size GB 200 250 5.0 4.0 0 3.04.0 2.03.0 1.02.0 1.0 0.0 0.0 0 60 60 80 SCSI 6.0 0.0 50 100 150 Raw Disk unit Size GB 200 0 10.0 1.05.0 IDE y = 2x 0 0 5 10 5.0 11/4/2003 y=x 400 10.0 7.0 IDE raw k$/TB 6.09.0 60 y 20 40 60 Raw Disk unit Size GB SCSI SCSI 10 15 y = 6.7x SCSI 0 800 600 9.0 20 25 $ y = 7.2x IDE raw k$/TB 10.0 25 30 SCSI $ $ 200 30 35 Price vs disk capacityy = 6x IDE SCSI IDE y = 3.8x GB $ 400 35 40 4/1/2002 Price vs disk capacity 800 200 600 40 $ $ $ $ 1000 900 1000 800 900 700 800 1400 600 700 500 1200 600 400 500 300 14001000 400 200 800 300 100 12001400200 600 0 100 10001200 0 0 400 0 1000 50 SCSI IDE 100 150 Disk unit size GB 200 IDE 0 50 50 100 150 200 Disk unit size GB Disk100 unit size150 GB Section 1 #7 200 250 Kilo Mega Disk Evolution Giga Tera Peta Exa Zetta Yotta Section 1 #8 Memex As We May Think, Vannevar Bush, 1945 “A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility” “yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so that he can enter material freely” Section 1 #9 Can you fill a terabyte in a year? Item Items/TB Items/day a 300 KB JPEG image 3M 9,800 a 1 MB Document 1M 2,900 a 1 hour, 256 kb/s MP3 audio file 9K 26 a 1 hour 1 MPEG video 290 0.8 Section 1 # 10 On a Personal Terabyte, How Will We Find Anything?  Need Queries, Indexing, Data Mining, Scalability, Replication…  If you don’t use a DBMS, you will implement one of your own!  Need for Data Mining, Machine Learning is more important then ever! Of the digital data in existence today,  80% is personal/individual DBMS  20% is Corporate/Governmental Section 1 # 11 We’re awash with data!  Network data:   10 exabytes by 2010 ~ 1019 Bytes 10 zettabytes by 2015 ~ 1022 Bytes WWW (and other text collections)   ~ 1016 Bytes Sensor data from sensors (including Micro & Nano -sensor networks)   15 petabytes by 2007 National Virtual Observatory (aggregated astronomical data)   ~ 1013 Bytes US EROS Data Center archives Earth Observing System (near Soiux Falls SD) Remotely Sensed satellite and aerial imagery data   10 terabytes by 2004 10 yottabytes by 2020 ~ 1025 Bytes Genomic/Proteomic/Metabolomic data (microarrays, genechips, genome sequences)  10 gazillabytes by 2030 ~ 1028 Bytes? I made up these Name! Projected data sizes are overrunning our ability to name their orders of magnitude!  Stock Market prediction data (prices + all the above?)  10 supragazillabytes by 2040 ~ 1031 Bytes? Useful information must be teased out of these large volumes of raw data. AND these are some of the 1/5th of Corporate or Governmental data collections. The other 4/5ths of data sets are personnel! Section 1 # 12  Parkinson’s Law (for data)  Data expands to fill available storage  Disk-storage version of Moore’s Law  Available storage doubles every 9 months!  How do we get the information we need from the massive volumes of data we will have?  Querying (for the information we know is there)  Data mining (for the answers to questions we don't know to ask precisely). Section 3 # 13 Thank you. Section 3 #1

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Query Processing, Resource Management and Approximate in a