Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
How much information? Adapted from a presentation by: Jim Gray Microsoft Research http://research.microsoft.com/~gray Alex Szalay Johns Hopkins University http://tarkus.pha.jhu.edu/~szalay/ 1 How much information is there in the world 1.What can we store. 2.What is stored. 3.Why are we interested. 2 Infinite Storage? • The Terror Bytes are Here – 1 TB costs 1k$ to buy – 1 TB costs 300k$/y to own • Management & curation are expensive – Searching 1TB takes minutes or hours Yotta Zetta Exa • Petrified by Peta Bytes? Peta • But… people can “afford” them so,We are here Tera – Even though they can never actually be Giga seen in your lifetime – Automate the process Mega 3 Kilo How much information is there? Yotta • Soon everything can be recorded and indexed • Most bytes will never be seen by humans. • Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: Everything ! Recorded All Books MultiMedia Zetta Exa Peta All books (words) .Movi e Tera Giga http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli A Photo A Book Mega 4 Kilo First Disk 1956 • IBM 305 RAMAC • 4 MB • 50x24” disks • 1200 rpm • 100 ms access • 35k$/y rent • Included computer & accounting software (tubes not transistors) 6 Storage capacity beating Moore’s law • Improvements: Capacity 60%/y Bandwidth 40%/y Access time 16%/y • 1000 $/TB today • 100 $/TB in 2007 Moores law 58.70% /year TB growth 112.30% /year since 1993 Price decline 50.70% /year since 1993 Most (80%) data is personal (not enterprise) This will likely remain true. Disk TB Shipped per Year 1E+7 1998 Disk Trend (Jim Porter) http://www.disktrend.com/pdf/portrpkg.pdf. ExaByte 1E+6 1E+5 disk TB growth: 112%/y Moore's Law: 58.7%/y 1E+4 1E+3 1988 1991 1994 1997 2000 9 Disk Storage Cheaper Than Paper • File Cabinet: Cabinet (4 drawer) Paper (24,000 sheets) Space (2x3 @ 10€/ft2) Total 0.03 $/sheet 3 pennies per page • Disk: disk (250 GB =) 250$ ASCII: 100 m pages 2e-6 $/sheet(10,000x cheaper) micro-dollar per page Image: 1 m photos 3e-4 $/photo (100x cheaper) milli-dollar per photo 250$ 250$ 180$ 700$ • Store everything on disk Note: Disk is 100x to 1000x cheaper than RAM 11 Trying to fill a terabyte in a year Item Items/TB Items/day 300 KB JPEG 3M 9,800 1 MB Doc 1M 2,900 1 hour 256 kb/s MP3 audio 1 hour 1.5 Mbp/s MPEG video 9K 26 290 0.8 14 Portable Computer: 2006? • 100 Gips processor • 1 GB RAM • 1 TB disk • 1 Gbps network • “Some” of your software finding things is a data mining challenge 15 80% of data is personal / individual. But, what about the other 20%? • Business – Wall Mart online: 1PB and growing…. – Paradox: most “transaction” systems < 1 PB. – Have to go to image/data monitoring for big data • Government – Government is the biggest business. • Science – LOTS of data. 19 Q: Where will the Data Come From? A: Sensor Applications • Earth Observation – 15 PB by 2007 • Medical Images & Information + Health Monitoring – Potential 1 GB/patient/y 1 EB/y • Video Monitoring – ~1E8 video cameras @ 1E5 MBps 10TB/s 100 EB/y filtered??? • Airplane Engines – 1 GB sensor data/flight, – 100,000 engine hours/day – 30PB/y • Smart Dust: ?? EB/y http://robotics.eecs.berkeley.edu/~pister/SmartDust/ http://www-bsac.eecs.berkeley.edu/~shollar/macro_motes/macromotes.html 20 Premise: DataGrid Computing • Store exabytes twice (for redundancy) • Access them from anywhere • Implies huge archive/data centers • Supercomputer centers become super data centers • Examples: Google, Yahoo!, Hotmail, BaBar, CERN, Fermilab, SDSC, … 23 Thesis • Most new information is digital (and old information is being digitized) • An Information Science Grand Challenge: – Capture – Organize – Summarize – Visualize this information • Optimize Human Attention as a resource • Improve information quality 24 The Evolution of Science • Observational Science – Scientist gathers data by direct observation – Scientist analyzes data • Analytical Science – Scientist builds analytical model – Makes predictions. • Computational Science – Simulate analytical model – Validate model and makes predictions • Data Exploration Science Data captured by instruments Or data generated by simulator – Processed by software – Placed in a database / files – Scientist analyzes database / files 26 Computational Science Evolves • Historically, Computational Science = simulation. • New emphasis on informatics: – – – – – Capturing, Organizing, Summarizing, Analyzing, Visualizing • Largely driven by observational science, but also needed by simulations. • Too soon to say if comp-X and X-info will unify or compete. BaBar, Stanford P&E Gene Sequencer From http://www.genome.uci.edu/ 27 Space Telescope Next-Generation Data Analysis • Looking for – Needles in haystacks – the Higgs particle – Haystacks: Dark matter, Dark energy • Needles are easier than haystacks • Global statistics have poor scaling – Correlation functions are N2, likelihood techniques N3 • As data and computers grow at same rate, we can only keep up with N logN • A way out? – Discard notion of optimal (data is fuzzy, answers are approximate) – Don’t assume infinite computational resources or memory 28 • Requires combination of statistics & computer science Smart Data (active databases) • If there is too much data to move around, take the analysis to the data! • Do all data manipulations at database – Build custom procedures and functions in the database • Automatic parallelism guaranteed • Easy to build-in custom functionality – Databases & Procedures being unified – Example temporal and spatial indexing – Pixel processing • Easy to reorganize the data – Multiple views, each optimal for certain types of analyses – Building hierarchical summaries are trivial • Scalable to Petabyte datasets 29 Challenge: Make Data Publication & Access Easy • Augment FTP with data query: Return intelligent data subsets • Make it easy to – Publish: Record structured data – Find: • Find data anywhere in the network • Get the subset you need – Explore datasets interactively • Realistic goal: – Make it as easy as publishing/reading web sites today. 31 Data Federations of Web Services • Massive datasets live near their owners: – – – – Near the instrument’s software pipeline Near the applications Near data knowledge and curation Super Computer centers become Super Data Centers • Each Archive publishes a web service – Schema: documents the data – Methods on objects (queries) • Scientists get “personalized” extracts • Uniform access to multiple Archives – A common global schema Federation • Challenge: – What is the object model for your science? 32 Web Services: The Key? • Web SERVER: – Given a url + parameters – Returns a web page (often dynamic) Your program Web Server • Web SERVICE: – Given a XML document (soap msg) – Returns an XML document – Tools make this look like an RPC. • F(x,y,z) returns (u, v, w) – Distributed objects for the web. – + naming, discovery, security,.. • Internet-scale distributed computing Your program Data In your address space Web Service 33 Emerging technologies • Look at science • High end computation and storage 34