Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Database Activities and Trends Jim Gray Microsoft Research 2 June 2006, Microsoft, TechNet, London Outline • What I have been doing • BIG Changes in DBs • Q&A 15 minutes 15 minutes 20 minutes Scalability Projects • TerraServer: Geospatial data online – Now part of Virtual Earth http://local.live.com/ • SATA disk evaluation – Copy 1.5 Petabytes (count types of errors) MSR-TR-2005-166 • Disk and Network performance move 1GB/s from CERN to Pasadena • Bricks MSR-TR-2004-62 – BI-Bricks: cheap boxes/disks for BI – Server Bricks: TerraServer Bricks: MSR-TR-2004-107 DB Projects • Spatial data access inside SQL – Gives a good example of using CLR to extend SQL – Sample is part of SQL 2005 programming samples. – Many papers …, MSR-TR-2005-122, MSR-TR-2006-52 • To Blob or NOT to Blob? – Explored what is the break-even point of Blobs vs Files. Guess what! Almost all files should be blobs. MSR-TR-2006-45 • GPU TeraSort: – You have been hearing about Many-Core from Intel – Nvidia & ATI give you 100 cores today (2x next year) Not me, but 10x the operations per second than the CPU very cool! 10x the memory bandwidth of the CPU – How to program them? – Sort represents IO, memory, processing. – GPU TerraSort demos this MSR-TR-2005-183 – Accelerator: C# extension is a GPU compiler. MSR-TR-2005-184 eScience Projects • SkyServer: Astronomy data online http://skyserver.sdss.org/ – A real Data Grid app – Web services are popular – SkyQuery and CasJobs use web services. http://casjobs.sdss.org/CasJobs/ – Spatial access built as SQL 2005 C# extensions. • Doing Finite Element Analysis with a DB and Vis tools Supporting Finite Element Analysis with a Relational Database Backend; Part I: There is Life beyond Files MSR-TR-2005-49 • Ecological sensors (soil, water, ocean,…) – Only public thing so far: http://lifeunderyourfeet.org/ – Many papers coming • Starting BioInfo efforts (Portable PubMed Central, ….) Portable PubMedCentral • “Information at your fingertips” • Helping build PortablePubMedCentral • Deployed US, China, England, Italy, South Africa, (Japan soon). • Each site can accept documents • Archives replicated • Federate thru web services • Working to integrate Word/Excel/… with PubmedCentral – e.g. WordML, XSD, • To be clear: NCBI is doing 99% of the work, but it is very cool and very significant. Outline • What I have been doing • BIG Changes in DBs • Q&A 15 minutes 15 minutes 20 minutes DB System Architecture sets records os Worked, but applications wanted to query other data types Added: sets utilities Notification Space Time Data Mine Cubes Text ETL Replication XML Queues Procedures +Text, Time, Space + Cubes, Data mining + XML, XQuery + Programming Languages + Triggers and queues + Replication, Pub/sub + Extract-Transform-Load + Many more extensions coming utilities • The classic DBMS model … records os A Mess? DB Systems evolved to be containers for information services develop, deploy, and execution environment • Classic ++ – – – – – – – + Triggers and queues + Replication, Pub/sub + Extract-Transform-Load + Text, Time, Space + Cubes, Data mining + XML, XQuery + Many more extensions coming sets records os utilities – + Programming Languages • DBMS is an ecosystem OO is the key structuring strategy: – – – – – Everything is a class Database is a complex object Core object is DataSet Classes publish/consume them Depends on strong Object Model DataSet The Object-Relational World marry programming languages and DBMSs • Stored procedures evolve to “real” languages Java, C#,.. With real object models. • Data encapsulated: a class with methods • Classes may be persistent Business • Tables are enumerable & index-able Objects record sets with foreign keys • Records are vectors of objects • Opaque or transparent types • Set operators on transparent classes • Transactions: – Preserve invariants – A composition strategy – An exception strategy • Ends Inside-DB Outside-DB dichotomy Ask not “How to add objects to databases?”, Ask “What kind of object is a database?” Q: Given an object model, what is a DB? A: DataSet class and methods (nested relation with metadata) This is the basis for the ecosystem Distributed DB Extensible DB Interoperable DB Tables Question or Text …. or cube Dataset Or….. This was implicit in ODBC but is now explicit within the DBMS ecosystem Input: Command (any language) Output: Dataset Entity Set in ADO.NET 3.0 Queues & Workflows SODA (Service Oriented Data Architecture) Service Oriented Database Architecture: App Server-Lite? MSR TR 2005 129 • Apps are loosely connected via Queued messages Workflow: • Queues are databases. Script Execute • Basis for workflow Administer & • Queues: the first class to Expedite add to an OR DBMS all built on queues • Queues fire triggers. Active databases • Synergy with DBMS security, naming, persistence, types, query,… Text, Temporal, and Spatial Data Access select Title, Abstract, Rank • Q: What comes after queues? from Books join (Title, A: Basic types: text, time, space,… FreeTextTable Abstract, ‘XML semistructured') T • Great application of OR technology on BookID = T.Key • Key idea: table valued functions == indices An index is a table, organized differently Query executor uses index to map: select galaxy, distance from GetNearbyObjEQ(22,37) Key → set (aka sequence of rows) • Table valued function can do this map Optimizer can use it. select store, holiday, sum(sales) • +extras: cost function, cardinality,… from Sales join HolidayDates(2004) T on Sales.day = T.day group by store, holiday • BIG DEAL: Approximate answers: Rank and Support What’s new here? Question Dataset • DBMS have tight-integration with language classes (Java, C#, VB,.. ) • The DB is a class • You can add classes to DB. • Adding indices is “easy” If you have a new idea. • Now have solid Queue systems Adding workflow is “easy” If you have a new idea. • This is a vehicle for publishing data on the Web. Tables or Text or cube Or….. Cubes • Data cubes now standard • MDX is very powerful (Multi-Dimensional eXpressions) SELECT <axis_spec> FROM <cube_spec> WHERE <slicer_spec> • Dimension, Measure, Operator concepts highly evolved beyond snowflake schema • Cube stores cohabit with row stores ROLAP + MOLAP + (x xOLAP) (relational +multidimensional online analytic processing) • Very sophisticated algorithms • A big part of the ecosystem RED WHITE BLUE Semi-Structured Data • “Everyone starts with the same schema: <stuff/>.” Then they refine it.” J. Widom • We are a “strong schema” community • That has pros-and-cons. • Files <stuff/> and XML <<foo/> <bar/>> are here to stay. Get over it! • File directories are becoming databases; – Pivot on any attribute – Folders are standing queries. – Freetext+schema search (better precision/recall) • XSD (xml schema) and xQuery are transitional; But we have to do them to get to the real answer. • Cohabit with row-stores. • Challenge: figure out what comes after XSD+xQuery Data Mining and Machine Learning • Tasks: classification, association, prediction • Tools: Decision trees, Bayes, A Priori, clustering, regression, Neural net,… • now unified with DBs – Create table T (x,y,z,u,v,w) Learn “x,y,z” from “u,v,w” using <algorithm> – Train T with data. – Then can ask: • Probability x,y,z,u,v,w • What are the u,v,w probabilities given x,y,z – Example: Learn height from age. • Anyone with a data mining algorithm has full access to the DBMS infrastructure. • Challenge: Better learning algorithms. DM – DB Synergy Create the model: CREATE MINING MODEL HeightFromAgeSex ( ID long key, Gender text discrete, Age long continuous, Height long continuous PREDICT) USING Decision_Trees Train a data mining model: INSERT INTO Height SELECT ID, Gender, Age, Height FROM People learn height from Gender + Age DB verbs to drive Modeler Predict height from model: SELECT height, PredictProbability(height) FROM Height PREDICTION JOIN New ON New.Gender = Height.Gender AND New.Age = Height.Age Probabilistic Reasoning Notification, Stream Processing, and Sensor Processing • Traditionally: Query billions of facts • Streams: millions of queries one new fact Q? – New protein compare to all DNA – Change in price or time A! facts • Implications – New aggregation operators (extension) – New programming style – Streams in products: • Queries represented as records • New query optimizations. fact, fact, fact… Q Q Q QQ Q Q • Sensor networks – push queries out to sensors. – Simpler programming model – Optimizes power & bandwidth Notification Restatement: DB Systems evolved to be containers for information services develop, deploy, and execution environment records os sets records os utilities – Everything is a class – Database is a complex object – Core object is DataSet sets utilities • DBMS is an ecosystem Key structuring strategy: • The architecture lets you add your new ideas. DataSet Language + DB Integration (the Microsoft contribution) • LINQ are a BIG deal (SQL and XML) http://msdn.microsoft.com/data • Entity Sets are next step in Data Sets ADO.NET V3 automates entities Data access today Explicit DB connections command text Connection Rows void EmpsByDate(DateTime date) { using( SqlConnection con = new SqlConnection( Settings.Default.AdventureWorksSQL)) { con.Open(); Opaque Command DataReader SqlCommand cmd = con.CreateCommand(); cmd.CommandText = @" SELECT SalesPersonID, FirstName, HireDate Relational Engine Customer SalesPerson FROM SalesPerson sp INNER JOIN Employee e ON sp.SalesPersonID = e.EmployeeID INNER JOIN Contact c ON e.EmployeeID = c.ContactID Entities ≠ Rows WHERE e.HireDate < @date"; cmd.Parameters.AddWithValue("@date", date); DbDataReader r = cmd.ExecuteReader(); Untyped resultsets while(r.Read()) { Console.WriteLine( "{0:d}:\t{1}", r["HireDate"], r["FirstName"]); } } void EmpsByDate(DateTime date) { Auto-Gen classes No explicit connections Query<T> Connection Command DataReader Relational Engine foreach (SalesPerson p in people) { Console.WriteLine("{0:d}\t{1}", p.HireDate, p.FirstName ); } } ObjectContext MapDataReader Rows Strongly typed commands Order MapCommand using (AdventureWorksDB aw = new AdventureWorksDB()) { var people = from p in aw.SalesPeople where p.HireDate < date select p; SalesData MapConnection Entities public partial class AdventureWorksDB : System.Data.Objects.ObjectContext { public System.Data.Objects. Query<SalesOrder> SalesOrders { … } public System.Data.Objects. Query<SalesPerson> SalesPeople { … } } Objects Domain Objects Data access tomorrow Customer Strongly typed results SalesPerson Outline • What I have been doing • BIG Changes in DBs • Q&A 15 minutes 15 minutes 20 minutes