* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Data Structure Database - Tufts Computer Science
Survey
Document related concepts
Tandem Computers wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Microsoft Access wikipedia , lookup
Oracle Database wikipedia , lookup
Concurrency control wikipedia , lookup
Functional Database Model wikipedia , lookup
Ingres (database) wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Versant Object Database wikipedia , lookup
Clusterpoint wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Relational algebra wikipedia , lookup
Transcript
1/60 Lecture 8: Databases November 3, 2015 COMP 150-2 Visualization Slides courtesy of Stanford, Toronto, and UIUC 2/60 Quick Note about VIS • http://ieeevis.org/year/2015/info/overviewamp-topics/papers-sessions • Animated Transitioning • http://www3.cs.stonybrook.edu/~pruchikachor/a nalogy/ • Memorability • Database Workshop 3/60 Why Databases? • In-memory visualization systems are not scalable. • When datasets become large, maintaining interactivity is not easy. • Why do we care about databases? • Different databases have different “properties” • • i.e. things that they are good at and things that they are not good at. Consider the library example • • • Current structure – “write once, read often” What if, “write often, read rarely”? Twitter? 4/60 Overview 1. Relational algebra 2. Writing a SQL query i. ii. iii. iv. Insert, update, delete Select, from, where (Inner) join Other syntax 3. Stages of a query: i. ii. iii. Parsing (query optimization) PreparedStatement 4. Technical Stuff a) Database (ODBC) connection i. b) Using the “device driver” model ResultSet 5. Using Databases with Visualizations 5/60 Relational Algebra • Definitions • • • Relational Schema Relational Instance A Relation • Bags vs. Sets • Operations • Operation tree • Limitations 6/60 Different Types of Models 7/60 The Relational Model: Schemata Part (I) Relational Schema: Students(sid: string, name: string, gpa: float) Relation name Attributes string, float, int are the domains of the attributes 8/60 The Relational Model II: Data Students(sid: string, name: string, gpa: float) Attributes Tuples / Records # of tuples is the cardinality Sid 101 121 123 135 Name Gpa Bob 3.2 Joe 2.8 Mary 3.8 Alice 3.7 # of attributes is the arity NB: In practice, DB systems relax the set requirement. Why? A relational instance is a set of tuples all conforming to the same schema 9/60 To reiterate A relational schema describes the data that is contained in a relational instance A relational instance is a set of tuples of the same type. Let R(f1:D1,…,fm:Dm) be a relational schema then, an instance of R is a subset of Dom1 x Dom2 x … x Domn A tuple viewed as a total function from attribute names to types (names important) 10/60 One more time A relational schema describes the data that is contained in a relational instance A relation R is of arity t is a function: R : D1 x … x Dt {0,1} The schema is simply the signature of the function. Note that the 0, 1 represents whether the instance exists 11/60 A relational database A relational database schema is a set of relational schemata, one for each relation A relational database instance is a set of relational instances, one for each relation Two conventions: 1. We call instances as simply databases 2. We assume all instances are valid, i.e., satisfy the domain constraints 12/60 Operations on Bags A bag = a set with repeated elements All operations need to be defined carefully on bags {a,b,b,c}{a,b,b,b,e,f,f}={a,a,b,b,b,b,b,c,e,f,f} {a,b,b,b,c,c} – {b,c,c,c,d} = {a,b,b} sC(R): preserve the number of occurrences PA(R): no duplicate elimination Cartesian product, join: no duplicate elimination Important ! Relational Engines work on bags, not sets ! 13/60 Relational Algebra Five operators: Union: Difference: Selection: s Projection: P Cartesian Product: Derived or auxiliary operators: Intersection, complement Joins (natural, equi-join, theta join, semi-join) Renaming: r Division 14/60 Cartesian Product Example Employee Name John Tony SSN 999999999 777777777 Dependents EmployeeSSN 999999999 777777777 Dname Emily Joe Employee x Dependents Name SSN EmployeeSSN John 999999999 999999999 John 999999999 777777777 Tony 777777777 999999999 Tony 777777777 777777777 Dname Emily Joe Emily Joe 15/60 Natural Join Notation: R1 || R2 Meaning: R1 || R2 = PA(sC(R1 R2)) Where: The selection sC checks equality of all common attributes The projection eliminates the duplicate common attributes 16/60 Natural Join Example Employee Name SSN John 999999999 Tony 777777777 Dependents SSN Dname 999999999 Emily 777777777 Joe Employee Dependents = PName, SSN, Dname(s SSN=SSN2(Employee x rSSN2, Dname(Dependents)) Name SSN Dname John 999999999 Emily Tony 777777777 Joe 17/60 Natural Join R= R || S= S= A B B C X Y Z U X Z V W Y Z Z V Z V A B C X Z U X Z V Y Z U Y Z V Z V W 18/60 Virtues of the model Physical independence (logical too), Declarative Simple, elegant clean: Everything is a relation Why did it take multiple years? Doubted it could be done efficiently. 19/60 What is an Algebra? 20/60 Logical Equivalence of Plans R(A,B) S(B,C) s[A=5] (p[A] R) = p[A] (s[A=5] R) Here, projection and selection commute p[B] (s[A=5] R)? Can we play the same game here? 21/60 A simple plan p[B] R(A,B) S(B,C) What SQL query does this correspond to? 22/60 Pushing down projection p[B] R(A,B) S(B,C) p[B] p[B] R(A,B) S(B,C) Why might we prefer this plan? 23/60 Example Limitations • SQL/RA has its problems • Reusability • Modularity • Readability, etc. • “Computation” or math (e.g., average, min, max, stdev, etc.) is not a part of RA • But there are also “real” limitations 24/60 Finally: RA has Limitations ! Cannot compute “transitive closure” Name1 Name2 Relationship Fred Mary Father Mary Joe Cousin Mary Bill Spouse Nancy Lou Sister Find all direct and indirect relatives of Fred Cannot express in RA !!! Need to write C program, use a graph engine, or modern SQL… 25/60 SQL Query • Four Types of Operations • • • • INSERT UPDATE DELETE SELECT 26/60 Ok, but you said this was operational… SQL Data Definition Language (DDL) is one method of defining relations In SQL, Relation → table Students(sid: string, name: string, gpa: float) CREATE TABLE Students ( sid CHAR(20), Name CHAR(50), Gpa REAL); 27/60 Table name Tables in SQL Attribute names Product PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi Tuples or rows 27 28/60 Tables Explained The schema of a table is the table name and its attributes: Product(PName, Price, Category, Manfacturer) A key is an attribute whose values are unique; we underline a key Product(PName, Price, Category, Manfacturer) 28 29/60 Data Types in SQL Atomic types: Characters: CHAR(20), VARCHAR(50) Numbers: INT, BIGINT, SMALLINT, FLOAT Others: MONEY, DATETIME, … Every attribute must have an atomic type Hence tables are flat Why ? 29 30/60 Tables Explained A tuple = a record Restriction: all attributes are of atomic type A table = a (multi)-set of tuples Like a list… …but it is unordered: no first(), no next(), no last(). 30 31/60 Outline Data in SQL (Review) Simple Queries in SQL Queries with more than one relation 31 32/60 SQL Query Basic form: (there are many many more bells and whistles) SELECT <attributes> FROM <one or more relations> WHERE <conditions> Call this: An SFW query. 32 33/60 Simple SQL Query Product PName Gizmo Price $19.99 Powergizmo SingleTouch MultiTouch $29.99 $149.99 $203.99 Category Gadgets Manufacturer GizmoWorks Gadgets GizmoWorks Photography Canon Household Hitachi SELECT * FROM Product WHERE category=‘Gadgets’ “selection” PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets 33 GizmoWorks 34/60 Simple SQL Query Product PName Gizmo Price $19.99 Powergizmo SingleTouch MultiTouch $29.99 $149.99 $203.99 Category Gadgets Manufacturer GizmoWorks Gadgets GizmoWorks Photography Canon Household Hitachi SELECT Pname, Price,Manufacturer FROM Product WHERE category=‘Gadgets’ “selection” “projection” PName Price Manufacturer Gizmo $19.99 GizmoWorks Powergizmo $29.99 GizmoWorks 34 35/60 The IN operator SELECT * FROM Products WHERE PName in (‘Gadgets’, ‘Photography’); This is functionally equivalent to: SELECT * FROM Products WHERE PName = ‘Gadgets’ OR PName =‘Photography’; 35 36/60 The LIKE operator SELECT * FROM Products WHERE PName LIKE ‘%gizmo%’ s LIKE p: pattern matching on strings p may contain two special symbols: % = any sequence of characters _ = any single character 36 37/60 Eliminating Duplicates SELECT DISTINCT category FROM Product Category Gadgets Photography Household Compare to: SELECT category FROM Product Category Gadgets Gadgets Photography Household 37 38/60 Ordering the Results SELECT pname, price, manufacturer FROM Product WHERE category=‘gizmo’ AND price > 50 ORDER BY price, pname Ties are broken by the second attribute on the ORDER BY list, etc. Ordering is ascending, unless you specify the DESC keyword. 38 39/60 Employee table LastName DepartmentID Rafferty 31 Jones 33 Heisenberg 33 Robinson 34 Smith 34 Williams NULL Department table DepartmentID DepartmentName 31 Sales 33 Engineering 34 Clerical 35 Marketing 40/60 Inner Join SELECT * FROM employee INNER JOIN department ON employee.DepartmentID = department.DepartmentID; SELECT * FROM employee, department WHERE employee.DepartmentID = department.DepartmentID; 41/60 Inner Join Employee.LastNam e Employee.Departm entID Department.Depart Department.Depart mentName mentID Robinson 34 Clerical 34 Jones 33 Engineering 33 Smith 34 Clerical 34 Heisenberg 33 Engineering 33 Rafferty 31 Sales 31 42/60 (Left) Outer Join SELECT * FROM employee LEFT OUTER JOIN department ON employee.DepartmentID = department.DepartmentID; Employee.LastNam e Employee.Departm entID Department.Depart Department.Depart mentName mentID Jones 33 Engineering 33 Rafferty 31 Sales 31 Robinson 34 Clerical 34 Smith 34 Clerical 34 Williams NULL NULL NULL Heisenberg 33 Engineering 33 43/60 Subqueries Returning Relations Company(name, city) Product(pname, maker) Purchase(id, product, buyer) Return cities where one can find companies that manufacture products bought by Joe Blow SELECT Company.city FROM Company WHERE Company.name IN (SELECT Product.maker FROM Purchase , Product WHERE Product.pname=Purchase.product AND Purchase .buyer = ‘Joe Blow‘); 43 44/60 Aggregation SELECT avg(price) FROM Product WHERE maker=“Toyota” SELECT count(*) FROM Product WHERE year > 1995 SQL supports several aggregation operations: sum, count, min, max, avg Except count, all aggregations apply to a single attribute 44 45/60 Aggregation: Count COUNT applies to duplicates, unless otherwise stated: SELECT Count(category) FROM Product WHERE year > 1995 same as Count(*) We probably want: SELECT Count(DISTINCT category) FROM Product WHERE year > 1995 45 46/60 Stages of a Query • Parsing • Query Plan Generation • Query Optimization • (Using PreparedStatement) 47/60 Example Query Plan / Optimizer Given a database with two tables: dept (dno, floor) emp (name, age, sal, dno) Consider the following SQL query: select name, floor from employ, dept where employ.dno = dept.dno and employ.sal > 100k Example taken from “Query Optimization” by Ioannidis, 1997 48/60 Possible Query Plans 49/60 Cost of the Query For a database with 100,000 employees (stored across 20,000 page files), the three query plans can have significantly different execution time (in 1997): T3: <1 sec T1: >1 hour T2: ~1 day 50/60 PreparedStatement • For repeated queries java.sql.PreparedStatement stmt = connection.prepareStatement( "SELECT * FROM users WHERE USERNAME = ? AND ROOM = ?"); stmt.setString(1, ‘Tom’); stmt.setInt(2, 24); stmt.executeQuery(); 51/60 Technical Stuff • Think of ODBC (Open Database Connectivity) and JDBC as a device driver / API • Just like a printer driver: 1. For each printer (database software), you need to install a specific driver 2. Once the driver is installed, your software talks it the printer (database) via a common API / interface 3. The printer (database) takes the command and converts it to “native” functions and calls 52/60 Example Database Connection Code import java.sql.*; public class FirstExample { // JDBC driver name and database URL static final String JDBC_DRIVER = "com.mysql.jdbc.Driver"; static final String DB_URL = "jdbc:mysql://localhost/EMP"; // Database credentials static final String USER = "username"; static final String PASS = "password"; public static void main(String[] args) { Connection conn = null; Statement stmt = null; try{ //STEP 2: Register JDBC driver Class.forName("com.mysql.jdbc.Driver"); //STEP 3: Open a connection System.out.println("Connecting to database..."); conn = DriverManager.getConnection(DB_URL,USER,PAS S); //STEP 4: Execute a query System.out.println("Creating statement..."); stmt = conn.createStatement(); String sql; sql = "SELECT id, first, last, age FROM Employees"; ResultSet rs = stmt.executeQuery(sql); : : } 53/60 ResultSet String sql; sql = "SELECT id, first, last, age FROM Employees"; ResultSet rs = stmt.executeQuery(sql); //STEP 5: Extract data from result set while(rs.next()){ //Retrieve by column name int id = rs.getInt("id"); int age = rs.getInt("age"); String first = rs.getString("first"); String last = rs.getString("last"); } //STEP 6: Clean-up environment rs.close(); 54/60 Database Background • SQL Query. Example: SELECT person FROM datasetTable WHERE (weight > 130) AND (weight < 180) AND (height > 5.8) AND (height < 6) 55/60 Database and Visualization • This is the equivalent of a selection box in a scatter plot visualization Height = 6 Height = 5.8 weight = 130 weight = 180 56/60 Database and Visualization • For a specific type of interaction (e.g. select, filter, zoom, etc.), there is often a 1:1 mapping between the interactions and database queries • Homefinder Example 57/60 Database and Visualization • There is a nice mapping between each visual element with an attribute in the database SELECT latitude, longitude FROM homesTable WHERE (distance > 5) AND (distance < 10) AND (price > 100,000) AND (price < 300,000) AND (bedrooms > 3) AND (garage = TRUE) AND (fireplace = FALSE) 58/60 Query Relaxation • “Generalized Selection via Interactive Query Relaxation” (Heer et al. CHI 2008) • http://vis.berkeley.edu/papers/generalized_s election/ 59/60 Database Strength and Weakness • Strengths: • • • Generalizable Separates data from visualization cleanly SQL is a very stable language that is very complete for interacting with data stored in relational databases • Weaknesses: • • • Dynamic query construction is tricky DB overhead Speed is a problem • • • Network speed DB searching speed Parsing DB returned tuples 60/60 Conjunctive Form • A selection / filtering / zooming interaction == database query • It can also be thought of as a conjunctive form !(A1 V A2) ^ A3 V (A4 V A5 ^ A6) V … where A1 could be the clause (price > 200) 61/60 Conjunctive Form • Very similar to SQL query, but can be built in software • Has some of the similar limitations as SQL in that the clauses need to be pre-built • Such that a data attribute needs to be “hard-coded” to a visual element • Consider a slider that connects to two or more attributes (http://gravis.cs.unibas.ch/publications/2007/VIS07_S mith.pdf)