Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Databases A/Prof Alan Fekete University of Sydney Overview • Data, databases, and database management systems • Querying a database with SQL • Creating a database with a RDBMS • Tradeoffs Data • Facts about the world that are worth knowing – Eg NGC 450 is at 18.876,-0.861 – Eg The AAT has a 3.9 m reflector – The redshift of NGC 450 was measured with the AAT’s 2dF spectrograph • Data is very valuable – Data can be analysed to produce theories, or to test them – It takes a lot of effort to gather data Persistence • Values of variables in a running program are stored in the memory (RAM) – They are lost when the program ends, or the machine crashes • Computers have persistent storage (hard disk, flash memory etc) whose values persist • The operating system organizes the persistent storage into files – So store valuable data in files Sharing • Once someone has collected data and got it into a computer, they can use it for many purposes, and many others can use it • The naïve approach: copy the file around – Or email, or set up an ftp site, etc • This is very bad, if the data might change – More data (more galaxies, more facts about each galaxy!) – Corrections to existing data • Copies can become inconsistent; so share one store instead A Database • A store of data relevant to some domain, stored in a computer, and shared (between application programs or between users) • Typically, the data is structured – Many facts in each of a few types – Eg many galaxies, each has name, ra, dec, redshift etc Clients and servers • The database server is the machine where the database is stored • Each user must run some presentation client code on their machine, to display the data – Usually this is just a web browser • There may be other machines in-between – Eg web server, application server – Note that a machine may be a server which receives some requests, and also a client of others as it tries to respond! • We focus on the database server, and the “database client” where the application code is running that understands the data Two approaches • Purpose-written application code, accessing data stored in files, using knowledge of the structure of the files This is common in science • Use a database management system (DBMS) which manages the data and provides higher-level facilities to access it – You may write applications that call the DBMS, or even use it directly, interactively This is standard in industry, and is becoming more frequent in science DBMS Features • A query language – Describe the data to retrieve – Can also describe updates etc • A data definition language – Describe the logical structure of the data • Storage management – Control the layout in files, and with facilities to improve performance • Access control Different Data Models • Each database has a model or schema which tells what sort of data is stored – Eg each galaxy has id, name, ra, dec, etc • Using a DBMS means that the data model must fit the style supported by the DBMS • The most common DBMSs all support the relational style of data model Relational Data Models • Database consists of tables, each with a name • Each table has many rows, each with exactly the same structure – several attributes, each with a name and type (integer, float, string of length 20, etc) – Each row has a value for each attribute • Rows are distinguished by the values of a primary key (one or more attributes) Example Note use of “foreign key” to connect information in different tables • Schema: – galaxy(gal_id INTEGER, gal_ra FLOAT, gal_decl FLOAT, gtype STRING) – observation(obs_id INTEGER, obs_ra FLOAT, obs_decl FLOAT, obs_flux FLOAT, gal_id INTEGER, source STRING) • An Instance 1 2 3 4 1 13.1 7.2 S 2 14.7 5.3 C 3 11.2 4.8 S 4 15.3 4.6 S 14.6 11.2 14.5 11.1 13.4 9.6 13.3 9.7 5.4 4.9 4.3 4.8 2 3 2 3 A A B C Views • You can define a view on a database • Just like a table, but the data isn’t inserted or stored explicitly – It is computed when needed from the other tables • A view can be used in queries just like a table Queries • User can write a query, and run it – Answer is a collection of relevant data • Query may be typed interactively, or written into a program for repeated use – Perhaps with parameters that vary • Answer may be viewed on the screen, imported into a program, printed, or stored back in the database SQL • A standardized language in which queries can be written against databases in relational DBMSs – But each platform has slight variations • SQL is declarative – Query describes what data is needed – RDBMS has the job of working out how to trawl through the database to compute the answer Select-from-where SELECT gal_id, gtype FROM galaxy WHERE gal_ra between 13.0 and 16.0 AND gal_decl between 4.0 and 6.0; 1 13.1 7.2 S 2 14.7 5.3 C 3 11.2 4.8 S 4 15.3 4.6 S Answer is: 2 C 4 S Remember • FROM tells which table to use • WHERE tells which rows you are interested in – A boolean condition that is evaluated for each row • SELECT says which attributes (columns) you want in the answer Note query is stated SELECT-FROM-WHERE Limiting the answers • Real scientific databases have tables with many many rows • A query which returns too many is expensive to run – And it’s hard to understand the answer • Especially when exploring the data, get a small sample first – Ask for the top few rows, perhaps with respect to some order on the selected rows • Syntax details are not standard – MySQL has “LIMIT n” clause at end of query Joins • A query can combine data from several tables SELECT observation.obs_id, galaxy.gal_ra, galaxy.gal_decl, galaxy.gtype FROM galaxy JOIN observation – use the foreign keys ON observation.gal_id = galaxy.gal_id to pick values from WHERE observation.source=‘A’; the related row From observation 1 2 3 4 14.6 11.2 14.5 11.1 5.4 4.9 4.3 4.8 13.4 9.6 13.3 9.7 2 3 2 3 From galaxy A A B C 14.7 11.2 14.7 11.2 5.3 4.8 5.3 4.8 C S C S Unstructured joins • Sometimes you want to take all combinations of rows from tables, not just the rows related through foreign key – Then filter by a condition SELECT galaxy.gal_id, observation.obs_id FROM galaxy, observation WHERE ABS(galaxy.gal_ra - observation.obs_ra)<0.2 AND ABS(galaxy.gal_decl - observation.obs_decl)<0.3; Aggregation • Combine the values from a column, among all the rows that satisfy WHERE clause – MAX, MIN, SUM, AVG, COUNT – COUNT(DISTINCT colname) is usefully different if values are duplicated in the column SELECT MAX(obs_flux) FROM observation WHERE gal_id = 3; Find the largest flux in any observation of galaxy 3 Grouping • Divide the rows into groups, based on the values in some column(s) • Aggregate the values in some other column(s), for each group • Display the grouping column along with the aggregate SELECT source, COUNT(DISTINCT obs_id) FROM observation WHERE obs_flux > 10.0 GROUP BY source; For each source, report how many observations that have high flux were made by that source. Subqueries • SQL is block-structured • A subquery can be used in the FROM clause, instead of an existing table • A subquery can be used to produce a value used in the WHERE clause – eg in a comparison SELECT obs_id FROM observation WHERE obs_flux = (SELECT MAX(obs_flux) FROM observation); Find the observation with the highest flux Creating a database • Decide on the platform to use • Decide on the schema that will represent your data • Create a database and tables • Load the data • Allow users to access it Relational DBMS platforms • Enterprise solutions (Oracle, IBM DB2, SQL Server) are very expensive – And require professional administration too • Vendors also offer light-weight cheap/free variants • Free open-source platforms: PostreSQL, MySQL, Cloudscape Table Design • One table for each type of object in the domain – Columns for each attribute of importance – Make sure there is a primary key • Invent a new identifier for this purpose if necessary • Use foreign keys to represent relationships between objects – Or have a separate “association table” where one object can be related to many, in each direction Normalisation • It is important that the schema doesn’t allow the same facts to be stored repeatedly in different rows, or different tables – Eg observation(obs_id, obs_ra,…,gal_id, gal_ra, …, gtype, …) would have same gal_ra stored in several rows of observations of this galaxy – This would risk inconsistency when updates occur • There is theory about this, but sensibly defining one table per object will avoid the problems Declaring tables • Syntax isn’t quite standard – Details of types etc • MySQL variant is CREATE TABLE galaxy ( gal_id INT UNSIGNED, gal_ra FLOAT, gal_decl FLOAT, gtype VARCHAR(3), PRIMARY KEY (gal_id) ); Indices • The performance of queries can vary greatly depending on how the data is stored in the DBMS • Having an index on a column usually speeds up queries that select rows with given values (or ranges of values) in that column – SQL has CREATE INDEX statement Permissions • Each DBMS can control which users are allowed to run different sorts of statements on each table separately • Typically, for scientific data one allows SELECT access to all users (or all registered users) – INSERT access to a few trusted users – DELETE, UPDATE only for the administrators • Usually have a separate account, so even the admin people can’t do this accidentally while acting as scientists Transactions • Nasty problems can occur if some users are modifying data while others are looking at it – Or worse, if several modify concurrently • This is serious for enterprise applications, but not common in scientific situations, as updates are rare • A transaction is a whole collection of SQL statements that should be done as a single block – Not interrupted by others, and completely undone unless the whole collection succeeds Loading data • SQL has INSERT statement which adds rows to a table INSERT INTO galaxy (gal_id, gal_ra, gal_decl, gtype) VALUES (1, 13.1, 7.2, 'S'), (2, 14.7, 5.3, 'C'), (3, 11.2, 4.8, 'S'), (4, 15.3, 4.6, 'S'); Bulk load • If data is already in a file in a wellknown format (eg csv), then DBMS has non-standard commands to import it directly into a table • MySQL LOAD DATA LOCAL INFILE “gal.csv” INTO TABLE galaxy Trade-offs Compared to writing applications against data stored directly in files, using DBMS to manage the data: • Allows users to perform unpredicted (ad-hoc) queries – without being programmers, and without knowing the structure in files • Gives better support as data schema evolves – Existing queries can continue to run • Has more predictable performance – Easier to avoid very slow execution – But you may not be able to get very fast execution • Has better security – But considerable overhead in administration