Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Microsoft Jet Database Engine wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Concurrency control wikipedia , lookup
Functional Database Model wikipedia , lookup
Relational model wikipedia , lookup
Healthcare Cost and Utilization Project wikipedia , lookup
Introduction Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems September 4, 2003 Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan Welcome! To CIS 550, officially “Database & Information Systems” in the course catalog… … A tour of the data management field… … A question for you: what does this really mean? What is this course (and the field) about? 2 What The Course Is Not… (A Few Warnings) This is not a course on Oracle or SQL It may not directly impact your marketability It’s an investigation into the principles of data management – which will improve your understanding This course will not be a cakewalk! The data management field is broad, and we’ll touch on many subjects at a rapid pace 8 homeworks, paper summaries, term project, midterm, final, … This course is not suitable for people with a limited programming background; need skills in: Algorithms & data structures Logic Programming languages Perhaps even a little complexity theory! 3 What The Course Will Do Most CS courses concentrate on code – now you’ll understand data management and representation In the end, it’s all about the data! Background in most of the important areas: Data design, modeling Understanding of DB system internals, performance Understanding of data-driven systems (e.g., web sites) An understanding of the complexities of integrating data – perhaps the biggest CS problem today Understanding of what research topics in data management are 4 Administrivia Instructor: Zachary Ives, zives@cis Levine 611 (until end of Sept.; then 5th floor GRW) Office hours: Tuesday, 3:00-4:00PM (after class) TA: Dinkar Gupta, [email protected] Office hours on Monday; time 3:00-4:00PM (office TBA) Newsgroup: upenn.cis.cis550 Home page: www.seas.upenn.edu/~zives/cis550/ Text(s): Ramakrishnan & Gerke, Database Systems, 3rd ed. Supplementary papers (to be handed out in class) Other books may be useful (see web page) 5 Course Format and Grading We’ll cover roughly one major topic area per week to two weeks Readings in the text & research papers Occasionally, summaries/commentary on papers (5%) Homework assignment for each topic area (30%) One midterm (10%), one final exam (20%) Project (30%) – groups of 3-4: Build a blogging system on top of a database, or Build a P2P data sharing system for XML data (Or propose your own idea) General participation, discussion, intangibles (5%) 6 Diving In… What is a database and a DBMS? Why do we need a DBMS? Database and data management architectures Process of building a DB DBMS components 7 What’s Data Management? In the 1960’s and early 70’s: file formats, traversals, indexes In the 1980’s (mid- to late-70’s in research) and 90’s: Separation of logical + physical data representations Well-defined general purpose, declarative data manipulation language (DML) and data definition language (DDL) Reliable, consistent storage + concurrency control Sophisticated system that takes DDL statements and knowledge of physical data representations and produces answers in “optimized” way Today: All that plus managing and manipulating data in many models and representations 8 What is a DBMS? A database (DB) is a large, integrated collection of data Generally is cohesive in “some” way A DB models a real-world organization or unit A database management system (DBMS) is a software package designed to store and manage databases Reliable storage & recovery of 100s of GB Querying/updating interface and API Support for many concurrent users 9 Connection to Other Areas of CS… Programming languages and software engineering (obviously) Algorithms (obviously) Logic, discrete math, and theory of computation Systems: concurrency, operating systems, file organization and networks, peer-to-peer, … Web (and Semantic Web), information retrieval, digital libraries, software agents, … AI planning and machine learning 10 But 80% of the World’s Data is Not in a DB! Examples: scientific data (large images, complex programs that analyze the data) personal data WWW Data management is expanding to tackle these problems Flexibility – data management imposes many constraints to make problems solvable Must deal with entities outside our control In this course, we’ll start by focusing on databases, but eventually look “outside the box” 11 Why Not “Program up” Databases As Needed? For simple (single-concept) and small databases this is often the best solution Flat files and grep get us a long way But there are limits: The structure is complicated (more than a simple table) The database gets large (e.g., bigger than RAM) Many people want to use it simultaneously Need for reliable recovery from crashes Updates generally require complete rewrite of file 12 Example: Palm-Style Calendar We might start by building a file with the following structure: Event Day When Who Where Lunch 10/24 CS123 10/25 Biking 10/26 Dinner 10/26 1pm 9am 9am 6PM Joe’s Diner Morris234 Jane’s house Café Le Boeuf Rick Dr. Egghead Jane Jane This text file is easy to deal with. So there's no need for a DBMS! Right…? 13 Problem 1: Data Organization Consider the all-important who field. Do we also want to keep e-mail addresses, telephone numbers etc.? Expand our file to look like: Event When Who-name Who-email … Who-tel …. Where Now we are keeping our address book in our calendar and doing so redundantly 14 “Link” Calendar with Address Book? Two conceptual “entities” – contact information and calendar – with a relationship between them, linking people in the calendar to their contact information This link could be based on something as simple as the person's name 15 Problem 2: Efficiency Size of personal address book is probably less than one hundred entries, but there are things we'd like to do quickly and efficiently “Give me all appointments on 10/28” “When am I next meeting Jim?” “Program” these as quickly as possible (and make them resilient to data format changes) Have these programs executed efficiently What would happen if you were using a corporate calendar with hundreds of thousands of entries? 16 Problem 3: Concurrency and Reliability Suppose other people are allowed access to your calendar and are allowed to modify it? How do we stop two people changing the file at the same time and leaving it in a physical (or logical) mess? Suppose the system crashes while we are changing the calendar. How do we recover our work? 17 Transactions Key concept for concurrency is that of a transaction : an atomic sequence of database actions (read/write) on data items (e.g. calendar entry). Key concept for recoverability is that of a log : keeping track of all actions carried out by the db. Sounds like operating systems all over again! 18 Database Architecture:The Traditional View It is common to describe databases in two ways: Logical structure: What users see. The program or query language interface. Physical structure: How files are organized. What indexing mechanisms are used. The logical level is further split into two components: Overall database design (conceptual; seen by the DB designer) Views that various users get to see 19 Three-level Architecture View 1 View 2 Schema … View N Conceptual Level Physical Level (file organization, indexing) 20 Data Independence A user of a relational database system should be able to use query the database without knowing about how the precisely how data is stored, e.g. SELECT When, Where FROM Calendar WHERE Who = "Bill" After all, you don't worry much how numbers are stored when you program some arithmetic or use a computer-based calculator 21 More on Data Independence Logical data independence Protects the user from changes in the logical structure of the data: could reorganize the calendar “schema” without changing how I query it Physical data independence Protects the user from changes in the physical structure of data: could add an index on who (or sort by when) without changing how the user would write the query, but the query would execute faster (query optimization) 22 That's the Traditional View, But ... Three-level architecture is not always achievable: when databases get big, queries must be carefully written to achieve efficiency Also, may need a 4th tier… Sometimes this is called middleware 23 Combining Databases with Mediators (a kind of middleware) “Mediated Schema” XML May need to add further layers to combine multiple databases/sources on the Web Some of these are databases over which we have no control Some must be accessed in special ways 24 Data-Driven Web Sites: Consumers of Database Output HTML view Processing “Data driven web sites” also add an HTML “presentation” layer on top of what we’ve seen 25 The Process of Building a Database Start with a conceptual model Design & implement schema Write applications using DBMS and other tools Many ways of doing this where the hard problems are taken care of by other people (DBMS, API writers, library authors, web server, etc.) 26 Conceptual Design fid PROFESSOR name Teaches STUDENT sid name COURSE Takes cid name semester 27 Designing a Schema (Set of Relations) STUDENT COURSE Takes sid name sid cid cid name sem 1 Jill 1 550-0103 550-0103 DB F03 2 Qun 1 700-1003 700-1003 AI S03 3 Nitin 3 500-0103 501-0103 Arch F03 Convert to tables + constraints Then need to do “physical” design: the layout on disk, indices, etc. PROFESSOR Teaches fid name fid cid 1 Ives 1 550-0103 2 Saul 2 700-1003 8 Roth 8 501-0103 28 Applications Use Queries in SQL Structured Query Language Based on restricted first-order logic expressions over relations Not procedural – defines constraints on the output Converted into a query plan that exploits properties; run over the data by the query optimizer and query execution engine <html> <body> <!-- hypotheticalEmbeddedSQL: SELECT * FROM STUDENT, Takes, COURSE WHERE STUDENT.sid = Takes.sID AND Takes.cID = cid --> </body> </html> 29 Processing the Query Web Server / UI / etc Hash STUDENT Optimizer Takes by cid Execution Engine Merge COURSE by cid Storage Subsystem SELECT * FROM STUDENT, Takes, COURSE WHERE STUDENT.sid = Takes.sID AND Takes.cID = cid 30 DBMS in a Bit More Detail API/GUI (Simplification!) Query Optimizer Stats Physical plan Exec. Engine Catalog Schemas Data/etc Logging, recovery Requests Index/file/rec Mgr Data/etc Requests Buffer Mgr Pages Pages Storage Mgr Data Requests Storage 31 DBMSs in the Wild A huge industry for 20% of the world’s data! Big, mature relational databases IBM, Oracle, Microsoft “Middleware” above these SAP, PeopleSoft, dozens of special-purpose apps “Application servers” Integration and warehousing systems Trends: More integration; web services; XML everywhere Smarter, self-tuning systems 32 The Research World Conventional databases aren’t interesting! Understanding what’s possible to do with XML Better query processing Better languages for meta-info (e.g., constraints) Data streams Peer-to-peer Integrating data from different formats Lots of theory and systems-building 33 In this Course... Study relational databases, their design, how to query, what forms of indices to use. Beyond relational algebra: a logical model of data (Datalog), recursion XML and semi-structured data models Understanding DB internals How DBs are built Performance implications Integrating and mediating between databases (a huge problem today) 34 Questions? Dilbert, 8/9/2003 (via online archive) 35