* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Views
Survey
Document related concepts
Transcript
Introduction to Database Systems 1 © Michael Lang, National University of Ireland, Galway Today’s session • Some history • Origins of DB • Organisational structure • Organisational information systems • Data storage • DBMS • Data models 2 © Michael Lang, National University of Ireland, Galway What is “Data”? • ANSI definition: Data A representation of facts, concepts, or instructions in a formalized manner suitable for communication, interpretation, or processing by humans or by automatic means. Any representation such as characters or analog quantities to which meaning is or might be assigned. Generally, we perform operations on data or data items to supply some information about an entity. • Volatile vs. persistent data 3 Our concern is primarily with persistent data © Michael Lang, National University of Ireland, Galway Early Data Management Ancient History • Data are not stored on disk • Programmer defines both logical data structure and physical structure (storage structure, access methods, I/O modes, etc) • One data set per program. High data redundancy. PROGRAM 1 Data Management DATA SET 1 PROGRAM 2 Data Management DATA SET 2 PROGRAM 3 Data Management 4 DATA SET 3 © Michael Lang, National University of Ireland, Galway Problems • There is no persistence. All data is transient and disappears when the program terminates. • Random access memory (RAM) is expensive and limited All data may not fit available memory • Programmer productivity low 5 The programmer has to do a lot of tedious work. © Michael Lang, National University of Ireland, Galway • File Processing Recent (and Current) History Data are stored in files with interface between programs and files. • Various access methods exist (e.g., sequential, indexed, random) • One file corresponds to one or several programs. Data Management PROGRAM 2 Data Management PROGRAM 3 FILE 1 File System Services FILE 2 Redundant Data PROGRAM 1 Data Management 6 © Michael Lang, National University of Ireland, Galway File System Functions • Mapping between logical files and physical files Logical files: a file viewed by users and programs. • Data may be viewed as a collection of bytes or as a collection of records (collection of bytes with a particular structure) • Programs manipulate logical files Physical files: a file as it actually exists on a storage device. • Data usually viewed as a collection of bytes located at a physical address on the device • Operating systems manipulate physical files. • A set of services and an interface (usually called application independent interface – API) 7 © Michael Lang, National University of Ireland, Galway Problems With File Systems • Data are highly redundant sharing limited and at the file level • Data is unstructured “flat” files • High maintenance costs data dependence; accessing data is difficult ensuring data consistency and controlling access to data • Sharing granularity is very coarse • Difficulties in developing new applications 8 © Michael Lang, National University of Ireland, Galway Database Approach PROGRAM 1 DBMS Query Processor PROGRAM 1 Transaction Mgr … Integrated Database PROGRAM 2 9 © Michael Lang, National University of Ireland, Galway Origins of Databases • c1455: Gutenberg invents printing. Explosive interest in publication of books (analogous with explosive growth of Web in early 1990s) leads to public libraries • Libraries were first to introduce standards for information storage and retrieval • These paper-based systems were extended and enhanced, and filing, indexing and classification schemes were developed 10 • Second World War: accelerated R&D in computing technologies spawned capability to computerise maintenance of records © Michael Lang, National University of Ireland, Galway Computerised Data Storage • Advantages of computerised data storage over paper-based systems include: 11 ability to store data compactly (e.g Britannica CD) enhanced data retrieval ability to access data remotely, e.g. from a mobile workstation, off-site location, or distant branch ability to share data amongst multiple users with concurrent access facility to automate regular, speedy back-ups enhanced data editing • Most significant disadvantage is vulnerability e.g system crash, corrupted data, viruses, hackers, etc. © Michael Lang, National University of Ireland, Galway Information: A Corporate Asset • Information is a vital corporate asset. Without accurate, current, relevant information, mistakes and misjudgements may be made • Data management is an essential capability in the modern business environment / information society • A knowledge organisation is one in which the primary asset is information; its competitive advantage is derived from effective use of documented knowledge. Examples: accounting firms, marketing companies, software houses 12 • Organisational memory extends and amplifies information / knowledge by capturing, organising, disseminating, and reusing it © Michael Lang, National University of Ireland, Galway Organisational Memory Organisational Memory Database (logical grouping of related files) File / Table (collection of related records) Record (collection of related fields, bound together as single units) Field (part of a record reserved for a particular data item) Byte (group of eight binary bits - can represent a single character) 13 © Michael Lang, National University of Ireland, Galway Organisational Memory Organisational memory Data Informed decisions Improved products and services 14 © Michael Lang, National University of Ireland, Galway Attributes of Organisational Memory • Current • Timely real-time systems are commonplace in modern business environment • Relevant data is only useful if relevant to task in hand • Shareable • Complete • Accurate and consistent 15 © Michael Lang, National University of Ireland, Galway Attributes of Organisational Memory • Transportable authorised personnel should have access to data anywhere, anytime • Secure against unauthorised access 16 © Michael Lang, National University of Ireland, Galway Organisational Information Systems • Generally, there is an alignment between business units and core operational systems • Typical core systems are: 17 Sales & Marketing Department: Customer management system, Order processing system Operations Unit: Purchasing system, Inventory control system Finance Department: Accounts payable and receivable systems, Credit Management system • There are interdependencies between these systems; hence the need for an integrated data management approach © Michael Lang, National University of Ireland, Galway Data Storage • Information systems create, read, update and delete data • Data can be stored in conventional files or databases Files are collections of similar records Databases are collections of interrelated files. • Records can be linked through specified relationships to records in other files 18 © Michael Lang, National University of Ireland, Galway Conventional Files • In the file environment, data storage is built around the applications that will use the files • Essentially, the file “belongs” to a specific application. This is termed program-data dependence File 19 • As applications are developed, customised files are created which may be unusable by other applications File Information System File © Michael Lang, National University of Ireland, Galway File Conventional Files • First attempts at computerised storage of records followed traditional paper-based metaphors (“Flat file” systems) • Flat files were inefficient for data retrieval: it might be necessary to search entire file for a record (which, it may transpire, does not exist). Remedy: index files • Indexing improved data retrieval, but conventional files have other disadvantages: 20 Program-Data dependence Proprietary file formats (closed systems) Poor scalability © Michael Lang, National University of Ireland, Galway Conventional Files • Disadvantages (Cont’d): Duplicated and redundant data • ambiguity: same thing being referred to by different names in different places • inconsistency: conflicting / unsynchronised data • wasted effort Separation and isolation of data • data dispersed amongst many files complicates processing Inflexibility • cumbersome data structures and report layouts • not responsive to ad hoc queries • excessive program maintenance 21 Development environment • procedural -v- non-procedural (3GL -v- 4GL) © Michael Lang, National University of Ireland, Galway Conventional Files • Advantages: Historically, conventional files have been faster to process than DBMS applications • As legacy file-based systems become candidates for reengineering, the trend is to replace them with database systems 22 © Michael Lang, National University of Ireland, Galway Database Management Systems • A database is a large, integrated collection of data which models a real-world enterprise Information System • A Database Management System (DBMS) is a software package designed to store and manage databases 23 • In a DBMS environment, applications are built around an integrated adaptable database Information System Database Information System Information System © Michael Lang, National University of Ireland, Galway Database Management Systems • Advantages: 24 ability to share the same data across multiple applications and systems data independence control of redundancy enforced data integrity improved data security uniform data administration concurrent access improved backup and recovery facility flexible data structures © Michael Lang, National University of Ireland, Galway Database Management Systems • Advantages (Cont’d): 25 databases allow the use of the data in ways not originally specified by the end-users (ad hoc queries) database definition can be extended without impacting existing programs that use it economies of scale © Michael Lang, National University of Ireland, Galway Database Management Systems • Disadvantages: database technology is more complex than file technology • requires more sophisticated hardware and software (DBMS) DBMS’s can still be slower than file-based systems database technology requires a significant investment • database administration • operating costs and ongoing maintenance • end-user training 26 higher impact of system failure © Michael Lang, National University of Ireland, Galway Database Management Systems • Roles in a DBMS environment 27 Data Administrator Database Administrator Database Designer Application Programmer End-User © Michael Lang, National University of Ireland, Galway Database Architectures • Hierarchical Data Model • Network Data Model • Relational Data Model • Object-Relational Model 28 © Michael Lang, National University of Ireland, Galway Data Relationships • One-to-One (1:1) Example: Bank Manager manages one and only one Bank Branch; Bank Branch is managed by one and only one Bank Manager • One-to-Many (1:M) Example: An Employee works in a designated Department; in any Department, there may be many Employees • Many-to-Many (M:M) 29 Example: A Student registers for one or more Courses; for any Course, there may be one or more registered Students © Michael Lang, National University of Ireland, Galway Hierarchical Data Model 30 © Michael Lang, National University of Ireland, Galway Hierarchical Data Model • Arose as a solution to a practical problem Managing millions of parts for the space program (standard “Bill of Materials” problem) • The basic structure is a hierarchy or tree Parent-child relationship • Relationships are represented by pointers • Restrictions: 31 Each segment has at most one parent All relationships are 1:M © Michael Lang, National University of Ireland, Galway Hierarchical Data Model • Problem: how to represent a M:M relationship ? 32 A hierarchical structure (tree) can only support 1:M relationships. Therefore, to represent M:M, must create multiple hierarchies … but, this means that records are duplicated in different hierarchies duplication gives rise to data anomalies duplication can be eliminated using pointers; must decide in which table to store data, and in which table to store pointer. This is a very awkward means of implementing M:M © Michael Lang, National University of Ireland, Galway Network Data Model 33 © Michael Lang, National University of Ireland, Galway Network Data Model • Objective was to overcome shortcomings of the hierarchical data model, in particular, representation of M:M relationship • Like the hierarchical model, it can be likened to trees; unlike hierarchical model, several trees can share branches • In practice, enjoyed little commercial success Too complex: suited to use by programmers as opposed to end-users 34 Overtaken by the relational model No clear theoretical base © Michael Lang, National University of Ireland, Galway Network Model: Structures • Data item a field or attribute • Record • • • • a collection of data items Vectors (repeating structures) are permitted Relationships are represented by sets Sets have owners (parent) and members (children) A member cannot have two parents in the one set cannot directly represent M:M relationships • Member records of a set can be ordered 35 © Michael Lang, National University of Ireland, Galway Shortcomings of Early Models • Languages of both hierarchical and network are procedural and record-at-a-time • To retrieve data … you must navigate (find a path) to the required record issue multiple statements directing the system to traverse that path • It was necessary to issue multiple requests to the DBMS to retrieve a data item • It was necessary to have a detailed understanding of how data was stored and structured 36 this is contrary to data independence principle © Michael Lang, National University of Ireland, Galway Relational Data Model • Relational model, developed by E. F. Codd, has a strong theoretical basis and overcomes shortcomings of network / hierarchical models • Relational model is non-procedural (What?, not How?), and set-at-a-time • Navigation is automatic • Relations 2-dimensional data set consisting of N columns (fields / attributes) and M rows (records) • All rows in a relation are unique • A relational database is a set of relations 37 © Michael Lang, National University of Ireland, Galway Relational Model: Structures 38 © Michael Lang, National University of Ireland, Galway Relational Model: Structures 39 © Michael Lang, National University of Ireland, Galway Relational Model: Structures • Primary key A unique identifier of a row in a relation; can be composite • Candidate key An attribute that could be a primary key • Alternative key A candidate key that is not selected as the primary key • Foreign key An attribute of a relation that is the primary key of another relation; can be composite 40 © Michael Lang, National University of Ireland, Galway Relational Algebra and SQL • A major strength of the relational model is that it supports SQL, a flexible data retrieval language which facilitates ad hoc queries • Relational algebra is a standard for judging a data retrieval language Relational Algebra Restrict A where condition SELECT * FROM A WHERE condition Project Product A [X] A times B SELECT X FROM A SELECT * FROM A, B Union A union B SELECT * FROM A UNION SELECT * FROM B Difference 41 SQL A minus B SELECT * FROM A WHERE NOT EXISTS (SELECT * FROM B WHERE A.X = B.X AND A.Y=B.Y AND ... © Michael Lang, National University of Ireland, Galway Relational Databases • A truly relational database supports structures (domains and relations) integrity rules a manipulation language • The word “relational” is sometimes used too freely; many commercial systems are not fully relational because they do not support domains and integrity rules • E. F. Codd has set forward 12 rules that a database must satisfy before it can be said to be truly relational 42 © Michael Lang, National University of Ireland, Galway Codd’s 12 Rules • Information representation All data, including metadata (data definition, constraints, user names, etc.), is represented solely and explicitly as values in a table • Guaranteed access Every value in a database is accessed by specifying a combination of table name, column name, and value of the primary key of the row in which it is stored No artificial paths, such as linked lists • Systematic treatment of null values 43 There must be a distinct representation for unknown data, irrespective of data type Null values are not equivalent to zero or the empty string © Michael Lang, National University of Ireland, Galway Codd’s 12 Rules • Dynamic on-line catalog based on the relational model database description is represented at the logical level in the same way as ordinary data thus, only one manipulation language needs to be learned • Comprehensive data sublanguage rule A relational system may support several languages and various modes of terminal use However, there must be at least one language that supports data definition, data manipulation, security and integrity constraints, and transaction processing operations • View updating 44 If the base tables of a view are updated, then the view itself should be updated © Michael Lang, National University of Ireland, Galway Codd’s 12 Rules • High-level insert, update, and delete The system must support set-at-a-time operations; for example, a set of rows can be deleted by a single statement • Physical data independence Changes to storage representation or access methods will not affect application programs • Logical data independence 45 Implementation of changes to base tables will not affect application programs; for example, if a table is restructured or split, applications should be immune to change (views are beneficial here) © Michael Lang, National University of Ireland, Galway Codd’s 12 Rules • Integrity independence Integrity constraints should be part of a database's definition rather than embedded within application programs It must be possible to change integrity constraints without affecting any existing application programs • Distribution independence Introduction of a distributed DBMS or redistributing existing distributed data should have no impact on existing applications • Nonsubversion 46 It must not be possible to use low-level record-at-a-time interface to by-pass high-level set-at-a-time security or integrity constraints © Michael Lang, National University of Ireland, Galway Codd’s Rule 0 • A relational DBMS must be able to manage databases entirely through its relational capacities • A DBMS is either totally relational or it is not relational 47 © Michael Lang, National University of Ireland, Galway Tasks in the seminars 1 - 3 • In order to do MMDB you need to be able to use SQL – self study sessions with tutor support 48 © Michael Lang, National University of Ireland, Galway