Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 of 37 Lecture 6 Data Model Design (continued) Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National Science Foundation Grants EPS 1135482 and EPS 1208732 2 of 37 Objectives • Identify and describe important entities and relationships to model data • Develop data models to represent, organize, and store data • Design and use relational databases to organize, store, and manipulate data 3 of 37 Present and discuss your preliminary designs 4 of 37 Naming Database Objects • Names should be – Unique – Have some meaning to the user – Short – No spaces or reserved characters • Entity and Attribute names = nouns • Relationship names = verbs Many Observations are made at a Site. 5 of 37 More on Attributes • Attribute values should be atomic – Present a single fact • Allows for: – simpler programming, – greater reusability of data – easier to implement changes 6 of 37 Atomic Attribute Example Instead of 1 overloaded attribute: VariableName = “Dissolved Oxygen, mg/L, surface water” You might use three: VariableName = “Dissolved Oxygen” Units = “mg/L” SampleMedium = “surface water” 7 of 37 Common Attribute Atomicity Violations • Simple aggregation: Address = “8200 Old Main Hill, Logan, UT, 84322” • Complex codes: VariableCode = “DO_mgL_Avg” • Text fields: Free form text. Overreliance may mean that data requirements may not be met by the model. • Mixed domains: Where the value of an attribute can have different meaning under different conditions. 8 of 37 Primary Keys • Attribute or set of attributes that uniquely identify a specific instance of an entity (row in the table) • Primary keys must: – Have a non-null value for each instance of an entity – Have a unique value for each instance of an entity – Have values that do not change or become null 9 of 37 Normalization • Organizing the fields and tables in a relational database to minimize redundancy and dependency – Dividing large tables into smaller tables (with relationships) • Isolate data so that additions, deletions, and modifications of a field or record can be made in one place • Reduce the need for restructuring the database as new types of data are introduced 10 of 37 Unnormalized Data Example SiteID SiteName VariableID VariableName DateTime Value 1 Logan River 1 Temperature 1/1/2012 5 1 Logan River 1 Temperature 1/2/2012 5 1 Logan River 2 pH 1/1/2012 8 1 Logan River 2 pH 1/2/2012 8 2 Spring Creek 1 Temperature 1/1/2012 7 2 Spring Creek 1 Temperature 1/2/2012 7 2 Spring Creek 2 pH 1/1/2012 7.5 2 Spring Creek 2 pH 1/2/2012 7.5 11 of 37 Issues with Unnormalized Data SiteID SiteName VariableID VariableName DateTime Value 1 Logan River 1 Temperature 1/1/2012 5 1 Logan River 1 Temperature 1/2/2012 5 1 Logan River 2 pH 1/1/2012 8 1 Logan River 2 pH 1/2/2012 8 INSERT: The fact that a site or variable exists cannot be asserted until a measurement has been made. DELETE: If a row is deleted, information may be lost about not only the measurement, but also the variable and the site. UPDATE: If a SiteName or VariableName changes, multiple records have to be updated with the new information 12 of 37 Normalization Example 1 * * SiteID SiteName SiteID VariableID DateTime Value 1 Logan River 1 1 1/1/2012 5 2 Spring Creek 1 1 1/2/2012 5 1 2 1/1/2012 8 1 2 1/2/2012 8 2 1 1/1/2012 7 2 1 1/2/2012 7 2 2 1/1/2012 7.5 2 2 1/2/2012 7.5 1 VariableID VariableName 1 Temperature 2 pH 13 of 37 Normalization Tradeoffs • Pros: – Eliminates redundant data – Saves space and can improve storage efficiency – Inserts and updates are done in one place – Can improve efficiency • Cons: – May complicate the code of common queries – Abstracts tables using keys – can be harder for a human to “see” the data 14 of 37 Data Integrity Rules • Entity Integrity – Primary key must exist, be unique, and not null SiteID SiteName 1 Logan River 2 Spring Creek VariableID VariableName 1 Temperature 2 pH ValueID SiteID VariableID DateTime Value 101 1 1 1/1/2012 5 102 1 1 1/2/2012 5 103 1 2 1/1/2012 8 104 1 2 1/2/2012 8 105 2 1 1/1/2012 7 106 2 1 1/2/2012 7 107 2 2 1/1/2012 7.5 108 2 2 1/2/2012 7.5 15 of 37 Data Integrity Rules • Referential Integrity – Every foreign key value must match a primary key value in an associated table – Ensures that we can navigate relationships ValueID SiteID VariableID DateTime Value SiteID SiteName 101 1 1 1/1/2012 5 1 Logan River 102 1 1 1/2/2012 5 2 Spring Creek 103 1 2 1/1/2012 8 104 1 2 1/2/2012 8 105 2 1 1/1/2012 7 106 2 1 1/2/2012 7 107 2 2 1/1/2012 7.5 108 2 2 1/2/2012 7.5 VariableID VariableName 1 Temperature 2 pH 16 of 37 Data Integrity Rules • Insert and Delete Rules – What happens to a parent entity when child entities are deleted? – What happens to child entities when a parent is deleted? ValueID SiteID VariableID DateTime Value SiteID SiteName 101 1 1 1/1/2012 5 1 Logan River 102 1 1 1/2/2012 5 2 Spring Creek 103 1 2 1/1/2012 8 104 1 2 1/2/2012 8 105 2 1 1/1/2012 7 106 2 1 1/2/2012 7 107 2 2 1/1/2012 7.5 108 2 2 1/2/2012 7.5 VariableID VariableName 1 Temperature 2 pH 17 of 37 Data Integrity Rules • Value Domains – Valid set of values for an attribute – Controlled vocabulary, data type, length, range, constraints, default value Integer Fields Controlled Domain Date Field Double ValueID SiteID VariableID DateTime Value 101 1 1 1/1/2012 5.5 VariableID VariableName 102 1 1 1/2/2012 5.678 1 Temperature 103 1 2 1/1/2012 8.0 2 pH 104 1 2 1/2/2012 8.9 18 of 37 Specialization • Designating entity subgroups within a higher level entity • Entity subgroups have attributes or relationships that do not apply to the higher level entity • Attributes are inherited – A lower level entity inherits all of the attributes and relationship participation of the higher level entity to which it is linked 19 of 37 Specialization Example • A car is a vehicle • A truck is a vehicle 20 of 37 Generalization • Combine a number of entities that share features into a higher level entity • Specialization and generalization are inversions of each other Specialization Generalization 21 of 37 Constraints on Specialization/Generalization • Constraints on which entities can be members of a given lower-level entity set – Condition-defined – “all vehicles with a towing capacity of more than 10,000 lbs are trucks” • Constraints on whether entities can belong to more than one lower-level entity set – Disjoint – an entity can belong to only one – Overlapping – an entity can belong to more than one • Completeness constraint – must every higher level entity belong to at least one lower level entity 22 of 37 Mapping Specialization to Tables • Option 1: Put everything in one table • There will be NULL values where attributes don’t apply 23 of 37 Mapping Specialization to Tables • Option2: Form tables for the higher level entity and the lower level entities • Each lower level entity includes the primary key of the higher level entity set 24 of 37 Mapping Specialization to Tables • Option3: Model only the lower level entities • Repeats attributes 25 of 37 Steps in Data Model Design 1. Identify entities 2. Identify relationships among entities 3. Determine the cardinality and participation of relationships 4. Designate keys / identifiers for entities 5. List attributes of entities 6. Identify constraints and business rules 7. Map 1-6 to a physical implementation 26 of 37 Physical Data Model • The “physical” means a specific implementation of the data model – Choice of hardware and operating system – Choice of relational database management system – Implementation of tables, relationships, constraints, triggers, indices, data types – Database access and security – Performance – Storage 27 of 37 Relational Database Management Systems (RDBMS) • • • • • File vs. server based Free vs. commercial Different data types Potentially different syntax for SQL queries Security models and concurrent users 28 of 37 Reduction of an ER Diagram to Tables • Converting an ER diagram to table format is the basis for deriving a relational database – Primary keys allows entities to be expressed as tables that contain data – A database is a collection of tables – Tables are assigned the same name as the entity – Each table has columns that correspond to attributes – each column has a unique name – Each column must have a single data type 29 of 37 Advanced Database Objects • • • • Views Stored procedures Triggers Constraints • Implementation of these objects may depend on your choice of RDBMS software 30 of 37 Database Views • A View is equivalent to a table, but is defined by a SQL query • Used to present a set of desired information, independent of the underlying database structure • Can be used to hide complexities of the underlying data model from the user – One way to address the Cons of normalization 31 of 37 Stored Procedures • A set of structured query language (SQL) statements that are stored and executed on the server • Useful for repetitive tasks • Encapsulate functionality and isolate users from data tables • Can provide a security layer – software applications have no access to the database directly, but can execute stored procedures 32 of 37 Triggers • Special kind of stored procedure • Automatically executes on a table or view when an event occurs in the database • Events include: CREATE, ALTER, INSERT, UPDATE, DELETE • Mostly used to maintain the integrity of information in the database 33 of 37 Constraints • Common way to enforce data integrity • Examples: – Not NULL – value in a column must not be NULL – Unique – value(s) in specified column(s) must be unique for each row in a table – Primary Key – value(s) in the specified column(s) must be unique for each row in the table and not be NULL – Foreign Key – values(s) in the specified column(s) must reference an existing record in another table via its primary key – Check – an expression that validates data and must not be FALSE 34 of 37 Data Types • Each attribute of an entity (column in a database table) must have a single data type • Data types are enforced by RDBMS software Table: DataValues Attribute Data Type Sample Data ValueID Integer 1 SiteID Integer 5 VariableID Integer 5 DateTime Date/Time 8/15/2013 4:30 PM DataValue Double 4.567 35 of 37 Data Types • Data types can be specific to RDBMS software RDBMS Integer Floating Point Decimal String Date/Time MS SQL Server TINYINT, SMALLINT, INT, BIGINT FLOAT, REAL NUMERIC, DECIMAL, SMALLMONEY, MONEY CHAR, VARCHAR, TEXT, NCHAR, NVARCHAR, NTEXT DATE, DATETIMEOFFSET, DATETIME2, SMALLDATETIME, DATETIME, TIME MySQL TINYINT (8-bit), SMALLINT (16bit), MEDIUMINT (24-bit), INT (32bit), BIGINT (64bit) FLOAT (32-bit), DOUBLE (aka REAL) (64-bit) DECIMAL CHAR, BINARY, VARCHAR, VARBINARY, TEXT, TINYTEXT, MEDIUMTEXT, LONGTEXT DATETIME, DATE, TIMESTAMP, YEAR PostgreSQL SMALLINT (16bit), INTEGER (32bit), BIGINT (64bit) REAL (32-bit), DOUBLE PRECISION (64-bit) DECIMAL, NUMERIC CHAR, VARCHAR, TEXT DATE, TIME (with/without TIMEZONE), TIMESTAMP (with/without TIMEZONE), INTERVAL Quick summary from: http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems 36 of 37 Summary of 3 Levels of Data Model Design Feature Conceptual Logical Entity Names X X Entity Relationships X X Physical Attributes X Primary Keys X X Foreign Keys X X Table Names X Column Names X Column Data Types X Views X Stored Procedures X Triggers X Constraints X 37 of 37 Summary • Simple rules for naming objects and specifying domains can help protect the integrity of data • Normalization can help reduce redundancy, increase storage efficiency, and protect data integrity – but there are tradeoffs • Data integrity rules include relationships and domains and protect the integrity of data in the database • Specialization and generalization require special consideration in implementation • A physical database implementation requires choices about hardware, software, security, formats and storage, and other factors