Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IT For Engineers Entity-Relationship Diagrams INFO 203 Dr. Jennifer Booker INFO 203 Week #4 1 Databases Storing and manipulating data are the heart of an information system Data must be INFO 203 Available when needed Accurate Consistent in its definition Efficiently stored and retrieved Presented in some useful form for planning, making decisions, etc. Week #4 2 Databases There are two approaches for database design – flat file and relational A flat (or ‘conventional’) file database keeps separate files for each application A relational database stores data in a single structure for use by many applications INFO 203 COBOL systems, and FileMaker Pro Like using Microsoft Excel as a database Every major database system; Oracle, DB2, Access, MySQL, MS SQL Server, Informix, Sybase, etc. Week #4 3 Databases Flat file systems are, by definition, redundant Data is copied to each report table only when the report is generated, so later changes make the table obsolete Relational databases keep the connections (relationships) to each set of data INFO 203 Each record exists once in the system Is much more flexible to generate custom views of the data, without changing its structure Week #4 4 Reality, Data, Metadata A database models some collection of data which appears in reality Reality is the thing being modeled Data is used to describe the key attributes of that thing INFO 203 About what aspects of that thing do we wish to keep track? Metadata (data about data) describes the characteristics of the data Week #4 5 Data Modeling Data modeling (or database or information modeling) is a way of organizing and describing the data in a system It is a logical model to describe the specific data fields (elements) we wish to capture, and how they are related to each other INFO 203 Week #4 6 Where to start? Recall that data modeling starts with thinking about the things involved in your system These things are formally called “entities” – nouns, if you will Start by identifying all of the places, people, events, and ideas which are affected by your system INFO 203 Week #4 7 Permanent vs. Transient Data A key for relational data modeling is that we are primarily concerned with data we need to keep permanently Data which is only needed briefly isn’t modeled in an ERD INFO 203 That’s a major difference between relational and object-oriented analysis Week #4 8 Lots of Types of Files Master files contain permanent information Transaction files describe business events Document files contain copies of historic data Archival files contain master and transaction file records offline, which are no longer needed Table look-up files contain static data for reference calculations (tax tables, Zip codes) Audit files record changes to other files INFO 203 Week #4 9 Characterize Entities Then examine each entity and determine the attributes which you are interested in – what data do you need to know in order to describe one such entity meaningfully? Consider if some attributes can be readily grouped together, thereby forming compound attributes (e.g. name) INFO 203 Week #4 10 Characterize Entities Entities are generally one of two types: A set of data you want to keep permanently (customer orders, product information, etc.), or A lookup list or table (types of status codes, shipping rates, tax rates, etc.) Data which is transient is generally kept in local variables, and doesn’t appear in an ERD (e.g. change of address info) INFO 203 Week #4 11 Keep it or not? In trying to decide if data needs to be kept, consider whether someone might want to analyze that data in the future For examples, to look for sales patterns, trace relocation history, keep record of data changes (who modified what data and when?) When in doubt, keep it for now INFO 203 Week #4 12 Attributive Entities An entity which depends on the existence of another entity, but only indicates multiple examples of that ‘parent’ entity, is an attributive entity INFO 203 Hence it’s referred to as describing ‘repeating groups of data’, or they are ‘parent/child’ entities Shown using part of an oval inside the entity box (p. 365) Week #4 13 Database Notation In week 1 we covered the basic notation for cardinality Only one Zero or one One or many Zero, one, or many Strictly many Now we want to put all the two-entity pairs together into a single diagram INFO 203 Week #4 14 ERD The Entity-Relationship Diagram (ERD) captures all the entities for our system, and shows the relationships among them (the lines connecting them) (e.g. p. 366) The relationships all need to have A verb phrase in at least one direction Cardinality at both ends Now we can add attributes and keys INFO 203 Week #4 15 Attributes Each entity is almost certain to have at least two attributes (often more) At least one of them is a unique identifier for each record in the entity, called the primary key Other attributes describe significant characteristics of the entity INFO 203 So, about what aspects of the entity do you want to retain knowledge? What is relevant for this system? Attributes are also called fields, data items, etc. Week #4 16 Characterize Attributes For each attribute, define its data type: Text (“Fred”) [and the character set (Latin)] Number (real (3.56) or integer (124)) Date and/or time Yes/No (a.k.a. T/F, binary, or Boolean) A fixed set of possible values (e.g. grades) INFO 203 Formally known as an enumerated list Multimedia: photos, drawings, movies, sounds Week #4 17 Attributes & Records Attributes can be fixed or variable length Length of integer (8-64 bits, in powers of 2), Lengths of text fields (the number of characters or bytes it contains) Length of real number fields and the number of decimals (8.1 only has six digits, -12345.6) We won’t worry about attribute lengths A record is a set of attributes from one entity INFO 203 Typically corresponding to one row in a data table Week #4 18 Relevant Data Type Standards Character sets ISO/IEC 8859 – a set of ten 8-bit character sets, each used for a different family of languages (Latin, Cyrillic, Greek, Arabic, Hebrew, etc.) Unicode – a 32-bit character set to represent almost all languages Representation of dates and times INFO 203 ISO 8601 Week #4 19 Characterize Attributes Other traits to look for, beyond the scope of this course: Identify the domain of each attribute What is the range of allowable values? Determine if there is a default value for each attribute Is each attribute mandatory (required) for each entity? (Avoid many mandatory fields) Ask yourself: could an attribute be a key? INFO 203 Week #4 20 Key Attributes An attribute or group of attributes may be a unique identifier, or key, for each entity Examples are Social Security Number, driver’s license number, ISBN, Student ID If a group of attributes is used, it is a concatenated (or composite or compound) key INFO 203 {Course number, section number, and term} could form a concatenated key Week #4 21 Many Keys Possible There might be more than one usable key for an entity Each possible key is called a candidate key One candidate key is selected to be the primary key (PK) All others are alternate keys (AK) INFO 203 Example: the electric company may use a customer ID or account number as a primary key, and your phone number or email address as an alternate key Week #4 22 Primary Key may be Meaningless A primary key may correspond to some recognizable attribute Or it may be completely meaningless SSN, student ID, ISBN, etc. A sequential number, called Order_ID, or customer ID, or account number As long as the primary key is unique for every record, either kind is acceptable INFO 203 Week #4 23 Foreign Keys A foreign key is an attribute which establishes the connection between two entities A Sales entity might have a foreign key attribute Customer_ID, which points to all of the data in the Customer entity for the particular customer who placed a given sale If that were the case, the primary key in the Customer entity must be Customer_ID This type of connection is the heart of relational database modeling (!) INFO 203 Week #4 24 Foreign Keys A foreign key (FK) is an attribute which exists, in an entity other than where it is a primary key (PK), to establish the relationship between the two entities INFO 203 Primary key must be unique for each record, but a foreign key value may appear many times Only one PK-FK connection is used for the relationship to exist The entity with FK generally has a PK of its own Week #4 25 Other Key Traits A PK attribute may also be a FK Especially for 1:1 relationships An associative entity builds a concatenated primary key from more than one entity INFO 203 Uses a diamond shape inside the normal box to show its special nature Week #4 26 Other Relationships A many-to-many (non-specific) relationship implies a lot of one-to-many relationships Often use an associative entity to bridge between them An identifying relationship is when a parent entity’s PK is used as part of the PK for a child entity INFO 203 Child entity is then considered “weak” because it depends on the parent Week #4 27 Summary of Key Traits The bottom line for keys is: Each entity must have at least one PK INFO 203 More than one PK implies a concatenated key Alternate keys are completely optional Each entity may have from zero to many FK’s Each FK is a PK in another, related entity Only one PK-FK relationship is needed to relate two entities Some keys are not inherently meaningful data Week #4 28 Data Normalization Analysis of a data model for implementation is done using data normalization Normalization organizes data attributes to form simple, non-redundant, flexible, adaptive entities There are five levels of data normalization, of which three are ever used INFO 203 They build on each other; so to get to second level you have to comply with first level also Week #4 29 First Normal Form (1NF) An entity is in first normal form if there are no attributes which can have more than one value for each instance (record) of the entity Attributes which could have more than one value for a given entity belong to a different kind of entity In other words, every attribute appears only once for each record INFO 203 Week #4 30 Second Normal Form (2NF) Look at concatenated keys only! Must be first normal form, and: Each non-primary-key attribute is uniquely determined by the entire primary key Non-primary-key attributes may not be dependent on only part of the primary key If any are, move them to another table which uses only that part of the primary key If there are no concatenated keys, second normal form comes automatically with 1NF INFO 203 Week #4 31 Third Normal Form (3NF) Must be second normal form, and: The value of each non-primary-key attribute is not dependent upon any other non-primary-key attribute Every attribute depends only on the primary key The two ways to look for this are derived attributes and transitive dependencies... INFO 203 Week #4 32 Third Normal Form (3NF) Derived attributes (data) are fields calculated or logically derived from other fields Exception: OK to keep attribute if multiple entities are involved in deriving an attribute Transitive dependencies may exist for non-concatenated keyed tables; is when a non-key attribute depends on another non-key attribute INFO 203 Week #4 33 Third Normal Form (3NF) Or in brief, for third normal form… An entity is in third normal form if every non-primary key attribute is dependent on the primary key, the whole primary key, and nothing but the primary key (as in, “Do you swear to tell the truth…”) INFO 203 Week #4 34 Further Normalization Additional improvement in data structure is possible through “Simplification by Inspection” - look for other redundancies or simplifications possible Many CASE tools can also inspect for first level normalization, but generally no further Just for the record, here are the 4th and 5th normal forms INFO 203 I’ve never seen them used in industry Week #4 35 Fourth and Fifth Normal Forms Fourth normal form (4NF) involves removing multivalued dependencies If a pair of records has two matching attributes, decompose the data structure to remove that Fifth normal form (5NF) involves removing join dependencies (nearly impossible to do) INFO 203 This is when business rules define a connection among many entities (e.g. if you replace a tire, you must also replace the valve stem) Week #4 36 Other database concepts Integrity constraints keep the database consistent when records are changed or deleted Anomalies can occur when creating new database tables We won’t go into those in detail You can forget about 4NF and 5NF, too INFO 203 Week #4 37 Data Warehouses A data warehouse is a set of databases that organize data around subjects, generally drawing from databases that were created independently INFO 203 Data is stored in summary form not raw records Data are accessed via data mining applications (siftware?) to look for hidden patterns May answer complex questions using online analytical processing (OLAP) Week #4 38 Business Intelligence (BI) BI is a decision support system (DSS) that draws from data warehouses or big data sources INFO 203 May need to use text analytics to extract information from unstructured data (not in a data warehouse) Much of the challenge for these systems is designing the user interfaces Week #4 39 Summary: How to create an ERD Define the entities about which you’ll want to keep information Define the relationships between those entities Define the attributes for each entity INFO 203 Must have a verb phrase and both cardinalities At a minimum, define the name and data type of each attribute Identify a primary key for each entity Identify foreign keys as needed Normalize to third normal form (and fix keys as needed) Week #4 40