Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Entity–attribute–value model wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Concurrency control wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Relational model wikipedia , lookup
Clusterpoint wikipedia , lookup
IEM 5743 INFORMATION SYSTEMS AND TECHNOLOGY Lecture Notes Databases and DBMS Week #5, 2/9/00, Part 1 Prepared by Dr. M. Kamath Types of Data The five primary types of data in today’s information systems are: 1. Predefined Data Items: Numerical/alphabetical items whose meaning and format are specified explicitly. For example, credit card number, date, and social security number. Common in transaction oriented systems and programs are written based on the precise meaning and format of these data items. Reason for importance of the Y2K problem. 2. Text: A series of characters; meaning is not important for the program. Example is a document that is interpreted by the user. 3. Images: Photographs, hand sketches, graphs, etc. which are stored, modified and transmitted much like text. 4. Audio: Data in the form of sounds. 5. Video: Pictures and sounds displayed over time. Logical Vs Physical Views of Data A logical view of data expresses the way a user thinks about the data. Usually expressed in terms of a data model, which describes business entities such as “ORDER” and “CUSTOMER” in terms of their attributes as well as the business relationships between entities such as “CUSTOMERs place ORDERs.” A physical view of data is the way computers handle the data, i.e., the storage and retrieval of it. Usually stated in terms of specific locations on storage devices plus techniques used to access it. Data Access Methods DBMSs use three common methods to store and retrieve data. 1. Sequential Access: Individual records within a file are processed in sequence. Data are processed in the order they are stored. It is the only method for data on tape, but can also be used on a direct access device such as a disk. Impractical when immediate access of data is needed. 2. Direct Access: The individual item in the file is accessed directly. Record key (an identifying attribute) is translated into a disk storage address using a hashing scheme or randomizing formula. A collision occurs when the same physical address is generated for multiple records. An overflow area is used in this case and this adds to the access time. 3. Indexed Access: Uses a table (index) to locate the required piece of data. Often called ISAM (Indexed Sequential Access Method). Just like a phone book index, the program uses the index to decide where to start searching for the record. Again overflow area is used if the space pointed to by the index is full. Database performance degrades as more data goes into the overflow area. Transaction Processing Transaction processing involves a lot of data manipulation. The programmer writing SQL type code makes only logical references to the data item needed, but does not specify how to find the data. For example, “Find next inventory record.” The DBMS converts a logical reference into a physical address. Another issue in transaction processing is controlling access to data items when many transactions are occurring simultaneously. DBMSs support record locking, the ability to lock the specific record temporarily to prevent access by any other process until it is unlocked, to avoid data integrity problems. Distributed Databases Ideally a database should exist in one location and should always be updated to assure data integrity. In practice this could result in huge data transmission costs for updates and retrieval from remote locations. Discussion: A student at one of the remote sites gave an example where data is transferred from remote field sites via satellite in response to a comment from an on-campus student that transmission cost may not be significant with T1 line connections. A distribute database is a database, parts of which exist in different locations. Inconsistency can easily arise in a distributed database. For example, when an order is shipped, both the billing and item availability data must be updated. If one is not done then we have inconsistency. A technique used to maintain consistency across the databases is the two-phase commit, in which the DBMS checks for the readiness of the local parts involved before executing a transaction. Database Replication is a common alternative to distributed databases. Complete or partial copies of the master database are kept at remote locations. How frequently does the central master database should be updated is an issue. In general the tradeoffs between centralized and distributed database architectures include issues such as cost of data transmission, costs of synchronizing distributed parts of the database, and the degree to which the entire database must be current at all times. Data Warehouse As we have seen before, there are many types of information systems and databases are a key component of each of these. However, the nature of the information system puts very different requirements on the databases. In a transaction processing system (TPS), quick access to detailed data is needed. Databases that support TPSs are called operational databases. On the other hand, decision support systems do not typically have real-time constraints and do not need on-line access to detailed transaction data. A different type of a database called a data warehouse is used of this purpose. A data warehouse is a large database that is a collection of data from smaller (operational) databases. Data warehouses are used to support decision making and are updated less frequently than transaction databases. Data warehouses are structured to provide support for fast online queries and quick summaries for managers. Data warehouses usually have a global view. Data mart is a subset of a data warehouse that provides data about a specific function or a department. Users can “drill down” into several layers of data to locate a problem or an opportunity. Queries are often multidimensional in nature. The data warehouse structure is optimized to support these types of multidimensional queries. For example the star schema is used to map multidimensional DSS data into a relational database. Star schema still preserves the relational structure on which many of the operational databases are built. The basic star schema has four components: facts, dimensions, attributes, and attribute hierarchies. Facts are numeric measurements or values that represent a specific business aspect or activity. Facts commonly used in business data analysis are units, costs, prices, and revenues. The fact table that is the center of the star schema contains facts that are linked through their dimensions. The fact table is updated periodically with data from operational databases. Dimensions are qualifying characteristics that provide additional perspectives to a given fact. For instance, we might compare sales by product from region to region, and from one time period to the next. Here sales has product, location, and time dimensions. Such dimensions are normally stored in dimension tables. Each dimension table contains attributes. Attributes are often used to search, filter, and classify facts. For the sales example we have Product dimension: product ID, description, product type, manufacturer Location dimension: region, state, city, and store Time dimension: year, quarter, month, week, and date. Attributes within dimensions can be ordered in a well-defined attribute hierarchy – e.g., region, state, city, and store in the location dimension. An attribute hierarchy allows the user to perform “drill-down” and “roll-up” searches. Sales Star Schema LOCATION Location_ID Description Region_ID State City SALES Time_ID Location_ID Product_ID Quantity Price Amount We will cover more on data warehouses under “data management.” TIME Time_ID Year Quarter Month Week Day PRODUCT Product_ID Description Brand Color Size Package References Alter S. (1999), Information Systems: A Management Perspective, Third Edition, Prentice-Hall, Inc., New Jersey. (Chapter 4 114-139) Gupta, U. (2000), Information Systems Success in the 21st Century, Prentice-Hall, Inc., New Jersey. (Chapter 5) Rob, P. and C. Coronel (1997), Database Systems: Design, Implementation, and Management, Course Technology, Cambridge, MA.