Download Lecture Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Healthcare Cost and Utilization Project wikipedia , lookup

Database model wikipedia , lookup

Transcript
IEM 5743
INFORMATION SYSTEMS AND TECHNOLOGY
Lecture Notes
Databases and DBMS
Week #5, 2/9/00, Part 1
Prepared by Dr. M. Kamath
Types of Data
The five primary types of data in today’s information systems are:
1. Predefined Data Items: Numerical/alphabetical items whose meaning and format are
specified explicitly. For example, credit card number, date, and social security
number. Common in transaction oriented systems and programs are written based on
the precise meaning and format of these data items. Reason for importance of the
Y2K problem.
2. Text: A series of characters; meaning is not important for the program. Example is a
document that is interpreted by the user.
3. Images: Photographs, hand sketches, graphs, etc. which are stored, modified and
transmitted much like text.
4. Audio: Data in the form of sounds.
5. Video: Pictures and sounds displayed over time.
Logical Vs Physical Views of Data
A logical view of data expresses the way a user thinks about the data. Usually expressed in
terms of a data model, which describes business entities such as “ORDER” and “CUSTOMER”
in terms of their attributes as well as the business relationships between entities such as
“CUSTOMERs place ORDERs.”
A physical view of data is the way computers handle the data, i.e., the storage and retrieval of it.
Usually stated in terms of specific locations on storage devices plus techniques used to access it.
Data Access Methods
DBMSs use three common methods to store and retrieve data.
1. Sequential Access: Individual records within a file are processed in sequence. Data
are processed in the order they are stored. It is the only method for data on tape, but
can also be used on a direct access device such as a disk. Impractical when
immediate access of data is needed.
2. Direct Access: The individual item in the file is accessed directly. Record key (an
identifying attribute) is translated into a disk storage address using a hashing scheme
or randomizing formula. A collision occurs when the same physical address is
generated for multiple records. An overflow area is used in this case and this adds to
the access time.
3. Indexed Access: Uses a table (index) to locate the required piece of data. Often
called ISAM (Indexed Sequential Access Method). Just like a phone book index, the
program uses the index to decide where to start searching for the record. Again
overflow area is used if the space pointed to by the index is full. Database
performance degrades as more data goes into the overflow area.
Transaction Processing
Transaction processing involves a lot of data manipulation. The programmer writing SQL type
code makes only logical references to the data item needed, but does not specify how to find the
data. For example, “Find next inventory record.” The DBMS converts a logical reference into
a physical address.
Another issue in transaction processing is controlling access to data items when many
transactions are occurring simultaneously. DBMSs support record locking, the ability to lock
the specific record temporarily to prevent access by any other process until it is unlocked, to
avoid data integrity problems.
Distributed Databases
Ideally a database should exist in one location and should always be updated to assure data
integrity. In practice this could result in huge data transmission costs for updates and retrieval
from remote locations.
Discussion: A student at one of the remote sites gave an example where data is transferred
from remote field sites via satellite in response to a comment from an on-campus student that
transmission cost may not be significant with T1 line connections.
A distribute database is a database, parts of which exist in different locations. Inconsistency can
easily arise in a distributed database. For example, when an order is shipped, both the billing
and item availability data must be updated. If one is not done then we have inconsistency. A
technique used to maintain consistency across the databases is the two-phase commit, in which
the DBMS checks for the readiness of the local parts involved before executing a transaction.
Database Replication is a common alternative to distributed databases. Complete or partial
copies of the master database are kept at remote locations. How frequently does the central
master database should be updated is an issue.
In general the tradeoffs between centralized and distributed database architectures include issues
such as cost of data transmission, costs of synchronizing distributed parts of the database, and the
degree to which the entire database must be current at all times.
Data Warehouse
As we have seen before, there are many types of information systems and databases are a key
component of each of these. However, the nature of the information system puts very different
requirements on the databases. In a transaction processing system (TPS), quick access to
detailed data is needed. Databases that support TPSs are called operational databases. On the
other hand, decision support systems do not typically have real-time constraints and do not need
on-line access to detailed transaction data. A different type of a database called a data
warehouse is used of this purpose. A data warehouse is a large database that is a collection of
data from smaller (operational) databases. Data warehouses are used to support decision
making and are updated less frequently than transaction databases. Data warehouses are
structured to provide support for fast online queries and quick summaries for managers.
Data warehouses usually have a global view. Data mart is a subset of a data warehouse that
provides data about a specific function or a department. Users can “drill down” into several
layers of data to locate a problem or an opportunity. Queries are often multidimensional in
nature. The data warehouse structure is optimized to support these types of multidimensional
queries. For example the star schema is used to map multidimensional DSS data into a
relational database. Star schema still preserves the relational structure on which many of the
operational databases are built. The basic star schema has four components: facts, dimensions,
attributes, and attribute hierarchies.
Facts are numeric measurements or values that represent a specific business aspect or activity.
Facts commonly used in business data analysis are units, costs, prices, and revenues. The fact
table that is the center of the star schema contains facts that are linked through their dimensions.
The fact table is updated periodically with data from operational databases.
Dimensions are qualifying characteristics that provide additional perspectives to a given fact.
For instance, we might compare sales by product from region to region, and from one time
period to the next. Here sales has product, location, and time dimensions. Such dimensions are
normally stored in dimension tables.
Each dimension table contains attributes. Attributes are often used to search, filter, and classify
facts. For the sales example we have
 Product dimension:
product ID, description, product type, manufacturer
 Location dimension:
region, state, city, and store
 Time dimension:
year, quarter, month, week, and date.
Attributes within dimensions can be ordered in a well-defined attribute hierarchy – e.g., region,
state, city, and store in the location dimension. An attribute hierarchy allows the user to perform
“drill-down” and “roll-up” searches.
Sales Star Schema
LOCATION
Location_ID
Description
Region_ID
State
City
SALES
Time_ID
Location_ID
Product_ID
Quantity
Price
Amount
We will cover more on data warehouses under “data management.”
TIME
Time_ID
Year
Quarter
Month
Week
Day
PRODUCT
Product_ID
Description
Brand
Color
Size
Package
References
Alter S. (1999), Information Systems: A Management Perspective, Third Edition, Prentice-Hall,
Inc., New Jersey. (Chapter 4 114-139)
Gupta, U. (2000), Information Systems Success in the 21st Century, Prentice-Hall, Inc., New
Jersey. (Chapter 5)
Rob, P. and C. Coronel (1997), Database Systems: Design, Implementation, and Management,
Course Technology, Cambridge, MA.