Download Security/controls of Databases

Document related concepts

Open Database Connectivity wikipedia , lookup

IMDb wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Abdul-hakim M. Warsame
Richard A. Nyatangi
Timothy K. Kamau
James Kungu
James Kitonga
D61/72809/2012
D61/68249/2011
D61/67302/2011
D61/72729/2012
D61/72713/2012
Introduction
Computers store and retrieve information
using file systems and databases, but they are
both designed to handle data in different ways.
Databases and file systems are software-based
and can be used on both personal computers
and large mainframes. Both systems have
largely replaced their paper-based equivalents,
and without them many tasks that computers
do would be impossible.
A file consists of a number of records.
Record: A record is a collection of related
fields.
Field: A field contains a set of related
characters
Character: A character is the smallest
element in a file
File system is a system that collecting a data or files
and stored in physical location like hard disk or tape
File systems are containers of collections.
Collections are commonly called directories, and they
contain a set of data units commonly called files.
A "File manager" is used to store all relationships in
directories in File Systems
File management systems record the location of files
saved on a computer's hard disk rather than individual
data records.
They store information about the location of files on
the hard disk, their size, and associated attributes--such
as whether the file is read-only
Master file: These are files of a fairly permanent
nature e.g. Payroll, inventory, customer. Requires
regular updating of the files to show a current
position.
2. The master file will contain data that is static in
nature and some data that keep on changing.
3. Transaction/Movement file – It’s made up of the
various transactions crated form the source
document.
These file is used to update the master file.
4. Reference file: - a file with a reasonable amount of
permanency.
1.
There are two basic strategies for processing
transactions of the files:
1.Transaction processing – processing each
transaction as it occurs (Real time
Processing)
2.Batch processing – collecting transactions
together over some interval of time and then
processing the whole batch.
File-based systems were an early attempt
to computerize the manual filing system.
Although File-based system is a collection
of application programs that perform
services for the end-users. Each program
defines and manages its data.
However, several types of problem are
occurred in using the file-based approach:
and they are:
Separation and isolation of data
When data is isolated in separate files, it is more
difficult for us to access data that should be available.
The application programmer is required to synchronize
the processing of two or more files to ensure the
correct data is extracted
Duplication of data
When employing the decentralized file-based
approach, the uncontrolled duplication of data is
occurred. Uncontrolled duplication of data is
undesirable because
Data dependence
Using file-based system, the physical structure and storage of the
data files and records are defined in the application program
code. This characteristic is known as program-data dependence.
Making changes to an existing structure are rather difficult and
will lead to a modification of program. Such maintenance
activities are time-consuming and subject to error.
Incompatible file formats
The structures of the file are dependent on the application
programming language. However file structure provided in one
programming language such as direct file, indexed-sequential
file which is available in COBOL programming, may be different
from the structure generated by other programming language
such as C. The direct incompatibility makes them difficult to
process jointly.
Fixed queries / proliferation of application programs
File-based systems are very dependent upon the application
programmer. Any required queries or reports have to be
written by the application programmer. Normally, a fixed
format query or report can only be entertained and no
facility for ad-hoc queries if offered.
File-based systems also give tremendous pressure on data
processing staff, with users' complaints on programs that
are inadequate or inefficient in meeting their demands.
Documentation may be limited and maintenance of the
system is difficult.
Provision for security, integrity and recovery
capability is very limited
In order to overcome the limitations of the file-based
approach, the concept of database.
What is A Database?
A database is a structured collection of records or
data that is stored in a computer system.
A database is a single organized collection of
structured data, with controlled redundancy
A database is basically a computerized record-
keeping system. It is a repository or container for a
collection of computerized data files.
The data in the database is integrated and
shared:
Integrated means that the database can be thought of as a
unification of several distinct files with controlled redundancy.
Shared means that individual pieces of data in the database
can be shared among different users. Any given users will only be
concerned only with some aspects of the total database.
Independence of the database and programs using it means
that one can be changed without changing the other.
In order for a database to be truly functional, it must not only
store large amounts of records well, but be accessed easily.
In addition, new information and changes should also be fairly
easy to input. In order to have a highly efficient database system,
a program that manages the queries and information stored on
the system must be incorporated. This is usually referred to as
DBMS or a Database Management System. Besides these
features, all databases that are created should be built with high
data integrity and the ability to recover data if hardware fails.
1. Hardware
The DBMS and the applications require hardware to
run on. The hardware can range from a single personal
computer, to a single mainframe, to a network of
computers. The particular hardware depends on the
organization's requirements and the DBMS used.
2. Software
The software component comprises the DBMS and the
application programs, together with the operating
system, including network software if the DBMS is
being used over a network.
3. People
Data and database administrators, application
developers, and the end-users.
DBMS is a software system that enables users to define,
create, maintain database and control access to the
database.
The DBMS is the software that interacts with the users'
application programs and the database.
It thus provides the controlled interface between the user
and the data in the database.
It allows users to define the database, usually
through a Data Definition Language (DDL).
It allows users to insert, update, delete, and retrieve
data from the database, usually through a Data
Manipulation Language (DML)
DBMS also provides security through:
Protecting data against unauthorized access
Safeguarding data against corruption
Providing recovery and restart facilities after hardware or
software failure.
Application development.
There are several people involved in databases. These
include:
1.End users who interact with the system from their
workstations/terminals.
2.Application programmers who are responsible for the
development of application programs. They make use of
programming languages.
3.Database administrator (DBA) who responsible the following.







Development of the database
Maintenance of the database
Maintenance of the data dictionary
Manuals
Security of the database
Appraisal of the database performance
Ensuring adherence to Data protection
There are several common types of databases. Each type
of database has its own data model (how the data is
structured). They include
1)Flat Model
2)Hierarchical Model
3)Relational Model, and
4)Network Model.
1. Flat Model
In a flat model database, there is a two dimensional
(flat structure) array of data. For instance, there is
one column of information and within this column
it is assumed that each data item is related to the
other.
2. Hierarchical Model
The hierarchical model database resembles a tree
like structure, such as how Microsoft Windows
organizes folders and files. In a hierarchical model
database, each upward link is nested in order to
keep data organized in a particular order on a same
level list.
3. Relational Model
The relational model is the most popular type of database
and an extremely powerful tool, not only to store
information, but to access it as well. Relational databases
are organized as tables. The beauty of a table is that the
information can be accessed or added without
reorganizing the tables. A table can have many records
and each record can have many fields.
4. Network Model
In a network model, the defining feature is that a record
is stored with a link to other records – in effect
networked. These networks (or sometimes referred to as
pointers) can be any type of information such as node
numbers or even a disk address.
A number of advantages of applying database approach in
application system are obtained including:
1. Control of data redundancy
The database approach attempts to eliminate the redundancy by
integrating the file. Although the database approach does not
eliminate redundancy entirely, it controls the amount of
redundancy inherent in the database.
2. Data consistency
By eliminating or controlling redundancy, the database approach
reduces the risk of inconsistencies occurring. It ensures all copies
of the idea are kept consistent.
3. More information from the same amount of data
With the integration of the operated data in the database
approach, it may be possible to derive additional information for
the same data.
4. Sharing of data
Database belongs to the entire organization and can be shared by
all authorized users.
5. Improved data integrity
Database integrity provides the validity and consistency of stored
data. Integrity is usually expressed in terms of constraints, which
are consistency rules that the database is not permitted to violate.
6. Improved security
Database approach provides a protection of the data from the
unauthorized users. It may take the term of user names and
passwords to identify user type and their access right in the
operation including retrieval, insertion, updating and deletion.
7. Enforcement of standards
The integration of the database enforces the necessary standards
including data formats, naming conventions, documentation
standards, update procedures and access rules.
8.Economy of scale
Cost savings can be obtained by combining all organization's operational data
into one database with applications to work on one source of data.
9. Balance of conflicting requirements
By having a structural design in the database, the conflicts between users or
departments can be resolved. Decisions will be based on the base use of
resources for the organization as a whole rather that for an individual entity.
10. Improved data accessibility and responsiveness
By having integration in the database approach, data accessing can be crossed
departmental boundaries. This feature provides more functionality and better
services to the users.
11. Increased productivity
The database approach provides all the low-level file-handling routines. The
provision of these functions allows the programmer to concentrate more on the
specific functionality required by the users. The fourth-generation
environment provided by the database can simplify the database application
development.
10.
12. Improved maintenance
Database approach provides a data independence. As a change
of data structure in the database will be affect the application
program, it simplifies database application maintenance.
13. Increased concurrency
Database can manage concurrent data access effectively. It
ensures no interference between users that would not result
any loss of information or loss of integrity.
14. Improved backing and recovery services
Modern database management system provides facilities to
minimize the amount of processing that can be lost following a
failure by using the transaction approach.
In split of a large number of advantages can be found in the
database approach, it is not without any challenge. The following
disadvantages can be found including:
1. Complexity
Database management system is an extremely complex piece of
software. All parties must be familiar with its functionality and
take full advantage of it. Therefore, training for the
administrators, designers and users is required.
2. Size
The database management system consumes a substantial
amount of main memory as well as a large number amount of
disk space in order to make it run efficiently.
3. Cost of DBMS
A multi-user database management system may be very
expensive. Even after the installation, there is a high recurrent
annual maintenance cost on the software
4. Cost of conversion
When moving from a file-base system to a database system, the
company is required to have additional expenses on hardware
acquisition and training cost.
5. Performance
As the database approach is to cater for many applications rather
than exclusively for a particular one, some applications may not
run as fast as before.
6. Higher impact of a failure
The database approach increases the vulnerability of the system
due to the centralization. As all users and applications reply on
the database availability, the failure of any component can bring
operations to a halt and affect the services to the customer
seriously.
Database design is the process of producing a detailed
data model of database to meet an end users requirement.
The ability to design databases and associated
applications is critical to the success of the modern
enterprise.
Database design requires understanding both the
operational and business requirements of an organization
as well as the ability to model and realize those
requirements using a database.
Reflects real-world structure of the problem
Can represent all expected data over time
Avoids redundancy and ensures Consistency
Provides efficient access to data
Supports the maintenance of data integrity
over time
Work interactively with users
Follow a structured methodology
Employ a data-driven approach.
Include structural and integrity considerations.
Combine
conceptualization,
normalization,
transaction validation techniques
Use diagrams
Use Database Design Language (DBDL)
Build a data dictionary
Be willing to repeat steps
and
The most critical aspect of specification is the
gathering and compilation of system and user
requirements. This process is normally done in
conjunction with managers and users.
The major goal in requirements gathering is to:
collect the data used by the organization,
identify relationships in the data,
identify future data needs,
and determine how the data is used and
generated.
The starting place for data collection is
gathering existing forms and reviewing policies
and systems. Then, ask users what the data
means, and determine their daily processes.
These things are especially critical:
Identification of unique fields (keys)
Data dependencies, relationships, and
constraints (high-level)
The data sizes and their growth rates
Fact-finding
is
using
interviews
and
questionnaires to collect facts about systems,
requirements, and preferences. Five fact-finding
techniques:
 examining documentation
interviewing
observing the enterprise in operation
research
questionnaires
The requirements gathering and specification provides you
with a high-level understanding of the organization, its
data, and the processes that you must model in the
database.
Database design involves constructing a suitable model of
this information. Since the design process is complicated,
especially for large databases, database design is divided
into three phases:
Conceptual database design
Logical database design
Physical database design
It is a process of constructing a data model for
each view of the real world problem which is
independent of physical considerations.
Conceptual database design involves modelling
the collected information at a high-level of
abstraction without using a particular data
model or DBMS.
Independent of DBMS.
Allows for easy communication between end-
users and developers.
Has a clear method to convert from high-level
model to relational model.
Conceptual schema is a permanent
description of the database requirements.
 Construct the ER Model
 Check the model for redundancy
 Validating the model against user transactions to
ensure all the scenarios are supported
ER – Entity Relationship
Pictorial Representation of the Real world problems in
terms of entities and relations between the entities is
referred as ER diagram.
Most popular conceptual model for database design
Basis for many other models
Describes the data in a system and how that data is
related
Describes data as entities, attributes and relationships
Entities: A class of distinct identifiable objects or
concepts e.g. a person, an account, a course.
Relations: Associations among entities is
referred as relations.
Attributes: Properties or characteristics of
entities e.g. Person-String, Account-Decimal.
Entity Set: A collection of similar entities e.g.
all employees. All entities in an entity set have
the same set of attributes.
Each entity set has a key.
Each attribute has a domain.
Used for the description of the conceptual
schema of the database.
Not used for database implementation.
Formal notation.
Close to natural language.
Can be mapped to various data models i.e.
relational, object-oriented, object-relational,
XML, description logics.
Schema: Should have stable information.
Instance:
Consider
changing
nature
of
information.
Avoid redundancy (each fact should be
represented once).
No need to store information that can be
computed.
Keys should be as small as possible.
Introduce artificial keys only if no simple,
natural keys are available.
1. Creating relation schemas from entity types.
2. Creating relation schemas from relationship
types.
3. Identifying keys.
4. Identifying foreign keys.
5. Schema optimization.
ER design is subjective. There are often many
ways to model a given scenario. Analyzing
alternatives can be tricky, especially for a
large enterprise.
We
must convert the written database
requirements into an E-R diagram.
There is need to determine the entities,
attributes and relationships.
– nouns = entities
– adjectives = attributes
– verbs = relationships
Weak entities do not have key attributes of
their own.
Weak entities cannot exist without a
relationship to another entity.
A partial key is the portion of the key that
comes from the weak entity. The rest of the key
comes from the other entity in the relationship.
Weak entities always have total participation
as they cannot exist without the identifying
relationship.
Each entity has a set of associated properties
that describes the entity. These properties are
known as attributes.
Attributes can be:
Simple or composite
Single or Multi-valued
Stored or Derived
Null
Candidate key:
an attribute or set of attributes that
uniquely identifies individual occurrences of an entity type.
Composite key: The terms composite key and compound
key are also used to describe primary keys that contain
multiple attributes. When dealing with a composite
primary key it is important to understand that it is the
combination of values for all attributes that must be
unique.
 Primary key: The primary key is an attribute or
combination of attributes that uniquely identifies an
instance of the entity. In other words, no two instances of
an entity may have the same value for the primary key.
Defines a set of associations between various
entities
Can have attributes to define them
Are limited by:
Participation
Cardinality Ratio
For example, a SECTION entity might be related to a
COURSES entity, or an EMPLOYEES entity might be
related to an OFFICES entity. An ellipse connected by lines
to the related entities indicates a relationship in an ERD.
Degree of relationship: the number of
participating entities in a relationship, e. g. unary,
binary, ternary etc.
A unary relationship is a relationship involving
a single entity.
A relationship between two entities is called a
binary relationship.
When three entities are involved in a
relationship, a ternary relationship exists.
Relationships that involve more than three
entities are referred to as n-ary relationships, where
n is the number of entities involved in the
relationship.
Relationships can also differ in terms of their
cardinality.
Maximum cardinality refers to the maximum
number of instances of one entity that can be
associated with a single instance of a related
entity.
Minimum cardinality refers to the minimum
number of instances of one entity that must be
associated with a single instance of a related
entity.
If one CUSTOMER can be related to only one ACCOUNT
and one ACCOUNT can be related to only a single
CUSTOMER, the cardinality of the CUSTOMERACCOUNT relationship is one-to-one (1:1).
If an ADVISOR can be related to one or more
STUDENTS, but a STUDENT can be related to only a single
ADVISOR, the cardinality is one-to-many (1:N).
The cardinality of the relationship is many-to-many
(M:N) if a single STUDENT can be related to zero or more
COURSES and a single COURSE can be related to zero or
more STUDENTS.
An ER schema for a database to store info about professors,
courses and course sections:
It is a process of constructing a model of
information, which can then be mapped into storage
objects supported by the Database Management
System.
Once the relationships and dependencies amongst
the various pieces of information have been
determined, it is possible to arrange the data into a
logical structure which can then be mapped into the
storage objects supported by the database
management system. In the case of relational
databases the storage objects are tables which store
data in rows and columns.
Each table may represent an implementation of
either a logical object or a relationship joining one or
more instances of one or more logical objects.
Relationships between tables may then be stored as
links connecting child tables with parents. Since
complex logical relationships are themselves tables
they will probably have links to more than one parent.
In an Object database the storage objects correspond
directly to the objects used by the Object-oriented
programming language used to write the applications
that will manage and access the data. The relationships
may be defined as attributes of the object classes
involved or as methods that operate on the object
classes.
This logical database design step involves:
i.Table Generation From ER Model
ii.Normalization of Tables
The Cardinality of relationships among the entities can be
considered while deriving the tables from ER Model into:
One-to-one:
Entities with one-to-one relationships should be merged into
a single entity.
Each remaining entity is modeled by a table with a primary
key and attributes, some of which may be foreign keys.
One-to-many:
One-to-many relationships are modeled by a foreign key
attribute in the table. This foreign key would refer to another
table that would contain the other side of the relation.
Many-to-many
Many-to-many relationships among two entities are modeled
by a third table that has foreign keys that refer to the entities.
Normalization is a process of eliminating
redundancy and other anomalies in the system.
In most cases in the enterprise world,
normalization up to Third Normal form would
suffice.
In certain cases or some transactions it is
desirable that certain tables be denormalised for
efficiency in querying the database tables. In
those cases tables can be in denormalised form.
The physical design of the database specifies
the physical configuration of the database on
the storage media. This includes detailed
specification of data elements, data types,
indexing options and other parameters residing
in the DBMS data dictionary. It is the detailed
design of a system that includes modules & the
database's hardware & software specifications of
the system.
It involves describing the base relations, file organizations,
and indexes design used to achieve efficient access to the
data, and any associated integrity constraints and security
measures.
Other functions include:
Consider typical workloads and further refine the
database design.
Consider the introduction of controlled redundancy
Design security mechanisms
Monitor and tune the operational system
Design user views
Poor design/planning
Far too often, a proper planning phase is ignored in favor
of just "getting it done". The project heads off in a certain
direction and when problems inevitably arise – due to the
lack of proper designing and planning – there is "no time"
to go back and fix them properly, using proper techniques.
Ignoring Normalization
Normalization defines a set of methods to break down
tables to their constituent parts until each table represents
one and only one "thing", and its columns serve to fully
describe only the one "thing" that the table represents.
Ignoring Normalization
Names, while a personal choice, are the first and
most important line of documentation for your
application. The names you choose are not just to
enable you to identify the purpose of an object, but to
allow all future programmers, users, and so on to
quickly and easily understand how a component part
of your database was intended to be used, and what
data it stores. No future user of your design should
need to wade through a 500 page document to
determine the meaning of some wacky name.
Introduction
 Ensuring the security of a database is a
complex issue for Organizations
 The more complex the databases are the
more complex the security measures that are
to be applied.
 Network and internet connections to
databases may complicate things even
further.
 Every additional internal user that would be
added to use the database can create further
serious security problems.
 An organization’s variable information stored
in a computer system database is the most
precious asset and must be protected.
Definition of Database Security
 These are mechanisms that protect the database
against intentional or accidental threats
 Security considerations not only apply to the data
held in a database but breaches to other parts of
the system can also affect the database.
 Database security encompasses hardware,
software,
people
and
data.
Effective
implementation of security requires appropriate
controls which are defined in specific mission
objectives for the system.
 The need for security is driven by the increasing
amounts of company crucial corporate data being
stored on computer and the acceptance that any
loss or unavailability of this data could prove to
be disastrous.
Database security risk
Security of database is considered in relation to
the following situations
• Theft and fraud
• Loss of confidentiality (Secrecy)
• Loss of privacy
• Loss of integrity
• Loss of availability
Fraud or loss of privacy may arise because of either
intentional or unintentional acts and do not
necessarily result in any detectable changes to the
database or the computer system.
Theft and fraud affect not only the database
environment but also the entire organization. They
do not necessarily alter data as in the case for
activities that result in either loss of confidentiality
or loss of privacy.
Confidentiality refers to the need to maintain
secrecy over data, usually only that which is critical
to the organization whereas privacy refers to the
need to protect data about individuals. Breaches of
security resulting into loss of confidentiality could
for instance, lead to loss of competitiveness and loss
of privacy could lead to legal action being taken
against the organization.
Loss of data integrity results in invalid or
corrupted data, which may seriously affect the
operation of an organization.
Loss of availability means that the data or the
system or both cannot be accessed, which can
seriously affect an organization’s financial
performance.
NB: database security aims to minimize losses
caused by anticipated events in a cost effective
manner without unduly constraining the users.
Threats: Any situation or event, whether
intentional or accidental, that may adversely
affect a system and consequently the
organization.
It may be caused by a situation or event involving
a person, action, circumstance that is likely to
bring harm to an organization.
The harm may be tangible such as loss of
hardware, software or data or intangible such as
loss of credibility or client confidence. The
problem facing organization is to identify all
possible threats.
Summary of potential threats to computer
systems
Hardware
•
•
•
•
•
•
fire/floods/bombs
data corruption due to power loss or surge
failure of security mechanisms giving greater access
theft of equipment
physical damage to equipment
electronic interference and radiation
DBMS and Application Software
•
•
•
Failure of security mechanism giving greater access
Program alteration
Theft of programs
Communication networks
• Wire tapping
•Breaking or disconnection of cables
•Electronic and radiation
Database
•Unauthorized amendment or copying of data
•Theft of data
•Data corruption due to power loss or surge
Users
•Using another person’s means of access
•Viewing and disclosing unauthorized data
•Inadequate staff training
•Illegal entry by hacker
•Blackmail
•Introduction of viruses
Programmers/operators
•Creating trapdoors
•Program alteration
•Inadequate staff training
•Inadequate security policies and procedures
•Staff shortages or strikes
Data/database administrator
•Inadequate security policies and procedures
Countermeasures- Computer-Based Controls
The types of controls range from physical to administrative
measures.
Types of controls
•Authorization
•Access controls
•Views
•Backup and recovery
•Integrity
•Encryption
•RAID technology
Authorization
It refers to the granting of a right or privilege that enables
a subject to have legitimate access to a system or a system’s
subject.
It can be inbuilt into the software and govern not only
what system or object a specified user can access, but also
what also what a user can do with it.
The process of authorization involves authentication of
subjects requesting access to objects where subject
represent a user or program and object represents a
database table, view, procedure, trigger or any other object
that can be created within the system.
Authentication: - A mechanism that determines whether a
user is who he or she claims to be
A system administrator is usually responsible for allowing
users to have access to a computer system by creating
individual user accounts
Access controls
Typical way to provide access controls for a
database system is based on the granting and
revoking of privileges.
A privilege allows a user to create or access (i.e.
read, write, or modify) some database object
(such as a relation, view, or index) or to run
certain DBMS utilities. Privileges are granted to
users to accomplish the tasks required for their
jobs.
The DBMS keeps tracks of how the privileges are
granted to other users and possibly revoked and
ensures that at all times only users with
necessary privileges can access an object.
Discretionary Access Control (DAC)
Most commercial DBMSs provide an approach to
managing privileges that uses SQL called
Discretionary Access Control (DAC).
The SQL standard supports DAC through the
GRANT and REVOKES commands. The GRANT
command gives privileges to users and REVOKE
command takes away privileges.
Discretionary access control, while effective has
certain weaknesses. An unauthorized user can
trick an authorized user into disclosing sensitive
data
Mandatory Access control (MAC)
This is based on system wide policies that cannot be
changed by individual users. In this approach each
database object is assigned a security class and each
user is assigned a clearance for a security class and
rules are imposed on reading and writing of
database objects by users.
The DBMS determines whether a given user can
read or write a given object based on certain rules
that involve the security level of an object and the
clearance of the user.
The rules seek ensure that sensitive data can never
be passed on to another user without necessary
clearance.
The SQL standard does not include support for
MAC.
Views
A view is a dynamic result of one or more relational
operations operating on the base relations to
produce another relation. A view is a virtual relation
that does not actually exist in the database, but is
produced upon request by a particular user, at the
time of request.
It provides a powerful and flexible security
mechanism by hiding parts of the database from
certain users. The user is not aware of the existence
of any attributes or rows that are missing from the
view.
It can be defined over several relations with a user
being granted the appropriate privileged to use it,
but not to use the base relations.
Backup and Recovery
This is the process of periodically taking a copy
of the database and log file (and possibly
programs) on to offline storage media
A DBMS should provide backup facilities to
assist with the recovery of a database following a
failure. It is always advisable to make backup
copies of the database and the log file at regular
intervals and to ensure that copies are in secure
location.
In the event of a failure that renders the database
unusable, the backup copy and the details
captured in the log file are used to restore the
database to the latest possible consistent state.
Journaling-the process of keeping and maintaining the
log file (or journal) of all changes made to the database to
enable recovery to be undertaken effectively in the event
of a failure.
A DBMS should provide logging facilities which keep
track of the current state of transactions and database
changes, to provide support for recovery procedures.
The advantage of journaling is that, in the event of a
failure, the database can be recovered to its last known
consistent state using a backup copy of the database and
the information contained in the log file.
If no journaling is enabled on a failed system, the only
means of recovery is to restore the database using the
latest backup version of the database
Without a log file, any changes made after the last backup
to the database will be lost.
Integrity
Integrity constraints also ensure secure database
system by preventing data from becoming invalid
and hence giving misleading or incorrect results.
Encryption
The encoding of the data by a special algorithm that
renders the data unreadable by any program without
the decryption key.
Symmetric encryption:-Uses the same key for both
encryption and decryption and relies on safe
communication lines for exchanging the key.
Data encryption Standard:- it uses one key for
both encryption and decryption, which must be
kept secret, although the algorithm need not to be.
RAID (Redundant Array of Independent Disks)
The hardware that the DBMS is running on must be fault
tolerant, meaning that the DBMS should continue to
operate even if one of the hardware components fails.
RAID works on having a large disk array comprising an
arrangement of several independent disks that are
organized to improve reliability and at the same time
increase performance.
Performance is increased through data striping: the data
is segmented into equal-size partitions (the striping unit)
which are transparently distributed across multiple disks.
This gives the appearance of a single large, fast disk where
in actual fact data is distributed across several smaller
disks.
Striping improves overall I/O performance by allowing
multiple I/O to be serviced in parallel. Data striping also
balances the load among disks
DBMS AND Web Security
This focuses on how to make a DBMS secure on the web.
Internet communication relies on TCP/IP as the underlying protocol.
However TCP/IP and HTTP were not designed with security in
mind. Without any special software all internet traffic travels in ‘the
clear’ and anyone who monitors traffic can read it.
Internet challenges
The challenge in the internet is to transmit and receive information over
the internet while ensuring that:• It is accessible to anyone but the sender and receiver(Privacy)
• It has not been changed during transmission (integrity)
• The receiver can be sure it came from the sender (authenticity)
• The sender can be sure the receiver is guanine (non-fabrication)
• The sender cannot deny he or she sent it (non-repudiation)
With a multi-tier architecture such as the web environment, the
complexity of ensuring secure access to and from the database is
necessary.
Security must be addressed if the information transmitted contains
executable content.
Dangers of executable content
•Corrupt data or the execution state of programs
•Reformat complete disks
•Perform a total system shutdown
•Collect and download confidential data such as
files or passwords to another site
•Usurp identity and impersonate the user or
user’s computer to attack other targets on the
network
•Lock up resources making them unavailable for
legitimate users and programs
•Cause non fatal but unwelcome effects,
especially on the output devices
Data base security (Internet Environment)
Proxy servers: Proxy server is a computer that sits between a web
browser and a web server. It intercepts all requests to the web to
determine if it can fulfill the request itself. If not, it forwards the
requests to the web server.
Purposes of proxy server
•To improve performance: it saves the results of all requests for a certain
amount of time-thus significantly improving performance for a group of
users.
•To filter requests-e.g. preventing employees from accessing a particular
web site
Firewalls
It’s a system designed to prevent unauthorized access to or from a
private network. It can be implemented in both hardware or software or
a combination of both.
They are frequently used to prevent unauthorized internet users from
accessing private networks connected to the internets, especially
intranets.
All messages entering or leaving the intranet pass through the firewall,
which examines each message and blocks those that do not meet the
specified security criteria.
Types of firewall techniques
•Packet Filter: looks at each packet entering or leaving
the network and accepts or rejects it based on userdefined rules.
•Application Gateway: it applies security mechanisms to
specific applications such as FTP and telnet servers.
•Circuit level gateway: it applies security mechanism
when a TCP or UDP connection is established. Once the
connection has been made packets can flow between the
hosts without further checking.
•Proxy server: this intercepts all messages entering or
leaving the network.
Message Digest Algorithms and Digital signatures
A message digest algorithm or one-way hash function,
takes an arbitrary sized string (the message) and generates
a fixed-length string (the digest or hash) which has the
following characteristics
•It should be computationally infeasible to find
another message that will generate the same
digest
•The digest does not reveal anything about the
message
A digital signature consists of two pieces of
information
•A string of bits that is computed from the data
that is being signed along with
•the private key of the individual or organization
wishing the signature
The signature can be used to verify that the data
comes from this individual or organization
Digital certificates
This is an attachment to an electronic message used
for security purposes, most commonly to verify that
a user sending a message is who he or she claims to
be and to provide the receiver with the means to
encode a reply.
Kerberos
This is a server of secured user names and password.
It provides one centralized security server for all
data resources on the network. Database access,
login, authorization control and other security
features are centralized on trusted Kerberos.
It has a similar function to that of a certificate
server; to identify and validate a user,
Data Warehouse:
The term Data Warehouse was coined by Bill Inmon in 1990,
which he defined in the following way: "A warehouse is a
subject-oriented, integrated, time-variant and non-volatile
collection of data in support of management's decision making
process". He defined the terms in the sentence as follows:
Subject Oriented:
Data that gives information about a particular subject instead
of about a company's ongoing operations.
Integrated:
Data that is gathered into the data warehouse from a variety
of sources and merged into a coherent whole.
Time-variant:
All data in the data warehouse is identified with a particular
time period.
Non-volatile
Data is stable in a data warehouse. More data is added but
data is never removed. This enables management to gain a
consistent picture of the business.
Information
Data
A process of transforming
data into information and
making it available to users
in a timely enough manner
to make a difference
Which are our
lowest/highest margin
customers ?
Who are my customers
and what products
are they buying?
What is the most
effective distribution
channel?
What product prom-otions have the biggest
impact on revenue?
Which customers
are most likely to go
to the competition ?
What impact will
new products/services
have on revenue
and margins?
Relational
Databases
Optimized Loader
ERP
Systems
Extraction
Cleansing
Data Warehouse
Engine
Purchased
Data
Legacy
Data
Metadata Repository
Analyze
Query
Putting Information technology to help the
knowledge worker make faster and better
decisions
Which of my customers are most likely to
go to the competition?
What product promotions have the biggest
impact on revenue?
How did the share price of software
companies correlate with profits over last 10
years?
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
Use of the system is loosely defined and can
be ad-hoc
Used by managers and end-users to
understand the business and make judgements
Enterprise Data warehouse
collects all information about subjects
(customers,products,sales,assets,
personnel) that span the entire organization
Data Mart
Departmental subsets that focus on selected
subjects
Decision Support System (DSS)
Information technology to help the knowledge
worker (executive, manager, analyst) make faster &
better decisions
Online Analytical Processing (OLAP)
an element of decision support systems (DSS)
Terabytes -- 10^12 bytes:
Walmart -- 24 Terabytes
Petabytes -- 10^15 bytes:
Exabytes -- 10^18 bytes:
Geographic Information
Systems
National Medical Records
Zettabytes -- 10^21 bytes:
Weather images
Zottabytes -- 10^24 bytes:
Intelligence Agency Videos
Industry
Finance
Insurance
Telecommunication
Transport
Consumer goods
Data Service providers
Utilities
Application
Credit Card Analysis
Claims, Fraud Analysis
Call record analysis
Logistics management
promotion analysis
Value added data
Power usage analysis
A Data Warehouse Delivers Enhanced Business
Intelligence
By providing data from various sources, managers and
executives will no longer need to make business decisions
based on limited data or their gut. In addition, “data
warehouses and related BI can be applied directly to
business processes including marketing segmentation,
inventory management, financial management, and sales.”
A Data Warehouse Saves Time
Since business users can quickly access critical data from a
number of sources—all in one place—they can rapidly
make informed decisions on key initiatives. They won’t
waste precious time retrieving data from multiple sources.
Not only that but the business execs can query the data
themselves with little or no support from IT—saving more
time and more money.
A Data Warehouse Enhances Data Quality and Consistency
A data warehouse implementation includes the conversion
of data from numerous source systems into a common
format. Since each data from the various departments is
standardized, each department will produce results that are
in line with all the other departments. So you can have more
confidence in the accuracy of your data. And accurate data is
the basis for strong business decisions.
A Data Warehouse Provides Historical Intelligence
A data warehouse stores large amounts of historical data so
you can analyze different time periods and trends in order to
make future predictions. Such data typically cannot be stored
in a transactional database or used to generate reports from a
transactional system.
1. Data warehouses are not the optimal environment for
unstructured data.
2. Because data must be extracted, transformed and loaded
into the warehouse, there is an element of latency in data
warehouse data.
3. Over their life, data warehouses can have high costs.
Data warehouses can get outdated relatively quickly.
4. There is a cost of delivering suboptimal information to
the organization.
5. There is often a fine line between data warehouses and
operational systems. Duplicate, expensive functionality
may be developed. Or, functionality may be developed in
the data warehouse that, in retrospect, should have been
developed in the operational systems.
What is Data Mining?
Data Mining, a specialized part of Knowledge-Discovery in
Databases (KDD), is the process of automatically searching
large volumes of data for patterns and rules.
It helps in extracting meaningful new patterns that cannot
be found necessarily by merely querying or processing data or
metadata in the data warehouse.
A few examples arising from data mining may include:
•pattern showing whenever a customer buys video
equipment, he or she also buys another electronic gadget.
•suppose a customer buys a camera, and within three
months he or she buys photographic supplies, and within
six months an accessory item?
•Customer classification by frequency of visits, by amount
of purchase, item purchase, payment mode etc…
What is Data Mining?
Data mining access of the database differs from this traditional
access in three major areas:
1.
Query: The query might not be well formed or precisely
stated. The data miner might not even be exactly sure of
what they want to see.
2.
Data: The data access is usually a different version from
that of the operational database (it typically comes from a
data warehouse). The data must be cleansed and
modified to better support mining operations.
3.
Output: The output of the data mining query probably is
not a subset of the database. Instead it is the output of
some analysis of the contents of the database.
Why mine data?
Prediction - show how certain attributes within the
data will behave in the future.
Identification - Data patterns can be used to identify
the existence of an item, an event, or an activity.
classification - Partitioning the data so that
different categories can be identified based on
combinations of parameters
Optimization - optimize the use of limited
resources to maximize output variables e.g. sales or
profits.
Stage 1: Exploration. Data preparation which
involves cleaning data, data transformations, and
selecting subsets of records.
Stage 2: Model building and validation.
Considering various models & choosing the best one
based on their predictive performance.
Stage 3: Deployment. applying the model to new
data in order to generate predictions or estimates of
the expected outcome.
The knowledge discovered during data
mining can be described in five ways, as
follows:
Association rules—These rules correlate the
presence of a set of items with another range
of values for another set of variables.
Examples: (1) When a female retail shopper
buys a handbag, she is likely to buy shoes.
Classification hierarchies—The goal is
to work from an existing set of events or
transactions to create a hierarchy of classes.
Examples: (1) A population may be divided
into five ranges of credit worthiness based
on a history of previous credit transactions.
Sequential patterns—A sequence of
actions or events is sought. Example: If a
patient underwent cardiac bypass surgery for
blocked arteries and later developed high
blood sugar within a year of surgery, he or she
is likely to suffer from kidney failure within the
next 18 months.
Detection of sequential patterns is equivalent
to detecting association among events with
certain temporal relationships.
Categorization and segmentation—A given
population of events or items can be partitioned
(segmented) into sets of "similar" elements.
Example: The adult population in Kenya may be
categorized into five groups from "most likely to
buy" to "least likely to buy" a new product.
For most applications, the desired knowledge is a
combination of the above types.
Data Mining
Descriptive Models
Predictive Models
Classification
Regression
Prediction
Time-series Analysis
Clustering
Summarization
Sequence Discovery
Association Rules
Data mining models and some typical tasks. Not an exhaustive listing.
Combinations of these tasks yield more sophisticated mining operations.
Classification:
Given a set of items that have several classes, and
given the past association class, Classification is
the process of predicting the class of a new item.
Technique used: Decision-Tree Classifiers
Also: Artificial Neural Networks:
Predictive models that learn through training and
resemble biological neural networks in structure.
A tree Induction e.g.
Customer renting property more
than 2 years?
Yes
No
Customer over 25 years
old
Rent property
No
This predictive model will classify
customers into one of two
categories: renters and buyers.
The model will predict that
customers who are over 25 years
old and have rented for more than 2
years will buy property, others will
rent.
Rent property
Classification Using An Induction Tree
Yes
Buy property
Clustering is similar to classification except that the
groups are not predefined, but rather defined by the
data alone.
Clustering is alternatively referred to as unsupervised
learning or segmentation (actually, segmentation is a
special case of clustering although many people refer
to them synonymously).
Clustering algorithms find groups of items that are
similar.
Technique used: Nearest neighbor. A classification
technique that classifies each record based on the
records most similar to it in an historical database.
Regression - Predictive
Regression is used to map a data item to a real valued
prediction variable. Prediction of a value, rather than a class
Regression assumes that the target data fit into some known
type of function (i.e., linear, logistic, etc.) and then
determines the best function of this type that models the
given data.
Some type of error analysis is used to determine which function
is “best”, i.e., produces the least total error.
The problem with linear regression is that the technique only
works well with linear data and is sensitive to the presence of
outliers (data values which do not conform to the expected
norm). Although nonlinear regression avoids the main
problems of linear regression but not fully.
Data mining requires statistical methods that can
accommodate nonlinearity, outliers, and non-numeric data.
An association algorithm creates rules that describe how
often events have occurred together.
Eg: When a customer buys a hammer, then 90% of the
time they will buy nails.
Support: “is a measure of what fraction of the
population satisfies both the antecedent and the
consequent of the rule” - e.g. People who buy hotdog
buns also buy hotdog sausages in 99% of cases
Confidence: “is a measure of how often the
consequent is true when the antecedent is true.e.g.
90% of Hotdog bun purchases are accompanied by
hotdog sausages.
•When using association rules, one must remember that
these are not casual relationships. They do not represent
any relationship inherent in the actual data as is the case
with functional dependencies, or in the real world.
•There is probably no relationship between items and no
guarantee that this association will apply in the future.
Data mining technologies can be applied to a large variety of
decision-making contexts in business. In particular, areas of
significant payoffs are expected to include the following:
Marketing—Applications include analysis of consumer
behavior based on buying patterns; determination of
marketing strategies including advertising, store location,
and targeted mailing; segmentation of customers, stores, or
products; and design of catalogs, store layouts, and
advertising campaigns.
Finance—Applications include analysis of creditworthiness
of clients, segmentation of account receivables, performance
analysis of finance investments like stocks, bonds, and
mutual funds; evaluation of financing options; and fraud
detection.
Manufacturing—Applications involve optimization of
resources like machines, manpower, and materials;
optimal design of manufacturing processes, shop-floor
layouts, and product design, such as for automobiles
based on customer requirements.
Health Care—Applications include an analysis of
effectiveness of certain treatments; optimization of
processes within a hospital, relating patient wellness
data with doctor qualifications; and analyzing side
effects of drugs.
Science – Predict environmental change
Provides new knowledge from existing data.
Public databases
Government sources
Company Databases
Old data can be used to develop new knowledge.
New knowledge can be used to improve services or
products.
Improvements lead to:
Bigger profits
More efficient service
Research: Insight.
Privacy Issues:
A persons life story can be painted from the
collected data e.g by linking Shopping History,
Credit History, Bank History, Employment History.
Eg:According to Washington Post, in 1998,
American Express also sold their customers’ credit
card purchases to another company.
Security issues: Companies have a lot of personal
information online, they do not guarantee to protect
it.
Misuse of information: Service discrimination
e.g. Some of the company will answer your phone
based on your purchase history.
Privacy Issues:
A persons life story can be painted from the
collected data e.g by linking Shopping History,
Credit History, Bank History, Employment History.
Eg:According to Washington Post, in 1998,
American Express also sold their customers’ credit
card purchases to another company.
Security issues: Companies have a lot of personal
information online, they do not guarantee to protect
it.
Misuse of information: Service discrimination
e.g. Some of the company will answer your phone
based on your purchase history.
Missing data: During the preprocessing phase of KDD,
missing data may be replaced with estimates. Resulting in
invalid estimates
Irrelevant data: Some attributes in the database might not
be of interest to the data mining task being developed.
Changing data: Databases cannot be assumed to be static.
However, most data mining algorithms do assume a static
database. This requires that the algorithms be completely
rerun anytime the database changes.
Application: Determining the intended use for the
information obtained from the data mining function is a
challenge. How business executives can effectively use the
output & modify the firms is sometimes considered the
more difficult part.
THE KDD PROCESS:
Consists of the following five basic steps:
1.
Selection: The data needed for the data mining process is obtained
from many different and heterogeneous data sources.
2.
Preprocessing: The data to be used by the process may have incorrect
or missing data. Erroneous data may be corrected or removed, whereas
missing data must be supplied or predicted (often using data mining
tools).
3.
Transformation: Data from different sources must be converted into a
common format for processing. Some data may be encoded or
transformed into more usable formats. Data reduction may be used to
reduce the number of possible data values being considered.
4.
Data mining: Based on the data mining task being performed, this
step applies the algorithms to the transformed data to generate the
desired results.
5.
Interpretation/evaluation: How the data mining results are presented
to the users is extremely important because the usefulness of the
results is dependent on it.
The concept of Data Mining is becoming
increasingly popular as a business
information management tool where it is
expected to reveal knowledge structures that
can guide decisions in conditions of limited
certainty.
http://en.wikipedia.org/wiki/Data_mining
http://www.statsoft.com/textbook/stdatmin.
html
http://www.anderson.ucla.edu/faculty/jason.
frand/teacher/technologies/palace/datamini
ng.htm
Ramez Elmasri and Shamkant B. Navathe:
Fundamentals of Database Systems.