* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Security/controls of Databases
Survey
Document related concepts
Transcript
Abdul-hakim M. Warsame Richard A. Nyatangi Timothy K. Kamau James Kungu James Kitonga D61/72809/2012 D61/68249/2011 D61/67302/2011 D61/72729/2012 D61/72713/2012 Introduction Computers store and retrieve information using file systems and databases, but they are both designed to handle data in different ways. Databases and file systems are software-based and can be used on both personal computers and large mainframes. Both systems have largely replaced their paper-based equivalents, and without them many tasks that computers do would be impossible. A file consists of a number of records. Record: A record is a collection of related fields. Field: A field contains a set of related characters Character: A character is the smallest element in a file File system is a system that collecting a data or files and stored in physical location like hard disk or tape File systems are containers of collections. Collections are commonly called directories, and they contain a set of data units commonly called files. A "File manager" is used to store all relationships in directories in File Systems File management systems record the location of files saved on a computer's hard disk rather than individual data records. They store information about the location of files on the hard disk, their size, and associated attributes--such as whether the file is read-only Master file: These are files of a fairly permanent nature e.g. Payroll, inventory, customer. Requires regular updating of the files to show a current position. 2. The master file will contain data that is static in nature and some data that keep on changing. 3. Transaction/Movement file – It’s made up of the various transactions crated form the source document. These file is used to update the master file. 4. Reference file: - a file with a reasonable amount of permanency. 1. There are two basic strategies for processing transactions of the files: 1.Transaction processing – processing each transaction as it occurs (Real time Processing) 2.Batch processing – collecting transactions together over some interval of time and then processing the whole batch. File-based systems were an early attempt to computerize the manual filing system. Although File-based system is a collection of application programs that perform services for the end-users. Each program defines and manages its data. However, several types of problem are occurred in using the file-based approach: and they are: Separation and isolation of data When data is isolated in separate files, it is more difficult for us to access data that should be available. The application programmer is required to synchronize the processing of two or more files to ensure the correct data is extracted Duplication of data When employing the decentralized file-based approach, the uncontrolled duplication of data is occurred. Uncontrolled duplication of data is undesirable because Data dependence Using file-based system, the physical structure and storage of the data files and records are defined in the application program code. This characteristic is known as program-data dependence. Making changes to an existing structure are rather difficult and will lead to a modification of program. Such maintenance activities are time-consuming and subject to error. Incompatible file formats The structures of the file are dependent on the application programming language. However file structure provided in one programming language such as direct file, indexed-sequential file which is available in COBOL programming, may be different from the structure generated by other programming language such as C. The direct incompatibility makes them difficult to process jointly. Fixed queries / proliferation of application programs File-based systems are very dependent upon the application programmer. Any required queries or reports have to be written by the application programmer. Normally, a fixed format query or report can only be entertained and no facility for ad-hoc queries if offered. File-based systems also give tremendous pressure on data processing staff, with users' complaints on programs that are inadequate or inefficient in meeting their demands. Documentation may be limited and maintenance of the system is difficult. Provision for security, integrity and recovery capability is very limited In order to overcome the limitations of the file-based approach, the concept of database. What is A Database? A database is a structured collection of records or data that is stored in a computer system. A database is a single organized collection of structured data, with controlled redundancy A database is basically a computerized record- keeping system. It is a repository or container for a collection of computerized data files. The data in the database is integrated and shared: Integrated means that the database can be thought of as a unification of several distinct files with controlled redundancy. Shared means that individual pieces of data in the database can be shared among different users. Any given users will only be concerned only with some aspects of the total database. Independence of the database and programs using it means that one can be changed without changing the other. In order for a database to be truly functional, it must not only store large amounts of records well, but be accessed easily. In addition, new information and changes should also be fairly easy to input. In order to have a highly efficient database system, a program that manages the queries and information stored on the system must be incorporated. This is usually referred to as DBMS or a Database Management System. Besides these features, all databases that are created should be built with high data integrity and the ability to recover data if hardware fails. 1. Hardware The DBMS and the applications require hardware to run on. The hardware can range from a single personal computer, to a single mainframe, to a network of computers. The particular hardware depends on the organization's requirements and the DBMS used. 2. Software The software component comprises the DBMS and the application programs, together with the operating system, including network software if the DBMS is being used over a network. 3. People Data and database administrators, application developers, and the end-users. DBMS is a software system that enables users to define, create, maintain database and control access to the database. The DBMS is the software that interacts with the users' application programs and the database. It thus provides the controlled interface between the user and the data in the database. It allows users to define the database, usually through a Data Definition Language (DDL). It allows users to insert, update, delete, and retrieve data from the database, usually through a Data Manipulation Language (DML) DBMS also provides security through: Protecting data against unauthorized access Safeguarding data against corruption Providing recovery and restart facilities after hardware or software failure. Application development. There are several people involved in databases. These include: 1.End users who interact with the system from their workstations/terminals. 2.Application programmers who are responsible for the development of application programs. They make use of programming languages. 3.Database administrator (DBA) who responsible the following. Development of the database Maintenance of the database Maintenance of the data dictionary Manuals Security of the database Appraisal of the database performance Ensuring adherence to Data protection There are several common types of databases. Each type of database has its own data model (how the data is structured). They include 1)Flat Model 2)Hierarchical Model 3)Relational Model, and 4)Network Model. 1. Flat Model In a flat model database, there is a two dimensional (flat structure) array of data. For instance, there is one column of information and within this column it is assumed that each data item is related to the other. 2. Hierarchical Model The hierarchical model database resembles a tree like structure, such as how Microsoft Windows organizes folders and files. In a hierarchical model database, each upward link is nested in order to keep data organized in a particular order on a same level list. 3. Relational Model The relational model is the most popular type of database and an extremely powerful tool, not only to store information, but to access it as well. Relational databases are organized as tables. The beauty of a table is that the information can be accessed or added without reorganizing the tables. A table can have many records and each record can have many fields. 4. Network Model In a network model, the defining feature is that a record is stored with a link to other records – in effect networked. These networks (or sometimes referred to as pointers) can be any type of information such as node numbers or even a disk address. A number of advantages of applying database approach in application system are obtained including: 1. Control of data redundancy The database approach attempts to eliminate the redundancy by integrating the file. Although the database approach does not eliminate redundancy entirely, it controls the amount of redundancy inherent in the database. 2. Data consistency By eliminating or controlling redundancy, the database approach reduces the risk of inconsistencies occurring. It ensures all copies of the idea are kept consistent. 3. More information from the same amount of data With the integration of the operated data in the database approach, it may be possible to derive additional information for the same data. 4. Sharing of data Database belongs to the entire organization and can be shared by all authorized users. 5. Improved data integrity Database integrity provides the validity and consistency of stored data. Integrity is usually expressed in terms of constraints, which are consistency rules that the database is not permitted to violate. 6. Improved security Database approach provides a protection of the data from the unauthorized users. It may take the term of user names and passwords to identify user type and their access right in the operation including retrieval, insertion, updating and deletion. 7. Enforcement of standards The integration of the database enforces the necessary standards including data formats, naming conventions, documentation standards, update procedures and access rules. 8.Economy of scale Cost savings can be obtained by combining all organization's operational data into one database with applications to work on one source of data. 9. Balance of conflicting requirements By having a structural design in the database, the conflicts between users or departments can be resolved. Decisions will be based on the base use of resources for the organization as a whole rather that for an individual entity. 10. Improved data accessibility and responsiveness By having integration in the database approach, data accessing can be crossed departmental boundaries. This feature provides more functionality and better services to the users. 11. Increased productivity The database approach provides all the low-level file-handling routines. The provision of these functions allows the programmer to concentrate more on the specific functionality required by the users. The fourth-generation environment provided by the database can simplify the database application development. 10. 12. Improved maintenance Database approach provides a data independence. As a change of data structure in the database will be affect the application program, it simplifies database application maintenance. 13. Increased concurrency Database can manage concurrent data access effectively. It ensures no interference between users that would not result any loss of information or loss of integrity. 14. Improved backing and recovery services Modern database management system provides facilities to minimize the amount of processing that can be lost following a failure by using the transaction approach. In split of a large number of advantages can be found in the database approach, it is not without any challenge. The following disadvantages can be found including: 1. Complexity Database management system is an extremely complex piece of software. All parties must be familiar with its functionality and take full advantage of it. Therefore, training for the administrators, designers and users is required. 2. Size The database management system consumes a substantial amount of main memory as well as a large number amount of disk space in order to make it run efficiently. 3. Cost of DBMS A multi-user database management system may be very expensive. Even after the installation, there is a high recurrent annual maintenance cost on the software 4. Cost of conversion When moving from a file-base system to a database system, the company is required to have additional expenses on hardware acquisition and training cost. 5. Performance As the database approach is to cater for many applications rather than exclusively for a particular one, some applications may not run as fast as before. 6. Higher impact of a failure The database approach increases the vulnerability of the system due to the centralization. As all users and applications reply on the database availability, the failure of any component can bring operations to a halt and affect the services to the customer seriously. Database design is the process of producing a detailed data model of database to meet an end users requirement. The ability to design databases and associated applications is critical to the success of the modern enterprise. Database design requires understanding both the operational and business requirements of an organization as well as the ability to model and realize those requirements using a database. Reflects real-world structure of the problem Can represent all expected data over time Avoids redundancy and ensures Consistency Provides efficient access to data Supports the maintenance of data integrity over time Work interactively with users Follow a structured methodology Employ a data-driven approach. Include structural and integrity considerations. Combine conceptualization, normalization, transaction validation techniques Use diagrams Use Database Design Language (DBDL) Build a data dictionary Be willing to repeat steps and The most critical aspect of specification is the gathering and compilation of system and user requirements. This process is normally done in conjunction with managers and users. The major goal in requirements gathering is to: collect the data used by the organization, identify relationships in the data, identify future data needs, and determine how the data is used and generated. The starting place for data collection is gathering existing forms and reviewing policies and systems. Then, ask users what the data means, and determine their daily processes. These things are especially critical: Identification of unique fields (keys) Data dependencies, relationships, and constraints (high-level) The data sizes and their growth rates Fact-finding is using interviews and questionnaires to collect facts about systems, requirements, and preferences. Five fact-finding techniques: examining documentation interviewing observing the enterprise in operation research questionnaires The requirements gathering and specification provides you with a high-level understanding of the organization, its data, and the processes that you must model in the database. Database design involves constructing a suitable model of this information. Since the design process is complicated, especially for large databases, database design is divided into three phases: Conceptual database design Logical database design Physical database design It is a process of constructing a data model for each view of the real world problem which is independent of physical considerations. Conceptual database design involves modelling the collected information at a high-level of abstraction without using a particular data model or DBMS. Independent of DBMS. Allows for easy communication between end- users and developers. Has a clear method to convert from high-level model to relational model. Conceptual schema is a permanent description of the database requirements. Construct the ER Model Check the model for redundancy Validating the model against user transactions to ensure all the scenarios are supported ER – Entity Relationship Pictorial Representation of the Real world problems in terms of entities and relations between the entities is referred as ER diagram. Most popular conceptual model for database design Basis for many other models Describes the data in a system and how that data is related Describes data as entities, attributes and relationships Entities: A class of distinct identifiable objects or concepts e.g. a person, an account, a course. Relations: Associations among entities is referred as relations. Attributes: Properties or characteristics of entities e.g. Person-String, Account-Decimal. Entity Set: A collection of similar entities e.g. all employees. All entities in an entity set have the same set of attributes. Each entity set has a key. Each attribute has a domain. Used for the description of the conceptual schema of the database. Not used for database implementation. Formal notation. Close to natural language. Can be mapped to various data models i.e. relational, object-oriented, object-relational, XML, description logics. Schema: Should have stable information. Instance: Consider changing nature of information. Avoid redundancy (each fact should be represented once). No need to store information that can be computed. Keys should be as small as possible. Introduce artificial keys only if no simple, natural keys are available. 1. Creating relation schemas from entity types. 2. Creating relation schemas from relationship types. 3. Identifying keys. 4. Identifying foreign keys. 5. Schema optimization. ER design is subjective. There are often many ways to model a given scenario. Analyzing alternatives can be tricky, especially for a large enterprise. We must convert the written database requirements into an E-R diagram. There is need to determine the entities, attributes and relationships. – nouns = entities – adjectives = attributes – verbs = relationships Weak entities do not have key attributes of their own. Weak entities cannot exist without a relationship to another entity. A partial key is the portion of the key that comes from the weak entity. The rest of the key comes from the other entity in the relationship. Weak entities always have total participation as they cannot exist without the identifying relationship. Each entity has a set of associated properties that describes the entity. These properties are known as attributes. Attributes can be: Simple or composite Single or Multi-valued Stored or Derived Null Candidate key: an attribute or set of attributes that uniquely identifies individual occurrences of an entity type. Composite key: The terms composite key and compound key are also used to describe primary keys that contain multiple attributes. When dealing with a composite primary key it is important to understand that it is the combination of values for all attributes that must be unique. Primary key: The primary key is an attribute or combination of attributes that uniquely identifies an instance of the entity. In other words, no two instances of an entity may have the same value for the primary key. Defines a set of associations between various entities Can have attributes to define them Are limited by: Participation Cardinality Ratio For example, a SECTION entity might be related to a COURSES entity, or an EMPLOYEES entity might be related to an OFFICES entity. An ellipse connected by lines to the related entities indicates a relationship in an ERD. Degree of relationship: the number of participating entities in a relationship, e. g. unary, binary, ternary etc. A unary relationship is a relationship involving a single entity. A relationship between two entities is called a binary relationship. When three entities are involved in a relationship, a ternary relationship exists. Relationships that involve more than three entities are referred to as n-ary relationships, where n is the number of entities involved in the relationship. Relationships can also differ in terms of their cardinality. Maximum cardinality refers to the maximum number of instances of one entity that can be associated with a single instance of a related entity. Minimum cardinality refers to the minimum number of instances of one entity that must be associated with a single instance of a related entity. If one CUSTOMER can be related to only one ACCOUNT and one ACCOUNT can be related to only a single CUSTOMER, the cardinality of the CUSTOMERACCOUNT relationship is one-to-one (1:1). If an ADVISOR can be related to one or more STUDENTS, but a STUDENT can be related to only a single ADVISOR, the cardinality is one-to-many (1:N). The cardinality of the relationship is many-to-many (M:N) if a single STUDENT can be related to zero or more COURSES and a single COURSE can be related to zero or more STUDENTS. An ER schema for a database to store info about professors, courses and course sections: It is a process of constructing a model of information, which can then be mapped into storage objects supported by the Database Management System. Once the relationships and dependencies amongst the various pieces of information have been determined, it is possible to arrange the data into a logical structure which can then be mapped into the storage objects supported by the database management system. In the case of relational databases the storage objects are tables which store data in rows and columns. Each table may represent an implementation of either a logical object or a relationship joining one or more instances of one or more logical objects. Relationships between tables may then be stored as links connecting child tables with parents. Since complex logical relationships are themselves tables they will probably have links to more than one parent. In an Object database the storage objects correspond directly to the objects used by the Object-oriented programming language used to write the applications that will manage and access the data. The relationships may be defined as attributes of the object classes involved or as methods that operate on the object classes. This logical database design step involves: i.Table Generation From ER Model ii.Normalization of Tables The Cardinality of relationships among the entities can be considered while deriving the tables from ER Model into: One-to-one: Entities with one-to-one relationships should be merged into a single entity. Each remaining entity is modeled by a table with a primary key and attributes, some of which may be foreign keys. One-to-many: One-to-many relationships are modeled by a foreign key attribute in the table. This foreign key would refer to another table that would contain the other side of the relation. Many-to-many Many-to-many relationships among two entities are modeled by a third table that has foreign keys that refer to the entities. Normalization is a process of eliminating redundancy and other anomalies in the system. In most cases in the enterprise world, normalization up to Third Normal form would suffice. In certain cases or some transactions it is desirable that certain tables be denormalised for efficiency in querying the database tables. In those cases tables can be in denormalised form. The physical design of the database specifies the physical configuration of the database on the storage media. This includes detailed specification of data elements, data types, indexing options and other parameters residing in the DBMS data dictionary. It is the detailed design of a system that includes modules & the database's hardware & software specifications of the system. It involves describing the base relations, file organizations, and indexes design used to achieve efficient access to the data, and any associated integrity constraints and security measures. Other functions include: Consider typical workloads and further refine the database design. Consider the introduction of controlled redundancy Design security mechanisms Monitor and tune the operational system Design user views Poor design/planning Far too often, a proper planning phase is ignored in favor of just "getting it done". The project heads off in a certain direction and when problems inevitably arise – due to the lack of proper designing and planning – there is "no time" to go back and fix them properly, using proper techniques. Ignoring Normalization Normalization defines a set of methods to break down tables to their constituent parts until each table represents one and only one "thing", and its columns serve to fully describe only the one "thing" that the table represents. Ignoring Normalization Names, while a personal choice, are the first and most important line of documentation for your application. The names you choose are not just to enable you to identify the purpose of an object, but to allow all future programmers, users, and so on to quickly and easily understand how a component part of your database was intended to be used, and what data it stores. No future user of your design should need to wade through a 500 page document to determine the meaning of some wacky name. Introduction Ensuring the security of a database is a complex issue for Organizations The more complex the databases are the more complex the security measures that are to be applied. Network and internet connections to databases may complicate things even further. Every additional internal user that would be added to use the database can create further serious security problems. An organization’s variable information stored in a computer system database is the most precious asset and must be protected. Definition of Database Security These are mechanisms that protect the database against intentional or accidental threats Security considerations not only apply to the data held in a database but breaches to other parts of the system can also affect the database. Database security encompasses hardware, software, people and data. Effective implementation of security requires appropriate controls which are defined in specific mission objectives for the system. The need for security is driven by the increasing amounts of company crucial corporate data being stored on computer and the acceptance that any loss or unavailability of this data could prove to be disastrous. Database security risk Security of database is considered in relation to the following situations • Theft and fraud • Loss of confidentiality (Secrecy) • Loss of privacy • Loss of integrity • Loss of availability Fraud or loss of privacy may arise because of either intentional or unintentional acts and do not necessarily result in any detectable changes to the database or the computer system. Theft and fraud affect not only the database environment but also the entire organization. They do not necessarily alter data as in the case for activities that result in either loss of confidentiality or loss of privacy. Confidentiality refers to the need to maintain secrecy over data, usually only that which is critical to the organization whereas privacy refers to the need to protect data about individuals. Breaches of security resulting into loss of confidentiality could for instance, lead to loss of competitiveness and loss of privacy could lead to legal action being taken against the organization. Loss of data integrity results in invalid or corrupted data, which may seriously affect the operation of an organization. Loss of availability means that the data or the system or both cannot be accessed, which can seriously affect an organization’s financial performance. NB: database security aims to minimize losses caused by anticipated events in a cost effective manner without unduly constraining the users. Threats: Any situation or event, whether intentional or accidental, that may adversely affect a system and consequently the organization. It may be caused by a situation or event involving a person, action, circumstance that is likely to bring harm to an organization. The harm may be tangible such as loss of hardware, software or data or intangible such as loss of credibility or client confidence. The problem facing organization is to identify all possible threats. Summary of potential threats to computer systems Hardware • • • • • • fire/floods/bombs data corruption due to power loss or surge failure of security mechanisms giving greater access theft of equipment physical damage to equipment electronic interference and radiation DBMS and Application Software • • • Failure of security mechanism giving greater access Program alteration Theft of programs Communication networks • Wire tapping •Breaking or disconnection of cables •Electronic and radiation Database •Unauthorized amendment or copying of data •Theft of data •Data corruption due to power loss or surge Users •Using another person’s means of access •Viewing and disclosing unauthorized data •Inadequate staff training •Illegal entry by hacker •Blackmail •Introduction of viruses Programmers/operators •Creating trapdoors •Program alteration •Inadequate staff training •Inadequate security policies and procedures •Staff shortages or strikes Data/database administrator •Inadequate security policies and procedures Countermeasures- Computer-Based Controls The types of controls range from physical to administrative measures. Types of controls •Authorization •Access controls •Views •Backup and recovery •Integrity •Encryption •RAID technology Authorization It refers to the granting of a right or privilege that enables a subject to have legitimate access to a system or a system’s subject. It can be inbuilt into the software and govern not only what system or object a specified user can access, but also what also what a user can do with it. The process of authorization involves authentication of subjects requesting access to objects where subject represent a user or program and object represents a database table, view, procedure, trigger or any other object that can be created within the system. Authentication: - A mechanism that determines whether a user is who he or she claims to be A system administrator is usually responsible for allowing users to have access to a computer system by creating individual user accounts Access controls Typical way to provide access controls for a database system is based on the granting and revoking of privileges. A privilege allows a user to create or access (i.e. read, write, or modify) some database object (such as a relation, view, or index) or to run certain DBMS utilities. Privileges are granted to users to accomplish the tasks required for their jobs. The DBMS keeps tracks of how the privileges are granted to other users and possibly revoked and ensures that at all times only users with necessary privileges can access an object. Discretionary Access Control (DAC) Most commercial DBMSs provide an approach to managing privileges that uses SQL called Discretionary Access Control (DAC). The SQL standard supports DAC through the GRANT and REVOKES commands. The GRANT command gives privileges to users and REVOKE command takes away privileges. Discretionary access control, while effective has certain weaknesses. An unauthorized user can trick an authorized user into disclosing sensitive data Mandatory Access control (MAC) This is based on system wide policies that cannot be changed by individual users. In this approach each database object is assigned a security class and each user is assigned a clearance for a security class and rules are imposed on reading and writing of database objects by users. The DBMS determines whether a given user can read or write a given object based on certain rules that involve the security level of an object and the clearance of the user. The rules seek ensure that sensitive data can never be passed on to another user without necessary clearance. The SQL standard does not include support for MAC. Views A view is a dynamic result of one or more relational operations operating on the base relations to produce another relation. A view is a virtual relation that does not actually exist in the database, but is produced upon request by a particular user, at the time of request. It provides a powerful and flexible security mechanism by hiding parts of the database from certain users. The user is not aware of the existence of any attributes or rows that are missing from the view. It can be defined over several relations with a user being granted the appropriate privileged to use it, but not to use the base relations. Backup and Recovery This is the process of periodically taking a copy of the database and log file (and possibly programs) on to offline storage media A DBMS should provide backup facilities to assist with the recovery of a database following a failure. It is always advisable to make backup copies of the database and the log file at regular intervals and to ensure that copies are in secure location. In the event of a failure that renders the database unusable, the backup copy and the details captured in the log file are used to restore the database to the latest possible consistent state. Journaling-the process of keeping and maintaining the log file (or journal) of all changes made to the database to enable recovery to be undertaken effectively in the event of a failure. A DBMS should provide logging facilities which keep track of the current state of transactions and database changes, to provide support for recovery procedures. The advantage of journaling is that, in the event of a failure, the database can be recovered to its last known consistent state using a backup copy of the database and the information contained in the log file. If no journaling is enabled on a failed system, the only means of recovery is to restore the database using the latest backup version of the database Without a log file, any changes made after the last backup to the database will be lost. Integrity Integrity constraints also ensure secure database system by preventing data from becoming invalid and hence giving misleading or incorrect results. Encryption The encoding of the data by a special algorithm that renders the data unreadable by any program without the decryption key. Symmetric encryption:-Uses the same key for both encryption and decryption and relies on safe communication lines for exchanging the key. Data encryption Standard:- it uses one key for both encryption and decryption, which must be kept secret, although the algorithm need not to be. RAID (Redundant Array of Independent Disks) The hardware that the DBMS is running on must be fault tolerant, meaning that the DBMS should continue to operate even if one of the hardware components fails. RAID works on having a large disk array comprising an arrangement of several independent disks that are organized to improve reliability and at the same time increase performance. Performance is increased through data striping: the data is segmented into equal-size partitions (the striping unit) which are transparently distributed across multiple disks. This gives the appearance of a single large, fast disk where in actual fact data is distributed across several smaller disks. Striping improves overall I/O performance by allowing multiple I/O to be serviced in parallel. Data striping also balances the load among disks DBMS AND Web Security This focuses on how to make a DBMS secure on the web. Internet communication relies on TCP/IP as the underlying protocol. However TCP/IP and HTTP were not designed with security in mind. Without any special software all internet traffic travels in ‘the clear’ and anyone who monitors traffic can read it. Internet challenges The challenge in the internet is to transmit and receive information over the internet while ensuring that:• It is accessible to anyone but the sender and receiver(Privacy) • It has not been changed during transmission (integrity) • The receiver can be sure it came from the sender (authenticity) • The sender can be sure the receiver is guanine (non-fabrication) • The sender cannot deny he or she sent it (non-repudiation) With a multi-tier architecture such as the web environment, the complexity of ensuring secure access to and from the database is necessary. Security must be addressed if the information transmitted contains executable content. Dangers of executable content •Corrupt data or the execution state of programs •Reformat complete disks •Perform a total system shutdown •Collect and download confidential data such as files or passwords to another site •Usurp identity and impersonate the user or user’s computer to attack other targets on the network •Lock up resources making them unavailable for legitimate users and programs •Cause non fatal but unwelcome effects, especially on the output devices Data base security (Internet Environment) Proxy servers: Proxy server is a computer that sits between a web browser and a web server. It intercepts all requests to the web to determine if it can fulfill the request itself. If not, it forwards the requests to the web server. Purposes of proxy server •To improve performance: it saves the results of all requests for a certain amount of time-thus significantly improving performance for a group of users. •To filter requests-e.g. preventing employees from accessing a particular web site Firewalls It’s a system designed to prevent unauthorized access to or from a private network. It can be implemented in both hardware or software or a combination of both. They are frequently used to prevent unauthorized internet users from accessing private networks connected to the internets, especially intranets. All messages entering or leaving the intranet pass through the firewall, which examines each message and blocks those that do not meet the specified security criteria. Types of firewall techniques •Packet Filter: looks at each packet entering or leaving the network and accepts or rejects it based on userdefined rules. •Application Gateway: it applies security mechanisms to specific applications such as FTP and telnet servers. •Circuit level gateway: it applies security mechanism when a TCP or UDP connection is established. Once the connection has been made packets can flow between the hosts without further checking. •Proxy server: this intercepts all messages entering or leaving the network. Message Digest Algorithms and Digital signatures A message digest algorithm or one-way hash function, takes an arbitrary sized string (the message) and generates a fixed-length string (the digest or hash) which has the following characteristics •It should be computationally infeasible to find another message that will generate the same digest •The digest does not reveal anything about the message A digital signature consists of two pieces of information •A string of bits that is computed from the data that is being signed along with •the private key of the individual or organization wishing the signature The signature can be used to verify that the data comes from this individual or organization Digital certificates This is an attachment to an electronic message used for security purposes, most commonly to verify that a user sending a message is who he or she claims to be and to provide the receiver with the means to encode a reply. Kerberos This is a server of secured user names and password. It provides one centralized security server for all data resources on the network. Database access, login, authorization control and other security features are centralized on trusted Kerberos. It has a similar function to that of a certificate server; to identify and validate a user, Data Warehouse: The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the following way: "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process". He defined the terms in the sentence as follows: Subject Oriented: Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Time-variant: All data in the data warehouse is identified with a particular time period. Non-volatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Information Data A process of transforming data into information and making it available to users in a timely enough manner to make a difference Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? What is the most effective distribution channel? What product prom-otions have the biggest impact on revenue? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? Relational Databases Optimized Loader ERP Systems Extraction Cleansing Data Warehouse Engine Purchased Data Legacy Data Metadata Repository Analyze Query Putting Information technology to help the knowledge worker make faster and better decisions Which of my customers are most likely to go to the competition? What product promotions have the biggest impact on revenue? How did the share price of software companies correlate with profits over last 10 years? Used to manage and control business Data is historical or point-in-time Optimized for inquiry rather than update Use of the system is loosely defined and can be ad-hoc Used by managers and end-users to understand the business and make judgements Enterprise Data warehouse collects all information about subjects (customers,products,sales,assets, personnel) that span the entire organization Data Mart Departmental subsets that focus on selected subjects Decision Support System (DSS) Information technology to help the knowledge worker (executive, manager, analyst) make faster & better decisions Online Analytical Processing (OLAP) an element of decision support systems (DSS) Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes Petabytes -- 10^15 bytes: Exabytes -- 10^18 bytes: Geographic Information Systems National Medical Records Zettabytes -- 10^21 bytes: Weather images Zottabytes -- 10^24 bytes: Intelligence Agency Videos Industry Finance Insurance Telecommunication Transport Consumer goods Data Service providers Utilities Application Credit Card Analysis Claims, Fraud Analysis Call record analysis Logistics management promotion analysis Value added data Power usage analysis A Data Warehouse Delivers Enhanced Business Intelligence By providing data from various sources, managers and executives will no longer need to make business decisions based on limited data or their gut. In addition, “data warehouses and related BI can be applied directly to business processes including marketing segmentation, inventory management, financial management, and sales.” A Data Warehouse Saves Time Since business users can quickly access critical data from a number of sources—all in one place—they can rapidly make informed decisions on key initiatives. They won’t waste precious time retrieving data from multiple sources. Not only that but the business execs can query the data themselves with little or no support from IT—saving more time and more money. A Data Warehouse Enhances Data Quality and Consistency A data warehouse implementation includes the conversion of data from numerous source systems into a common format. Since each data from the various departments is standardized, each department will produce results that are in line with all the other departments. So you can have more confidence in the accuracy of your data. And accurate data is the basis for strong business decisions. A Data Warehouse Provides Historical Intelligence A data warehouse stores large amounts of historical data so you can analyze different time periods and trends in order to make future predictions. Such data typically cannot be stored in a transactional database or used to generate reports from a transactional system. 1. Data warehouses are not the optimal environment for unstructured data. 2. Because data must be extracted, transformed and loaded into the warehouse, there is an element of latency in data warehouse data. 3. Over their life, data warehouses can have high costs. Data warehouses can get outdated relatively quickly. 4. There is a cost of delivering suboptimal information to the organization. 5. There is often a fine line between data warehouses and operational systems. Duplicate, expensive functionality may be developed. Or, functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems. What is Data Mining? Data Mining, a specialized part of Knowledge-Discovery in Databases (KDD), is the process of automatically searching large volumes of data for patterns and rules. It helps in extracting meaningful new patterns that cannot be found necessarily by merely querying or processing data or metadata in the data warehouse. A few examples arising from data mining may include: •pattern showing whenever a customer buys video equipment, he or she also buys another electronic gadget. •suppose a customer buys a camera, and within three months he or she buys photographic supplies, and within six months an accessory item? •Customer classification by frequency of visits, by amount of purchase, item purchase, payment mode etc… What is Data Mining? Data mining access of the database differs from this traditional access in three major areas: 1. Query: The query might not be well formed or precisely stated. The data miner might not even be exactly sure of what they want to see. 2. Data: The data access is usually a different version from that of the operational database (it typically comes from a data warehouse). The data must be cleansed and modified to better support mining operations. 3. Output: The output of the data mining query probably is not a subset of the database. Instead it is the output of some analysis of the contents of the database. Why mine data? Prediction - show how certain attributes within the data will behave in the future. Identification - Data patterns can be used to identify the existence of an item, an event, or an activity. classification - Partitioning the data so that different categories can be identified based on combinations of parameters Optimization - optimize the use of limited resources to maximize output variables e.g. sales or profits. Stage 1: Exploration. Data preparation which involves cleaning data, data transformations, and selecting subsets of records. Stage 2: Model building and validation. Considering various models & choosing the best one based on their predictive performance. Stage 3: Deployment. applying the model to new data in order to generate predictions or estimates of the expected outcome. The knowledge discovered during data mining can be described in five ways, as follows: Association rules—These rules correlate the presence of a set of items with another range of values for another set of variables. Examples: (1) When a female retail shopper buys a handbag, she is likely to buy shoes. Classification hierarchies—The goal is to work from an existing set of events or transactions to create a hierarchy of classes. Examples: (1) A population may be divided into five ranges of credit worthiness based on a history of previous credit transactions. Sequential patterns—A sequence of actions or events is sought. Example: If a patient underwent cardiac bypass surgery for blocked arteries and later developed high blood sugar within a year of surgery, he or she is likely to suffer from kidney failure within the next 18 months. Detection of sequential patterns is equivalent to detecting association among events with certain temporal relationships. Categorization and segmentation—A given population of events or items can be partitioned (segmented) into sets of "similar" elements. Example: The adult population in Kenya may be categorized into five groups from "most likely to buy" to "least likely to buy" a new product. For most applications, the desired knowledge is a combination of the above types. Data Mining Descriptive Models Predictive Models Classification Regression Prediction Time-series Analysis Clustering Summarization Sequence Discovery Association Rules Data mining models and some typical tasks. Not an exhaustive listing. Combinations of these tasks yield more sophisticated mining operations. Classification: Given a set of items that have several classes, and given the past association class, Classification is the process of predicting the class of a new item. Technique used: Decision-Tree Classifiers Also: Artificial Neural Networks: Predictive models that learn through training and resemble biological neural networks in structure. A tree Induction e.g. Customer renting property more than 2 years? Yes No Customer over 25 years old Rent property No This predictive model will classify customers into one of two categories: renters and buyers. The model will predict that customers who are over 25 years old and have rented for more than 2 years will buy property, others will rent. Rent property Classification Using An Induction Tree Yes Buy property Clustering is similar to classification except that the groups are not predefined, but rather defined by the data alone. Clustering is alternatively referred to as unsupervised learning or segmentation (actually, segmentation is a special case of clustering although many people refer to them synonymously). Clustering algorithms find groups of items that are similar. Technique used: Nearest neighbor. A classification technique that classifies each record based on the records most similar to it in an historical database. Regression - Predictive Regression is used to map a data item to a real valued prediction variable. Prediction of a value, rather than a class Regression assumes that the target data fit into some known type of function (i.e., linear, logistic, etc.) and then determines the best function of this type that models the given data. Some type of error analysis is used to determine which function is “best”, i.e., produces the least total error. The problem with linear regression is that the technique only works well with linear data and is sensitive to the presence of outliers (data values which do not conform to the expected norm). Although nonlinear regression avoids the main problems of linear regression but not fully. Data mining requires statistical methods that can accommodate nonlinearity, outliers, and non-numeric data. An association algorithm creates rules that describe how often events have occurred together. Eg: When a customer buys a hammer, then 90% of the time they will buy nails. Support: “is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule” - e.g. People who buy hotdog buns also buy hotdog sausages in 99% of cases Confidence: “is a measure of how often the consequent is true when the antecedent is true.e.g. 90% of Hotdog bun purchases are accompanied by hotdog sausages. •When using association rules, one must remember that these are not casual relationships. They do not represent any relationship inherent in the actual data as is the case with functional dependencies, or in the real world. •There is probably no relationship between items and no guarantee that this association will apply in the future. Data mining technologies can be applied to a large variety of decision-making contexts in business. In particular, areas of significant payoffs are expected to include the following: Marketing—Applications include analysis of consumer behavior based on buying patterns; determination of marketing strategies including advertising, store location, and targeted mailing; segmentation of customers, stores, or products; and design of catalogs, store layouts, and advertising campaigns. Finance—Applications include analysis of creditworthiness of clients, segmentation of account receivables, performance analysis of finance investments like stocks, bonds, and mutual funds; evaluation of financing options; and fraud detection. Manufacturing—Applications involve optimization of resources like machines, manpower, and materials; optimal design of manufacturing processes, shop-floor layouts, and product design, such as for automobiles based on customer requirements. Health Care—Applications include an analysis of effectiveness of certain treatments; optimization of processes within a hospital, relating patient wellness data with doctor qualifications; and analyzing side effects of drugs. Science – Predict environmental change Provides new knowledge from existing data. Public databases Government sources Company Databases Old data can be used to develop new knowledge. New knowledge can be used to improve services or products. Improvements lead to: Bigger profits More efficient service Research: Insight. Privacy Issues: A persons life story can be painted from the collected data e.g by linking Shopping History, Credit History, Bank History, Employment History. Eg:According to Washington Post, in 1998, American Express also sold their customers’ credit card purchases to another company. Security issues: Companies have a lot of personal information online, they do not guarantee to protect it. Misuse of information: Service discrimination e.g. Some of the company will answer your phone based on your purchase history. Privacy Issues: A persons life story can be painted from the collected data e.g by linking Shopping History, Credit History, Bank History, Employment History. Eg:According to Washington Post, in 1998, American Express also sold their customers’ credit card purchases to another company. Security issues: Companies have a lot of personal information online, they do not guarantee to protect it. Misuse of information: Service discrimination e.g. Some of the company will answer your phone based on your purchase history. Missing data: During the preprocessing phase of KDD, missing data may be replaced with estimates. Resulting in invalid estimates Irrelevant data: Some attributes in the database might not be of interest to the data mining task being developed. Changing data: Databases cannot be assumed to be static. However, most data mining algorithms do assume a static database. This requires that the algorithms be completely rerun anytime the database changes. Application: Determining the intended use for the information obtained from the data mining function is a challenge. How business executives can effectively use the output & modify the firms is sometimes considered the more difficult part. THE KDD PROCESS: Consists of the following five basic steps: 1. Selection: The data needed for the data mining process is obtained from many different and heterogeneous data sources. 2. Preprocessing: The data to be used by the process may have incorrect or missing data. Erroneous data may be corrected or removed, whereas missing data must be supplied or predicted (often using data mining tools). 3. Transformation: Data from different sources must be converted into a common format for processing. Some data may be encoded or transformed into more usable formats. Data reduction may be used to reduce the number of possible data values being considered. 4. Data mining: Based on the data mining task being performed, this step applies the algorithms to the transformed data to generate the desired results. 5. Interpretation/evaluation: How the data mining results are presented to the users is extremely important because the usefulness of the results is dependent on it. The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. http://en.wikipedia.org/wiki/Data_mining http://www.statsoft.com/textbook/stdatmin. html http://www.anderson.ucla.edu/faculty/jason. frand/teacher/technologies/palace/datamini ng.htm Ramez Elmasri and Shamkant B. Navathe: Fundamentals of Database Systems.