* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Data Mining
Open Database Connectivity wikipedia , lookup
Concurrency control wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Functional Database Model wikipedia , lookup
Clusterpoint wikipedia , lookup
Resource Management Data Fall 2004 1 Data Management COLLECT Data Sources Represent MANAGE STORE Databases Data Inhouse vs Warehouses Outsourcing USE Processing (Transaction versus Web) Data Mining Target Marketing 2 Data and their Sources • Types of data – public – private • Sources – internal – external • Implications? 3 e-Commerce architecture E-Commerce environment E-Commerce application (e-shop etc) Request Info 3rd Layer: Data Layer Client Receive Info Web Server 1st Layer: Presentation Layer 2nd Layer: Application Layer SERVER ENVIRONMENT Database 4 Databases • Database --- A non-redundant collection of logically related records or files. It enables a common pool of data records to serve many processing applications. •Database management software --- mechanism for storing and organizing data for sophisticated queries and manipulation of data • Relational database – most popular, data organized into tables (Microsoft SQL Server, Oracle) 5 Database Approach • Data occupies the central position; referenced as needed. • Data sharing: data is not the property of one person; one representation for each piece of data; avoids (minimizes) redundancy. • User Views: allows a user to have a view of the database that is different from the view used by others; user isolated from changes in the data/programs. • Query Language: An English-like language designed for end-users to query the database. • Database Administrator: Specialist who manages the database. 6 Database System for Bank Client Database Application program CUSTOMER DATA Cust name SSN Address Savings Account # Loan Account # Investment Account # SAVINGS DATA Savings Account # Account balance User SAVINGS SYSTEM Database Management System LOAN SYSTEM LOAN DATA Loan Account # Account balance INVESTMENT DATA Investment Account # Account balance INVESTMENT SYSTEM Source: Dorit Nevo 7 Database Development • Objective • develop a database that accurately represents the real world. (i.e. “model” the real world.) • Database: a model of an organization. • Any results that the database gives you must be true in the real world. • Any relevant results about the real world must be obtainable from the database. 8 Redundant Data Consider the following table that stores data about auto parts and suppliers. This seemingly harmless table contains many potential problems. Part# Description Supplier Address City 100 Coil Dynar Denver CO 101 Muffler GlassCo 1638 S. Front 102 103 Wheel Cover A1 Auto 7441 E. 4th Street Battery Dynar 45 Estern Ave. 104 Radiator 105 Manifold United 346 Taylor Drive Austin TX Parts GlassCo 1638 S. Front Seattle WA 106 Converter GlassCo 1638 S. Front Suppose you want to add another part? 107 Tail Pipe GlassCo 45 Eastern Ave. 1638 S. Front State Seattle WA Detroit MI Denver CO Seattle WA Seattle WA Disk space is wasted by duplicating data about the supplier. Every time a new part is entered for a particular supplier, all of the supplier data is repeated. Imagine the problems if several 9 suppliers supply hundreds of auto parts each. Modification Anomaly What is GlassCo moves to Olympia? How many rows have to change in order to ensure that the new address is recorded. Part# Description Supplier Address City State 100 Coil Dynar 101 Muffler GlassCo 1638 S. Front 102 103 Wheel Cover A1 Auto 7441 E. 4th Street Battery Dynar 45 Estern Ave. 104 Radiator 105 Manifold United 346 Taylor Drive Austin TX Parts GlassCo 1638 S. Front Seattle WA 106 Converter GlassCo 1638 S. Front Seattle WA 107 Tail Pipe GlassCo 1638 S. Front Seattle WA 45 Eastern Ave. Denver CO Seattle WA Detroit MI Denver CO Again, imagine the issues surrounding modifications of hundreds of rows of data for one supplier. When changes are made, they must be made to all copies of the data. Think about the confusion that results from changing only a subset of the duplicate 10 data. Deletion Anomaly Suppose you no longer carried part number 102 and decided to delete that row from the table? Part# Description Supplier Address City 100 Coil Dynar Denver CO 101 Muffler GlassCo 1638 S. Front Seattle 102 Wheel Cover A1 Auto Detroit MI 103 Battery Dynar 104 Radiator TX 105 Manifold United 346 Taylor Drive Austin Parts GlassCo 1638 S. Front Seattle 106 Converter GlassCo 1638 S. Front Seattle WA 107 Tail Pipe GlassCo 1638 S. Front Seattle WA 45 Eastern Ave. 7441 E. 4th Street 45 Estern Ave. State WA Denver CO WA 11 Now, looking at the remaining data below, what is the address of A1 Auto? Part# Description Supplier Address City 100 Coil Dynar Denver CO 101 Muffler GlassCo 1638 S. Front Seattle WA 103 Battery Dynar Denver CO 104 Radiator 105 Manifold United 346 Taylor Drive Austin TX Parts GlassCo 1638 S. Front Seattle WA 106 Converter GlassCo 1638 S. Front Seattle WA 107 Tail Pipe GlassCo 1638 S. Front Seattle WA 45 Eastern Ave. 45 Estern Ave. State A deletion anomaly means that we lose more information than we want. We lose facts about more than one subject with one deletion. 12 Insertion Anomaly Next, you want to add a new supplier, CarParts, but you have not yet ordered parts from that supplier. What do you add? Part# Description Supplier Address City State 100 Coil Dynar 45 Eastern Ave. Denver CO 101 Muffler GlassCo 1638 S. Front Seattle WA 103 Battery Dynar 45 Estern Ave. Denver CO 104 Radiator Austin TX 105 Manifold United 346 Taylor Parts Drive GlassCo 1638 S. Front Seattle WA 106 Converter GlassCo 1638 S. Front Seattle WA 107 Tail Pipe GlassCo 1638 S. Front Seattle WA ??? ???????? CarParts 101 Mariposa Orlando FL The situation is called an insertion anomaly. Negatively stated, we cannot add a fact about one subject until we have additional data about another subject. 13 Data Management Data Sources Represent MANAGE STORE COLLECT Databases Entity Relationship Model Data Inhouse vs Warehouses Outsourcing USE Transaction Processing Data Mining Target Marketing Relational Model 14 Database Design Representation Entity-Relationship Model ENTITY: Person, place, thing, event about which data must be kept • ATTRIBUTE: Description of a particular ENTITY • KEY FIELD: Field used to retrieve, update, sort RECORD Source: @2002 Prentice Hall 15 KEY FIELD • Field in each record • Uniquely Identifies THIS Record • For RETRIEVAL UPDATING SORTING Source: @2002 Prentice Hall 16 TYPES OF RELATIONSHIPS ONE-TO-ONE: STUDENT CLASS ONE-TO-MANY: STUDENT A MANY-TO-MANY: Mother STUDENT B CLASS 1 STUDENT A STUDENT C CLASS 2 STUDENT B STUDENT C 17 Example: Consulting Company Database You have been asked to create a database for a small consulting company. The company wants to keep track of which employees are assigned to which project and what dates they start and stop working on them. An employee can work on more than one project at a time (as many students know). You also need to keep track of which client sponsors which project(s). Each project usually requires a set of skills so you need to know what skills an employee has and when he or she obtained them. Employees are encouraged to find clients and receive extra compensation for doing so. 18 Employee N N Entity-Relationship Model finds 1 N start-date has Client end-date 1 assigned to M Skill sponsors M M requires N N Project 19 Employee Emp# name address dob 741 852 963 357 12 Peachtree Rd 807 Piedmont Rd 4321 Cobb Dr 15 Peachtree Rd 1 Jan 1960 6 May 1964 15 Oct 1971 14 Feb 1979 Fred Smith Sarah Thomas Daniel McCarthy Ellen Lewis Client Client--Id 1150 1151 1152 1153 name Joe Johnson Stacey Smith Donald Davis Ed Edwards phone 555-7412 555-8523 888-3699 777-9513 date-signed 14 Dec 2001 7 Jan 2002 26 Jan 2002 28 Feb 2002 hourlyrate 75 85 69 55 emp# 852 741 852 963 20 Project Project# 9357 9159 9752 9684 name Virtual Courtyard Metro Pontiac Looking Glass Skill Skill-name Relational database Object-oriented database Data Mining Electronic Commerce date-began 30 Jan 2001 8 Jan 2001 5 Mar 2001 30 Dec 2000 date-completed client-id 1152 1 Mar 2001 1151 1153 15 Feb 2001 1150 description Relational db design and implementation Object-oriented db design and implementation Implementing data mining systems Intranet development and ecommerce applications 21 Has-skill Emp# Skill-name 852 Electronic Commerce 852 Relational database 741 Relational database 852 Data Mining 963 Data Mining Requires Project# 9357 9684 9159 9684 9357 9752 date-required 7 Jan 2001 30 Dec 2000 15 Jan 2001 10 Jan 2001 15 Mar 2000 Skill-name Electronic Commerce Relational database Relational database Data Mining Data Mining Data Mining Assigned Emp# Project# 852 9357 741 9159 963 9752 852 9684 start-date 30 Jan 2002 8 Jan 2002 5 Mar 2002 30 Dec 2001 end-date 1 Mar 2002 15 Feb 2002 22 Typical Queries • What date was the project called “Metro” completed? • What is the name of the client who sponsors the project called “Pontiac?” • What skills are required for the project called “Virtual Courtyard”? Note: Minimal redundancy in database design 23 Relational Data Model • One basic construct: the relation. • Relations represent both entities and relationships. • Data Manipulation Language: English-like. • Dominant database structure. – DB2 by IBM – ACCESS by Microsoft – Oracle 24 Translate E-R Model into Relational Model • Each entity represented by an (entity) relation • N:M relationship represented by a separate (relationship) relation – Key is concatenation (joining together) of entity keys. – Relationship attributes are non keys. • 1:N relationship represented by foreign key, i.e. key of entity on “1” side appears as non key in relation for the entity on the “N” side. 25 Example: Student-Course Design a database to keep track of what courses a student takes and the grade he or she receives. Entities: Student: [SSN, name, address] Course: [Course-Id, description] Relationships: Student takes Course: [grade] N : M 26 Student Relation SSN nam e a d d ress 1 1 1 -2 2 -3 3 3 3 M . T o m k in s 1 7 O a k S t. 4 4 4 -7 1 -2 2 2 2 L .R ic h a r d o 2 2 T illy C o u rt 7 9 5 -4 4 -1 1 1 1 H . M cE n ro e 3 3 S ta r S t. Course Relation C o u r se -Id d e s c r ip tio n M B A 401 M g t. In fo rm a tio n S y s te m s C IS 4 8 1 S tra te g ic S y s te m s C IS 7 2 1 D a ta b a s e M g t. S y s te m s 27 Takes Relation SSN Course-Id, grade 795-44-1111 CIS 721 A 444-71-2222 CIS 481 B 111-22-3333 MBA 401 B 28 What is Data Quality? 1. Data is accurate— e.g. customer’s name spelled correctly; address correct. 2. Data is stored according to data type— e.g. as character, integer. 3. Data has integrity— backup and recovery procedures. 4. Data is not redundant. 5. Data follows business rules— e.g. loan balance may never be negative. 29 What is Data Quality (cont’d) 6.Data corresponds to established domains— e.g. employee age 16-65 7. Data is timely— e.g. monthly, weekly, daily, real-time. 8. Data satisfies needs of the business— e.g. marketing (customers, demographics), accounts payable (vendors, products). 9. Data is complete— e.g. all line items for an invoice captured. 30 What Managers Should Know About Data Modeling • Database operators represent ways in which data can be manipulated to assist in managerial decision-making – Without some sense of the possibilities of queries and reports, managers will have a misconception of what they can expect from a database. • Data modeling is a technique used expertly by professionals – Nevertheless, general managers need to understand the general design issues involved in order to appreciate the effort involved and value of excellent data modeling. 31 Privacy Issues • Is there information in my files that should not be there? • Is information being used for the purpose it was originally intended? • Is information being shared appropriately (both inside and outside the firm?) • Is information being combined in appropriate ways? • Are decisions that require human judgment being made appropriately? • Are appropriate procedures in place for preventing and correcting errors? Source: Cash, J.I., McFarlan, F.W., McKenney, J.L., and Applegate, L.M., Corporate Information Systems Management: Text and Cases, Homewood. II. 32 What should managers know about Database Management? • Management of database is an important issue – Although once a technical issue, the management of databases has become increasingly important throughout all types of organizations. • Organizations store and use large quantities of data – Sheer volume of data alone means that proper management is essential. • Data are a valuable resource that must be managed – Value is assured by capturing, validating, and protecting the data. 33 What should managers know about Database Management? • The wrong approach to managing data adds complexity in the management of organizations. – The management of data should be part of the solution, not part of the problem. • You have a right to influence the management of data you need. – The management of databases is not an activity that should occur in isolation. Those who rely on the data captured and stored in an organization have a need and, in fact, an obligation to be involved in the decisions that affect their use of the data. 34 Appendix • File organization • Components of database management system • SQL (Structured Query Language) 35 FILE ORGANIZATION • BIT: Binary Digit (0,1; Y,N; On,Off) • BYTE: Combination of BITS which represent a CHARACTER • FIELD: Collection of BYTES which represent a DATUM or Fact • RECORD: Collection of FIELDS which reflect a TRANSACTION Source: @2002 Prentice Hall 36 FILE ORGANIZATION • FILE: A Collection of similar RECORDS • DATABASE: An Organization’s Electronic Library of FILES organized to serve business applications Source: @2002 Prentice Hall 37 COMPONENTS OF DBMS • DATA DEFINITION LANGUAGE – Defines data elements in database • DATA MANIPULATION LANGUAGE – Manipulates data for applications • DATA DICTIONARY – Formal definitions of all variables in database, controls variety of database contents, data elements Source: @2002 Prentice Hall DBMS 38 STRUCTURED QUERY LANGUAGE (SQL) • DE FACTO STANDARD • DATA MANIPULATION LANGUAGE FOR RELATIONAL DATABASES Source: @2002 Prentice Hall DBMS 39 ELEMENTS OF SQL • SELECT: List of columns from tables desired • FROM: Identifies tables from which columns will be selected • WHERE: Includes conditions for selecting specific rows, conditions for joining multiple tables Source: @2002 Prentice Hall DBMS 40