Download Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Resource Management
Data
Fall 2004
1
Data
Management
COLLECT
Data
Sources
Represent
MANAGE
STORE
Databases
Data
Inhouse vs
Warehouses
Outsourcing
USE
Processing
(Transaction
versus Web)
Data Mining
Target
Marketing
2
Data and their Sources
• Types of data
– public
– private
• Sources
– internal
– external
• Implications?
3
e-Commerce architecture
E-Commerce environment
E-Commerce
application (e-shop etc)
Request Info
3rd Layer: Data Layer
Client
Receive Info
Web Server
1st Layer: Presentation Layer
2nd Layer: Application Layer
SERVER ENVIRONMENT
Database
4
Databases
• Database --- A non-redundant collection of
logically related records or files. It enables a
common pool of data records to serve many
processing applications.
•Database management software --- mechanism
for storing and organizing data for sophisticated
queries and manipulation of data
• Relational database – most popular, data
organized into tables (Microsoft SQL Server,
Oracle)
5
Database Approach
• Data occupies the central position; referenced as needed.
• Data sharing: data is not the property of one person; one
representation for each piece of data; avoids (minimizes)
redundancy.
• User Views: allows a user to have a view of the database
that is different from the view used by others; user isolated
from changes in the data/programs.
• Query Language: An English-like language designed for
end-users to query the database.
• Database Administrator: Specialist who manages the
database.
6
Database System for Bank
Client
Database
Application
program
CUSTOMER DATA
Cust name
SSN
Address
Savings Account #
Loan Account #
Investment Account #
SAVINGS DATA
Savings Account #
Account balance
User
SAVINGS
SYSTEM
Database
Management
System
LOAN
SYSTEM
LOAN DATA
Loan Account #
Account balance
INVESTMENT DATA
Investment Account #
Account balance
INVESTMENT
SYSTEM
Source: Dorit Nevo
7
Database Development
• Objective
• develop a database that accurately represents
the real world. (i.e. “model” the real world.)
• Database: a model of an organization.
• Any results that the database gives you must
be true in the real world.
• Any relevant results about the real world
must be obtainable from the database.
8
Redundant Data
Consider the following table that stores data about auto parts and suppliers. This
seemingly harmless table contains many potential problems.
Part# Description
Supplier Address
City
100
Coil
Dynar
Denver CO
101
Muffler
GlassCo 1638 S. Front
102
103
Wheel Cover A1 Auto 7441 E. 4th
Street
Battery
Dynar
45 Estern Ave.
104
Radiator
105
Manifold
United
346 Taylor Drive Austin TX
Parts
GlassCo 1638 S. Front
Seattle WA
106
Converter
GlassCo 1638 S. Front
Suppose you want to add another part?
107
Tail Pipe
GlassCo
45 Eastern Ave.
1638 S. Front
State
Seattle WA
Detroit MI
Denver CO
Seattle WA
Seattle
WA
Disk space is wasted by duplicating data about the supplier. Every time a new part is entered for
a particular supplier, all of the supplier data is repeated. Imagine the problems if several 9
suppliers supply hundreds of auto parts each.
Modification Anomaly
What is GlassCo moves to Olympia? How many rows have to change in order to
ensure that the new address is recorded.
Part# Description
Supplier Address
City
State
100
Coil
Dynar
101
Muffler
GlassCo 1638 S. Front
102
103
Wheel Cover A1 Auto 7441 E. 4th
Street
Battery
Dynar
45 Estern Ave.
104
Radiator
105
Manifold
United 346 Taylor Drive Austin TX
Parts
GlassCo 1638 S. Front
Seattle WA
106
Converter
GlassCo 1638 S. Front
Seattle WA
107
Tail Pipe
GlassCo 1638 S. Front
Seattle WA
45 Eastern Ave. Denver CO
Seattle WA
Detroit MI
Denver CO
Again, imagine the issues surrounding modifications of hundreds of rows of data for
one supplier. When changes are made, they must be made to all copies of the data.
Think about the confusion that results from changing only a subset of the duplicate
10
data.
Deletion Anomaly
Suppose you no longer carried part number 102 and decided
to delete that row from the table?
Part# Description
Supplier Address
City
100
Coil
Dynar
Denver CO
101
Muffler
GlassCo 1638 S. Front
Seattle
102
Wheel Cover
A1 Auto
Detroit MI
103
Battery
Dynar
104
Radiator
TX
105
Manifold
United
346 Taylor Drive Austin
Parts
GlassCo 1638 S. Front
Seattle
106
Converter
GlassCo 1638 S. Front
Seattle
WA
107
Tail Pipe
GlassCo 1638 S. Front
Seattle
WA
45 Eastern Ave.
7441 E. 4th
Street
45 Estern Ave.
State
WA
Denver CO
WA
11
Now, looking at the remaining data below, what is the address
of A1 Auto?
Part# Description
Supplier Address
City
100
Coil
Dynar
Denver CO
101
Muffler
GlassCo 1638 S. Front
Seattle WA
103
Battery
Dynar
Denver CO
104
Radiator
105
Manifold
United 346 Taylor Drive Austin TX
Parts
GlassCo 1638 S. Front
Seattle WA
106
Converter
GlassCo 1638 S. Front
Seattle WA
107
Tail Pipe
GlassCo 1638 S. Front
Seattle WA
45 Eastern Ave.
45 Estern Ave.
State
A deletion anomaly means that we lose more information than we want.
We lose facts about more than one subject with one deletion.
12
Insertion Anomaly
Next, you want to add a new supplier, CarParts, but you have
not yet ordered parts from that supplier. What do you add?
Part# Description Supplier Address
City
State
100
Coil
Dynar
45 Eastern Ave. Denver
CO
101
Muffler
GlassCo 1638 S. Front
Seattle
WA
103
Battery
Dynar
45 Estern Ave.
Denver
CO
104
Radiator
Austin
TX
105
Manifold
United
346 Taylor
Parts
Drive
GlassCo 1638 S. Front
Seattle
WA
106
Converter
GlassCo 1638 S. Front
Seattle
WA
107
Tail Pipe
GlassCo 1638 S. Front
Seattle
WA
???
????????
CarParts 101 Mariposa
Orlando FL
The situation is called an insertion anomaly. Negatively stated, we cannot
add a fact about one subject until we have additional data about another
subject.
13
Data
Management
Data
Sources
Represent
MANAGE
STORE
COLLECT
Databases
Entity
Relationship
Model
Data
Inhouse vs
Warehouses
Outsourcing
USE
Transaction
Processing
Data
Mining
Target
Marketing
Relational
Model
14
Database Design
Representation
Entity-Relationship Model
ENTITY: Person, place, thing, event about which data
must be kept
• ATTRIBUTE: Description of a particular ENTITY
• KEY FIELD: Field used to retrieve, update, sort
RECORD
Source: @2002 Prentice Hall
15
KEY FIELD
• Field in each record
• Uniquely Identifies THIS Record
• For RETRIEVAL
UPDATING
SORTING
Source: @2002 Prentice Hall
16
TYPES OF RELATIONSHIPS
ONE-TO-ONE:
STUDENT
CLASS
ONE-TO-MANY:
STUDENT
A
MANY-TO-MANY:
Mother
STUDENT
B
CLASS
1
STUDENT
A
STUDENT
C
CLASS
2
STUDENT
B
STUDENT
C
17
Example: Consulting Company
Database
You have been asked to create a database for a small
consulting company. The company wants to keep track
of which employees are assigned to which project and
what dates they start and stop working on them. An
employee can work on more than one project at a time
(as many students know). You also need to keep track
of which client sponsors which project(s). Each project
usually requires a set of skills so you need to know
what skills an employee has and when he or she
obtained them. Employees are encouraged to find
clients and receive extra compensation for doing so.
18
Employee
N
N
Entity-Relationship
Model
finds
1
N
start-date
has
Client
end-date
1
assigned to
M
Skill
sponsors
M
M
requires
N
N
Project
19
Employee
Emp# name
address
dob
741
852
963
357
12 Peachtree Rd
807 Piedmont Rd
4321 Cobb Dr
15 Peachtree Rd
1 Jan 1960
6 May 1964
15 Oct 1971
14 Feb 1979
Fred Smith
Sarah Thomas
Daniel McCarthy
Ellen Lewis
Client
Client--Id
1150
1151
1152
1153
name
Joe Johnson
Stacey Smith
Donald Davis
Ed Edwards
phone
555-7412
555-8523
888-3699
777-9513
date-signed
14 Dec 2001
7 Jan 2002
26 Jan 2002
28 Feb 2002
hourlyrate
75
85
69
55
emp#
852
741
852
963
20
Project
Project#
9357
9159
9752
9684
name
Virtual Courtyard
Metro
Pontiac
Looking Glass
Skill
Skill-name
Relational database
Object-oriented database
Data Mining
Electronic Commerce
date-began
30 Jan 2001
8 Jan 2001
5 Mar 2001
30 Dec 2000
date-completed client-id
1152
1 Mar 2001
1151
1153
15 Feb 2001
1150
description
Relational db design and implementation
Object-oriented db design and implementation
Implementing data mining systems
Intranet development and ecommerce
applications
21
Has-skill
Emp# Skill-name
852
Electronic Commerce
852
Relational database
741
Relational database
852
Data Mining
963
Data Mining
Requires
Project#
9357
9684
9159
9684
9357
9752
date-required
7 Jan 2001
30 Dec 2000
15 Jan 2001
10 Jan 2001
15 Mar 2000
Skill-name
Electronic Commerce
Relational database
Relational database
Data Mining
Data Mining
Data Mining
Assigned
Emp# Project#
852
9357
741
9159
963
9752
852
9684
start-date
30 Jan 2002
8 Jan 2002
5 Mar 2002
30 Dec 2001
end-date
1 Mar 2002
15 Feb 2002
22
Typical Queries
• What date was the project called “Metro”
completed?
• What is the name of the client who sponsors
the project called “Pontiac?”
• What skills are required for the project called
“Virtual Courtyard”?
Note: Minimal redundancy in database design
23
Relational Data Model
• One basic construct: the relation.
• Relations represent both entities and
relationships.
• Data Manipulation Language: English-like.
• Dominant database structure.
– DB2 by IBM
– ACCESS by Microsoft
– Oracle
24
Translate E-R Model into Relational Model
• Each entity represented by an (entity) relation
• N:M relationship represented by a separate
(relationship) relation
– Key is concatenation (joining together) of
entity keys.
– Relationship attributes are non keys.
• 1:N relationship represented by foreign key, i.e.
key of entity on “1” side appears as non key in
relation for the entity on the “N” side.
25
Example: Student-Course
Design a database to keep track of what courses
a student takes and the grade he or she receives.
Entities:
Student: [SSN, name, address]
Course: [Course-Id, description]
Relationships:
Student takes Course: [grade]
N
:
M
26
Student Relation
SSN
nam e
a d d ress
1 1 1 -2 2 -3 3 3 3
M . T o m k in s
1 7 O a k S t.
4 4 4 -7 1 -2 2 2 2
L .R ic h a r d o
2 2 T illy C o u rt
7 9 5 -4 4 -1 1 1 1
H . M cE n ro e
3 3 S ta r S t.
Course Relation
C o u r se -Id
d e s c r ip tio n
M B A 401
M g t. In fo rm a tio n S y s te m s
C IS 4 8 1
S tra te g ic S y s te m s
C IS 7 2 1
D a ta b a s e M g t. S y s te m s
27
Takes Relation
SSN
Course-Id,
grade
795-44-1111
CIS 721
A
444-71-2222
CIS 481
B
111-22-3333
MBA 401
B
28
What is Data Quality?
1. Data is accurate— e.g. customer’s name spelled
correctly; address correct.
2. Data is stored according to data type— e.g. as
character, integer.
3. Data has integrity— backup and recovery
procedures.
4. Data is not redundant.
5. Data follows business rules— e.g. loan balance
may never be negative.
29
What is Data Quality (cont’d)
6.Data corresponds to established domains—
e.g. employee age 16-65
7. Data is timely— e.g. monthly, weekly, daily,
real-time.
8. Data satisfies needs of the business— e.g.
marketing (customers, demographics), accounts
payable (vendors, products).
9. Data is complete— e.g. all line items for an
invoice captured.
30
What Managers Should Know About
Data Modeling
• Database operators represent ways in which data can be
manipulated to assist in managerial decision-making
– Without some sense of the possibilities of queries and reports,
managers will have a misconception of what they can expect
from a database.
• Data modeling is a technique used expertly by professionals
– Nevertheless, general managers need to understand the general
design issues involved in order to appreciate the effort
involved and value of excellent data modeling.
31
Privacy Issues
• Is there information in my files that should not be there?
• Is information being used for the purpose it was originally
intended?
• Is information being shared appropriately (both inside and
outside the firm?)
• Is information being combined in appropriate ways?
• Are decisions that require human judgment being made
appropriately?
• Are appropriate procedures in place for preventing and
correcting errors?
Source: Cash, J.I., McFarlan, F.W., McKenney, J.L., and Applegate, L.M.,
Corporate Information Systems Management: Text and Cases, Homewood. II.
32
What should managers know
about Database Management?
• Management of database is an important issue
– Although once a technical issue, the management of databases
has become increasingly important throughout all types of
organizations.
• Organizations store and use large quantities of data
– Sheer volume of data alone means that proper management is
essential.
• Data are a valuable resource that must be managed
– Value is assured by capturing, validating, and protecting the
data.
33
What should managers know
about Database Management?
• The wrong approach to managing data adds complexity in the
management of organizations.
– The management of data should be part of the solution, not
part of the problem.
• You have a right to influence the management of data you need.
– The management of databases is not an activity that should
occur in isolation. Those who rely on the data captured and
stored in an organization have a need and, in fact, an
obligation to be involved in the decisions that affect their use
of the data.
34
Appendix
• File organization
• Components of database management
system
• SQL (Structured Query Language)
35
FILE ORGANIZATION
• BIT: Binary Digit (0,1; Y,N; On,Off)
• BYTE: Combination of BITS which
represent a CHARACTER
• FIELD: Collection of BYTES which
represent a DATUM or Fact
• RECORD: Collection of FIELDS which
reflect a TRANSACTION
Source: @2002 Prentice Hall
36
FILE ORGANIZATION
• FILE: A Collection of similar RECORDS
• DATABASE: An Organization’s Electronic
Library of FILES organized to serve business
applications
Source: @2002 Prentice Hall
37
COMPONENTS OF DBMS
• DATA DEFINITION LANGUAGE
– Defines data elements in database
• DATA MANIPULATION LANGUAGE
– Manipulates data for applications
• DATA DICTIONARY
– Formal definitions of all variables in database,
controls variety of database contents, data
elements
Source: @2002 Prentice Hall
DBMS
38
STRUCTURED QUERY LANGUAGE
(SQL)
• DE FACTO STANDARD
• DATA MANIPULATION LANGUAGE
FOR RELATIONAL DATABASES
Source: @2002 Prentice Hall
DBMS
39
ELEMENTS OF SQL
• SELECT: List of columns from tables
desired
• FROM: Identifies tables from which
columns will be selected
• WHERE: Includes conditions for selecting
specific rows, conditions for joining
multiple tables
Source: @2002 Prentice Hall
DBMS
40