Download Why Databases?? - CS-People by full name

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
What is a Database System?
System?
„ Database:
A very large collection of related data
„ Models a real world enterprise:
Database Systems
Data Mining
ß Entities (e.g., teams, games / students, courses)
ß Relationships (e.g., The Patriots are playing in the Super
bowl!)
ß Even active components (e.g. business logic)
„ DBMS: A software package/system that can be used
to store, manage and retrieve data form databases
Slides based on the slides created by Prof. Mitch Cherniack
Brandeis University
http://www.cs.brandeis.edu/~cs127b/
„ Database System: DBMS+data (+ applications)
1.2
Why Study Databases??
Why Databases??
„ Why not store everything on flat files: use the file system of t he
OS, cheap/simple&
„ Shift from computation to information
ß Always true for corporate computing
Name, Course, Grade
ß More and more true in the scientific world
John Smith, CS112, B
Mike Stonebraker, CS234, A
ß and of course, Web
„ DBMS encompasses much of CS in a practical discipline
ß OS, languages, theory, AI, logic
Jim Gray, CS560, A
John Smith, CS560, B+
&&&&&&&
„ Yes, but not scalable&
1.3
1.4
1
Problem 1
„ Data redundancy and inconsistency
f
ƒ Multiple file formats, duplication of information in different iles
Name, Course, Email, Grade
Problem 2
„ Data retrieval:
ß Find the students who took CS560
ß Find the students with GPA > 3.5
John Smith, [email protected], CS112, B
Mike Stonebraker, [email protected], CS234, A
For every query we need to write a program!
Jim Gray, CS560, [email protected], A
John Smith, CS560, [email protected], B+
„ We need the retrieval to be:
ß Easy to write
Why this a problem?
ß Execute efficiently
ƒ Wasted space
ƒ Potential inconsistencies (multiple formats, John
Smith vs Smith J.)
1.5
1.6
Problem 3
„ Data Integrity
Data Organization
„ Two levels of data modeling
„ Conceptual or Logical level : describes data stored in database,
and the relationships among the data.
ß No support for sharing:
ƒ Prevent simultaneous modifications
type customer = record
name : string;
street : string;
city : integer;
ß No coping mechanisms for system crashes
-coded
ß No means of Preventing Data Entry Errors (checks must be hard
in the programs)
ß Security problems
end;
„ Physical level: describes how a record (e.g., customer) is
stored.
„ Database systems offer solutions to all the above problems
„ Also, View level: application programs hide details of data types.
Views can also hide information (e.g., salary) for security
purposes.
1.7
1.8
2
View of Data
Database Schema
A logical architecture for a database system
„ Similar to types and variables in programming languages
„ Schema
the structure of the database
ß e.g., the database consists of information about a set of
customers and accounts and the relationship between them
ß Analogous to type information of a variable in a program
ß Physical schema: database design at the physical level
ß Logical schema: database design at the logical level
1.9
Data Organization
1.10
EntityEntity-Relationship Model
„ Data Models: a framework for describing
ß
ß
ß
ß
Example of schema in the entity -relationship model
data
data relationships
data semantics
data constraints
„ Entity-Relationship model
„ We will concentrate on Relational model
„ Other models:
ß object-oriented model
ß semi-structured data models, XML
1.11
1.12
3
Entity Relationship Model (Cont.)
Relational Model
Attributes
„ E-R model of real world
„ Example of tabular data in the relational model
ß Entities (objects)
ƒ E.g. customers, accounts, bank branch
ß Relationships between entities
ƒ E.g. Account A-101 is held by customer Johnson
ƒ Relationship set depositor associates customers with accounts
„ Widely used for database design
ß Database design in E-R model usually converted to design in the
relational model (coming up next) which is used for storage and
processing
Customer-id
customername
192-83-7465
Johnson
019-28-3746
Smith
192-83-7465
Johnson
321-12-3123
Jones
019-28-3746
Smith
1.13
customerstreet
customercity
accountnumber
Alma
Palo Alto
A-101
North
Rye
A-215
Alma
Palo Alto
A-201
Main
Harrison
A-217
North
Rye
A-201
1.14
Database Architecture
(data organization)
Data Organization
„ Data Storage
Where can data be stored?
DBA
ƒ Main memory
DDL Commands
ƒ Secondary memory (hard disks)
ƒ Optical storage (DVDs)
DDL Interpreter
ƒ Tertiary store (tapes)
„ Move data? Determined by buffer manager
„ Mapping data to files? Determined by file manager
File Manager
Buffer Manager
Storage Manager
Data
Secondary Storage
1.15
Metadata
Schema
1.16
4
Data retrieval
Data retrieval
Query
„ Queries
Query = Declarative data retrieval
describes what data, not how to retrieve it
Ex. Give me the students with GPA > 3.5
Query Processor
Plan
Query Optimizer
Query Evaluator
vs
Scan the student file and retrieve the records wi th gpa>3.5
„ Why?
Data
„ Query Optimizer
compiler for queries (aka DML Compiler)
Plan ~ Assembly Language Program
1. Easier to write
2. Efficient to execute (why?)
Optimizer Does Better With Declarative Queries:
1. Algorithmic Query (e.g., in C)⇒ 1 Plan to choose from
2. Declarative Query (e.g., in SQL)⇒ n Plans to choose from
1.17
1.18
Data retrieval:
Indexing
SQL
„ SQL: widely used (declarative) non -procedural language
ß E.g. find the name of the customer with customer-id 192-83-7465
select customer.customer-name
from customer
where customer.customer-id = 192 -83-7465
ß E.g. find the balances of all accounts held by the customer with
customer-id 192-83-7465
select account.balance
from depositor, account
where depositor.customer-id = 192 -83-7465 and
depositor.account-number = account.account-number
„ Procedural languages: C++, Java, relational algebra
1.19
„ How to answer fast the query: Find the student with SID = 101?
„ One approach is to scan the student table, check every student, retrurn
the one with id=101& very slow for large databases
„ Any better idea?
1st keep student record over the SID. Do a binary search&. Updates&
2nd Use a dynamic search tree!! Allow insertions, deletions, updat
es and at the
same time keep the records sorted! In databases we use the B+
-tree (multiway
search tree)
3rd Use a hash table. Much faster for exact match queries& but cannot support
Range queries. (Also, special hashing schemes are needed for dyn
amic data)
1.20
5
B+Tree Example
Database Architecture
(data retrieval)
B=4
DB Programmer
Root
User
120
Code w/ embedded queries
180
150
100
DDL Commands
Query Optimizer
DML Precompiler
30
DBA
Query
Query Evaluator
Query Processor
DDL Interpreter
File Manager
Buffer Manager
180
200
150
156
179
120
130
100
101
110
30
35
3
5
11
Storage Manager
Secondary Storage
Indices
Data
Statistics
Metadata
Schema
1.21
1.22
Data Integrity
Data Integrity
Transaction processing
Recovery
„ Why Concurrent Access to Data must be Managed?
Transfer $50 from account A ($100) to account B ($200)
John and Jane withdraw $50 and $100 from a common
account&
1. get balance for A
2. If balanceA > $50
John:
1. get balance
2. if balance > $50
3. balance = balance - $50
4. update balance
Jane:
1. get balance
2. if balance > $100
3. balance = balance - $100
4. update balance
3. balance A = balanceA
50
4.Update balance A in database
5. Get balance for B
System crashes&.
6. balance B = balanceB + 50
7. Update balance B in database
Initial balance $300. Final balance=?
Recovery management
It depends&
1.23
1.24
6
Database Architecture
What is Data Mining?
DB Programmer
User
Code w/ embedded queries
DBA
Query
DDL Commands
Query Optimizer
DML Precompiler
Query Evaluator
Query Processor
DDL Interpreter
(2) The analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel
ways that are both understandable and useful to the data owner
File Manager
Transaction Manager
Recovery Manager
Buffer Manager
Storage Manager
Secondary Storage
„ Data Mining is:
(1) The efficient discovery of previously unknown, valid, potentially
useful, understandable patterns in large datasets
Indices
Data
Metadata
Integrity Constraints
Statistics
Schema
1.25
Overview of terms
1.26
Overview of terms
„ The Data Mining Task:
„ Data: a set of facts (items) D, usually stored in a database
„ Pattern: an expression E in a language L, that describes a subset of
facts
For a given dataset D, language of facts L, interestingness
function I D,L and threshold c, find the expression E such that
ID,L(E) > c efficiently.
„ Attribute: a field in an item i in D.
„ Interestingness: a function I D,L that maps an expression E in L into a
measure space M
1.27
1.28
7
Knowledge Discovery
Examples of Large Datasets
„ Government: IRS, NGA, &
„ Large corporations
ß WALMART: 20M transactions per day
ß MOBIL: 100 TB geological databases
ß AT&T 300 M calls per day
ß Credit card companies
„ Scientific
ß NASA, EOS project: 50 GB per hour
ß Environmental datasets
1.29
Examples of Data mining Applications
1.30
How Data Mining is used
1. Fraud detection: credit cards, phone cards
2. Marketing: customer targeting
3. Data Warehousing: Walmart
1. Identify the problem
4. Astronomy
2. Use data mining techniques to transform the data into informa tion
5. Molecular biology
3. Act on the information
4. Measure the results
1.31
1.32
8
The Data Mining Process
1. Understand the domain
Data Mining Tasks
1. Classification: learning a function that maps an item into on e of a
set of predefined classes
2. Create a dataset:
ß
Select the interesting attributes
2. Regression: learning a function that maps an item to a real v alue
ß
Data cleaning and preprocessing
3. Clustering: identify a set of groups of similar items
3. Choose the data mining task and the specific algorithm
4. Interpret the results, and possibly return to 2
1.33
Data Mining Tasks
4. Dependencies and associations:
1.34
Data Mining Methods
1. Decision Tree Classifiers:
identify significant dependencies between data attributes
5. Summarization: find a compact description of the dataset or a
subset of the dataset
Used for modeling, classification
2. Association Rules:
Used to find associations between sets of attributes
3. Sequential patterns:
Used to find temporal associations in time series
4. Hierarchical clustering:
used to group customers, web users, etc
1.35
1.36
9
Why Data Preprocessing?
Why can Data be Incomplete?
„ Data in the real world is dirty
ß incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
ß noisy: containing errors or outliers
ß inconsistent: containing discrepancies in codes or names
„ No quality data, no quality mining results!
ß Quality decisions must be based on quality data
tion for sales
„ Attributes of interest are not available (e.g., customer informa
transaction data)
„ Data were not considered important at the time of transactions,so they
were not recorded!
ß Data warehouse needs consistent integration of quality data
„ Data not recorder because of misunderstanding or malfunctions
ß Required for both OLAP and Data Mining!
„ Data may have been recorded and later deleted!
„ Missing/unknown values for some data
1.37
1.38
Classification: Definition
Data Cleaning
„ Given a collection of records ( training set )
ß Each record contains a set of attributes, one of the attributes is the
class.
„ Data cleaning tasks
„ Find a model for class attribute as a function of the values of
ß Fill in missing values
other attributes.
ß Identify outliers and smooth out noisy data
„ Goal: previously unseen records should be assigned a class
as accurately as possible.
ß Correct inconsistent data
ß A test set is used to determine the accuracy of the model. Usually, the
given data set is divided into training and test sets, with training set used
to build the model and test set used to validate it.
1.39
1.40
10
Example of a Decision Tree
Classification Example
al
al
ric
o
eg
at
c
us
ric
o
eg
at
c
uo
tin
s
as
on
cl
c
al
ic
or
g
te
al
us
ic
or
g
te
uo
tin
n
co
ca
Marital
Status
Taxable
Income Default
Tid Home
Owner
Marital
Status
Taxable
Income Default
Home
Owner
Marital
Status
Taxable
Income
Default
Tid Home
Owner
1
Yes
Single
125K
No
No
Single
75K
?
1
Yes
Single
125K
No
2
No
Married
100K
No
Yes
Married
50K
?
2
No
Married
100K
No
3
No
Single
70K
No
No
Married
150K
?
3
No
Single
70K
No
4
Yes
Married
120K
No
Yes
Divorced
90K
?
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
No
Single
40K
?
5
No
Divorced 95K
Yes
6
No
Married
No
No
Married
80K
?
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Test
Set
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Training
Set
10
Learn
Classifier
Model
60K
s
as
cl
ca
Splitting Attributes
HO
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Model: Decision Tree
Training Data
1.41
1.42
Clustering Definition
Another Example of Decision Tree
al
al
t
ca
o
eg
t
ca
uo
in
nt
co
similarity measure among them, find clusters such that
s
as
cl
Tid Home
Owner
Marital
Status
Taxable
Income Default
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
„ Given a set of data points, each having a set of attributes, and a
us
ric
ric
o
eg
Married
MarSt
NO
Single,
Divorced
HO
Yes
NO
ß Data points in one cluster are more similar to one another.
.
ß Data points in separate clusters are less similar to one another
„ Similarity Measures:
ß Euclidean Distance if attributes are continuous.
ß Other Problem-specific Measures.
No
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that
fits the same data!
10
1.43
1.44
11
Illustrating Clustering
Clustering: Application 1
„ Market Segmentation:
_Euclidean Distance Based Clustering in 3-D space.
Intracluster distances
are minimized
ß Goal: subdivide a market into distinct subsets of customers where any
subset may conceivably be selected as a market target to be reached with
a distinct marketing mix.
ß Approach:
ƒ Collect different attributes of customers based on their geographical
and lifestyle related information.
ƒ Find clusters of similar customers.
Intercluster distances
are maximized
ƒ Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
1.45
1.46
Clustering: Application 2
Association Rule Discovery: Definition
s from
„ Given a set of records each of which contain some number of item
„ Document Clustering:
a given collection;
ß Goal: To find groups of documents that are similar to each other
based on the important terms appearing in them.
ß
t.
ß Approach: To identify frequently occurring terms in each documen
Form a similarity measure based on the frequencies of different
terms. Use it to cluster.
Produce dependency rules which will predict occurrence of an item based
on occurrences of other items.
ß Gain: Information Retrieval can utilize the clusters to relate anew
document or search term to clustered documents.
1.47
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
1.48
12
Association Rule Discovery: Application 1
Data Compression
„ Marketing and Sales Promotion:
ß Let the rule discovered be
{Bagels, & } --> {Potato Chips}
ß Potato Chips as consequent => Can be used to determine what should be done
to boost its sales.
ß Bagels in the antecedent => Can be used to see which products would be
affected if the store discontinues selling bagels.
ß Bagels in antecedent and Potato chips in consequent => Can be used to see
what products should be sold with Bagels to promote sale of Potato chips!
Compressed
Data
Original Data
lossless
sy
los
Original Data
Approximated
1.49
1.50
Clustering
tative from
„ Partitions data set into clusters, and models it by one represen
each cluster
„ Can be very effective if data is clustered but not if data is smeared
„ There are many choices of clustering definitions and clusteringalgorithms,
more later!
1.51
13