Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
What is a Database System? System? Database: A very large collection of related data Models a real world enterprise: Database Systems Data Mining ß Entities (e.g., teams, games / students, courses) ß Relationships (e.g., The Patriots are playing in the Super bowl!) ß Even active components (e.g. business logic) DBMS: A software package/system that can be used to store, manage and retrieve data form databases Slides based on the slides created by Prof. Mitch Cherniack Brandeis University http://www.cs.brandeis.edu/~cs127b/ Database System: DBMS+data (+ applications) 1.2 Why Study Databases?? Why Databases?? Why not store everything on flat files: use the file system of t he OS, cheap/simple& Shift from computation to information ß Always true for corporate computing Name, Course, Grade ß More and more true in the scientific world John Smith, CS112, B Mike Stonebraker, CS234, A ß and of course, Web DBMS encompasses much of CS in a practical discipline ß OS, languages, theory, AI, logic Jim Gray, CS560, A John Smith, CS560, B+ &&&&&&& Yes, but not scalable& 1.3 1.4 1 Problem 1 Data redundancy and inconsistency f Multiple file formats, duplication of information in different iles Name, Course, Email, Grade Problem 2 Data retrieval: ß Find the students who took CS560 ß Find the students with GPA > 3.5 John Smith, [email protected], CS112, B Mike Stonebraker, [email protected], CS234, A For every query we need to write a program! Jim Gray, CS560, [email protected], A John Smith, CS560, [email protected], B+ We need the retrieval to be: ß Easy to write Why this a problem? ß Execute efficiently Wasted space Potential inconsistencies (multiple formats, John Smith vs Smith J.) 1.5 1.6 Problem 3 Data Integrity Data Organization Two levels of data modeling Conceptual or Logical level : describes data stored in database, and the relationships among the data. ß No support for sharing: Prevent simultaneous modifications type customer = record name : string; street : string; city : integer; ß No coping mechanisms for system crashes -coded ß No means of Preventing Data Entry Errors (checks must be hard in the programs) ß Security problems end; Physical level: describes how a record (e.g., customer) is stored. Database systems offer solutions to all the above problems Also, View level: application programs hide details of data types. Views can also hide information (e.g., salary) for security purposes. 1.7 1.8 2 View of Data Database Schema A logical architecture for a database system Similar to types and variables in programming languages Schema the structure of the database ß e.g., the database consists of information about a set of customers and accounts and the relationship between them ß Analogous to type information of a variable in a program ß Physical schema: database design at the physical level ß Logical schema: database design at the logical level 1.9 Data Organization 1.10 EntityEntity-Relationship Model Data Models: a framework for describing ß ß ß ß Example of schema in the entity -relationship model data data relationships data semantics data constraints Entity-Relationship model We will concentrate on Relational model Other models: ß object-oriented model ß semi-structured data models, XML 1.11 1.12 3 Entity Relationship Model (Cont.) Relational Model Attributes E-R model of real world Example of tabular data in the relational model ß Entities (objects) E.g. customers, accounts, bank branch ß Relationships between entities E.g. Account A-101 is held by customer Johnson Relationship set depositor associates customers with accounts Widely used for database design ß Database design in E-R model usually converted to design in the relational model (coming up next) which is used for storage and processing Customer-id customername 192-83-7465 Johnson 019-28-3746 Smith 192-83-7465 Johnson 321-12-3123 Jones 019-28-3746 Smith 1.13 customerstreet customercity accountnumber Alma Palo Alto A-101 North Rye A-215 Alma Palo Alto A-201 Main Harrison A-217 North Rye A-201 1.14 Database Architecture (data organization) Data Organization Data Storage Where can data be stored? DBA Main memory DDL Commands Secondary memory (hard disks) Optical storage (DVDs) DDL Interpreter Tertiary store (tapes) Move data? Determined by buffer manager Mapping data to files? Determined by file manager File Manager Buffer Manager Storage Manager Data Secondary Storage 1.15 Metadata Schema 1.16 4 Data retrieval Data retrieval Query Queries Query = Declarative data retrieval describes what data, not how to retrieve it Ex. Give me the students with GPA > 3.5 Query Processor Plan Query Optimizer Query Evaluator vs Scan the student file and retrieve the records wi th gpa>3.5 Why? Data Query Optimizer compiler for queries (aka DML Compiler) Plan ~ Assembly Language Program 1. Easier to write 2. Efficient to execute (why?) Optimizer Does Better With Declarative Queries: 1. Algorithmic Query (e.g., in C)⇒ 1 Plan to choose from 2. Declarative Query (e.g., in SQL)⇒ n Plans to choose from 1.17 1.18 Data retrieval: Indexing SQL SQL: widely used (declarative) non -procedural language ß E.g. find the name of the customer with customer-id 192-83-7465 select customer.customer-name from customer where customer.customer-id = 192 -83-7465 ß E.g. find the balances of all accounts held by the customer with customer-id 192-83-7465 select account.balance from depositor, account where depositor.customer-id = 192 -83-7465 and depositor.account-number = account.account-number Procedural languages: C++, Java, relational algebra 1.19 How to answer fast the query: Find the student with SID = 101? One approach is to scan the student table, check every student, retrurn the one with id=101& very slow for large databases Any better idea? 1st keep student record over the SID. Do a binary search&. Updates& 2nd Use a dynamic search tree!! Allow insertions, deletions, updat es and at the same time keep the records sorted! In databases we use the B+ -tree (multiway search tree) 3rd Use a hash table. Much faster for exact match queries& but cannot support Range queries. (Also, special hashing schemes are needed for dyn amic data) 1.20 5 B+Tree Example Database Architecture (data retrieval) B=4 DB Programmer Root User 120 Code w/ embedded queries 180 150 100 DDL Commands Query Optimizer DML Precompiler 30 DBA Query Query Evaluator Query Processor DDL Interpreter File Manager Buffer Manager 180 200 150 156 179 120 130 100 101 110 30 35 3 5 11 Storage Manager Secondary Storage Indices Data Statistics Metadata Schema 1.21 1.22 Data Integrity Data Integrity Transaction processing Recovery Why Concurrent Access to Data must be Managed? Transfer $50 from account A ($100) to account B ($200) John and Jane withdraw $50 and $100 from a common account& 1. get balance for A 2. If balanceA > $50 John: 1. get balance 2. if balance > $50 3. balance = balance - $50 4. update balance Jane: 1. get balance 2. if balance > $100 3. balance = balance - $100 4. update balance 3. balance A = balanceA 50 4.Update balance A in database 5. Get balance for B System crashes&. 6. balance B = balanceB + 50 7. Update balance B in database Initial balance $300. Final balance=? Recovery management It depends& 1.23 1.24 6 Database Architecture What is Data Mining? DB Programmer User Code w/ embedded queries DBA Query DDL Commands Query Optimizer DML Precompiler Query Evaluator Query Processor DDL Interpreter (2) The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner File Manager Transaction Manager Recovery Manager Buffer Manager Storage Manager Secondary Storage Data Mining is: (1) The efficient discovery of previously unknown, valid, potentially useful, understandable patterns in large datasets Indices Data Metadata Integrity Constraints Statistics Schema 1.25 Overview of terms 1.26 Overview of terms The Data Mining Task: Data: a set of facts (items) D, usually stored in a database Pattern: an expression E in a language L, that describes a subset of facts For a given dataset D, language of facts L, interestingness function I D,L and threshold c, find the expression E such that ID,L(E) > c efficiently. Attribute: a field in an item i in D. Interestingness: a function I D,L that maps an expression E in L into a measure space M 1.27 1.28 7 Knowledge Discovery Examples of Large Datasets Government: IRS, NGA, & Large corporations ß WALMART: 20M transactions per day ß MOBIL: 100 TB geological databases ß AT&T 300 M calls per day ß Credit card companies Scientific ß NASA, EOS project: 50 GB per hour ß Environmental datasets 1.29 Examples of Data mining Applications 1.30 How Data Mining is used 1. Fraud detection: credit cards, phone cards 2. Marketing: customer targeting 3. Data Warehousing: Walmart 1. Identify the problem 4. Astronomy 2. Use data mining techniques to transform the data into informa tion 5. Molecular biology 3. Act on the information 4. Measure the results 1.31 1.32 8 The Data Mining Process 1. Understand the domain Data Mining Tasks 1. Classification: learning a function that maps an item into on e of a set of predefined classes 2. Create a dataset: ß Select the interesting attributes 2. Regression: learning a function that maps an item to a real v alue ß Data cleaning and preprocessing 3. Clustering: identify a set of groups of similar items 3. Choose the data mining task and the specific algorithm 4. Interpret the results, and possibly return to 2 1.33 Data Mining Tasks 4. Dependencies and associations: 1.34 Data Mining Methods 1. Decision Tree Classifiers: identify significant dependencies between data attributes 5. Summarization: find a compact description of the dataset or a subset of the dataset Used for modeling, classification 2. Association Rules: Used to find associations between sets of attributes 3. Sequential patterns: Used to find temporal associations in time series 4. Hierarchical clustering: used to group customers, web users, etc 1.35 1.36 9 Why Data Preprocessing? Why can Data be Incomplete? Data in the real world is dirty ß incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data ß noisy: containing errors or outliers ß inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! ß Quality decisions must be based on quality data tion for sales Attributes of interest are not available (e.g., customer informa transaction data) Data were not considered important at the time of transactions,so they were not recorded! ß Data warehouse needs consistent integration of quality data Data not recorder because of misunderstanding or malfunctions ß Required for both OLAP and Data Mining! Data may have been recorded and later deleted! Missing/unknown values for some data 1.37 1.38 Classification: Definition Data Cleaning Given a collection of records ( training set ) ß Each record contains a set of attributes, one of the attributes is the class. Data cleaning tasks Find a model for class attribute as a function of the values of ß Fill in missing values other attributes. ß Identify outliers and smooth out noisy data Goal: previously unseen records should be assigned a class as accurately as possible. ß Correct inconsistent data ß A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 1.39 1.40 10 Example of a Decision Tree Classification Example al al ric o eg at c us ric o eg at c uo tin s as on cl c al ic or g te al us ic or g te uo tin n co ca Marital Status Taxable Income Default Tid Home Owner Marital Status Taxable Income Default Home Owner Marital Status Taxable Income Default Tid Home Owner 1 Yes Single 125K No No Single 75K ? 1 Yes Single 125K No 2 No Married 100K No Yes Married 50K ? 2 No Married 100K No 3 No Single 70K No No Married 150K ? 3 No Single 70K No 4 Yes Married 120K No Yes Divorced 90K ? 4 Yes Married 120K No 5 No Divorced 95K Yes No Single 40K ? 5 No Divorced 95K Yes 6 No Married No No Married 80K ? 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Test Set 10 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Training Set 10 Learn Classifier Model 60K s as cl ca Splitting Attributes HO Yes No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 10 Model: Decision Tree Training Data 1.41 1.42 Clustering Definition Another Example of Decision Tree al al t ca o eg t ca uo in nt co similarity measure among them, find clusters such that s as cl Tid Home Owner Marital Status Taxable Income Default 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Given a set of data points, each having a set of attributes, and a us ric ric o eg Married MarSt NO Single, Divorced HO Yes NO ß Data points in one cluster are more similar to one another. . ß Data points in separate clusters are less similar to one another Similarity Measures: ß Euclidean Distance if attributes are continuous. ß Other Problem-specific Measures. No TaxInc < 80K NO > 80K YES There could be more than one tree that fits the same data! 10 1.43 1.44 11 Illustrating Clustering Clustering: Application 1 Market Segmentation: _Euclidean Distance Based Clustering in 3-D space. Intracluster distances are minimized ß Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. ß Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Intercluster distances are maximized Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. 1.45 1.46 Clustering: Application 2 Association Rule Discovery: Definition s from Given a set of records each of which contain some number of item Document Clustering: a given collection; ß Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. ß t. ß Approach: To identify frequently occurring terms in each documen Form a similarity measure based on the frequencies of different terms. Use it to cluster. Produce dependency rules which will predict occurrence of an item based on occurrences of other items. ß Gain: Information Retrieval can utilize the clusters to relate anew document or search term to clustered documents. 1.47 TID Items 1 2 3 4 5 Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} 1.48 12 Association Rule Discovery: Application 1 Data Compression Marketing and Sales Promotion: ß Let the rule discovered be {Bagels, & } --> {Potato Chips} ß Potato Chips as consequent => Can be used to determine what should be done to boost its sales. ß Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. ß Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips! Compressed Data Original Data lossless sy los Original Data Approximated 1.49 1.50 Clustering tative from Partitions data set into clusters, and models it by one represen each cluster Can be very effective if data is clustered but not if data is smeared There are many choices of clustering definitions and clusteringalgorithms, more later! 1.51 13